**Author**: Marcello Victorino<br>
**Date**: 04/16 - 
<hr>

# Data Wrangling: Armenian Online Job Posting

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

The [dataset](https://www.kaggle.com/udacity/armenian-online-job-postings) consists of 19,000 job postings between 2004 - 2015, with 24 Columns.

**Variables Description:**
1. **jobpost**: The original job post.<br>
+ **date**: The date it was posted in the group.<br>
+ **Title**: Job title.<br>
+ **Company**: Employer name. <br>
+ **AnnouncementCode**: Announcement code, which is some internal code and is usually missing.<br>
+ **Term**: Full-Time, Part-time, etc.<br>
+ **Eligibility**: Eligibility of the candidates.<br>
+ **Audience**: Who can apply? <br>
+ **StartDate**: Start date of work.<br>
+ **Duration**: Duration of the employment.<br>
+ **Location**: Employment location.<br>
+ **JobDescription**: Job Description.<br>
+ **JobRequirment**: Job requirements.<br>
+ **RequiredQual**: Required qualifications.<br>
+ **Salary**: Job salary.<br>
+ **ApplicationP**: Application procedure.<br>
+ **OpeningDate**: Opening date of the job announcement.<br>
+ **Deadline**: Deadline for the job announcement. <br>
+ **Notes**: Additional notes.<br>
+ **AboutC**: About the company.<br>
+ **Attach**: Attachments.<br>
+ **Year**: Year of the announcement (derived from the field date). <br>
+ **Month**: Month of the announcement (derived from the field date). <br>
+ **IT**: TRUE if the job is an IT job. This variable is created by a simple search of IT job titles within the "Title" column.<br>

This Project Goal is to Wrangle this dataset, assessing and cleaning it before we can perform EDA or modeling.

Let's import the necessary libraries and read the data into a dataframe:

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns


<a id='wrangling'></a>
# Data Wrangling

## Gather

In [2]:
df = pd.read_csv('online-job-postings.csv')
df.head(5)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


## Assess

In [3]:
# Missing Values
df.isna().any()

jobpost             False
date                False
Title                True
Company              True
AnnouncementCode     True
Term                 True
Eligibility          True
Audience             True
StartDate            True
Duration             True
Location             True
JobDescription       True
JobRequirment        True
RequiredQual         True
Salary               True
ApplicationP         True
OpeningDate          True
Deadline             True
Notes                True
AboutC               True
Attach               True
Year                False
Month               False
IT                  False
dtype: bool

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

In [5]:
# Investigate groups relevance of Variables
var = df.Term
try:
    items = var.value_counts().items()
    [print(f'{k[:50]} | {v}') for k,v in items]
except:
    print(var.value_counts())

Full time | 5348
Full-time | 1078
Part time | 136
Full Time | 104
Long term | 102
Part-time or Full-time | 92
Part-time | 68
Permanent | 51
Long-term | 22
Full time salaried - 40 hours per week | 15
Full-Time | 14
ASAP | 13
Long Term | 13
Short term | 9
Up to 5 years | 9
Part time/ Full time | 8
Full time or part time | 8
Fixed term | 8
Part-time/ Full-time | 7
Full time/ Part time | 7
Full-tim | 6
Part time or full time | 6
Full time (8 working hours per day). | 6
Long term with 3 months probation period | 6
Full term | 6
Long term, with 3 months probation period | 6
Part-time or full-time | 5
Full time, flex time | 5
Full time (40 hrs/week) | 5
Service Contract | 5
Full time, 40 hours/week | 5
Short-term | 5
Full-time; 40 hours/week | 4
Contract | 4
Full time (6 days, 9:00-18:00, 18:00-24:00) | 4
Night shift | 4
Part-time (Full-time preferable) | 4
Full time/ Part time, flexible hours | 4
50 hrs per week | 4
According to current curricula | 4
Morning, night and afternoon shifts | 4
C

## Assess

+ **Irrelevant Information:**  either brings no informative value or has few relevant groups
    - jobpost: original data scraped into columns
    - Eligibility
    - Audience
    - Notes
    - AboutC (company)
    - Attach
    - AnnouncementCode
    - JobDescription
    - JobRequirment
    - Salary: competitive!?
    - ApplicationP (process)


+ **Potentially Informative:**
    - RequiredQual: requires scraping for key-words identification and count
    - date: periods of time with spike in job posting?
    - Title: hottest professions
    - Company: Biggest hirers
    - Term: correlation between contract kind (Full vs Part time) and other variables (job, requirement, salary)
    - StartDate: planned hiring or necessity?
    - Duration: correlation between duration and other variables
    - Location: hottest places per sector
    - OpeningDate - Deadline: application window


+ **Redundant Values:**
    - Term: Full time, Full-time ...
    - Start Date: ASAP , Immediately, As soon as Possible ...
    - Duration: Long term, Long-term ...
    - Location: Yerevan, Armenia + description
    - Salary: many descriptions
    - Opening Date: formatting (blank space)
    - Deadline: Rolling, Rolling Groups, value + description
    - Title: requires aggregation (Software Developer, Java developer, Web developer...)


+ **Data Type:**
    - date: object -> datetime
    - OpeningDate: object -> datetime
    - Deadline: object -> datetime

+ **Columns name:**
    - standardize string format (lower, _ instead of space or Camel Case)
    
    
+ **Missing values**: Redo after pre cleaning NaN (in 19 columns)

## Cleaning Plan

### Issue 1
#### Define
Remove irrelevant variables from dataframe: ['jobpost', 'Eligibility', 'Audience', 'Notes', 'AboutC', 'Attach', 'AnnouncementCode', 'JobDescription', 'JobRequirment', 'Salary', 'ApplicationP']

#### Code

In [6]:
drop_variables = ['jobpost', 'Eligibility', 'Audience', 'Notes', 'AboutC',
            'Attach', 'AnnouncementCode', 'JobDescription', 'JobRequirment',
            'Salary', 'ApplicationP']

df.drop(columns=drop_variables, inplace=True)

#### Test

In [7]:
assert drop_variables not in df.columns.tolist()

### Issue 2
#### Define
Term: standardize values in representative groups | 'Full-time', 'Part-time', 'Full/Part-time', 'Contract', and 'Other'

#### Code

In [24]:
# Verification of outcome before actually replacing the dataframe
def check_replacement(column, pattern):
    print(column.str.contains(pattern).sum())
    return column.replace(regex=pattern, value='Test', inplace=False).value_counts()

In [9]:
# Substitute line-breaks with underscore so regex can 'search' properly
df.Term.replace('\n','_', inplace=True, regex=True)

In [None]:
# Regex patterns to capture the entire value for each category to be aggregated
full_or_part_regex = '(?i)(?=full.*part).*|(?=part.*full).*|both.*' # 194 cases, both full and part
full_time_regex = '(?i)(full|long|permanent)(?!/).*' # 6,963 cases
part_time_regex = '(?i)(?<!/)part.*' # 251
contract_regex = '(?i).*(contract|freelance).*' # 49 cases, contract or freelance

In [26]:
# check_replacement(df.Term, full_or_part_regex)

In [None]:
# Actually saving value replacement
df.Term.replace(regex={full_or_part_regex: 'Full/Part-time',
                      full_time_regex: 'Full-time',
                       part_time_regex: 'Part-time',
                      contract_regex: 'Contract'}, inplace=True)

term_categories_list =  ['Full/Part-time', 'Full-time','Part-time', 'Contract']

# Rename everything else as 'Other'
df.Term.where(np.isin(df.Term, categories_list) ,'Other', inplace=True)

#### Test

In [50]:
term_new_categories = ['Full/Part-time', 'Full-time','Part-time','Contract', 'Other']
term_categories_df = df.Term.value_counts().keys().tolist()
assert all([(category in term_new_categories) for category in term_categories_df])

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [12]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [13]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [14]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!