**Author**: Marcello Victorino<br>
**Date**: 04/16 - 
<hr>

# Data Wrangling: Armenian Online Job Posting

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

The [dataset](https://www.kaggle.com/udacity/armenian-online-job-postings) consists of 19,000 job postings between 2004 - 2015, with 24 Columns.

**Variables Description:**
1. **jobpost**: The original job post.<br>
+ **date**: The date it was posted in the group.<br>
+ **Title**: Job title.<br>
+ **Company**: Employer name. <br>
+ **AnnouncementCode**: Announcement code, which is some internal code and is usually missing.<br>
+ **Term**: Full-Time, Part-time, etc.<br>
+ **Eligibility**: Eligibility of the candidates.<br>
+ **Audience**: Who can apply? <br>
+ **StartDate**: Start date of work.<br>
+ **Duration**: Duration of the employment.<br>
+ **Location**: Employment location.<br>
+ **JobDescription**: Job Description.<br>
+ **JobRequirment**: Job requirements.<br>
+ **RequiredQual**: Required qualifications.<br>
+ **Salary**: Job salary.<br>
+ **ApplicationP**: Application procedure.<br>
+ **OpeningDate**: Opening date of the job announcement.<br>
+ **Deadline**: Deadline for the job announcement. <br>
+ **Notes**: Additional notes.<br>
+ **AboutC**: About the company.<br>
+ **Attach**: Attachments.<br>
+ **Year**: Year of the announcement (derived from the field date). <br>
+ **Month**: Month of the announcement (derived from the field date). <br>
+ **IT**: TRUE if the job is an IT job. This variable is created by a simple search of IT job titles within the "Title" column.<br>

This Project Goal is to Wrangle this dataset, assessing and cleaning it before we can perform EDA or modeling.

Let's import the necessary libraries and read the data into a dataframe:

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns


<a id='wrangling'></a>
# Data Wrangling

## Gather

In [2]:
df = pd.read_csv('online-job-postings.csv')
df.head(5)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


## Assess

In [3]:
# Missing Values
df.isna().any()

jobpost             False
date                False
Title                True
Company              True
AnnouncementCode     True
Term                 True
Eligibility          True
Audience             True
StartDate            True
Duration             True
Location             True
JobDescription       True
JobRequirment        True
RequiredQual         True
Salary               True
ApplicationP         True
OpeningDate          True
Deadline             True
Notes                True
AboutC               True
Attach               True
Year                False
Month               False
IT                  False
dtype: bool

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

In [5]:
# Investigate groups relevance of Variables
var = df.Term
try:
    items = var.value_counts().items()
    [print(f'{k[:50]} | {v}') for k,v in items]
except:
    print(var.value_counts())

Full time | 5348
Full-time | 1078
Part time | 136
Full Time | 104
Long term | 102
Part-time or Full-time | 92
Part-time | 68
Permanent | 51
Long-term | 22
Full time salaried - 40 hours per week | 15
Full-Time | 14
Long Term | 13
ASAP | 13
Short term | 9
Up to 5 years | 9
Fixed term | 8
Full time or part time | 8
Part time/ Full time | 8
Full time/ Part time | 7
Part-time/ Full-time | 7
Part time or full time | 6
Long term, with 3 months probation period | 6
Full term | 6
Full-tim | 6
Long term with 3 months probation period | 6
Full time (8 working hours per day). | 6
Service Contract | 5
Full time, flex time | 5
Part-time or full-time | 5
Full time (40 hrs/week) | 5
Short-term | 5
Full time, 40 hours/week | 5
Morning, night and afternoon shifts | 4
Contract based | 4
50 hrs per week | 4
Contract | 4
Full-time; 40 hours/week | 4
According to current curricula | 4
Full time (6 days, 9:00-18:00, 18:00-24:00) | 4
Night shift | 4
Part-time (Full-time preferable) | 4
Full time/ Part time, f

## Assess

+ **Irrelevant Information:**  either brings no informative value or has few relevant groups
    - jobpost: original data scraped into columns
    - Eligibility
    - Audience
    - Notes
    - AboutC (company)
    - Attach
    - AnnouncementCode
    - JobDescription
    - JobRequirment
    - Salary: competitive!?
    - ApplicationP (process)


+ **Potentially Informative:**
    - RequiredQual: requires scraping for key-words identification and count
    - date: periods of time with spike in job posting?
    - Title: hottest professions
    - Company: Biggest hirers
    - Term: correlation between contract kind (Full vs Part time) and other variables (job, requirement, salary)
    - StartDate: planned hiring or necessity?
    - Duration: correlation between duration and other variables
    - Location: hottest places per sector
    - OpeningDate - Deadline: application window


+ **Redundant Values:**
    - Term: Full time, Full-time ...
    - Start Date: ASAP , Immediately, As soon as Possible ...
    - Duration: Long term, Long-term ...
    - Location: Yerevan, Armenia + description
    - Opening Date: formatting (blank space)
    - Deadline: Rolling, Rolling Groups, value + description
    - Title: requires aggregation (Software Developer, Java developer, Web developer...)


+ **Data Type:**
    - date: object -> datetime
    - OpeningDate: object -> datetime
    - Deadline: object -> datetime

+ **Columns name:**
    - standardize string format (lower, _ instead of space or Camel Case)
    
    
+ **Missing values**: Redo after pre cleaning NaN (in 19 columns)

## Cleaning Plan

### Issue 1
#### Define
Remove irrelevant variables from dataframe: ['jobpost', 'Eligibility', 'Audience', 'Notes', 'AboutC', 'Attach', 'AnnouncementCode', 'JobDescription', 'JobRequirment', 'Salary', 'ApplicationP']

#### Code

In [6]:
drop_variables = ['jobpost', 'Eligibility', 'Audience', 'Notes', 'AboutC',
            'Attach', 'AnnouncementCode', 'JobDescription', 'JobRequirment',
            'Salary', 'ApplicationP']

df.drop(columns=drop_variables, inplace=True)

#### Test

In [7]:
assert drop_variables not in df.columns.tolist()

### Issue 2
#### Define
Term: standardize values in representative groups | 'Full-time', 'Part-time', 'Full/Part-time', 'Contract', and 'Other'

#### Code

In [8]:
# Verification of outcome before actually replacing the dataframe
def check_replacement(column, pattern):
    print(column.str.contains(pattern).sum())
    return column.replace(regex=pattern, value='Test', inplace=False).value_counts()

In [9]:
# Substitute line-breaks with underscore so regex can 'search' properly - entire dataframe
df.replace('\n','_', inplace=True, regex=True)

In [10]:
# Regex patterns to capture the entire value for each category to be aggregated
full_or_part_regex = '(?i)(?=full.*part).*|(?=part.*full).*|both.*' # 194 cases, both full and part
full_time_regex = '(?i)(full|long|permanent)(?!/).*' # 6,963 cases
part_time_regex = '(?i)(?<!/)part.*' # 251
contract_regex = '(?i).*(contract|freelance).*' # 49 cases, contract or freelance

In [11]:
# check_replacement(df.Term, full_or_part_regex)

In [12]:
# Actually saving value replacement
df.Term.replace(regex={full_or_part_regex: 'Full/Part-time',
                      full_time_regex: 'Full-time',
                       part_time_regex: 'Part-time',
                      contract_regex: 'Contract'}, inplace=True)

term_categories_list =  ['Full/Part-time', 'Full-time','Part-time', 'Contract']

# Rename everything else as 'Other'
df.Term.where(np.isin(df.Term, term_categories_list) ,'Other', inplace=True)

#### Test

In [13]:
# Visually check if category aggregation worked
df.Term.value_counts()

Other             11544
Full-time          6963
Part-time           251
Full/Part-time      194
Contract             49
Name: Term, dtype: int64

In [14]:
term_new_categories = ['Full/Part-time', 'Full-time','Part-time','Contract', 'Other']
term_categories_df = df.Term.value_counts().keys().tolist()
assert all([(category in term_new_categories) for category in term_categories_df])

### Issue 3
#### Define
StartDate: standardize values in representative groups | 'ASAP' and 'Other'

#### Code

In [15]:
# asap_regex = '(?i).*(asap|immediat*|soon|upon hiring).*'
# check_replacement(df.StartDate, asap_regex)

In [16]:
asap_regex = '(?i).*(asap|immediat*|soon|upon hiring).*'

# Actually saving value replacement
df.StartDate.replace(regex={asap_regex: 'ASAP'}, inplace=True)

# Rename everything else as 'Other'
df.StartDate.where(df.StartDate == 'ASAP', 'Other', inplace=True)

In [17]:
df.StartDate.value_counts()

Other    12148
ASAP      6853
Name: StartDate, dtype: int64

#### Test

In [18]:
StartDate_categories_new = ['ASAP', 'Other']
StartDate_categories_df = df.StartDate.value_counts().keys().tolist()

assert all([category in StartDate_categories_new for category in StartDate_categories_df])

### Issue 4
#### Define
Duration: standardize values in representative groups | 'Permanent', 'Up to 1 Year', '2-5 years', 'Other

In [19]:
# mid_regex = '(?i).*([2-5].*year*).*'
# check_replacement(df.Duration, mid_regex)

In [20]:
permanent_regex = '(?i).*(long|perma*|open|indef*|term.*less|unlimited).*'
short_regex = '(?i).*(short|month*|1.*year|one.*year|week*|day*|temp*).*'
mid_regex = '(?i).*([2-5].*year*|(two|three|four|five) years).*' # 270 cases

# Actually saving value replacement
df.Duration.replace(regex={permanent_regex: 'Permanent',
                           short_regex: 'Up to 1 Year',
                          mid_regex: '2-5 years'}, inplace=True)

In [21]:
# Rename everything else as 'Other'
duration_categories_list = ['Permanent', 'Up to 1 Year', '2-5 years']
df.Duration.where(np.isin(df.Duration, duration_categories_list), 'Other', inplace=True)

In [22]:
df.Duration.value_counts()

Other           8556
Permanent       8381
Up to 1 Year    1848
2-5 years        216
Name: Duration, dtype: int64

#### Test

In [23]:
Duration_categories_new = ['Permanent', 'Up to 1 Year', '2-5 years', 'Other']
Duration_categories_df = df.Duration.value_counts().keys().tolist()

assert all([category in Duration_categories_new for category in Duration_categories_df])

### Issue 5
#### Define
Location: standardize values in representative groups | Main cities

In [24]:
# description_regex = '(?i).*(description).*'
# check_replacement(df.Location, description_regex)

In [25]:
yerevan_regex = '(?i).*(yerevan).*' # Capital city of Armenia
description_regex = '(?i).*(description|location|position|detail*).*' # Bad entries with full description

# Actually saving value replacement
df.Location.replace(regex={yerevan_regex: 'Yerevan, Armenia',
                          description_regex: 'Other'}, inplace=True)

In [26]:
# Replace all that do not have "city, country" pattern as "Other"
general_cities_regex = '(?i).*([a-z]*,.*[a-z]).*'
general_city_pattern = df.Location.str.contains(general_cities_regex, na=False)

df.Location.where(general_city_pattern, 'Other', inplace=True)

  This is separate from the ipykernel package so we can avoid doing imports until


#### Test

In [27]:
# Visually assessing categories
df.Location.value_counts()

Yerevan, Armenia                                                                                                            17581
Other                                                                                                                         152
Gyumri, Armenia                                                                                                               102
Abovyan, Armenia                                                                                                               95
Tbilisi, Georgia                                                                                                               74
Vanadzor, Armenia                                                                                                              72
Kapan, Armenia                                                                                                                 68
Dilijan, Armenia                                                                          

### Issue 6
#### Define
OpeningDate: 
+ fix formatting | extra blank space and special character
+ Then convert datatype to Datetime

#### Code

In [28]:
# Fixing double spacing and special character
df.OpeningDate.replace(regex={'\r_': ' ', '  ':' '}, inplace=True)

In [29]:
# Transforming all to Datetime. Errors='coerce' handles Missing values 
df.OpeningDate = pd.to_datetime(df.OpeningDate, errors='coerce')

#### Test

In [30]:
assert df.OpeningDate.dtype == np.dtype('datetime64[ns]')

### Issue 7
#### Define
Deadline: 
+ Remove irrelevant text to clean Date Format
+ Aggregate meaningless text values (i.e. "open" or "rolling")
+ Then convert datatype to Datetime, setting non date as NaT

In [31]:
# not_date_regex = '(?i)(open|roll*).*'
# check_replacement(df.Deadline, not_date_regex)

In [32]:
description_regex = '(?i)(\r_|close|cob|cet).*' # general text mixed with date to be removed
not_date_regex = '(?i)(open|roll*|for).*'

# Actually saving value replacement
df.Deadline.replace(regex={description_regex: '',
                          not_date_regex: 'Other'}, inplace=True)

In [33]:
# Transforming all to Datetime. Errors='coerce' handles Missing values 
df.Deadline = pd.to_datetime(df.Deadline, errors='coerce')

#### Test

In [34]:
assert df.OpeningDate.dtype == np.dtype('datetime64[ns]')

### Issue 8
#### Define
Title: aggregate into new categories | ..............
+ Remove any educational offerings (not job related)

In [35]:
# regex = '(?i).*(developer|software|QA|programmer).*'
# check_replacement(df.Title, regex)

In [36]:
finance_regex = '(?i).*(account*|finan*|audit*|loan|credit|contract).*'
not_job_regex = '(?i).*(course*).*'
developer_regex = '(?i).*(developer|software|QA|programmer).*'
administrative_regex = '(?i).*(office|project|administr*|manag*|leader|superv*).*'
marketing_regex = '(?i).*(market*|sale*|brand|merchand*|commercial|PR|advert*).*'
designer_regex = '(?i).*(design*|user experience).*'
receptionist_regex = '(?i).*(reception*|assistant|secret*).*'
legal_regex = '(?i).*(legal|law*).*'
engineer_regex = '(?i).*(engin*).*'
analyst_regex = '(?i).*(analy*|business consultant).*'
medical_regex = '(?i).*(medic*|pharmac*|nurs*).*'
it_regex = '(?i).*(it|technical support).*'
high_management_regex = '(?i).*(director|chief).*'
support_regex = '(?i).*(hr|human|procurement|customer|call center).*'

# Actually saving value replacement
df.Title.replace(regex={finance_regex: 'Finance',not_job_regex: 'remove',
                        developer_regex: 'Developer',administrative_regex: 'Administrative',
                       marketing_regex: 'Marketing/Sales',designer_regex: 'Designer',
                       receptionist_regex: 'Receptionist/Assistant', legal_regex: 'Legal',
                       engineer_regex: 'Misc. Engineer', analyst_regex: 'Analyst',
                       medical_regex: 'Healthcare', it_regex: 'IT',
                       high_management_regex: 'Higher Management',
                       support_regex: 'Supporting Services'}, inplace=True)

In [37]:
# Remove entries offering Course
df.drop(index=df.query('Title == "remove"').index, inplace=True)

In [38]:
# Rename everything else as 'Other'
title_categories_list = ['Finance', 'Developer', 'Marketing/Sales', 'Receptionist/Assistant',
                        'Misc. Engineer', 'Healthcare', 'Higher Management', 'Supporting Services']
df.Title.where(np.isin(df.Title, title_categories_list), 'Other', inplace=True)

In [39]:
df.Title.value_counts()

Other                     9337
Developer                 2956
Finance                   2376
Marketing/Sales           2279
Misc. Engineer             778
Receptionist/Assistant     565
Higher Management          253
Supporting Services        176
Healthcare                 104
Name: Title, dtype: int64

#### Test

In [40]:
Title_categories_new = ['Finance', 'Developer', 'Marketing/Sales', 'Receptionist/Assistant',
                        'Misc. Engineer', 'Healthcare', 'Higher Management', 'Supporting Services',
                       'Other']
Title_categories_df = df.Title.value_counts().keys().tolist()
assert all([(category in Title_categories_new) for category in Title_categories_df])

### Issue 9
#### Define
Data type: set date from object to Datetime

In [41]:
df2 = df.copy()

In [42]:
df2.date = pd.to_datetime(df2.date, errors='coerce')
df2.date.count() - len(df)
df[df2.date.isnull()]

Unnamed: 0,date,Title,Company,Term,StartDate,Duration,Location,RequiredQual,OpeningDate,Deadline,Year,Month,IT
311,Jun 1 10:13 PM,Other,World Vision Armenin,Other,Other,Other,"Yerevan, Armenia","- A university degree in HR, social sciences o...",NaT,2004-06-11 00:00:00,2004,6,False
313,Jun 3 11:31 AM,Other,IREX Armenia,Other,Other,Other,"Yerevan, Armenia",- Must be a graduate or a last year student \r...,NaT,2004-07-15 00:00:00,2004,6,False
314,Jun 3 11:36 AM,Other,World Vision Armenia,Other,ASAP,Other,"Yerevan, Armenia","The following knowledge, skills and abilities\...",2004-06-03,2004-06-18 00:00:00,2004,6,False
315,Jun 3 11:37 AM,Other,World Vision Armenia,Other,ASAP,Other,"Yerevan, Armenia",General: \r_The successful candidate will poss...,2004-06-03,2004-06-18 00:00:00,2004,6,False
316,Jun 3 10:23 PM,Other,Valetta Ltd,Other,Other,Other,"Yerevan, Armenia",- A university degree preferably in economics ...,NaT,2004-06-10 00:00:00,2004,6,False
317,Jun 4 11:37 PM,Other,Valetta Ltd,Other,ASAP,Other,"Yerevan, Armenia","- A University degree preferably in Economics,...",NaT,2004-06-11 00:00:00,2004,6,False
318,Jun 7 4:59 AM,Developer,B & S Ltd,Other,Other,Other,"Yerevan, Armenia","- Languages: C/C++, JAVA, C-Sharp, Visual Basi...",NaT,2004-07-15 00:00:00,2004,6,True
319,Jun 7 10:57 PM,Other,"DiOr (Design of Interiors, Offices and Rosariums)",Other,Other,Other,"Yerevan, Armenia","Higher education, let the job be either hobby\...",NaT,2004-07-01 00:00:00,2004,6,False
320,Jun 7 10:38 PM,Other,Training and Development Ltd.,Other,Other,Other,"Yerevan, Armenia",- Demonstrated experience creating presentatio...,2001-07-01,2004-06-21 00:00:00,2004,6,False
321,Jun 7 11:53 PM,Developer,Boomerang Software LLC,Other,Other,Other,"Yerevan, Armenia","- Excellent proficiency in ASP, C#, ASP.NET, J...",NaT,2004-06-18 00:00:00,2004,6,True


In [50]:
df.date = pd.to_datetime(df.date, infer_datetime_format=True, errors='coerce')

#### Test

In [55]:
assert df.date.dtype == np.dtype('datetime64[ns]')

### Research Question 2  (Replace this header name!)

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!