## This exercise outlines the entire data wrangling process

### Gather

In [2]:
import zipfile
import pandas as pd

In [3]:
# Extract all contents from zip file
with zipfile.ZipFile('archive.zip', 'r') as myzip:
    myzip.extractall()

FileNotFoundError: [Errno 2] No such file or directory: 'archive.zip'

In [None]:
# Read CSV (comma-separated) file into DataFrame
df = pd.read_csv('online-job-postings.csv')

### Assess

#### Visual assessment

In [None]:
df

##### issues
- missing values (NaN)
- start date inconsistencies

#### Programmatic Assessment

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.value_counts()

- Fix nondescriptive column headers (ApplicationP, AboutC, RequiredQual ... and also JobRequirment)

In [None]:
df.StartDate.unique()

#### More problems with our Dataset
Duplicate representation of ear and month data. For tidy, we need oneday, one month and one year so if we can update the data in one place.

Two types of observational units in this dataset: job posting data and company data. To make this tidy, we would have two tables: Job Posting data with everything except the AboutC and then a second Company table with only the company column and an About Company column.

### Clean

#### Define
- Select all records in the StartDatecolumn that have varying values, and replace them with "ASAP"

- Select all Nondescriptive and misspelled column headers, and replace them with full words/ more descriptive words.

> we will not bother with replacing the NaN values because of the nature of the dataset.

> we don't want spaces in the column headers because we can't access columns using dot notation in pandas.

#### Programmatic cleaning

#### Issue 1

#### Define
- Select all Nondescriptive and misspelled column headers, and replace them with full words/ more descriptive words."

#### Code

In [None]:
df_clean = df.copy()

In [None]:
df_clean = df.rename(columns = {'ApplicationP' : 'ApplicationProcedure',
                                'AboutC' : 'AboutCompany',
                                'RequiredQual' : 'RequiredQualifications',
                                'JobRequirment' : 'JobRequirement'})

#### Test

In [None]:
df_clean.info()

#### Issue 2

#### Define

- Select all records in the StartDatecolumn that have varying values, and replace them with "ASAP"

#### Code

In [None]:
df_clean.StartDate.unique()

In [None]:
asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
             'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
             '"Immediate employment, after passing the interview."',
             'ASAP preferred', 'Employment contract signature date',
             'Immediate employment opportunity', 'Immidiately', 'ASA',
             'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
             'Immediately upon agreement', '20 November 2014 or ASAP',
             'immediately', 'Immediatelly',
             '"Immediately upon selection or no later than November 15, 2009."',
             'Immediate job opening', 'Immediate hiring', 'Upon selection',
             'As soon as practical', 'Immadiate', 'As soon as posible',
             'Immediately with 2 months probation period',
             '12 November 2012 or ASAP', 'Immediate employment after passing the interview',
             'Immediately/ upon agreement', '01 September 2014 or ASAP',
             'Immediately or as per agreement', 'as soon as possible',
             'As soon as Possible', 'in the nearest future', 'immediate',
             '01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
             'Immediate or earliest possible', 'Immediate hire',
             'Earliest  possible', 'ASAP with 3 months probation period.',
             'Immediate employment opportunity.', 'Immediate employment.',
             'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

In [None]:
for phrase in asap_list:
    df_clean.StartDate.replace(phrase, 'ASAP', inplace=True)

#### Test

In [None]:
df_clean.StartDate.value_counts()

In [None]:
for phrase in asap_list:
    assert phrase not in df_clean.StartDate.values

## Extra Credit

## Analysis and Visualization

In [None]:
# Number of ASAP start dates 
asap_counts = df_clean.StartDate.value_counts()['ASAP']
asap_counts

In [None]:
non_empty_counts = df_clean.StartDate.count()
non_empty_counts

In [None]:
asap_counts/non_empty_counts

In [None]:
%matplotlib inline
import numpy as np
labels = np.full(len(df_clean.StartDate.value_counts()), "", dtype=object)
labels[0] = 'ASAP'
df_clean.StartDate.value_counts().plot(kind="pie", labels=labels)