# Data Cleaning (Poly)

# Table of Contents

- ## [Data Cleaning](./Data_Cleaning.ipynb)
    - ### [Import Libraries](./Data_Cleaning.ipynb#Import-Libraries)
    - ### [Import Data](./Data_Cleaning.ipynb#Import-Data)
    - ### [Clean the "Average Salary" Column](./Data_Cleaning.ipynb#Clean-the-Average-Salary-Column)
    - ### [Create Stop Words](./Data_Cleaning.ipynb#Create-Custom-Stop-Words)
    - ### [Prepare words to be vectorized](./Data_Cleaning.ipynb#Tokenize%2C-Remove-Stop-Words%2C-Remove-Punctuation%2C-Lemmatize)
    - ### [Vectorize Word Data](#Vectorize-Word-Data)
- ## [Modeling](./Models.ipynb)
    - ### [Import Libraries](./Models.ipynb#Import-Libraries)
    - ### [Models](./Models.ipynb#Models)
      - #### [Linear Regression](./Models.ipynb#Linear-Regression)
      - #### [Lasso](./Models.ipynb#Lasso)
      - #### [Ridge](./Models.ipynb#Ridge)
      - #### [Random Forest Regressor](./Models.ipynb#Random-Forest-Regressor)
      - #### [Gradient Boost Regressor](./Models.ipynb#Gradient-Boost-Regressor)
      - #### [Neural Network](./Models.ipynb#Neural-Network))

### Import Libraries

In [69]:
import pandas as pd
import numpy as np
import pickle

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, HashingVectorizer, ENGLISH_STOP_WORDS
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

import nltk
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import clean as cl
import matplotlib.pyplot as plt
%matplotlib inline

### Import Data

In [48]:
all_jobs_df = pd.read_csv('../../get_data/Jobs/full_jobs_df.csv')


### Clean the Average Salary Column

In [49]:
all_jobs_df['avg_salary']=[cl.avg(i) for i in all_jobs_df.salary]

##### Drop all rows with Null values in the Salary Column

In [50]:
df = all_jobs_df.dropna(subset = ['salary'])

In [51]:
df.reset_index(drop = True,inplace=True)

### Create Custom Stop Words

In [52]:
my_list_of_stop_words = {'york'}

In [53]:
def add_stop_words(word):
    my_list_of_stop_words.add(word)

In [54]:
for i in ENGLISH_STOP_WORDS:
    add_stop_words(i)

In [55]:
for i in ['york','religion','identity','sexual','orientation','veteran','ll',
          'status','equal','national','gender','expression','real','affirmative',
          'race', 'color','age', 'belief', 'chance', 'disability', 'ethnic',
          'fair', 'lyft', 'nationality',
          'ordinance', 'ordinances', 'origin', 'policy',
          'prohibited', 'pursuant', 'sex','kind','conviction']:
    add_stop_words(i)

In [56]:
basic_extras = '''is an Equal Employment Opportunity that proudly 
and hires a diverse does not make hiring or   on the basis of race, color, religion or religious belief, 
ethnic or national origin, nationality, sex, gender, gender-identity, 
sexual orientation, disability, age, military or veteran status, or any 
other basis protected by local, state, or federal laws or 
prohibited by policy. also strives for a healthy and safe 
and strictly prohibits harassment of any kind Pursuant to the 
Fair Chance Ordinance and other similar state laws and local 
ordinances, and its policy, will also consider for applicants with arrest and
'''

In [57]:
for i in basic_extras.split(" "):
    add_stop_words(i.lower())

In [58]:
pickle.dump( my_list_of_stop_words, open( "../models/word cleaning/custom_stop_words.pkl", "wb" ) )

### Tokenize, Remove Stop Words, Remove Punctuation, Lemmatize

In [59]:
corpus_cleaned = [cl.token_stop_lemm(i,my_list_of_stop_words) for i in df.body]

In [60]:
X = corpus_cleaned
pickle.dump( X, open( "../corpus_for_skills.pkl", "wb" ) )

## Vectorize Word Data

In [61]:
cvec = TfidfVectorizer(stop_words = my_list_of_stop_words,ngram_range=(1,3),min_df=.15,max_df = .9)
cvec.fit(X)
vectors = cvec.transform(X)

word_df = pd.DataFrame(data = vectors.todense(), columns = cvec.get_feature_names())

pickle.dump( cvec, open( "../models/word cleaning/body.pkl", "wb" ) )

In [62]:
cvec_titles = TfidfVectorizer(stop_words = my_list_of_stop_words,min_df = .05,ngram_range=(1,2))
cvec_titles.fit(df.title)
vectors_titles = cvec_titles.transform(df['title'])

title_df = pd.DataFrame(data = vectors_titles.toarray(), columns = cvec_titles.get_feature_names())

pickle.dump( cvec_titles, open( "../models/word cleaning/title.pkl", "wb" ) )

In [63]:
cvec_location = TfidfVectorizer(stop_words = my_list_of_stop_words,min_df = .05,ngram_range=(2,4))
cvec_location.fit(df.location)
vectors_location = cvec_location.transform(df['location'])

location_df = pd.DataFrame(data = vectors_location.toarray(), columns = cvec_location.get_feature_names())

pickle.dump( cvec_location, open( "../models/word cleaning/location.pkl", "wb" ) )

In [64]:
full_df = word_df
full_df = pd.merge(full_df,title_df,how = 'outer', left_index=True,right_index=True)
full_df = pd.merge(full_df,location_df,how = 'outer', left_index=True,right_index=True)

## Feature Engineering

In [65]:
poly = PolynomialFeatures(degree = 2)

X = poly.fit_transform(full_df)

pickle.dump( poly, open( "../models/poly/poly_features.pkl", "wb" ) )



## Feature Selection

In [66]:
pca = PCA(.95)

X_pca = pca.fit_transform(X)

pickle.dump( pca, open( "../models/poly/pca.pkl", "wb" ),protocol=-1)

## Set X and y Variables
- Because y is a dollar value, we will be taking the natural log of it
- When we predict the price, we will be using $e ^ n$ with $n$ being the predicted value from our model

In [67]:
X = pd.DataFrame(X_pca)
y = [np.log(i) for i in df.avg_salary]

In [68]:
pickle.dump( X, open( "../models/poly/X_95.pkl", "wb" ) )
pickle.dump( y, open( "../models/poly/y.pkl", "wb" ) )

# Go To:
[Original Modeling Process](../original/Models.ipynb) 

[Original Data Cleaning](../original/Data_Cleaning.ipynb)

[Poly Modeling Process](../poly/Models.ipynb)

[Poly Data Cleaning](../poly/Data_Cleaning.ipynb)