# Preprocessing

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from library.sb_utils import save_file

As always, I'll start by loading the data.

In [2]:
data = pd.read_csv("../data/clean_data.csv")

In [3]:
data.head(3)

Unnamed: 0,salary_range,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,function,fraudulent,text
0,0,0,1,0,Other,Internship,Unspecified,Marketing,0,marketing intern u ny new york we re food52 we...
1,0,0,1,0,Full-time,Not Applicable,Unspecified,Customer Service,0,customer service cloud video production nz auc...
2,0,0,1,0,Other,Not Applicable,Unspecified,Other,0,commissioning machinery assistant cma u ia wev...


As nice as it was to look at all the numerical features in during EDA, those features are not going to be needed in this project since I am dealing with text data. Perhpas if I identified clear correlations between the numerical data and the fraudulent classification, I would consider using the features.

In [None]:
data.drop['salary_range', 'telecommuting', 'has_company_logo', 'has_questions', 'required_experience', 're']

## Encoding

There are several categorical features here that need to be encoded. I'll start by refreshing myself on the values each feature contains.

In [None]:
for col in data[['employment_type', 'required_experience', 'required_education', 'function']].columns:
    print(col + ":", set(data[col]), "\n")

### Label Encoding

From the dataset, there are two columns that would require a label encoder: `employment_type` and `function`.

In [None]:
le = LabelEncoder()

In [None]:
le.fit(data['employment_type'])
data['employment_type'] = le.transform(data['employment_type'])

In [None]:
le.fit(data['function'])
data['function'] = le.transform(data['function'])

### Ordinal Encoder

In [None]:
set(data['required_experience'])

In [None]:
emply_types = ['Not Applicable', 'Internship', 'Associate', 
              'Entry level', 'Associate', 'Mid-Senior level', 'Director', 'Executive']
oe = OrdinalEncoder(categories=[emply_types])
oe.fit(data[['required_experience']])
data['required_experience'] = oe.transform(data[['required_experience']])

In [None]:
set(data['required_education'])

In [None]:
edu_types = ['Unspecified', 'Some High School Coursework', 'High School or equivalent',
            'Vocational - HS Diploma', 'Some College Coursework Completed', 'Associate Degree',
            'Vocational', 'Professional', 'Certification', 'Vocational - Degree',
            'Bachelor\'s Degree', 'Master\'s Degree', 'Doctorate']
oe = OrdinalEncoder(categories=[edu_types])
oe.fit(data[['required_education']])
data['required_education'] = oe.transform(data[['required_education']])

In [None]:
data.head(3)

## Text Processing

Now all there is left to is deal with all the text data. I have already consolidated the text data to one column, converted everything to lowercase and removed all the stop words.

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(data['text'])

In [None]:
vector = vectorizer.transform(data['text'])

In [None]:
features = vectorizer.get_feature_names()
dense = vector.todense().tolist()
df = pd.DataFrame(dense, columns=features)

In [None]:
df