# Fake job Prediction: Investigating the pattern to avoid the job scam

# Notebook 5  Neural network and Modeling

**Fifth setp: Modeling with Pre-trained word vectors and Neural network**

By: Polly Pang

In this portion of the notebook, I will use Pre-trained word vectors, convert text column into sparse metrics. And implement traditional Ml models and also Neural network model.

<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home"> Contents</h3>

* [1. Libraries](#1)
* [2. Load data](#2)  
* [3. Word Embeddings](#3)
    - [3.1 Word2Vec](#3.1)
    - [3.2 vectorization](#3.2)
* [4. Modeling](#3)
    - [4.1 oversampling](#3.2)

    
    
    
* [7. End of Notebook 5](#7)

# 1. Libraries <a id="1"></a>

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import joblib
from sklearn.model_selection import train_test_split

# NLP
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import gensim
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imbPipeline

# 2. Load data

## 2.1 Target and Feature<a id="2.1"></a>
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

**plan**
- Define X and y, Target :`fraudulent`, Features:`has_salary_range`,`telecommuting`,`has_company_logo`,`has_questions`,and `text`.
- split data into test set and reminder set, as I will process Cross-Validation in the follwing part, so I will not split my dataset into train and test.
- investigate the distribustion of reminder set.

In [5]:
# load data
word_embed_df=joblib.load('data/df_model_full_df.pkl')

In [6]:
# sample of dataset
word_embed_df

Unnamed: 0,has_salary_range,telecommuting,has_company_logo,has_questions,fraudulent,text
0,0,0,1,0,0,"Marketing Intern Marketing We're Food52, and w..."
1,0,0,1,0,0,Commissioning Machinery Assistant (CMA) Blank ...
2,0,0,1,0,0,Account Executive - Washington DC Sales Our pa...
3,0,0,1,1,0,Bill Review Manager Blank SpotSource Solutions...
4,0,0,0,0,0,Accounting Clerk Blank Blank Job OverviewApex ...
...,...,...,...,...,...,...
17874,0,0,1,1,0,Account Director - Distribution Sales Vend is ...
17875,0,0,1,1,0,Payroll Accountant Accounting WebLinc is the e...
17876,0,0,0,0,0,Project Cost Control Staff Engineer - Cost Con...
17877,0,0,0,1,0,Graphic Designer Blank Blank Nemsia Studios is...


In [7]:
print(f"The shape of word_embed_df is {word_embed_df.shape}")

The shape of word_embed_df is (17879, 6)


In [8]:
word_embed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17879 entries, 0 to 17878
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   has_salary_range  17879 non-null  int32 
 1   telecommuting     17879 non-null  int64 
 2   has_company_logo  17879 non-null  int64 
 3   has_questions     17879 non-null  int64 
 4   fraudulent        17879 non-null  int64 
 5   text              17879 non-null  object
dtypes: int32(1), int64(4), object(1)
memory usage: 768.4+ KB


**Define Target and Features**

In [9]:
# define X and y
X=word_embed_df.drop('fraudulent',axis=1)
y=word_embed_df['fraudulent']

In [10]:
print(f"The shape of features(X) is {X.shape},and the shape of y (Target) is {y.shape}")

The shape of features(X) is (17879, 5),and the shape of y (Target) is (17879,)


**Reminder and test split**

In [11]:
X_rem,X_test,y_rem,y_test=train_test_split(X,y,test_size=0.3,random_state=5, stratify=y)

In [12]:
print(f"The shape of X_reminder is {X_rem.shape},and the shape of y_rem is {y_rem.shape}")

The shape of X_reminder is (12515, 5),and the shape of y_rem is (12515,)


In [13]:
print(f"The shape of X_test is {X_test.shape},and the shape of y_test is {y_test.shape}")

The shape of X_test is (5364, 5),and the shape of y_test is (5364,)


In [14]:
X_rem.head()

Unnamed: 0,has_salary_range,telecommuting,has_company_logo,has_questions,text
4310,0,0,1,1,Creative Director - Art Blank Kettle is an ind...
11082,0,0,1,0,"CDL Driver Blank ABC Supply Co., Inc. is the n..."
14188,0,0,1,0,Data Analyst (Marketing) Blank Blank We are lo...
8280,0,0,1,1,Channel Representative Blank Intercom (# is a ...
13924,0,0,1,0,Programmatic Media Manager Media Since 1978Our...


In [15]:
y_rem.unique()

array([0, 1], dtype=int64)

In [16]:
X_test.head()

Unnamed: 0,has_salary_range,telecommuting,has_company_logo,has_questions,text
8222,1,0,1,0,eCommerce Specialist / Database Management for...
1560,0,0,1,1,Solution Engineer Blank Declara is focused on ...
17171,1,0,0,1,Sales Representative for Home Improvement Blan...
9164,0,0,1,1,Customer Service Specialist Interpreting Servi...
9454,0,0,1,0,Customer Service Associate - Records Blank Nov...


In [17]:
y_test.unique()

array([0, 1], dtype=int64)

- X_rem and X_test have same features, and y_test and y_rem contains only 2 result.
- 2 Reminder sets and test stes have same number of rows.
- Now I have my reminder and test set ready.

# 3. Word Embeddings

## 3.1 Word2Vec<a id="2.1"></a>
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

**Plan**
- Define function text_process(), apply to X_rem and X_test
- Define function sentence2vec()
- Define column transfor
- Create sparse matrix

**Function text_process()**

**What does the function do**
- lowercase all the text
- Get rid of extra blank
- Take out of stopwords
- Take out puncuations
- Stemming
- Combine all the text back to a sentence

In [38]:
# init stemmer
stemmer = nltk.stem.PorterStemmer()
# make tokens
def text_process(text):
    non_pun=''.join([w for w in text.lower() if w not in string.punctuation])

    lst_words=non_pun.split(' ')
    #print(lst_words)
    # remove stop words and stem
    lst_out=[]
    for w in lst_words:
        if (w not in stopwords.words('english')) and (w != ""):
            stemmed_word = stemmer.stem(w)
            lst_out.append(stemmed_word)
    # combine list into sentence
            sentence = ' '.join(lst_out)
    return sentence

In [40]:
# test result
text_process(X_rem['text'][0])

'market intern market food52 weve creat groundbreak awardwin cook site support connect celebr home cook give everyth need one placew top editori busi engin team focus use technolog find new better way connect peopl around specif food interest offer superb highli curat inform food cook attract talent home cook contributor countri also publish wellknown profession like mario batali gwyneth paltrow danni meyer partnership whole food market random housefood52 name best food websit jame beard foundat iacp featur new york time npr pando daili techcrunch today showwer locat chelsea new york citi food52 fastgrow jame beard awardwin onlin food commun crowdsourc curat recip hub current interview full parttim unpaid intern work small team editor execut develop new york citi headquartersreproduc andor repackag exist food52 content number partner site huffington post yahoo buzzfe variou content manag systemsresearch blog websit provis food52 affili programassist daytoday affili program support scre

In [41]:
# compare with the orignal text
X_rem['text'][0]

"Marketing Intern Marketing We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City. Food52, a fast-growing, James Beard Award-winning online food community and crowd-sour

In [49]:
# X_rem pre-process 
X_rem_stem=X_rem.copy()

X_rem_stem['text']=X_rem_stem['text'].apply(text_process)

In [57]:
X_rem_stem.shape==X_rem.shape

True

In [58]:
# X_test pre-process 
X_test_stem=X_test.copy()
X_test_stem['text']=X_test_stem['text'].apply(text_process)

In [63]:
X_test.shape==X_test_stem.shape

True

- X_rem and X_test has been transfered, moving forward I will keep NLP precoess using X_test_stem and X_rem_stem.

**customized vectorizer**

- Word2Vec, you need to down load and unzip the pre_trained word vectors. load pretrained vectors: we will use [LexVec](https://github.com/alexandres/lexvec) and Wikipedia.
- This is a 300-dimensional embeddings.

In [69]:
# load model
model = gensim.models.KeyedVectors.load_word2vec_format('word_vectors/lexvec-wikipedia-word-vectors', binary=False)

In [70]:
def sentence2vec(text):
    """
    Embed a sentence by averaging the word vectors of the tokenized text. 
    Out-of-vocabulary words are replaced by the zero-vector.
    -----
    
    Input: text (string)
    Output: embedding vector (np.array)
    """
    tokenized = simple_preprocess(text)
    
    word_embeddings = [np.zeros(300)]
    for word in tokenized:
        # if the word is in the model then embed
        if word in model:
            vector = model[word]
        # add zeros for out-of-vocab words
        else:
            vector = np.zeros(300)
            
        word_embeddings.append(vector)
    
    # average the word vectors
    sentence_embedding = np.stack(word_embeddings).mean(axis=0)
    
    return sentence_embedding

In [43]:
# run test on function sentence2vec, funtion sentence2vec will vectorlize a chunk of words and make them into vectors.
sentence2vec(test)

array([-8.73303677e-03, -7.40721906e-03,  1.96714861e-02, -9.62654653e-03,
       -7.61171243e-03,  7.10465100e-02,  2.79783374e-03, -3.44043646e-02,
        2.67091537e-02,  3.47746526e-02,  3.64276721e-02,  4.86087006e-02,
       -5.14526681e-02,  5.95221453e-02,  3.72461983e-02, -7.26838076e-03,
       -2.58509070e-02,  1.05571504e-02,  1.63256722e-02,  2.45892592e-02,
       -2.23805102e-02, -1.03946277e-02,  3.08614728e-02,  2.79396841e-02,
       -3.21179997e-02,  2.59392796e-02,  6.13525187e-02,  6.66009756e-02,
        8.12814498e-02, -9.03423099e-03, -6.22478060e-02,  1.34164980e-02,
        1.39349112e-02, -1.21182504e-02,  6.12421055e-03,  4.17263240e-02,
        1.42471214e-02, -5.70988515e-04, -2.02543078e-02,  5.34754635e-03,
        4.08871904e-02,  6.40502865e-03, -4.73126478e-02, -7.77524050e-02,
       -1.58924088e-02, -3.13673809e-02,  3.28009921e-02, -1.21536515e-02,
       -3.01963932e-02, -5.86122256e-03,  2.21079427e-02,  2.29700792e-03,
       -3.03853055e-03,  

In [64]:
#  300-dimensional
sentence2vec(test).shape

(300,)

**column tranformer**

In [80]:
numerical = list(X_rem_stem.select_dtypes('number').columns)
print(f"Numerical columns are: {numerical}")

Numerical columns are: ['has_salary_range', 'telecommuting', 'has_company_logo', 'has_questions']


In [79]:
text = list(X_rem_stem.select_dtypes('object').columns)
print(f"text columns is: {text}")

text columns is: ['text']


In [114]:
# Create the column transformations list + columns to which to apply
col_transforms_nn = [('num','passthrough', numerical),
                ('sentence2vec', sentence2vec, 'text')]

In [118]:
# Create the column transformer
col_trans_nn = ColumnTransformer(col_transforms_nn,remainder='drop')

In [119]:
# fit
#pipeline.fit_transform(X_rem_stem)
col_trans_nn.fit(X_rem_stem)

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. '<function sentence2vec at 0x000001A9F90160D0>' (type <class 'function'>) doesn't.