# Modelling

In this final notebook, I will begin to modelling and decide which model is best for classifying job postings as real or fake. For each model I will tune hyperparameters with cross validation and calculate the total execution time. 

*Note:* For  reference regarding the execution time, I am running a Mac with Dual-Core Intel i5 3.1 GHz with 8 GB of RAM. 

I'll start by loading some of the tools I'll need for the job.

In [1]:
import time
import pandas as pd
import warnings
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
warnings.filterwarnings("ignore")

## Loading the Data

I'll load in the preprocessed data just so I have it for reference.

In [2]:
data = pd.read_csv('../data/preprocessed_data.csv', index_col=0)

In [3]:
data.head(3)

Unnamed: 0_level_0,text
fraudulent,Unnamed: 1_level_1
0,marketing intern were food weve created ground...
0,customer service cloud video production second...
0,commissioning machinery assistant cma valor se...


And most importantly I'll load in the testing and training data.

In [4]:
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = pd.read_csv('../data/y_train.csv')['fraudulent']
X_test = pd.read_csv('../data/X_test.csv', index_col=0)
y_test = pd.read_csv('../data/y_test.csv')['fraudulent']

## Model Selection

Now comes the fun part! I am going to choose 3 different model and each will be tuned using `GridSearchCV` with a CV of 5. The models that I have chosen are:
- Naive Bayes
- K-Nearest Neighbor
- Passive Aggressive Classifier

The reason I chose these models is because of their speed. In my experience, I know Naive Bayes and KNN to be pretty efficient in terms of speed and memory. This will be my first time modelling with a Passive Aggressive Classifier, but from what I've read, it seems pretty efficient too.

However, before I even begin thinking of getting my hands dirty with modelling, I'll first run a dummy model as a basline. This will help me determine how good or bad my actual models are by comparision. 

### Dummy Classifier

In [5]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_predict_dummy = dummy.predict(X_test)

In [6]:
dummy_report = classification_report(y_test, y_predict_dummy)
print(dummy_report)

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5143
           1       0.00      0.00      0.00       221

    accuracy                           0.96      5364
   macro avg       0.48      0.50      0.49      5364
weighted avg       0.92      0.96      0.94      5364



At first glance, one could say the dummy model performed really well if you just look at the weighted F1 and overall accuracy. The truth is, that it isn't that great (after all it is a dummy model). Since the data is imbalanced, favoring the real jobs, the predicted label for each record is going to be real. The overall accuracy might be high, but the F1 for the fake jobs is what really matters here. This is the reality of dealing with imbalanced data.

### Naive Bayes

The first real model is good ole' Naive Bayes. Naive Bayes can work quite well for NLP tasks and I know it to be very effcient in terms of memory and speed. I won't go to crazy with the hyperparameters and I'll only tune a few.

In [7]:
start_time = time.time()
nb = MultinomialNB()

In [8]:
alpha = [i/10 for i in range(0, 5)]
fit_prior = [True, False]
param_dist = {"alpha": alpha, "fit_prior": fit_prior}

In [9]:
rand_search = GridSearchCV(estimator=nb, param_grid=param_dist, cv=5)
rand_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=MultinomialNB(),
             param_grid={'alpha': [0.0, 0.1, 0.2, 0.3, 0.4],
                         'fit_prior': [True, False]})

In [10]:
best_nb = rand_search.best_estimator_
print(best_nb)

MultinomialNB(alpha=0.1)


In [11]:
nb = best_nb
nb.fit(X_train, y_train)
y_predict_nb = nb.predict(X_test)
end_time = time.time()

In [12]:
nb_report = classification_report(y_test, y_predict_nb)
print(nb_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5143
           1       0.89      0.58      0.70       221

    accuracy                           0.98      5364
   macro avg       0.94      0.79      0.85      5364
weighted avg       0.98      0.98      0.98      5364

Execution time: 0.6847259322802226 min


This is a good start and kind of how I expected. Naive Bayes can be pretty useful for NLP tasks and run fairly quick. A 0.70 F1 score for the fraudulent jobs is nothing to be ashamed about, but I know I can do better. On to the next model.

### Passive Aggresive Classifier

The next model I've chosen to test is a Passive Aggresive Classifier. To be quite honest, I don't know much on how this model actually works. All I've heard about it is that it performs well for NLP tasks. I figured I would give it a shot here.

In [13]:
start_time = time.time()
pac = PassiveAggressiveClassifier()
loss = ['hinge', 'squared_hinge']
shuffle = [True, False]
param_dist = {"shuffle": shuffle, "loss": loss, "n_jobs": [-1]}

In [14]:
grid_search = GridSearchCV(pac, param_grid=param_dist, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=PassiveAggressiveClassifier(),
             param_grid={'loss': ['hinge', 'squared_hinge'], 'n_jobs': [-1],
                         'shuffle': [True, False]},
             scoring='f1')

In [15]:
best_pac = grid_search.best_estimator_

In [16]:
print(best_pac)

PassiveAggressiveClassifier(loss='squared_hinge', n_jobs=-1)


In [17]:
y_predict_pac = best_pac.predict(X_test)
end_time = time.time()

In [18]:
pac_report = classification_report(y_test, y_predict_pac)
print(pac_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5143
           1       0.83      0.77      0.80       221

    accuracy                           0.98      5364
   macro avg       0.91      0.88      0.90      5364
weighted avg       0.98      0.98      0.98      5364

Execution time: 1.6964730302492776 min


This is pretty solid. Overall, the accuracy is pretty high and the F1 scores for the fraudlent postings are no less than 0.80. Also, the execution time is pretty short, so thats a plus.

### KNN

Last up is KNN. KNN I'll tune the number of neighbors, weights and power parameter.

In [19]:
start_time = time.time()
knn = KNeighborsClassifier()
n_neighbors = [i for i in range(2, 6)]
p = [i for i in range(2, 5)]
param_dist = {"n_neighbors": n_neighbors, "weights": ['distance'], "p": p, "n_jobs": [-1]}

In [20]:
grid_search = GridSearchCV(knn, param_grid=param_dist, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
best_knn = grid_search.best_estimator_

In [None]:
print(best_knn)

In [None]:
y_predict_knn = best_knn.predict(X_test)
end_time = time.time()

In [None]:
knn_report = classification_report(y_test, y_predict_knn)
print(knn_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

## Summary