# Evaluating Naive Bayes vs Logistic Regression to classify job listings

Understanding the availability of jobs is important in determining which positions to feature in ads, who to target, where to target them, and which platforms to target them on.

As discussed in the Job Listing EDA notebook, labeling jobs by role or company can be error prone. Instead, learning patterns which are common in a certain type of job and applying a label automatically may be helpful. Especially when dealing with over 10M jobs each day. 

Here, we look at basic NLP approaches to classify jobs by title. For simplicity, we only consider whether a job is a skilled position (nurses, engineers, consultants) or a gig-role (drivers, cashiers, shoppers).

We are primarily interested in identifying and isolating gig-jobs. These roles are more suitable for advertising through social media platforms and tend to have high click rates.

Previously, we evaluated Logistic Regression. However, during that process, we noticed that certain words were almost guaranteed to impact the classification of a posting as a skilled or not. Naive Bayes offers several distinct benefits. First, it is easy to understand: a term has a probability of being associated with a particular label. Second, naive bayes is easy to train and can be used to apply multiple labels, given sufficient training examples for each label. Last, the model can continue to be re-trained as more labeled samples become available. 

In [21]:
import os, pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('../data/labeled_eda_sample_data_file.csv')
data.columns

Index(['title', 'city', 'state', 'zip', 'country', 'posted_at',
       'job_reference', 'company', 'category', 'body', 'gig'],
      dtype='object')

In [4]:
cols_to_train = ['title', 'gig'] 
data = data[cols_to_train]
data.head()

Unnamed: 0,title,gig
0,Retail Store Manager - Alabaster AL,0
1,Financial Relationship Consultant - Pell City,0
2,Prod Cook 3 PM Bob's Steak & Chop,1
3,Quant Developer,0
4,Human Resource Manager,0


** Cleanup text **

In [5]:
def standardize_text(df, text_field):
    '''Clean-up text column to prepare for tokenization
    
    Removes unwanted characters &
    Replaces them with spaces or blanks
    --
    Input
    + pandas dataframe
    + name of text column
    
    Returns
    + pandas dataframe with cleaned column
    '''
    df[text_field] = df[text_field].str.replace(r"http\S+", "")
    df[text_field] = df[text_field].str.replace(r"http", "")
    df[text_field] = df[text_field].str.replace(r"@\S+", "")
    df[text_field] = df[text_field].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    df[text_field] = df[text_field].str.replace(r"@", "at")
    df[text_field] = df[text_field].str.lower()
    return df

In [6]:
text_cols = ['title']

for col in text_cols:
    data = standardize_text(data, col)

col_names = {'title':'job_title',
             'gig':'class_label'}    

data = data.rename(columns=col_names)

#data.to_csv('../data/cleaned_labeled_data.csv')
data.head()

Unnamed: 0,job_title,class_label
0,retail store manager alabaster al,0
1,financial relationship consultant pell city,0
2,prod cook 3 pm bob's steak chop,1
3,quant developer,0
4,human resource manager,0


## Preprocessing

In [22]:
import keras
import nltk
import re
import codecs

In [8]:
from nltk.tokenize import RegexpTokenizer

In [9]:
tokenizer = RegexpTokenizer(r'\w+')

data['tokens'] = data['job_title'].apply(tokenizer.tokenize)
data.head()

Unnamed: 0,job_title,class_label,tokens
0,retail store manager alabaster al,0,"[retail, store, manager, alabaster, al]"
1,financial relationship consultant pell city,0,"[financial, relationship, consultant, pell, city]"
2,prod cook 3 pm bob's steak chop,1,"[prod, cook, 3, pm, bob, s, steak, chop]"
3,quant developer,0,"[quant, developer]"
4,human resource manager,0,"[human, resource, manager]"


## Vectorize the tokens

We have several options when representing the tokenized words mathematically:

+ Bag of words -- count how many times a word appears 
+ tf-idf (term frequency-inverse document frequency) - assign weight by relevance of word, not frequency

### Processing tools

Convert data and target to list format for later use.

Define a function to create document-term matrix and fit a vectorizer model. Allow for multiple vectorizer options.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# create lists of X and y for later use
list_corpus = data['job_title'].tolist()
list_labels = data['class_label'].tolist()

def fit_vectorizer(data, vec_type='count'):
    '''Create and fit a vectorizer
    
    Options:
    + count -> count_vectorizer 
    + tfidf -> tfidf_vectorizer
    
    Input:
    + data - X data to fit the model
    + vec_type - name of vectorizer to use
    
    Returns:
    + Document-term matrix or Tf-idf-weighted document-term matrix
    + vectorizer - fitted model
    '''
    if vec_type=='count':
        vectorizer = CountVectorizer()
    elif vec_type=='tfidf':
        vectorizer = TfidfVectorizer()
    else:
        print('Please select an appropriate option')
    
    emb = vectorizer.fit_transform(data)

    return emb, vectorizer

### Evaluation & Vizualization tools

Some functions to help assess model performance.

In [11]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

def get_metrics(y_test, y_predicted):  
    # true positives / (true positives+false positives)
    precision = precision_score(y_test, y_predicted, pos_label=None,
                                    average='weighted')             
    # true positives / (true positives + false negatives)
    recall = recall_score(y_test, y_predicted, pos_label=None,
                              average='weighted')
    
    # harmonic mean of precision and recall
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    
    # true positives + true negatives/ total
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1


In [20]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

def run_skf(df, vec):
    '''A training & testing pipeline to compare:
    
    + Logistic Regression
    + Naive Bayes Classification
    
    Inputs:
    
    df - pandas dataframe containing labeled data
    vec - choice of vectorizer model (count vectorizer or tf-idf)
    
    Output:
    
    10-fold stratified cross validation results
    Various performance metrics
    '''

    skf = StratifiedKFold(n_splits=10, random_state=0)

    X = data['job_title']
    y = data['class_label']

    current_split = 1

    acc_list = []; prec_list = []; rec_list = []
    acc_nb = []; prec_nb = []; rec_nb = []

    for train_index, test_index in skf.split(X, y):

        print('CURRENT SPLIT:', current_split)

        # get splits & assign data
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # vectorize word counts
        X_train_counts, count_vectorizer = fit_vectorizer(X_train, vec_type=vec)
        X_test_counts = count_vectorizer.transform(X_test)

        # train & test logsitic regression model
        clf = LogisticRegression(C=30.0, class_weight='balanced', solver='newton-cg', 
                             multi_class='ovr', n_jobs=-1, random_state=40)
        clf.fit(X_train_counts, y_train)
        y_predicted = clf.predict(X_test_counts)
        # check performance
        accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
        print("LR: accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

        # add metrics to list
        acc_list.append(accuracy)
        prec_list.append(precision)
        rec_list.append(recall)

        # do the same for Naive bayes
        nb_clf = MultinomialNB()
        nb_clf.fit(X_train_counts, y_train)
        y_pred_nb = nb_clf.predict(X_test_counts)
        # check performance
        accuracy, precision, recall, f1 = get_metrics(y_test, y_pred_nb)
        print("NB: accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))
        acc_nb.append(accuracy)
        prec_nb.append(precision)
        rec_nb.append(recall)

        current_split += 1

    # Sample classification report
    cr = classification_report(y_test, y_predicted, labels=[0,1], target_names=['Regular Job', 'Gig'])
    cm = confusion_matrix(y_test, y_predicted)

    print('\n--- LOGISTIC REGRESSION ---')
    print('\nClassification Report')
    print(cr)
    print('\nConfusion Matrix')
    print(cm)

    # Summarize
    print('\nFinal Perfomance')
    print('Accuracy: mean %.3f, variance %.3f' % (np.mean(acc_list), np.var(acc_list)))
    print('Precision: mean %.3f, variance %.3f' % (np.mean(prec_list), np.var(prec_list)))
    print('Recall: mean %.3f, variance %.3f'% (np.mean(rec_list), np.var(rec_list)))

    # nb
    cr = classification_report(y_test, y_pred_nb, labels=[0,1], target_names=['Regular Job', 'Gig'])
    cm = confusion_matrix(y_test, y_pred_nb)
    print('\n--- NAIVE BAYES CLASSSIFIER ---')
    print('\nClassification Report')
    print(cr)
    print('\nConfusion Matrix')
    print(cm)

    # Summarize
    print('\nFinal Perfomance')
    print('Accuracy: mean %.3f, variance %.3f' % (np.mean(acc_nb), np.var(acc_nb)))
    print('Precision: mean %.3f, variance %.3f' % (np.mean(prec_nb), np.var(prec_nb)))
    print('Recall: mean %.3f, variance %.3f'% (np.mean(rec_nb), np.var(rec_nb)))

### Bag of Words

+ Logistic Regression
+ NaiveBayes

In [18]:
run_skf(data, 'count')

CURRENT SPLIT: 1
LR: accuracy = 0.911, precision = 0.910, recall = 0.911, f1 = 0.910
NB: accuracy = 0.931, precision = 0.936, recall = 0.931, f1 = 0.925
CURRENT SPLIT: 2
LR: accuracy = 0.931, precision = 0.932, recall = 0.931, f1 = 0.931
NB: accuracy = 0.941, precision = 0.941, recall = 0.941, f1 = 0.938
CURRENT SPLIT: 3
LR: accuracy = 0.851, precision = 0.849, recall = 0.851, f1 = 0.850
NB: accuracy = 0.871, precision = 0.864, recall = 0.871, f1 = 0.865
CURRENT SPLIT: 4
LR: accuracy = 0.910, precision = 0.916, recall = 0.910, f1 = 0.912
NB: accuracy = 0.940, precision = 0.939, recall = 0.940, f1 = 0.939
CURRENT SPLIT: 5
LR: accuracy = 0.940, precision = 0.940, recall = 0.940, f1 = 0.940
NB: accuracy = 0.930, precision = 0.929, recall = 0.930, f1 = 0.929
CURRENT SPLIT: 6
LR: accuracy = 0.900, precision = 0.897, recall = 0.900, f1 = 0.898
NB: accuracy = 0.930, precision = 0.928, recall = 0.930, f1 = 0.928
CURRENT SPLIT: 7
LR: accuracy = 0.920, precision = 0.924, recall = 0.920, f1 = 0.9

From these results, we can see that Naive Bayes seems to work better than Logistic regression. But does this hold? Let's try tf-idf.

### Tf-idf Model

In [19]:
run_skf(data, 'tfidf')

CURRENT SPLIT: 1


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


LR: accuracy = 0.921, precision = 0.919, recall = 0.921, f1 = 0.918
NB: accuracy = 0.851, precision = 0.875, recall = 0.851, f1 = 0.817
CURRENT SPLIT: 2
LR: accuracy = 0.921, precision = 0.924, recall = 0.921, f1 = 0.922
NB: accuracy = 0.861, precision = 0.882, recall = 0.861, f1 = 0.832
CURRENT SPLIT: 3
LR: accuracy = 0.832, precision = 0.835, recall = 0.832, f1 = 0.833
NB: accuracy = 0.861, precision = 0.882, recall = 0.861, f1 = 0.832
CURRENT SPLIT: 4
LR: accuracy = 0.920, precision = 0.929, recall = 0.920, f1 = 0.922
NB: accuracy = 0.860, precision = 0.881, recall = 0.860, f1 = 0.831
CURRENT SPLIT: 5
LR: accuracy = 0.940, precision = 0.940, recall = 0.940, f1 = 0.940
NB: accuracy = 0.870, precision = 0.888, recall = 0.870, f1 = 0.846
CURRENT SPLIT: 6
LR: accuracy = 0.910, precision = 0.909, recall = 0.910, f1 = 0.909
NB: accuracy = 0.880, precision = 0.883, recall = 0.880, f1 = 0.865
CURRENT SPLIT: 7
LR: accuracy = 0.930, precision = 0.931, recall = 0.930, f1 = 0.931
NB: accuracy =

While Naive Bayes ourperforms Logistic Regression with a basic Bag of Words model, its performance, especially with regards to the recall of the underrepresented classs drops off when using Tf-idf.

Still, the performance of Naive Bayes + Bag of words remains better than Logistic Regression + Tf-idf (which improved from Bag of Words).

## Summary

It appears from this trial, both Naive Bayes and Logsitic regression perform well on our data. 

Naive Bayes is particularly promising, owing it it's simplicity and ease of additional training/modification. Naive Bayes is particularly attractive as it is extensible to labeling multiple classes and its ability to continue incorporating new training samples to an existing model. 

Moving forward, as we seek to classify more than just skilled & gig positions, Naive Bayes should be considered.

It should be noted, however, that due to the small size of training data currently available, it is premature to decide whether one classifier is clearly outperforming the rest. Therefore, both Logistic Regression and Naive Bayes should be re-evaluated as more data becomes available.