# Initial Model Building

Q: Is there a detectable difference between the Indeed.com job descriptions from different cities?

This question will be addressed by building an intial model (Naive Bayes) using the text of the job descriptions, and looking for differences between the cities. This first model will be quite naive, and is as much a test of the NLP pipeline as anything else.

In [156]:
from src.utils import import_data
from src.data_processing import Processing
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

In [162]:
# Build the Naive Bayes model
def simple_model(X, y, alpha = 0.01):
    model = MultinomialNB(alpha)
    m = np.mean(cross_val_score(model, X, y, cv=5))
    print("Model Accuracy = {:.4}".format(m))

In [3]:
df = import_data("job-hunter-plus-data", "indeed_data.csv")
df2 = df #Make a copy to avoid having to download from the S3 bucket one every run.

In [160]:
# Build a stemmitized TFIDF vectorized feature matrix, with class labels
# Two cities (NY, SF)
p = Processing(df2, num_cities=2) #Instantiate processing class
p.snowball_stemmatizer()
p.tfidf_vectorize(min_df = 0.03, max_df = 0.98)
X = p.vectorize.transform(p.transformed)
y = p.labels
simple_model(X,y)

Model Accuracy = 0.8169


In [161]:
# Build a lemmatized TFIDF vectorized feature matrix, with class labels
# Two cities (NY, SF)
p = Processing(df2, num_cities=2) #Instantiate processing class
p.wordnet_lemmatizer()
p.tfidf_vectorize(min_df = 0.03, max_df = 0.98)
X = p.vectorize.transform(p.transformed)
y = p.labels
simple_model(X, y)

Model Accuracy = 0.8071


In [176]:
# Build a stemmitized TFIDF vectorized feature matrix, with class labels
# Three cities (NY, SF, Chicago)
p = Processing(df2, num_cities=3) #Instantiate processing class
p.snowball_stemmatizer()
p.tfidf_vectorize(min_df = 0.03, max_df = 0.98)
X = p.vectorize.transform(p.transformed)
y = p.labels
simple_model(X,y)

Model Accuracy = 0.7293


So, in conlusion, there is compelling evidence that there are differences between job descriptions in different cities, and these can be used to build predictive models that have reasonable out of the box accuracy, considering the nature of the data that is being used.

From these initial results, it appears that a snowball stemmitization, followed by tokenization prior to TFIDF might be a good methodology to use.