<h1>Pre-processing and Modeling</h1>

Topic modeling is an unsupervised machine learning technique used to detect words and phrases within documents                                                                                                                                                    and automatically cluster groups of words and similar expressions that best characterize a set of documents.

This NLP technique is useful for tasks including text classification, extracting themes from documents, and building a recommender systems to recommend other text such as an article.

The topic modeling technique that will be used is LDA, Latent Dirichlet Analysis. This model assigns each word to a random topic. Then iteratively, the algorithm then reassigns the word to a new topic and considers a few things. First, what is the probability of the word belonging to a topic and the probability of the document to be generated by a topic.

<h3>Import Packages</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS

from sklearn.decomposition import LatentDirichletAllocation

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score

<h3>Load Data</h3>

In [2]:
from sqlalchemy import create_engine, MetaData, Table, select

engine = create_engine("sqlite:///joblist.sqlite")
metadata = MetaData()
data = Table('data', metadata, autoload=True, autoload_with=engine)
stmt = select([data.columns.jobdescription, data.columns.label])
connection = engine.connect()
results = connection.execute(stmt).fetchall()

df_data = pd.DataFrame(results)
df_data.columns = results[0].keys()
df_data['jobdescription'] = df_data['jobdescription'].astype('string')

In [3]:
df_data.head()

Unnamed: 0,jobdescription,label
0,Position Title:Pricing Analyst Position Type: ...,0
1,Title: Senior Data Analyst - Telephony Manager...,0
2,We are looking for a talented Fuel Cell Data E...,0
3,CAREER OPPORTUNITY SENIOR METER DATA ANALYST L...,0
4,The Data Engineer reports directly to the Dire...,0


In [4]:
df_data.describe(include='all')

Unnamed: 0,jobdescription,label
count,625,625.0
unique,553,
top,Our student and new graduate programs offer a ...,
freq,5,
mean,,0.28
std,,0.449359
min,,0.0
25%,,0.0
50%,,0.0
75%,,1.0


<h2>Count Vectorizer + Topic Modeling</h2>

The topic modeling technique LDA uses the text pre-processed via CountVectorizer function as an input. CountVectorizer returns an encoded vector with an integer count for each word.

Here, we will tokenize, use the built-in stop words list, and keep only tokens that appear in at least 4 dfs (document frequencies).

In [5]:
from nltk.corpus import stopwords

combined_stopwords = ENGLISH_STOP_WORDS.union(stopwords.words('french'))

cv = CountVectorizer(analyzer='word',  
                     stop_words = combined_stopwords,
                     lowercase = True, 
                     min_df=4,
                     max_df = 0.99,
                     ngram_range=(1,2))
count_vector = cv.fit_transform(df_data['jobdescription'])

In [6]:
%%time
# Initialize LDA model with 10 topics
lda_model = LatentDirichletAllocation(n_components=10,
                                      random_state=42)

# Fit it to our CountVectorizer Transformation
X_topics = lda_model.fit_transform(count_vector)

# Define variables
n_top_words = 10
topic_summaries = []

# Get the topic words
topic_word = lda_model.components_

# Get the vocabulary from the text features
vocab = cv.get_feature_names()

# Display the Topic Models
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

Topic 0: data | analytics | work | td | business | wattpad | experience | insights | team | marketing
Topic 1: data | business | experience | work | skills | reporting | analysis | business intelligence | intelligence | bi
Topic 2: data | business | experience | analytics | analysis | work | management | clients | solutions | skills
Topic 3: data | experience | business | requirements | quality | skills | project | work | years | knowledge
Topic 4: data | learning | experience | machine | machine learning | work | science | team | business | data science
Topic 5: data | business | team | work | experience | skills | sales | analysis | ability | strong
Topic 6: data | business | experience | work | management | team | skills | support | ability | knowledge
Topic 7: data | business | bmo | support | depth | management | experience | stakeholders | skills | group
Topic 8: experience | business | team | data | project | development | support | ability | solutions | customer
Topic 9: data |

import eli5



<h3>Create Train-Test Sets</h3>

In [7]:
train_data, test_data = train_test_split(df_data,
                               test_size=0.20, 
                               random_state=123)

In [8]:
y_train = train_data['label'].values
y_test = test_data['label'].values

In [9]:
y_train.shape

(500,)

In [10]:
y_test.shape

(125,)

<h2>Analyzing Job Descriptions: Baseline Model with Log Reg</h2>

With machine learning, use TF-IDF to extract features before fitting the classifer.

In [11]:
#nltk.download('stopwords')
from nltk.corpus import stopwords

# Initialize TFIDF Vectorizer
tvec = TfidfVectorizer(analyzer = 'word',  
                       stop_words = ENGLISH_STOP_WORDS.union(stopwords.words('french')), 
                       lowercase= True, 
                       min_df=4, 
                       #max_df = 0.95,
                       ngram_range = (1,2))

In [12]:
# Fit text data
tvec.fit_transform(train_data['jobdescription'].values) #add .values?

<500x12920 sparse matrix of type '<class 'numpy.float64'>'
	with 191658 stored elements in Compressed Sparse Row format>

In [13]:
# Extract features from x_train and x_test using TF-IDF
x_train_tfidf = tvec.transform(train_data['jobdescription'].values)
x_test_tfidf = tvec.transform(test_data['jobdescription'].values)

In [14]:
x_test_tfidf

<125x12920 sparse matrix of type '<class 'numpy.float64'>'
	with 45824 stored elements in Compressed Sparse Row format>

In [15]:
x_train_tfidf

<500x12920 sparse matrix of type '<class 'numpy.float64'>'
	with 191658 stored elements in Compressed Sparse Row format>

In [16]:
# Init log reg model
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='liblinear')

In [17]:
model = logreg.fit(x_train_tfidf, y_train)

<h3>Cross-Validation</h3>

In [18]:
%%timeit
# Apply 10-fold cross-validation
clf_result = cross_val_score(model, x_train_tfidf, y_train, cv=10, scoring='accuracy')
print("The mean of cross validation is: ", clf_result.mean())

The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000000001
The mean of cross validation is:  0.8300000000

<h3>Evaluation Metrics of Test data</h3>

In [19]:
# Predict y values using x test values
y_pred = model.predict(x_test_tfidf)
precision, recall, fscore, support = score(y_test, 
                                            y_pred, 
                                            pos_label=1, 
                                            average ='binary')

print("Classification Report: \nPrecision: {}, \nRecall: {}, \nF-score: {}, \nAccuracy: {}".format(round(precision,3),round(recall,3),round(fscore,3),round((y_pred==y_test).sum()/len(y_test),3)))

Classification Report: 
Precision: 0.895, 
Recall: 0.5, 
F-score: 0.642, 
Accuracy: 0.848


In [20]:
# Confusion matrix
confusion_matrix(y_test, y_pred)

array([[89,  2],
       [17, 17]])

In [30]:
y_pred_proba = model.predict_proba(x_test_tfidf)
# Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

# Sort for top 5 predictions
top5 = np.argsort(y_pred_proba, axis=1)[:,-5:]

# Retrieve category of predictions
top5_cat = [[model.classes_[predicted_cat] for predicted_cat in prediction] for prediction in top5]
top5_cat = [item[::-1] for item in top5_cat]
print(top5[:5])

[[1 0]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


In [29]:
type(top5)

numpy.ndarray