<h1>Part 3 : Modeling of Pre-processed Text Data</h1>

<h3>Import Packages</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score

<h3>Load Data</h3>

In [2]:
from sqlalchemy import create_engine, MetaData, Table, select

engine = create_engine("sqlite:///joblist.sqlite")
metadata = MetaData()
data = Table('data', metadata, autoload=True, autoload_with=engine)
stmt = select([data.columns.jobdescription, data.columns.label])
connection = engine.connect()
results = connection.execute(stmt).fetchall()

df_data = pd.DataFrame(results)
df_data.columns = results[0].keys()
df_data['jobdescription'] = df_data['jobdescription'].astype('string')

<h3>Load Pre-Processing Steps</h3>

In [3]:
custom_stopwords = ['bachelor', 'degree', 'work', 'equal', 'opportunity', 'employer', 'objectives', 'ontario', 'canada', 'disability', 'strong', 'including', 'ensure', 'understanding', 'related']

# Initialize TFIDF Vectorizer
tvec = TfidfVectorizer(analyzer = 'word',  
                       stop_words = ENGLISH_STOP_WORDS.union(custom_stopwords), 
                       lowercase= True, 
                       min_df=2)

def tfidf_pipeline(txt):
    x = tvec.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

tvec2 = TfidfVectorizer(analyzer = 'word', 
                       stop_words = ENGLISH_STOP_WORDS.union(custom_stopwords), 
                       lowercase= True, 
                       ngram_range=(2,3), 
                       min_df=4)

def tfidf2_pipeline(txt):
    x = tvec2.fit_transform(txt) # Apply Vectorizer, Stopword Removal, Lowercasing, & select Bi-/Tri-grams
    return x 

<h3>Create Train & Test Sets</h3>

In [4]:
# Create a training and testing data sets
x_train, x_test, y_train, y_test = train_test_split(df_data['jobdescription'],df_data['label'],test_size=0.20, random_state=123, stratify=df_data['label'])

In [6]:
x_train

574    Role Description:
The Messaging Specialist wil...
597    Req Id: 278054
At Bell, we do more than build ...
177    Job description: Skills: · Total of 10 years i...
228    Are you a kid at heart looking to build a care...
318    SENIOR ANALYST, DATA TEAM (Temporary Full-Time...
                             ...                        
309    BUSINESS INTELLIGENCE ANALYST  RESPONSIBILITIE...
78     Category: Business Analysis (functional and te...
207    Requisition ID: 97894  Join the Global Communi...
219    As a leading mobile games developer, Jam City ...
116    Job Title: Data Analyst Job Description:  Job ...
Name: jobdescription, Length: 500, dtype: string

In [17]:
x_train[309]

'BUSINESS INTELLIGENCE ANALYST  RESPONSIBILITIES Our analyst will be contributing to preparing budgets, forecasts, revenue and expense analysis and business trend analysis. You will provide senior management with financial and business trend reporting used for planning, strategic and tactical decision making. You will also be responsible for the development and maintenance of numerous reports that will point to either data anomalies, business trends or simply over/underspending. The areas covered may include financials, work requests, utilities, capital planning, project management, real estate transactions and others. You will need to analyze the data, make recommendations as appropriate and present the findings.  Experienced building reports using Power BI with live feed from Data Warehouse and be able to write complex queries within SQL Act as Business analyst to comprehend the new requirement for building the new reports Identify tactical and strategic opportunities, gaps, and fina

In [7]:
# Apply pre-processing
# Unigrams
x_train_TFIDF = tfidf_pipeline(x_train)
x_test_TFIDF = tfidf_pipeline(x_test)

In [9]:
# Apply pre-processing
# Bi- & tri-grams
x_train_TFIDF2 = tfidf2_pipeline(x_train)
x_test_TFIDF2 = tfidf2_pipeline(x_test)

In [10]:
print("Original training set shape: ", x_train.shape)
print("Preprocessed training set for unigrams: ", x_train_TFIDF.shape)
print("Preprocessed training set for n-grams: ", x_train_TFIDF2.shape)

Original training set shape:  (500,)
Preprocessed training set for unigrams:  (500, 6536)
Preprocessed training set for n-grams: : (500, 12727)


In [18]:
print("Shape of y_test: ", y_test.shape)

Shape of y_test:  (125,)


<h3>Modeling</h3>

In [12]:
# Initialize and fit Random Forest Classifier model
rfc = RandomForestClassifier(n_estimators=100, class_weight='balanced')
rfc_model = rfc.fit(x_train_TFIDF,y_train)

<h3>Cross-Validation</h3>

In [13]:
# Apply 10-fold cross-validation
rfc_result = cross_val_score(rfc_model, x_train_TFIDF, y_train, cv=10, scoring='accuracy')
print("The mean of cross validation is: ", rfc_result.mean())

The mean of cross validation is:  0.852


<h3>Evaluation on Test Data</h3>

In [15]:
# Predict y values using x train values
y_pred = rfc_model.predict(x_test_TFIDF)
precision, recall, fscore, support = score(y_test, 
                                            y_pred, 
                                            pos_label=1, 
                                            average ='binary')

print("Classification Report: \nPrecision: {}, \nRecall: {}, \nF-score: {}, \nAccuracy: {}".format(round(precision,3),round(recall,3),round(fscore,3),round((y_pred==y_test).sum()/len(y_test),3)))


# ValueError: Number of features of the model must match the input. 
# Model n_features is 6536 (x_train_TFIDF shape) and input n_features is 3213 

ValueError: Number of features of the model must match the input. Model n_features is 6536 and input n_features is 3213 

In [None]:
# Confusion matrix
confusion_matrix(y_test, y_pred)