# Final Project Demonstration

This notebook is a demonstration of the best iterations of models we built over the timeframe of the course. We chose one well-performing model from each of our model algorithms (logistic regression, support vector machine, stacking ensemble, bidirectional LSTM) and use them to predict on a variety of sample job postings in this demo. (You can also try to classify them as well if you want, the text files for each posting are in the inputs folder)

note: if this notebook cannot run, an example of the output is saved in project_output.html

The full output from the source code including model accuracy comparison used to choose demo models and analytical data is saved in source_output.html.

The source code is in a notebook file. Model generations and predictions take a long time to run (apart from logistic regression).

All generated models are saved in generated folder, along with the dataframe, vectorizers, and original feature dataset.

In [None]:
#imports

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from tensorflow import keras
import contractions
import pickle
import numpy as np

In [None]:
#Tools to process raw text (stem, tokenize, remove stopwords)

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer('\w+')

In [None]:
#method to convert text to be compatible with model predictions

def create_document(text):
    stem = [ps.stem(word.lower()) for word in tokenizer.tokenize(
        contractions.fix(text)) if not word.lower() in stop_words]
    return ' '.join(stem)

In [None]:
#method to read file from directory into python String

def readFile(fname):
    f = open(fname, 'r', encoding="utf8")
    raw_text = f.read()
    f.close()
    return create_document(raw_text)

In [None]:
#load saved models and vectorizers from generated folder

logistic_model = pickle.load(open('generated/logistic_res_model.pkl', 'rb'))
svm_model = pickle.load(open('generated/svm_model_tuned_tfidf.pkl', 'rb'))
ensemble_model = pickle.load(open('generated/ensemble_res_model.pkl', 'rb'))
lstm_model = keras.models.load_model('generated/lstm_model')

Tfidf_vect = pickle.load(open('generated/Tfidf_vect.pkl', 'rb'))
vectorizer = pickle.load(open('generated/vectorizer.pkl', 'rb'))

Sample files are a mix of real and fake job postings.

Real: amazon, atlassian, northwestern, wayup

Fake: doterra, pacifictransfer, tikehau, cordova, fake_job

In [None]:
#run predictions over each selected model from chosen job posting

job_text = ''
while True:
    #process user picked job posting
    fname = input('Enter a file(default=\'atlassian\', enter \'quit\' to stop): ').strip()
    try:
        if fname == 'quit':
            break
        elif fname:
            job_text = readFile('input/'+fname+'.txt')
        else:
            job_text = readFile('input/atlassian.txt')

        #run predictions on each model using vectorization of user picked job posting
        logistic_predictions = logistic_model.predict_proba(
            vectorizer.transform([job_text]))
        svm_predictions = svm_model.predict_proba(
            Tfidf_vect.transform([job_text]))
        ens_predictions = ensemble_model.predict_proba(
            vectorizer.transform([job_text]))
        lstm_predictions = lstm_model.predict(np.array([job_text]))

        #output each model's predicted fraudulence probability with 5 siginificant digits
        print("Fraudulence probability of", fname)
        print("\tLogistic -\t{0:.5%}".format(logistic_predictions[0, 1]))
        print("\tSVM -\t\t{0:.5%}".format(svm_predictions[0, 1]))
        print("\tEnsemble -\t{0:.5%}".format(ens_predictions[0, 1]))
        print("\tLSTM -\t\t{0:.5%}".format(float(lstm_predictions[0])))
        print()
    except FileNotFoundError:
        print('\tError: File not found')