# Assignment 2: Milestone 1 Natural Language Processing
## Task 2 & 3
By Harold Davies

Date: 5/05/2024

Environment: Python 3 and Jupyter notebook

Libraries used:
* sklearn
* logging
* numpy
* gensim
* nltk
* scipy
* matplotlib

## Introduction
After the job descriptions had been preprocessed in Task 1, Task 2 involved generating feature sets from the descriptions using 3 different methods, and Task 3 followed on with using the feature sets, complimented with some additional details to perform classification of the jobs ads. The methods used for generating feature sets were a simple count vector of words occuring in each description, and TF-IDF weighted and unweighted vectors generated using a Word2Vec model trained on the Google News 300 dataset. Once the weighted and unweighted vectors were generated, we took a random 50% sample of the 300 features and reduced their dimensionality down to 2D using t-SNE, then made a scatterplot of the resulting 2 features to visualise the distribution of target features among the vectors. Each feature set was used to classify job advertisements using a Support Vector Machines model and evaluated using a 5-fold Stratified k-Fold Evaluation cross valuation method. Finally the job titles were tokenized, and vectors were built using the aforementioned pretrained model on just the job titles, and on a combination of job titles and descriptions, and model performance was compared between these vector inputs and that of the job descriptions only. 

## Importing libraries 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import logging
import gensim.downloader as api         #only required for initial model loading
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
import nltk
from scipy import stats
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

## Function definitions

Where code chunks were required to be reused throughout this project, functions have been defined and are shown below. See the function documentation for further details. 

In [None]:
def calcUnweightedVector(tokenized_text, model):
    """
    Function for calculating unweighted vectors based on inputs:
        a pretrained model, and 
        a tokenised document set
    function prints the tokens from documents where no words were present 
    in the pretraining vocabulary for that document
    """
    #generate array to store document vectors
    PT_textvecs = np.zeros((len(tokenized_text), model.vector_size))

    #for each document
    for i, doc in enumerate(tokenized_text):
        #valid keys are words in both the document and the model
        valid_keys = [term for term in doc if term in model.key_to_index]
        #compile matrix of word vectors from the model for each word in the document
        try:
            docvec = np.vstack([model[term] for term in valid_keys])
        except:
            print(f"None of the words from '{doc}' are in the pretrained model")
        #sum columns to obtain document vector based on model
        docvec = np.sum(docvec, axis=0)     #possible alternatives to sum such as mean
        #save doc vector into row corresponding with document index
        PT_textvecs[i,:] = docvec
    return PT_textvecs

In [None]:
def compareFeatureSets(features_sets, features_set_names, target, random_seed):
    """
    Function for running different feature sets through the selected model
    Inputs:
        - list of feature sets
        - list of feature set names
        - target feature
        - random seed
    Outputs:
        - prints mean accuracies and corresponding feature sets
        - list of result lists for t-testing
    """
    #list to store lists of each fold's results for use in t-tests
    results = []
    #evaluation metric
    scoring_metric = 'accuracy'
    #ML model
    clf = SVC() 
    #cross validation method - stratified k fold
    cv_method = StratifiedKFold(n_splits=5,
                                shuffle=True, 
                                random_state=random_seed)
    for i, features in enumerate(features_sets):
        #cross validation
        cv_results_full = cross_val_score(  estimator=clf,
                                            X=features,
                                            y=target, 
                                            cv=cv_method, 
                                            scoring=scoring_metric)
        results.append(cv_results_full)
        mean_accuracy = cv_results_full.mean().round(3)
        print(f"Accuracy using SVM on {features_set_names[i]}: {mean_accuracy}")
    return results

In [None]:
def tokenizeData(text):
    """
        This function tokenizes a raw text string.
    """
    #change text to lower case
    text_lower = text.lower()

    #regex expression provided in assignment spec
    exp = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"

    #set tokenizer using regex expression
    tokenizer = nltk.RegexpTokenizer(exp)

    #apply tokenizer to text
    tokenised_job = tokenizer.tokenize(text_lower)

    #return tokenized text
    return tokenised_job

In [None]:
def plotTSNE(labels,features, random_seed): 
    """
    Function to plot t-SNE 2 dimensional representations of 50% sample of feature vectors coloured according to their target labels
    Inputs:
        feature vectors as a numpy array, 
        list of target labels
        random seed
    """
    #convert labels to a numpy array (if it is a list)
    if isinstance(labels, list):
        labels = np.array(labels)
    targets = sorted(np.unique(labels))
    #t-SNE is computationally expensive so take just a 50% sample of features
    n_samples = int(len(features) * 0.5)
    #set random seed to ensure repeatability
    np.random.seed(random_seed)
    #select indices of features to use in t-SNE
    indices = np.random.choice(range(len(features)), size=n_samples, replace=False)
    #reduce selected fearures down to 2 dimensions using t-SNE
    plot_features = TSNE(n_components=2, random_state=random_seed).fit_transform(features[indices])
    #set colors for each target value
    colors = ['blue', 'green', 'yellow', 'orange']
    #for each unique target value plot observations' 2 features
    for i in range(0,len(targets)):
        points = plot_features[labels[indices] == targets[i]]
        plt.scatter(points[:, 0], points[:, 1], s=30, c=colors[i], label=targets[i])
    plt.title("2D Feature vector for each document",
              fontdict=dict(fontsize=15))
    plt.legend()
    plt.show()

In [None]:
#set random seed
rand = 999

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

Before feature sets can be generated, the preprocessed and tokenized data from Task 1 needs to be imported into the notebook. Next, job descriptions are re-tokenized and count and TF-IDF vectors are generated. The Word2Vec word embedding model was used for this project, pretrained on the Google News 300 dataset. Once the pretrained model has been downloaded, it can be saved and loaded from file in future, however, for the sake of this project, the saving and loading code has been commented out. Once the model is loaded we can use it to generate weighted (by TF-IDF) and unweighted vectors for the job descriptions using the model. We took a random 50% sample of the 300 features in each vector generated using the pretrained model, reduced their dimensionality down to 2D using t-SNE, then made scatterplots of the resulting 2 features to visualise the distribution of target features among the vectors. There were some distinct clustering of categories, however it was also evident that sales job ads were associated with a particularly noisy set of data, which would be difficult to differentiate from other ads. 

### Importing data

In [None]:
#lists for all job details
job_titles = []                 #Title
job_ids = []                    #Webindex
job_companies = []              #Company
job_descriptions = []           #Description
job_categories = []             #Category

#open the cleaned text file
with open('jobs_clean.txt',"r",encoding = 'unicode_escape') as f:
    #for each line, save it to a list according to its index
    for i, line in enumerate(f):
        line = line.strip()
        if i % 6 == 0:
            job_titles.append(line)
        elif i % 6 == 1:
            job_ids.append(int(line))
        elif i % 6 == 2:
            job_companies.append(line)
        elif i % 6 == 4:
            job_descriptions.append(line)
        elif i % 6 == 5:
            job_categories.append(line)
    f.close()

In [None]:
#code to check file contents have been extracted correctly
ind = 400
print(f"Title: {job_titles[ind]}")              # Title
print(f"Webindex: {job_ids[ind]}")              # Webindex
print(f"Company: {job_companies[ind]}")         # Company
print(f"Description: {job_descriptions[ind]}")  # Description
print(f"Category: {job_categories[ind]}")       # Category

In [None]:
#extract vobac from file
vocab = []
with open('vocab.txt','r',encoding = 'unicode_escape') as f:
    for line in f:
        vocab.append(line.split(':')[0])
    f.close()

### Re-tokenize

In [None]:
#retokenize job descriptions
tokenized_jobs = [job_desc.split(' ') for job_desc in job_descriptions]

In [None]:
#check tokenization of running example
tokenized_jobs[ind]

### Count and TF-IDF Vectors

In [None]:
#generate count vector
cVectorizer = CountVectorizer(analyzer = "word",vocabulary = vocab)
count_features = cVectorizer.fit_transform(job_descriptions)
print(count_features.shape)

In [None]:
#generate TF-IDF vector
tVectorizer = TfidfVectorizer(analyzer = "word",vocabulary = vocab)
tfidf_features = tVectorizer.fit_transform(job_descriptions)
tfidf_features.shape

### Word2Vec word embedding model pretrained on GoogleNews300

In [None]:
#configure logging format
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# load pre-trained word-vectors from genism-data
pretrained_w2v_model = api.load('word2vec-google-news-300')

In [None]:
#save the model so we do not need to load it again
pretrained_w2v_model.save("googleNews300_w2v.model")

In [None]:
# # reload pretrained model from file (after model downloaded and saved)
# from gensim.models import KeyedVectors
# pretrained_w2v_model = KeyedVectors.load('googleNews300_w2v.model')

### Unweighted document vectors based on pretrained model

In [None]:
#use function defined at the top of this notebook to generate unweighted vectors
PT_w2v_jobvecs = calcUnweightedVector(tokenized_jobs, pretrained_w2v_model)

#sanity check the shape and verify I still understand what I am doing! xD
PT_w2v_jobvecs.shape

# -> matrix of word vectors based on pretrained model for each job description

In [None]:
#plot 2D representation of unweighted vectors for each document coloured according to its target label
plotTSNE(job_categories,PT_w2v_jobvecs, rand)

### TF-IDF weighted document vectors based on pretrained model

In [None]:
#create libray to map terms to vocab indices for tf-idf weight lookup
term_to_col = {term: col_idx for col_idx, term in enumerate(vocab)}

#generate array to store document vectors
PT_w2v_jobvecs_weighted = np.zeros((len(tokenized_jobs), pretrained_w2v_model.vector_size))

#for each document
for i, doc in enumerate(tokenized_jobs):
    #valid keys are words in both the document and the model
    valid_keys = [term for term in doc if term in pretrained_w2v_model.key_to_index]
    #retrieve tf-idf weights for valid words in the document
    tf_weights = [tfidf_features[i, term_to_col.get(term, 0)] for term in valid_keys]
    #satiny check to ensure data consistency
    assert len(valid_keys) == len(tf_weights)
    #compile model word vectors weighted by tfidf for each word in the document
    weighted = [pretrained_w2v_model[term] * w for term, w in zip(valid_keys, tf_weights)]
    #compile matrix of model word vectors
    docvec = np.vstack(weighted)
    #sum columns to obtain document vector based on model weighted by tfidf vector
    docvec = np.sum(docvec, axis=0)     #possible alternatives to sum such as mean 
    #save doc vector into row corresponding with document index
    PT_w2v_jobvecs_weighted[i,:] = docvec

In [None]:
#matrix of weighted word vectors based on pretrained model for each job description
PT_w2v_jobvecs_weighted.shape

In [None]:
#plot 2D representation of weighted vectors for each document coloured according to its target label
plotTSNE(job_categories,PT_w2v_jobvecs_weighted, rand)

### Saving outputs
Save the count vector representation as per project spectification.

In [None]:
# #number of job descriptions
# num = count_features.shape[0]
# #text file to output count vector
# output_file = open('count_vectors.txt', 'w') # creates a txt file and open to save the vector representation
# #for each description
# for a_ind in range(0, num):
#     output_file.write("#{}".format(job_ids[a_ind]))
#     #for each non-zero word count
#     for f_ind in count_features[a_ind].nonzero()[1]:
#         #get the count
#         value = count_features[a_ind][0,f_ind]
#         #write the word index and count to the file
#         output_file.write(",{}:{}".format(f_ind,value))
#     #insert new line after last word index:count combination
#     output_file.write('\n')
# output_file.close()

## Task 3. Job Advertisement Classification

Now that we have different sets of vectors built from the job descriptions, we can compare their performance. We will also build some sets of vectors for combinations of job title and job description to see how much our model is impacted by additional information. 

### Q1: Language model comparison

In [None]:
#define list of feature sets to compare
features_sets = [count_features, PT_w2v_jobvecs, PT_w2v_jobvecs_weighted]
#define list of feature set names for the print statements
features_set_names = ['count vectors', 'unweighted pretrained vectors', 'tfidf weighted pretrained vectors']
#save the results and compare the various input vector performances
results = compareFeatureSets(features_sets, features_set_names, job_categories, rand)

In [None]:
#calculate p-values for signifance of difference in model results
print(stats.ttest_rel(results[1], results[0]).pvalue.round(3))
print(stats.ttest_rel(results[1], results[2]).pvalue.round(3))

### Q2: Does more information provide higher accuracy?

#### Vectors for only job titles

In [None]:
#use the tokenize function to tokenize the job titles and check against out runnign example
tokenized_titles = [tokenizeData(job_title) for job_title in job_titles]
tokenized_titles[ind]

In [None]:
#use our function to calculate the unweighted vectors and note some irrelevant job titles
PT_w2v_job_descvecs = calcUnweightedVector(tokenized_titles, pretrained_w2v_model)

In [None]:
#check we have the desired vector matrix shape
PT_w2v_job_descvecs.shape

#### Vectors for only job descriptions

In [None]:
#already done!
PT_w2v_jobvecs.shape

#### Vectors for job titles and descriptions

In [None]:
#initiate list for tokenized titles + descriptions
tokenized_title_desc = []

#for each combo of title and description, smack 'em together and whack 'em on the list
for title, desc in zip(tokenized_titles, tokenized_jobs):
    tokenized_title_desc.append(title + desc)

In [None]:
#let's check our favorite auto 'leccy job ad
tokenized_title_desc[400]

In [None]:
#calculate the vectors to generate our new feature set
PT_w2v_job_title_descvecs = calcUnweightedVector(tokenized_title_desc, pretrained_w2v_model)
PT_w2v_job_title_descvecs.shape

In [None]:
#define list of feature sets to compare
features_sets = [PT_w2v_job_descvecs, PT_w2v_jobvecs, PT_w2v_job_title_descvecs]
#define list of feature set names for the print statements
features_set_names = ['pretrained vectors using titles only',
                      'pretrained vectors using descriptions only',
                      'pretrained vectors using titles and descriptions']
#save the results and compare the various input vector performances
results = compareFeatureSets(features_sets, features_set_names, job_categories, rand)

In [None]:
#calculate p-values for signifance of difference in model results
print(stats.ttest_rel(results[2], results[0]).pvalue.round(3))
print(stats.ttest_rel(results[2], results[1]).pvalue.round(3))

In [None]:
#train SVM on title and description weighted vector and save model for project milestone 2
import joblib
model = SVC()
model.fit(PT_w2v_job_title_descvecs, job_categories)
joblib.dump(model, 'best_svm_model.pkl')

## Summary
After the job descriptions had been preprocessed in Task 1, Task 2 involved generating feature sets from the descriptions using 3 different methods, and Task 3 followed on with using the feature sets, complimented with some additional details to perform classification of the jobs ads. The methods used for generating feature sets were a simple count vector of words occuring in each description, and TF-IDF weighted and unweighted vectors generated using a Word2Vec model trained on the Google News 300 dataset. Once the weighted and unweighted vectors were generated, we plotted 2D representations of the feature sets to visualise the distribution of target features among the vectors, and although there were some distinct clustering of categories, it was evident that sales job ads were associated with a particularly noisy set of data, which would be difficult to differentiate from other ads. Each feature set was used to classify job advertisements using a Support Vector Machines model and evaluated using a 5-fold Stratified k-Fold Evaluation cross valuation method. It was found that the unweighted pretrained vectors had the highest accuracy with 86.6% of jobs being correctly classified, however it was also found that the results were not statistically significantly better than the results using other feature sets. Finally the job titles were tokenized, and vectors were built using the aforementioned pretrained model on just the job titles, and on a combination of job titles and descriptions, and model performance was compared between these vector inputs and that of the job descriptions only. It was found that titles and descriptions in combination provided the highest accuracy followed by only descriptions, however again, the difference in accuracy was less than 2% and found to be not statistically significant. Some recommendations for further analysis would be to leverage stemming and lemmitisation in the preprocessing phase to improve the quality of input data, and to use repeated stratified k-fold cross validation to allow for a more reliable estimation of variance between models and input data options, and result in more robust significance testing. 