# Milestone I Natural Language Processing
## Task 2 & 3
#### Osama Alfawzan

Date: 02/10/2022

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* sklearn
* gensim.downloader
* logging

## Introduction
For the collection of job advertisements, I will create various feature representations as part of this task. We will only take into account the job description in this task. The feature representations I'll create are as follow: The count vector representation for each job advertisement description is contained in the bag-of-words model, which I then save it to a file called count_vectors.txt. I will then generate the TF-IDF weighted and unweighted vector representations for each job advertisement description using the GoogleNews300 embedding language model. After that, in Task 3 I will build machine learning models for classifying the category of a job advertisement text. For this particular task I will use the logistic regression model from sklearn. Finally, I will conduct two sets of experiments on the provided dataset to compare the three different language models, then investigate if more information could lead to a higher accuracy.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_files
import gensim.downloader as api
import pandas as pd
import numpy as np
import logging
import re

In [2]:
# to observe the progress of downloading GoogleNews300 model
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

In the following code block, I read the output files from Task 1 for the preprocessed job advertisments into a container which includes all the folders, subfolders, and .`txt` files. In the data folder, we have 4 different job categories which are `Accounting_Finance`, `Engineering`, `Healthcare_Nursing`, and `Sales` that correspong to the labels ordered as [0, 1, 2, 3].

In [3]:
# Reading the job advertisements .txt files
job_data = load_files(r"preprocessed_data") # reading the files from the preprocessed data folder
print('Number of folders in the preprocessed data folder:',len(job_data.target_names)) # number of folders(job categories)
print('Job categories in the preprocessed data folder:', end=' '), print(*job_data.target_names, sep=', ') # name of the job categories
print('Number of job advertisments:', len(job_data.data))

# Reading the vocab.txt file to use it as a base for CountVectorizer
with open('vocab.txt') as f:
    vocab_raw = [l.split(':') for l in f.read().splitlines()]
vocab = [w[0] for w in vocab_raw] # using list comprehension to remove the index of vocabs

Number of folders in the preprocessed data folder: 4
Job categories in the preprocessed data folder: Accounting_Finance, Engineering, Healthcare_Nursing, Sales
Number of job advertisments: 776


In the next cell, the code puts the job advertisments features into separate lists which in my opinion, is the most suitable data type for this task. The index of a specific list corresponds to the infromation about the job in the other lists.

In [4]:
# Extracting Titles, Webindices, and Descriptions into separate lists
titles = []
webindices = []
descriptions = []
categories = []

for i in job_data.target:
    categories.append(i)

for j in job_data.data:
    titles.append(re.findall(r"(?<=Title: ).*(?=\r\n)", j.decode())) # appending the title of each job using regex to [titles]
    webindices.append(re.findall(r"(?<=Webindex: )\d+", j.decode())) # using decode() to tranform to a string fotmat
    descriptions.append(re.findall(r"(?<=Description: ).+", j.decode())) # Search for the description of each job using regex
print('Number of job descriptions in the list:', len(descriptions))
print('\nFirst job advertisment in job_data:\n')
print('Title:',*titles[0],'\nWebindex:',*webindices[0],'\nDescription:',*descriptions[0])

Number of job descriptions in the list: 776

First job advertisment in job_data:

Title: Finance / Accounts Asst Bromley to ****k 
Webindex: 68997528 
Description: accountant partqualified south east london manufacturing requirement accountant permanent modern offices south east london credit control purchase ledger daily collection debts phone letter email handling ledger accounts handling accounts negotiating payment terms cash reconciliation accounts adhoc administration duties person ideal previous credit control capacity possess exceptional customer communication part fully qualified accountant considered


### Bag-of-words model:

Count Vector is represented as `word:count`, indicating how many times a word appears in a document. The following code generates the count vector for the job descriptions using `CountVectorizer` class from `sklearn` library.

In [5]:
flat_descriptions = [d for desc in descriptions for d in desc] # flatten the list of lists of descriptions
cVectorizer = CountVectorizer(analyzer = "word",vocabulary = vocab) # initialising the CountVectorizer
count_features = cVectorizer.fit_transform(flat_descriptions) # generate the count vector representation for all articles
print(count_features.shape)

(776, 5168)


### Pre-trained word2vec model based on  GoogleNews300:

For easier handling of the data through all the coming tasks, I've decided to create a pandas data frame for the job advertisments with each required job feature as column in the data frame. The columns are: `Webindex`, `Title`, `Description`, `tk_Descriptions`, and `Category`. `tk_Descriptions` is the tokens for each job description.

In [6]:
# create a data frame for the job advertisements for easier handling of the data
titles = [t for title in titles for t in title] # flatten the list of titles
webindices = [w[0] for w in webindices] # flatten the list of webindices
tk_descriptions = [d.split(' ') for d in flat_descriptions] # create a list of description tokens
# dictionaries for the column values
df_dic = {'Webindex': webindices, 'Title':titles, 'Description':flat_descriptions,
          'tk_Descriptions':tk_descriptions, 'Label':categories}
jobs_df = pd.DataFrame(df_dic) # initiliase the df with our dictionary
# create conditions to fill the column Category with the specific job category
conditions = [jobs_df['Label'] == 0, jobs_df['Label'] == 1, jobs_df['Label'] == 2, jobs_df['Label'] == 3] 
choices = ['Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales']
jobs_df['Category'] = np.select(conditions, choices, default=np.nan) # intiliase the category columns with the corresponding category
jobs_df = jobs_df.drop('Label', axis=1) # dropping the labels column as it is redundant
jobs_df.head(5)

2022-10-02 15:19:49,650 : INFO : Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-02 15:19:49,650 : INFO : NumExpr defaulting to 8 threads.


Unnamed: 0,Webindex,Title,Description,tk_Descriptions,Category
0,68997528,Finance / Accounts Asst Bromley to ****k,accountant partqualified south east london man...,"[accountant, partqualified, south, east, londo...",Accounting_Finance
1,68063513,Fund Accountant Hedge Fund,hedge funds london recruiting fund accountant ...,"[hedge, funds, london, recruiting, fund, accou...",Accounting_Finance
2,68700336,Deputy Home Manager,exciting arisen establish provider elderly car...,"[exciting, arisen, establish, provider, elderl...",Healthcare_Nursing
3,67996688,Brokers Wanted Imediate Start,expanding recruiting junior trainee brokers ci...,"[expanding, recruiting, junior, trainee, broke...",Accounting_Finance
4,71803987,RGN Nurses (Hospitals) Penarth,rgn nurses hospitals fulltime part swiis hour ...,"[rgn, nurses, hospitals, fulltime, part, swiis...",Healthcare_Nursing


#### TF-IDF weighted vector representation

We can obtain pre-trained embeddings from the web. We begin by downloading the pre-trained word2vec model from Google News.

In [7]:
# Loading the pretrained model
preTW2v = api.load('word2vec-google-news-300')

2022-10-02 15:19:49,724 : INFO : loading projection weights from C:\Users\Osama/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
2022-10-02 15:20:27,666 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from C:\\Users\\Osama/gensim-data\\word2vec-google-news-300\\word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2022-10-02T15:20:27.666161', 'gensim': '4.2.0', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'load_word2vec_format'}


In [8]:
tVectorizer = TfidfVectorizer(analyzer = "word",vocabulary = vocab) # initialising the TfidfVectorizer
tfidf_features = tVectorizer.fit_transform(flat_descriptions) # generate the tfidf vector representation for all descriptions
tfidf_features.shape

(776, 5168)

The following code stores the vector representation as format `word:weight` in a list of dictionaries called `tfidf_weights` to use it for generating weighted vectors. Each list item is a dictionary of vocabulary as keys and their TF-IDF weight as value for each job description.

In [9]:
# Creat a list of dictionaries for tfidf as word:weight
num = tfidf_features.shape[0]
tfidf_weights = [] # empty list of dictionaries for tfidf weights
for i in range(0, num):
    wordweight_dict = {} # initialize and empty dictionary for each decription
    for word, value in zip(vocab, tfidf_features.toarray()[i]): 
        if value > 0:
            wordweight_dict[word] = value # store the values (weights) in the dictionry keys (vocabs)
    tfidf_weights.append(wordweight_dict) # append the dictionary to the weights list

The following `weighted_vectors` function takes the word embeddings dictionary, the tokenized text of descriptions, and the tfidf weights (list of word:weight dictionaries, one for each description) as arguments, and generates the job ads embeddings:
 1. creates an empty dataframe `docs_vectors` to store the job ads embeddings of descriptions
  2. it loop through every tokenized text:
    - creates an empty dataframe `temp` to store all the word embeddings of the description
    - for each word that exists in the word embeddings dictionary/keyedvectors, 
        - if the argument `tfidf` weights are empty `[]`, it sets the weight of the word as 1
        - otherwise, retrieve the weight of the word from the corresponding word:weight dictionary of the article from  `tfidf`
    - row bind the weighted word embedding to `temp`
    - takes the sum of each column to create the document vector
    - append the created document vector to the list of document vectors

In [10]:
# generate weighted vector representation for job advertisements
def weighted_vectors(wv,tk_txts,tfidf = []):
    docs_vectors = pd.DataFrame() # creating empty final dataframe

    for i in range(0,len(tk_txts)):
        tokens = list(set(tk_txts[i])) # get the list of distinct words of the discription

        temp = pd.DataFrame()  # creating a temporary dataframe(store value for 1st doc & for 2nd doc remove the details of 1st & proced through 2nd and so on..)
        for w_ind in range(0, len(tokens)): # looping through each word of a single document and spliting through space
            try:
                word = tokens[w_ind]
                word_vec = wv[word] # if word is present in embeddings
                
                if tfidf != []:
                    word_weight = float(tfidf[i][word])
                else:
                    word_weight = 1
                temp = temp.append(pd.Series(word_vec*word_weight), ignore_index = True) # if word is present then append it to temporary dataframe
            except:
                pass
        doc_vector = temp.sum() # take the sum of each column(w0, w1, w2,........w300)
        docs_vectors = docs_vectors.append(doc_vector, ignore_index = True) # append each document value to the final dataframe
    return docs_vectors

In [11]:
weighted_preTW2v = weighted_vectors(preTW2v,jobs_df['tk_Descriptions'],tfidf_weights)

#### unweighted vector representation

The following function takes:
* wv, a word:embedding dictionary or KeyedVectors; and 
* tk_txts, a list of tokenized texts, each of a job description
as argument, it then does the following to generate the list of embedding vector representations, each for a description:
1. creates an empty dataframe `docs_vectors` to store the job ads embeddings of descriptions
2. it loop through every tokenized text:
    - creates an empty dataframe `temp` to store all the word embeddings of the article
    - for each word that exists in the word embeddings dictionary/keyedvectors, row bind the word embedding to `temp`
    - take the sum of each column to create the document vector
    - append the created document vector to the list of document vectors

In [12]:
def unweighted_vectors(wv,tk_txts): # generate unweighted vector representation for job advertisements
    docs_vectors = pd.DataFrame() # creating empty final dataframe
    
    for i in range(0,len(tk_txts)):
        tokens = tk_txts[i]
        temp = pd.DataFrame()  # creating a temporary dataframe(store value for 1st doc & for 2nd doc remove the details of 1st & proced through 2nd and so on..)
        for w_ind in range(0, len(tokens)): # looping through each word of a single document and spliting through space
            try:
                word = tokens[w_ind]
                word_vec = wv[word] # if word is present in embeddings
                temp = temp.append(pd.Series(word_vec), ignore_index = True) # if word is present then append it to temporary dataframe
            except:
                pass
        doc_vector = temp.sum() # take the sum of each column
        docs_vectors = docs_vectors.append(doc_vector, ignore_index = True) # append each document value to the final dataframe
    return docs_vectors

In [13]:
# unweighted vector representation for job advertisements descriptions only
unweighted_preTW2v = unweighted_vectors(preTW2v,jobs_df['tk_Descriptions'])
unweighted_preTW2v.isna().any().sum() # check whether there is any null values in the document vectors dataframe.

0

### Saving outputs
Save the count vector representation.
- count_vectors.txt

In [14]:
# code to save output data...
def write_vectorFile(job_id,data_features,filename):
    num = data_features.shape[0] # Number of job descriptions
    out_file = open(filename, 'w') # creates a txt file and open to save the count vectors
    for a_ind in range(0, num): # loop through each job description by index
        out_file.write('#{}'.format(job_id[a_ind])) # write the job webindex at the beginning of each line
        for f_ind in data_features[a_ind].nonzero()[1]: # for each word index that has non-zero entry in the data_feature
            value = data_features[a_ind][0,f_ind] # retrieve the value of the entry from data_features
            out_file.write(",{}:{}".format(f_ind,value)) # write the entry to the file in the format of word_index:value
        out_file.write('\n') # start a new line after each job decription
    out_file.close() # close the file

write_vectorFile(webindices,count_features,"count_vectors.txt") # write the count vector to file

## Task 3. Job Advertisement Classification

###  Language model comparisons

#### The count vectors language model

In [15]:
# Testing the count vector language model with Description of job advertisements as the feature for the model
seed = 0
cv = KFold(n_splits=5, random_state=seed, shuffle=True) # setting k-fold to 5 for cross validation
# creating training and test split
X_train, X_test, y_train, y_test = train_test_split(count_features, jobs_df['Category'], test_size=0.33, random_state=seed)

model = LogisticRegression(max_iter = 2000, random_state=seed) # increase the max_iter to 2000 for convergence
model.fit(X_train, y_train)
scores = cross_val_score(model, X_test, y_test, scoring='accuracy', cv=cv) # calculated the accuracy score for each fold
for i in range(len(scores)):
    print('Model accuracy for Fold',i+1, ':', scores[i])
print('The average accuracy of the model is:', np.mean(scores))

Model accuracy for Fold 1 : 0.8461538461538461
Model accuracy for Fold 2 : 0.8269230769230769
Model accuracy for Fold 3 : 0.7843137254901961
Model accuracy for Fold 4 : 0.8431372549019608
Model accuracy for Fold 5 : 0.9019607843137255
The average accuracy of the model is: 0.8404977375565611


#### The TF-IDF weighted vectors language model

In [16]:
# Testing the TF-IDF weighted language model with Description of job advertisements as the feature for the model
seed = 0
cv = KFold(n_splits=5, random_state=seed, shuffle=True) # setting k-fold to 5 for cross validation
# creating training and test split
X_train, X_test, y_train, y_test = train_test_split(weighted_preTW2v, jobs_df['Category'], test_size=0.33, random_state=seed)

model = LogisticRegression(max_iter = 2000, random_state=seed) # increase the max_iter to 2000 for convergence
model.fit(X_train, y_train)
scores = cross_val_score(model, X_test, y_test, scoring='accuracy', cv=cv) # calculated the accuracy score for each fold
for i in range(len(scores)):
    print('Model accuracy for Fold',i+1, ':', scores[i])
print('The average accuracy of the model is:', np.mean(scores))

Model accuracy for Fold 1 : 0.9038461538461539
Model accuracy for Fold 2 : 0.8461538461538461
Model accuracy for Fold 3 : 0.9215686274509803
Model accuracy for Fold 4 : 0.9215686274509803
Model accuracy for Fold 5 : 0.9019607843137255
The average accuracy of the model is: 0.8990196078431373


#### The unweighted vectors language model

In [17]:
# Testing unweighted language model with Description of job advertisements as the feature for the model
seed = 0
cv = KFold(n_splits=5, random_state=seed, shuffle=True) # setting k-fold to 5 for cross validation
# creating training and test split
X_train, X_test, y_train, y_test = train_test_split(unweighted_preTW2v, jobs_df['Category'], test_size=0.33, random_state=seed)

model = LogisticRegression(max_iter = 2000, random_state=seed) # increase the max_iter to 2000 for convergence
model.fit(X_train, y_train)
scores = cross_val_score(model, X_test, y_test, scoring='accuracy', cv=cv) # calculated the accuracy score for each fold
for i in range(len(scores)):
    print('Model accuracy for Fold',i+1, ':', scores[i])
print('The average accuracy of the model is:', np.mean(scores))

Model accuracy for Fold 1 : 0.8653846153846154
Model accuracy for Fold 2 : 0.8653846153846154
Model accuracy for Fold 3 : 0.9019607843137255
Model accuracy for Fold 4 : 0.8823529411764706
Model accuracy for Fold 5 : 0.8627450980392157
The average accuracy of the model is: 0.8755656108597286


By comparing the three language models and by looking at the accuracy scores in the above output, we can simply say that the TF-IDF weighted vectors language model has the best performance with a close proximity to the unweighted vectors language model. My understanding of this is that if we calculated the weight of the TF-IDF vector representaion then our model will be able to perform better.

###  Does more information provide higher accuracy?

In [18]:
# first we generate TF-IDF feature representation for the Title
title_Vectorizer = TfidfVectorizer(analyzer = "word") # initialising the TfidfVectorizer
title_features = title_Vectorizer.fit_transform(jobs_df['Title'])

In [19]:
# Testing tf-idf language model with Title of job advertisements as the feature for the model
seed = 0
cv = KFold(n_splits=5, random_state=seed, shuffle=True) # setting k-fold to 5 for cross validation
# creating training and test split
X_train, X_test, y_train, y_test = train_test_split(title_features, jobs_df['Category'], test_size=0.33, random_state=seed)

model = LogisticRegression(max_iter = 2000, random_state=seed) # increase the max_iter to 2000 for convergence
model.fit(X_train, y_train)
scores = cross_val_score(model, X_test, y_test, scoring='accuracy', cv=cv) # calculated the accuracy score for each fold
for i in range(len(scores)):
    print('Model accuracy for Fold',i+1, ':', scores[i])
print('The average accuracy of the model is:', np.mean(scores))

Model accuracy for Fold 1 : 0.7692307692307693
Model accuracy for Fold 2 : 0.5769230769230769
Model accuracy for Fold 3 : 0.7843137254901961
Model accuracy for Fold 4 : 0.7843137254901961
Model accuracy for Fold 5 : 0.7058823529411765
The average accuracy of the model is: 0.7241327300150829


In [20]:
# first we generate TF-IDF feature representation for the Title and Description
jobs_df['concatenated'] = jobs_df['Title'] + jobs_df['Description'] # we concatenated the job title with job description
TD_Vectorizer = TfidfVectorizer(analyzer = "word") # initialising the TfidfVectorizer
TD_features = TD_Vectorizer.fit_transform(jobs_df['concatenated'])

In [21]:
# Testing tf-idf language model with Title and Description of job advertisements as features for the model
seed = 0
cv = KFold(n_splits=5, random_state=seed, shuffle=True) # setting k-fold to 5 for cross validation
# creating training and test split
X_train, X_test, y_train, y_test = train_test_split(TD_features, jobs_df['Category'], test_size=0.33, random_state=seed)

model = LogisticRegression(max_iter = 2000, random_state=seed) # increase the max_iter to 2000 for convergence
model.fit(X_train, y_train)
scores = cross_val_score(model, X_test, y_test, scoring='accuracy', cv=cv) # calculated the accuracy score for each fold
for i in range(len(scores)):
    print('Model accuracy for Fold',i+1, ':', scores[i])
print('The average accuracy of the model is:', np.mean(scores))

Model accuracy for Fold 1 : 0.8269230769230769
Model accuracy for Fold 2 : 0.8076923076923077
Model accuracy for Fold 3 : 0.803921568627451
Model accuracy for Fold 4 : 0.8431372549019608
Model accuracy for Fold 5 : 0.9215686274509803
The average accuracy of the model is: 0.8406485671191554


To answer the question "Does more information provide higher accuracy?" is more complicated than giving a simple answer of yes or no. But in this case, more information didn't provide higher accuracy for the prediction model. The quality of the information is important in this case, if we cleaned the `Title` of the job description and generate the TF-IDF weighted vectore and then combine it with the weighted TF-IDF vector for the `Desctiptions` then I imagine it would give a higher score than all the other models.

## Summary
To summarise, in these two tasks, I produced different feature representations while only considering the job description.  I then stored the count vector representation of each job description into a file called `count_vectors.txt`. Then, using the GoogleNews300 embedding language model, I built the TF-IDF weighted and unweighted vector representations for each job advertisement description. Then, for Task 3, I developed machine learning models for categorising the text of a job advertisement. I implemented the logistic regression model from `sklearn` for this specific task. Then, I compared the three different language models using two sets of tests on the provided dataset, and I looked into whether having more information might result in a greater performance of the model.