Natural Language Processing

Libraries used:
* pandas
* numpy
* itertools
* gensim
* sklearn
* nltk
* logging

## Introduction

The files created from the first task were loaded and used to create various feature representations to be used in classifying the job advertisments according to their categories. 
The primary feature vectors are derived from word embeddings generated using the unweighted and TF-IDF weighted versions of the Google News 300 model. To assess their respective performance, a comparative analysis is conducted through K-folds cross-validation. Following this, a comprehensive evaluation of model performance was conducted, taking into account various levels of information: title-only, description-only, and a combination of both title and description.

## Importing libraries 

In [1]:
import pandas as pd
import numpy as np
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim.downloader as api
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import load_files 
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
import logging

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

## Bag of Words Model : Count Vector Representation

#### Loading the files generated from task 1.

In [2]:
webindexes_file = 'webindexes.txt'
with open(webindexes_file) as f: 
    webindexes = f.read().splitlines() 
    
job_ads_file = 'job_ads.txt'
with open(job_ads_file) as f: 
    joined_job_ad_descs = f.read().splitlines()
    
vocab_desc_file = 'vocab.txt'
with open(vocab_desc_file) as f:
    lines = f.read().splitlines()  # Read all lines from the file
vocab_desc = [line.split(":")[0] for line in lines]

In [3]:
cVectorizer = CountVectorizer(analyzer = "word",vocabulary = vocab_desc) # initialised the CountVectorizer
count_features = cVectorizer.fit_transform(joined_job_ad_descs)
count_features.shape
count_matrix = count_features.toarray() #converting to dense representation for save operation

### Saving outputs
Save the count vector representation:

In [4]:
def save_count_vectors(vocab):
    out_file = open("count_vectors.txt", 'w')
    for doc_index, doc_vector in enumerate(count_matrix):
        out_file.write(f"#{webindexes[doc_index]}:,")
        for word_index, word_count in enumerate(doc_vector):
            if word_count > 0:
                word = vocab[word_index]
                out_file.write(f"{word_index}:{word_count},")
        out_file.write("\n")
    out_file.close() # close the file
    
save_count_vectors(vocab_desc)

## Models based on word embeddings (Google News 300)

### Word2vec embeddings: unweighted

In [5]:
txt_fname = 'job_ads.txt'
with open(txt_fname) as txtf:
    job_ads = txtf.read().splitlines() # reading a list of strings, each for a job description
tk_job_ads = [job_ad.split(' ') for job_ad in job_ads]

categories_file = 'categories.txt'
with open(categories_file) as f: 
    categories = f.read().splitlines() 
    
example = 10
df = pd.DataFrame({'webindex':webindexes, 'categories': categories,'tk_job_ads':tk_job_ads})
df.iloc[example]

webindex                                               71851935
categories                                                    0
tk_job_ads    [eastleigh, investments, treasury, controller,...
Name: 10, dtype: object

In [6]:
# logging for event tracking
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
word2vec_googlenews = api.load('word2vec-google-news-300')
print(word2vec_googlenews)

2024-07-09 11:56:34,370 : INFO : loading projection weights from /Users/nigel/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2024-07-09 11:56:54,900 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /Users/nigel/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2024-07-09T11:56:54.900619', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]', 'platform': 'macOS-14.4.1-arm64-arm-64bit', 'event': 'load_word2vec_format'}


KeyedVectors<vector_size=300, 3000000 keys>



This code generates vector representations for documents based on word embeddings from the Google News 300 model and stores these representations in a pandas DataFrame.

In [7]:
def generate_docvecs(word2vec_googlenews, tk_job_ads):
    docvecs = []  # list to accumulate document vectors

    for tokens in tk_job_ads:
        word_vectors = []  # list to accumulate word vectors for the current document

        for word in tokens:
            if word in word2vec_googlenews:
                word_vectors.append(word2vec_googlenews[word])

        if word_vectors:
            doc_vector = np.sum(word_vectors, axis=0)  # sum the word vectors
        else:
            doc_vector = np.zeros(word2vec_googlenews.vector_size)  # handle documents with no valid tokens

        docvecs.append(doc_vector)

    return pd.DataFrame(docvecs)

In [8]:
# generate the feature vectors
unweighted_desc = generate_docvecs(word2vec_googlenews,df['tk_job_ads'])
unweighted_desc.isna().any().sum() # check whether there is any null values in the document vectors dataframe.

0

In [9]:
unweighted_desc.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.901138,1.406296,-0.886032,-0.003487,0.248032,2.174377,-0.769379,-4.703918,3.779785,2.390564,...,-5.691772,2.042877,-3.238186,0.131531,-1.752686,0.2444,-0.61758,-1.68454,-0.59021,-2.947083
1,0.20929,1.938171,2.122955,2.34066,0.67514,-0.609497,0.570648,-5.657169,7.982697,-0.905609,...,-4.684105,6.460129,-3.356812,2.719025,0.201538,-0.678894,1.273987,0.717865,0.105591,0.246948
2,-5.041443,0.118715,-0.036713,-0.700409,-3.232666,-1.395874,3.436344,-5.409973,4.627289,-1.761925,...,-5.439972,1.802765,-5.193237,0.246841,-0.305176,1.551659,-0.289094,-3.921082,0.552155,-2.575928
3,-1.147881,2.647675,-0.749817,1.750244,-1.853699,-1.820675,0.40498,-2.240421,1.846344,1.815089,...,-2.885651,2.926819,-2.497742,0.520813,-0.972656,0.596741,3.07341,0.373614,-0.096527,0.310362
4,-3.558548,0.628075,-1.552185,4.04353,-4.518372,-3.065369,5.994431,-5.978271,5.829208,3.333984,...,-5.320404,2.99102,-7.160202,-0.161396,-0.633545,5.454086,-1.107452,-1.175232,-0.436832,-4.710098


### Word2vec embeddings: TFIDF weighted

a TF-IDF vectorizer is initialized with the vocabulary and then used to generate TF-IDF vector representations for the collection of job advertisements' descriptions

In [10]:
tVectorizer = TfidfVectorizer(analyzer = "word",vocabulary = vocab_desc) # initialised the TfidfVectorizer
tfidf_features_desc = tVectorizer.fit_transform(joined_job_ad_descs) # generate the tfidf vector representation for all articles
tfidf_features_desc.shape

(776, 5090)

In [11]:
#load vocab as a dictionary (index:word)
vocab_file_desc = 'vocab.txt'
with open(vocab_file_desc) as f:
    lines = f.read().splitlines()  # Read all lines from the file
 
word_indexes = [line.split(":") for line in lines]
vocab_dict_desc = {int(word_index[1]):word_index[0] for word_index in word_indexes}

# Print the first 10 items:
count = 0
for key, value in vocab_dict_desc.items():
    if count < 10:
        print(f"{key}: {value}")
        count += 1
    else:
        break

0: aap
1: aaron
2: aat
3: abb
4: abenefit
5: aberdeen
6: abi
7: abilities
8: abreast
9: abroad


This function calculates and stores the TF-IDF weights for words in each document. The function returns a list of dictionaries where each dictionary maps words from vocab_dict to their corresponding TF-IDF weights for a specific document.

In [12]:
def doc_wordweights(tfidf_features, vocab_dict):
    tfidf_weights = []  # a list to store the word:weight dictionaries of documents

    for doc_index in range(tfidf_features.shape[0]):
        doc_weights = tfidf_features[doc_index].toarray()[0]  # Get TF-IDF weights for the current document
        wordweight_dict = {vocab_dict[word_index]: weight for word_index, weight in enumerate(doc_weights) if weight > 0}
        tfidf_weights.append(wordweight_dict)

    return tfidf_weights

# Call the function with tfidf_features and vocab_dict
tfidf_weights_desc = doc_wordweights(tfidf_features_desc, vocab_dict_desc)

# Print the first 10 items of example:
count = 0
for key, value in tfidf_weights_desc[example].items():
    if count < 10:
        print(f"{key}: {value}")
        count += 1
    else:
        break


aca: 0.06738994884666369
acca: 0.06848284037037111
accordance: 0.06588017420783195
accounting: 0.06283287951864482
acting: 0.054662066737579826
action: 0.07159885813531025
age: 0.06965675838335401
agency: 0.04232356723055944
analysing: 0.08328290049315928
analysis: 0.055142122993720726


This function computes weighted document vectors by combining word embeddings from a given word embedding model with their corresponding TF-IDF weights for a collection of documents.

In [13]:
def weighted_docvecs(embeddings, tfidf, docs):
    docvecs = []

    for i, doc in enumerate(docs):
        tf_weights = tfidf[i]
        valid_keys = [term for term in tf_weights if term in embeddings]
        
        if not valid_keys:
            print(f"No valid terms found in embeddings for document: {doc}")
            docvec = np.zeros(embeddings.vector_size)
        else:
            weighted = [embeddings[term] * tf_weights[term] for term in valid_keys]
            if not weighted:  # If weighted is empty, handle it
                docvec = np.zeros(embeddings.vector_size)
            else:
                docvec = np.vstack(weighted)
                docvec = np.sum(docvec, axis=0)  # Sum to create the final document vector

        docvecs.append(docvec)

    return pd.DataFrame(docvecs)

In [14]:
weighted_desc = weighted_docvecs(word2vec_googlenews, tfidf_weights_desc, df['tk_job_ads'])
weighted_desc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.138460,0.226597,-0.081565,-0.006660,0.114711,0.211352,-0.109989,-0.529260,0.434688,0.292923,...,-0.655827,0.254526,-0.395180,0.008176,-0.151750,0.082227,-0.049517,-0.149545,-0.102712,-0.328083
1,0.072223,0.150872,0.217045,0.226054,0.107462,-0.065670,0.024513,-0.466990,0.751906,-0.156715,...,-0.380454,0.623265,-0.290799,0.262206,0.067257,-0.068819,0.117884,0.066299,-0.031979,0.006392
2,-0.339290,-0.025993,-0.024305,-0.073153,-0.284411,-0.128614,0.235592,-0.434294,0.350736,-0.100273,...,-0.454402,0.122261,-0.373630,0.024087,-0.017727,0.108590,-0.038054,-0.348651,0.051754,-0.182820
3,-0.123891,0.344586,-0.151851,0.204709,-0.288331,-0.270977,0.053122,-0.266193,0.290436,0.320958,...,-0.359883,0.367943,-0.336372,0.100289,-0.119123,0.078298,0.407616,0.076324,-0.062215,0.009121
4,-0.236785,0.018763,-0.111336,0.324455,-0.393914,-0.289901,0.446163,-0.459797,0.472943,0.293615,...,-0.416899,0.203973,-0.539203,0.032364,-0.077112,0.421060,-0.106325,-0.055864,-0.002660,-0.323667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
771,-0.124630,-0.009323,-0.374541,0.192978,0.222527,-0.166560,-0.123691,-0.534163,0.401518,0.379013,...,-0.358915,0.207393,-0.857826,-0.065568,0.295326,-0.031123,0.523671,-0.147746,0.078561,-0.363654
772,-0.600541,0.574659,0.105523,0.184331,-0.536448,-0.163162,0.650189,-0.645796,0.811121,0.700073,...,-1.125246,0.636796,-0.948744,0.525721,-0.431497,0.364681,0.129525,-0.141560,-0.002172,-0.498964
773,-0.408723,0.479502,-0.245255,0.175926,-0.524066,0.297879,0.166414,-1.013701,0.488274,-0.421673,...,-0.535201,0.444471,-1.006210,0.270296,0.040544,0.326518,0.542841,0.076095,0.449920,0.158322
774,-0.106379,0.136687,-0.308232,0.273000,-0.619065,0.538901,0.234327,-0.612244,0.355565,0.088508,...,-0.188990,0.403811,-0.599088,0.087991,0.043367,0.183113,-0.004659,0.148513,0.297856,-0.153251


## Job Advertisement Classification Comparison

In this task, the performance difference between the unweighted and TF-IDF weighted vector representations made using the Google News 300 model is evaluated.

### Unweighted Word2vec vs TF-IDF weighted Word2vec

In [15]:
seed = 0  # set a seed to make sure the experiment is reproducible
kf = KFold(n_splits=5, shuffle=True, random_state=seed)
X_unweighted = unweighted_desc
X_weighted = weighted_desc
y = df['categories']
model = LogisticRegression(max_iter=2000, random_state=seed)

# Cross-validation for unweighted features
unweighted_scores = cross_val_score(model, X_unweighted, y, cv=kf)

# Cross-validation for weighted features
weighted_scores = cross_val_score(model, X_weighted, y, cv=kf)

# Print performance of each fold
for fold, (unweighted_score, weighted_score) in enumerate(zip(unweighted_scores, weighted_scores)):
    print(f"Fold {fold + 1} accuracy :")
    print(f"  Unweighted: {unweighted_score:.2f}")
    print(f"  Weighted: {weighted_score:.2f}")
    print()

# Calculate and print mean accuracy for each representation
mean_unweighted_accuracy = unweighted_scores.mean()
mean_weighted_accuracy = weighted_scores.mean()
print("Mean Unweighted Feature Representation Accuracy: {:.2f}".format(mean_unweighted_accuracy * 100))
print("Mean Weighted Feature Representation Accuracy: {:.2f}".format(mean_weighted_accuracy * 100))


Fold 1 accuracy :
  Unweighted: 0.84
  Weighted: 0.87

Fold 2 accuracy :
  Unweighted: 0.85
  Weighted: 0.90

Fold 3 accuracy :
  Unweighted: 0.82
  Weighted: 0.87

Fold 4 accuracy :
  Unweighted: 0.81
  Weighted: 0.83

Fold 5 accuracy :
  Unweighted: 0.83
  Weighted: 0.86

Mean Unweighted Feature Representation Accuracy: 82.99
Mean Weighted Feature Representation Accuracy: 86.60


## Comparing Different Levels of information

In this part of the activity, I compared word embeddings derived from three sources: only the title, only the description, and a combination of both title and description, considering both unweighted and TF-IDF weighted versions.

#### Unweighted Title Only Embeddings

This function tokenizes the title portion of a job advertisement, converting it to lowercase, segmenting it into sentences, and further tokenizing each sentence into individual words, returning a list of these tokens.

In [16]:
def tokenize_job_ad_title(raw_job_ad):
    ad = raw_job_ad.decode('utf-8') # convert the bytes-like object to python string, need this before we apply any pattern search on it
    ad = ad.lower() # cover all words to lowercase
    
    # Find the start of the title part using the "title:" keyword
    title_start = ad.find("title:")
    if title_start == -1: # Handle the case where "title:" is not found
        return []
    
    # Find the end of the title part (assuming it's terminated by a newline character)
    title_end = ad.find("\n", title_start)
    
    if title_end == -1:
        # Handle the case where a newline character is not found after "title:"
        return []
        
    title = ad[title_start + len("title:"):title_end].strip() # Extract the title value
    sentences = sent_tokenize(title) # segment into sentences
    
    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern) 
    token_lists = [tokenizer.tokenize(sentence) for sentence in sentences]
    
    tokenised_job_ad = list(chain.from_iterable(token_lists)) # merge them into a list of tokens
    return tokenised_job_ad

In [17]:
#generating unweighted vector representation for just the title
job_data = load_files(r"data") 
job_ads, categories = job_data.data, job_data.target 
tk_job_ad_titles = [tokenize_job_ad_title(job_ad) for job_ad in job_ads]  
print(tk_job_ad_titles)
df_titles = pd.DataFrame({'webindex':webindexes, 'categories': categories,'tk_job_ad_titles':tk_job_ad_titles})
unweighted_titles = generate_docvecs(word2vec_googlenews,df_titles['tk_job_ad_titles'])

[['finance', 'accounts', 'asst', 'bromley', 'to', 'k'], ['fund', 'accountant', 'hedge', 'fund'], ['deputy', 'home', 'manager'], ['brokers', 'wanted', 'imediate', 'start'], ['rgn', 'nurses', 'hospitals', 'penarth'], ['production', 'coordinator'], ['scrub', 'nurse'], ['sales', 'purchase', 'ledger', 'clerk', 'maternity', 'cover'], ['recruitment', 'sales', 'executive'], ['business', 'development', 'executive', 'field', 'sales', 'dartford'], ['investments', 'treasury', 'controller'], ['european', 'payroll'], ['engineering', 'assessor', 'instructor', 'south', 'yorkshire'], ['international', 'account', 'manager'], ['senior', 'production', 'technologist', 'malaysia'], ['insurance', 'sales', 'executive', 'horsham'], ['vehicle', 'purchaser', 'car', 'sales'], ['marine', 'engines', 'specialist', 'product', 'support'], ['sales', 'manager', 'medical', 'sales', 'executive'], ['optical', 'assistant', 'oxfordshire'], ['perm', 'unit', 'mgr', 'rgn', 'kid', 'minster', 'flexi', 'k', 'due'], ['perm', "rgn's

In [18]:
print(unweighted_titles.isna().any().sum()) # check whether there is any null values in the document vectors dataframe.
Nulls = unweighted_titles[unweighted_titles.isna().any(axis=1)]

print(df_titles.iloc[360])
print(df_titles.iloc[572])
print(df_titles.iloc[733])

#drop NAs
unweighted_titles.dropna(axis=0, inplace=True)
# Drop the corresponding rows from df_titles
df_titles.drop(Nulls.index, axis=0, inplace=True)

print(df_titles.shape)

0
webindex               70232739
categories                    2
tk_job_ad_titles    [rmn, rnld]
Name: 360, dtype: object
webindex                               62017964
categories                                    0
tk_job_ad_titles    [sipp, ssas, administartor]
Name: 572, dtype: object
webindex                                  72564385
categories                                       1
tk_job_ad_titles    [peiriannydd, gweithgynhyrchu]
Name: 733, dtype: object
(776, 3)


#### Weighted Title Only Embeddings

In [19]:
#combine the tokenized job titles into a list of sentences
joined_job_ad_titles = []
for token_list in tk_job_ad_titles:
    sentence = " ".join(token_list)
    joined_job_ad_titles.append(sentence)
for i in range(10):
    print(joined_job_ad_titles[i])  

finance accounts asst bromley to k
fund accountant hedge fund
deputy home manager
brokers wanted imediate start
rgn nurses hospitals penarth
production coordinator
scrub nurse
sales purchase ledger clerk maternity cover
recruitment sales executive
business development executive field sales dartford


In [20]:
#create the vocabulary and the dictionary
words = list(chain.from_iterable(tk_job_ad_titles))
vocab_title = set(words)
vocab_title = sorted(vocab_title)
vocab_dict_title = {index:word for index, word in enumerate(vocab_title)}
for index in range(10):
    print(vocab_title[index])
for index in range(10):
    print(f"{index}:{vocab_dict_title[index]}")

a
abbey
aberdeen
abi
accepted
accounant
account
accountancy
accountant
accounting
0:a
1:abbey
2:aberdeen
3:abi
4:accepted
5:accounant
6:account
7:accountancy
8:accountant
9:accounting


In [21]:
#generate the weighted vector representation
tVectorizer = TfidfVectorizer(analyzer = "word",vocabulary = vocab_title) # initialised the TfidfVectorizer
tfidf_features = tVectorizer.fit_transform(joined_job_ad_titles) # generate the tfidf vector representation for all articles
print(tfidf_features.shape)
tfidf_weights = doc_wordweights(tfidf_features, vocab_dict_title)
print(tfidf_weights[example])
weighted_titles = weighted_docvecs(word2vec_googlenews, tfidf_weights, df_titles['tk_job_ad_titles'])
print(weighted_titles)

(776, 1003)
{'controller': 0.5110131436438434, 'investments': 0.6646589807883488, 'treasury': 0.5450633048377316}
No valid terms found in embeddings for document: ['rmn', 'rnld']
No valid terms found in embeddings for document: ['sipp', 'ssas', 'administartor']
No valid terms found in embeddings for document: ['peiriannydd', 'gweithgynhyrchu']
          0         1         2         3         4         5         6    \
0    0.087739  0.109611  0.243042 -0.036647 -0.090930  0.023514 -0.053294   
1    0.170572  0.008746  0.161061 -0.039754  0.089066 -0.094399  0.077775   
2   -0.090090 -0.092335 -0.000078 -0.374884 -0.039919 -0.037689  0.272559   
3    0.076404  0.218866  0.037889  0.011598 -0.140168 -0.298649  0.078889   
4   -0.073963 -0.020260  0.217439  0.237971  0.041492 -0.026040 -0.136404   
..        ...       ...       ...       ...       ...       ...       ...   
771 -0.023126 -0.077155  0.145450 -0.058493 -0.068682 -0.116588 -0.022946   
772 -0.053175 -0.015673  0.100475  0.1

#### Unweighted Title + Description Embeddings

In [22]:
#concatenate the title and description from before and then generate the unweighted vector representation
df['categories'] = df['categories'].astype('int64')
df_td = pd.merge(df_titles, df, on=['webindex', 'categories'], how='left')
df_td['tk_td'] = df_td.apply(lambda row: row['tk_job_ad_titles'] + row['tk_job_ads'], axis=1)
print(df_td.head(5))
unweighted_td = generate_docvecs(word2vec_googlenews,df_td['tk_td'])

   webindex  categories                           tk_job_ad_titles  \
0  68997528           0  [finance, accounts, asst, bromley, to, k]   
1  68063513           0            [fund, accountant, hedge, fund]   
2  68700336           2                    [deputy, home, manager]   
3  67996688           0         [brokers, wanted, imediate, start]   
4  71803987           2          [rgn, nurses, hospitals, penarth]   

                                          tk_job_ads  \
0  [accountant, partqualified, south, east, londo...   
1  [hedge, funds, london, recruiting, fund, accou...   
2  [exciting, arisen, establish, provider, elderl...   
3  [expanding, recruiting, junior, trainee, broke...   
4  [rgn, nurses, hospitals, fulltime, part, swiis...   

                                               tk_td  
0  [finance, accounts, asst, bromley, to, k, acco...  
1  [fund, accountant, hedge, fund, hedge, funds, ...  
2  [deputy, home, manager, exciting, arisen, esta...  
3  [brokers, wanted, i

#### Weighted Title + Desc Embeddings

In [23]:
#join the tokenized title+description into a list of sentences
#create the vocabulary and dictionary
joined_job_ad_td = [sent1 + " " + sent2 for sent1, sent2 in zip(joined_job_ad_titles, joined_job_ad_descs)]
words = vocab_desc+vocab_title
vocab_td = set(words)
vocab_td = sorted(vocab_td)
vocab_dict_td = {index:word for index, word in enumerate(vocab_td)}
for index in range(10):
    print(vocab_dict_td[index])
for index in range(10):
    print(f"{index}:{vocab_td[index]}")

a
aap
aaron
aat
abb
abbey
abenefit
aberdeen
abi
abilities
0:a
1:aap
2:aaron
3:aat
4:abb
5:abbey
6:abenefit
7:aberdeen
8:abi
9:abilities


In [24]:
tVectorizer = TfidfVectorizer(analyzer = "word",vocabulary = vocab_td) # initialised the TfidfVectorizer
tfidf_features_td = tVectorizer.fit_transform(joined_job_ad_td) # generate the tfidf vector representation for all articles
tfidf_features_td.shape

(776, 5314)

In [25]:
#generate the weighted vector representation
tfidf_weights_td = doc_wordweights(tfidf_features_td, vocab_dict_td)
print(tfidf_weights_td[example])

{'aca': 0.0652044252880885, 'acca': 0.06626187324472237, 'accordance': 0.0637436141534258, 'accounting': 0.06079514628407042, 'acting': 0.052889321147770625, 'action': 0.06927683543162999, 'age': 0.06739771992040468, 'agency': 0.040950971541611855, 'analysing': 0.0805819525896607, 'analysis': 0.05311980543329934, 'analytical': 0.0619900562935598, 'application': 0.045858152392915075, 'architectures': 0.0898721452300595, 'articulate': 0.07582823400821496, 'assume': 0.07141895500419689, 'assumptions': 0.09608667494616789, 'awareness': 0.062409998073698146, 'back': 0.0668192770777162, 'bring': 0.07141895500419689, 'build': 0.05178618935357167, 'capital': 0.06995814386953422, 'cashflow': 0.09608667494616789, 'challenge': 0.05898989557198008, 'change': 0.05898989557198008, 'cima': 0.07220512302663372, 'clear': 0.06572403152817587, 'communication': 0.0751458364379183, 'compliance': 0.058323402907250674, 'comply': 0.07483905750591496, 'confident': 0.06118319020429629, 'consistently': 0.0730338

In [26]:
weighted_td = weighted_docvecs(word2vec_googlenews, tfidf_weights_td, df_td['tk_td'])
weighted_td

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.145644,0.243560,0.003314,-0.013064,0.072833,0.197028,-0.125262,-0.598082,0.476345,0.278501,...,-0.597980,0.203950,-0.361167,-0.034597,-0.194029,-0.005384,-0.063780,-0.163592,-0.125480,-0.223602
1,0.092423,0.128234,0.214745,0.179408,0.107170,-0.074878,0.036006,-0.431912,0.708693,-0.185550,...,-0.328630,0.598091,-0.288117,0.231099,0.039226,-0.063500,0.105906,0.077292,-0.048973,0.005504
2,-0.333958,-0.041919,-0.023717,-0.124600,-0.278062,-0.128001,0.263936,-0.447359,0.339667,-0.087960,...,-0.450619,0.134542,-0.345871,0.030985,-0.028801,0.096263,-0.011243,-0.363786,0.080333,-0.138853
3,-0.080518,0.360978,-0.122006,0.177395,-0.292315,-0.323947,0.071172,-0.249086,0.256888,0.316349,...,-0.325608,0.363814,-0.351380,0.145200,-0.098058,0.085278,0.371798,0.044728,-0.068194,-0.029659
4,-0.237862,0.014902,-0.071798,0.348949,-0.369499,-0.281876,0.404827,-0.446188,0.463685,0.270851,...,-0.431168,0.170005,-0.580494,0.040502,-0.080570,0.435868,-0.112610,-0.050169,0.019402,-0.292655
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
771,-0.105592,-0.043192,-0.255186,0.142276,0.160148,-0.182233,-0.108483,-0.508725,0.358053,0.420593,...,-0.257999,0.206968,-0.833112,-0.003461,0.292357,0.030230,0.515535,-0.112179,0.133660,-0.375179
772,-0.591444,0.556103,0.116888,0.205218,-0.543446,-0.169404,0.633813,-0.663347,0.780889,0.696724,...,-1.110849,0.640597,-0.924301,0.533574,-0.430908,0.348161,0.094643,-0.135836,-0.011093,-0.505172
773,-0.389227,0.448951,-0.243833,0.171819,-0.509464,0.298406,0.178607,-0.996932,0.459146,-0.414114,...,-0.515148,0.442528,-0.987143,0.280244,0.037782,0.318094,0.550783,0.074266,0.467540,0.167194
774,-0.128871,0.072328,-0.327231,0.282260,-0.624252,0.555349,0.217912,-0.617079,0.286599,0.057394,...,-0.211612,0.406742,-0.606084,0.082272,0.070883,0.182780,-0.012764,0.133380,0.309389,-0.125450


### Comparison of Model Performance:

In [27]:
# Drop the corresponding rows from weighted and unweighted to match the size
weighted_desc = np.delete(weighted_desc, Nulls.index, axis=0)
unweighted_desc.drop(Nulls.index, axis=0, inplace=True)

In [28]:
seed = 0  # set a seed to make sure the experiment is reproducible
kf = KFold(n_splits=5, shuffle=True, random_state=seed)

datasets = [
    ("Description - Unweighted", unweighted_desc),
    ("Description - Weighted", weighted_desc),
    ("Title - Unweighted", unweighted_titles),
    ("Title - Weighted", weighted_titles),
    ("Title + Description - Unweighted", unweighted_td),
    ("Title + Description - Weighted", weighted_td)
]

# Iterate through each dataset and feature representation, and perform cross-validation
for dataset_name, X in datasets:
    y = df_td['categories']
    model = LogisticRegression(max_iter=2000, random_state=seed)
    scores = cross_val_score(model, X, y, cv=kf)
    
    # Print performance of each fold
    print(f"{dataset_name}")
    for fold, score in enumerate(scores, start=1):
        print(f"Fold {fold} accuracy: {score:.2f}")
    print(f"Mean Accuracy: {scores.mean() * 100:.2f}%")
    print()


Description - Unweighted
Fold 1 accuracy: 0.84
Fold 2 accuracy: 0.85
Fold 3 accuracy: 0.82
Fold 4 accuracy: 0.81
Fold 5 accuracy: 0.83
Mean Accuracy: 82.99%

Description - Weighted
Fold 1 accuracy: 0.87
Fold 2 accuracy: 0.90
Fold 3 accuracy: 0.87
Fold 4 accuracy: 0.83
Fold 5 accuracy: 0.86
Mean Accuracy: 86.60%

Title - Unweighted
Fold 1 accuracy: 0.81
Fold 2 accuracy: 0.85
Fold 3 accuracy: 0.86
Fold 4 accuracy: 0.84
Fold 5 accuracy: 0.85
Mean Accuracy: 84.02%

Title - Weighted
Fold 1 accuracy: 0.79
Fold 2 accuracy: 0.88
Fold 3 accuracy: 0.84
Fold 4 accuracy: 0.85
Fold 5 accuracy: 0.85
Mean Accuracy: 84.16%

Title + Description - Unweighted
Fold 1 accuracy: 0.86
Fold 2 accuracy: 0.86
Fold 3 accuracy: 0.83
Fold 4 accuracy: 0.83
Fold 5 accuracy: 0.87
Mean Accuracy: 85.18%

Title + Description - Weighted
Fold 1 accuracy: 0.87
Fold 2 accuracy: 0.90
Fold 3 accuracy: 0.87
Fold 4 accuracy: 0.86
Fold 5 accuracy: 0.87
Mean Accuracy: 87.24%



In [40]:
# Train the model on the dataset
seed = 0  # set a seed to make sure the experiment is reproducible
y = df_td['categories']
model = LogisticRegression(max_iter=2000, random_state=seed)
model.fit(weighted_td, y)  # Fit the model to the entire dataset

# Function to preprocess and transform input text
def preprocess_and_transform(title, description, word2vec_model):
    combined_text = title + " " + description
    words = combined_text.split()
    word_vectors = [word2vec_model[word] for word in words if word in word2vec_model]
    if word_vectors:
        doc_vector = np.mean(word_vectors, axis=0)
    else:
        doc_vector = np.zeros(word2vec_model.vector_size)
    return doc_vector.reshape(1, -1)

# Example input
job_title = "Water Hygiene Engineer"
job_description = "Water Hygiene Engineer Location Post Code: BS**** (Bristol, Avon) Salary: ****K to ****K (Depending on Skills Experience) Our client is looking for an experienced Water Hygiene Engineer to join their growing team in Bristol. The ideal candidate will have substantial field service experience preferably in the Water Hygiene Industry, an exceptional mechanical aptitude and great customer service skills. They are looking for enthusiastic people with experience in;  Cold water storage tank cleaning and disinfection, including handling and dosing chemicals  Cold water storage tank lining, including using an angle grinder and painting  Showerhead cleaning and disinfection  Cold water storage tank inspections  Water hygiene monitoring tasks including temperature testing, tank inspections, samples and calorifier inspections  Driving ****k miles per year You must have  GCSE Maths and English grade C and above  Experience in Water Hygiene / Legionella control industry  Mechanical aptitude and technical ability  Experience at working unsupervised  Capable of physical work, lifting, carrying and climbing  Full current UK driving Licence In return they are offering a fantastic opportunity to work for a great company, on a good salary, with a van, overtime and generous holiday package. The hours are Monday to Friday 8.30am to 5.30pm. The role is managed from their Avonmouth office, and the work is at their customers sites throughout the UK"
# Preprocess and transform the input using the same method
input_vector = preprocess_and_transform(job_title, job_description, word2vec_googlenews)

# Predict the category
category_names = ['Accounting & Finance','Engineering','Healtcare & Nursing','Sales']
predicted_category = model.predict(input_vector)
predicted_category = category_names[predicted_category[0]]
print(f"Predicted Category: {predicted_category}")

Predicted Category: Engineering


In [41]:
from joblib import dump

# Save the trained logistic regression model
dump(model, 'logistic_regression_model.pkl')

import numpy as np

# Save the category names
category_names = ['Accounting & Finance', 'Engineering', 'Healthcare & Nursing', 'Sales']
np.save('category_names.npy', category_names)