# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Sukhum Boondecharak
#### Student ID: S3940976

Date: 04 Oct 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction
The objective for task 2 is to create feature representations for job advertisement descriptions. These representations will be used to capture the essential information within the text data, making it suitable for machine learning models. The task involves two main feature generation processes:

1. Bag-of-Words Model: This approach involves creating count vector representations for each job advertisement description based on a preprocessed vocabulary. The generated count vectors represent the frequency of each word in the descriptions.
2. Word Embeddings: This is to capture semantic relationships between words and can provide rich representations for text data. In this sub-task, I chose FastText as a word embedding model and initially created both unweighted and TF-IDF weighted vector representations for job advertisement descriptions.

Task 3 focuses on building machine learning models to classify job advertisements into specific categories based on their textual content. The primary goal is to investigate two key questions:

- Q1: Which language model, among those created in Task 2, performs best when combined with chosen machine learning models? Various models will be built based on different feature representations, and their performance will be evaluated.

- Q2: Does more information improve accuracy? Different combinations of features will be explored, including using only the job title, only the job description, or both. By experimenting with these combinations, We aim to understand whether incorporating additional information, such as job titles, improves the accuracy of the classification models.

## Importing libraries 

Various libraries are imported for different activities.

In [4]:
import os
import numpy as np
from gensim.models.fasttext import FastText
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import load_files  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
from collections import Counter

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

### Bag-of-Words (BoW) Model: Count Vectors

First, generate the Count vector representation for each job advertisement description using the vocabulary created in Task 1. The count vectors will be combined and saved at the end of this task.

In [2]:
# Load the cleaned data
with open("cleaned_descriptions.txt", "r", encoding="utf-8") as file:
    cleaned_descriptions = file.readlines()

# Load the vocabulary
with open("vocab.txt", "r") as file:
    vocab = [line.strip().split(":")[0] for line in file]

# Indicate original data folder
original_data_folder = "data"

# Initiate an empty list to store webindex numbers
webindex_numbers = []

# Iterate through the original data files and extract webindex
for category_folder in os.listdir(original_data_folder):
    category_path = os.path.join(original_data_folder, category_folder)
    if os.path.isdir(category_path):
        for job_file in os.listdir(category_path):
            if job_file.startswith("Job_") and job_file.endswith(".txt"):
                with open(os.path.join(category_path, job_file), "r", encoding="utf-8") as f:
                    content = f.read()
                    
                    # Extract the webindex from the original data and remove the newline character
                    webindex = content.split("Webindex: ")[1].split("\n")[0]
                    
                    webindex_numbers.append(webindex)

webindex_numbers

['72444142',
 '68687567',
 '68257980',
 '71168766',
 '72441930',
 '70205492',
 '69929266',
 '68814305',
 '71737507',
 '69540434',
 '71199751',
 '70457475',
 '72411451',
 '69001764',
 '71171544',
 '68057786',
 '69040220',
 '68784018',
 '69146761',
 '68256016',
 '70251801',
 '68180459',
 '68056671',
 '72448172',
 '69250788',
 '67639091',
 '68256188',
 '69577650',
 '66600427',
 '71339723',
 '72691163',
 '69989027',
 '69635720',
 '69993409',
 '68356152',
 '72439398',
 '69577820',
 '68708197',
 '67895483',
 '71848552',
 '68257221',
 '68686791',
 '68806418',
 '69188332',
 '68062805',
 '68678164',
 '72680220',
 '69553242',
 '71633491',
 '71171000',
 '71677705',
 '62269820',
 '71677311',
 '70597736',
 '68258658',
 '68553492',
 '71596865',
 '69694967',
 '71750603',
 '72457901',
 '70597879',
 '72673887',
 '70599432',
 '71198896',
 '72672874',
 '71678606',
 '69250648',
 '71848359',
 '72233918',
 '70190910',
 '70256074',
 '68062611',
 '66887344',
 '72240625',
 '68257449',
 '70520065',
 '71849489',

In [3]:
# Run the CountVectorizer using the vocabulary
count_vectorizer = CountVectorizer(vocabulary = vocab)

# Fit and transform the preprocessed data to get the BoW representation
count_vectors = count_vectorizer.fit_transform(cleaned_descriptions)

In [4]:
count_vectors.shape

(776, 5168)

### Word Embeddings Models:

Choose FastText as the embedding language model.

In [5]:
# Train the model using words from cleaned descriptions

# Set vector size and corpus file
file = 'cleaned_descriptions.txt'
model = FastText(vector_size = 200)
model.build_vocab(corpus_file = file)

# Train the model
model.train(corpus_file = file, 
                     epochs = model.epochs, 
                     total_examples = model.corpus_count, 
                     total_words = model.corpus_total_words)

# See model overview
print(model)

FastText(vocab=2741, vector_size=200, alpha=0.025)


In [6]:
# Save the trained FastText model to a file
model_path = 'ft_model.bin'
model.save(model_path)

In [7]:
# Load trained FastText model
model_path = 'ft_model.bin'
ft_model = FastText.load(model_path)

In [8]:
# Define a function to generate unweighted word embeddings and also handle missing words
def gen_unweighted(data, model):
    unweighted_word_embeddings = []
    
    for text in data:
        
        # Split text into tokens
        tokens = text.split()
        
        # Initiate an empty list to store unweighted embeddings
        unweighted_embeddings = []

        for token in tokens:
            
            # If the token is in the model's vocabulary, get its embedding
            if token in model.wv.key_to_index:
                word_vec = model.wv.get_vector(token)
                unweighted_embeddings.append(word_vec)
                
            else:
                
                # Handle missing words by replacing with a zero vector
                unweighted_embeddings.append(np.zeros(model.vector_size))

        # Calculate the mean of unweighted word embeddings for this text
        if unweighted_embeddings:
            unweighted_mean_embedding = np.mean(unweighted_embeddings, axis = 0)
            
        else:
            
            # Again, if no valid tokens, use a zero vector
            unweighted_mean_embedding = np.zeros(model.vector_size)

        unweighted_word_embeddings.append(unweighted_mean_embedding)

    return unweighted_word_embeddings

In [9]:
# Generate unweighted word embeddings with handling missing words for the preprocessed data
unweighted_descriptions = gen_unweighted(cleaned_descriptions, ft_model)

# Print example
unweighted_descriptions[0]

array([-3.16085886e-01,  1.31698322e-01,  1.42177568e-01, -1.93397799e-01,
        4.06616063e-01, -3.01417670e-02, -5.77941797e-02, -2.02977796e-01,
        8.99173869e-02, -1.21408529e-01, -1.16406157e-01, -1.14120013e-02,
       -3.41974583e-01, -1.77978319e-01, -3.52121552e-01, -8.55907139e-02,
        2.80723544e-02, -1.41619433e-01, -8.32215895e-02, -1.32550846e-01,
       -1.02289083e-01,  4.88783133e-02, -2.28657142e-02, -1.41454551e-02,
       -2.07850342e-01,  1.74182147e-01,  1.21845290e-02,  2.43405515e-01,
        4.44324232e-01,  1.94856189e-01,  3.75705325e-02, -5.84976507e-02,
        2.53603580e-01,  2.95438035e-02,  1.20897621e-02,  3.09662810e-01,
       -1.33661314e-01, -1.63215300e-01, -2.11922777e-01,  3.95339357e-01,
        7.55148386e-02,  5.49730560e-02,  1.23371840e-01,  5.37294854e-01,
        4.77035165e-02, -2.94763239e-01,  5.74185361e-03, -4.45274959e-02,
        9.38234671e-02, -3.07710857e-02, -4.38175227e-01,  3.58605801e-04,
        1.90199921e-01,  

In [10]:
# Define a function to generate TF-IDF weighted word embeddings
def gen_tfidf_weighted(data, model):
    
    # Create a TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(vocabulary = model.wv.index_to_key)
    tfidf_vectors = tfidf_vectorizer.fit_transform(data)
    
    tfidf_weighted_word_embeddings = []
    
    for tfidf_vector in tfidf_vectors:
        
        # Convert the TF-IDF vector to an array
        tfidf_array = tfidf_vector.toarray()[0]
        
        # Calculate the weighted mean embedding using TF-IDF weights
        weighted_embedding = np.sum(
            tfidf_array[i] * model.wv.get_vector(token) if token in model.wv.key_to_index else np.zeros(model.vector_size)
            for i, token in enumerate(tfidf_vectorizer.get_feature_names_out())
        )
        
        tfidf_weighted_word_embeddings.append(weighted_embedding)
    
    return tfidf_weighted_word_embeddings

In [11]:
# Generate TF-IDF weighted word embeddings for the preprocessed data
tfidf_descriptions = gen_tfidf_weighted(cleaned_descriptions, ft_model)

# Print example
tfidf_descriptions[0]

  weighted_embedding = np.sum(


array([-1.8355283e+00,  7.6412028e-01,  8.1532222e-01, -1.1226172e+00,
        2.3467407e+00, -1.7220733e-01, -3.3090088e-01, -1.1794643e+00,
        5.1841837e-01, -7.0068049e-01, -6.7607397e-01, -6.1454147e-02,
       -1.9713411e+00, -1.0359097e+00, -2.0333958e+00, -4.9681216e-01,
        1.5987560e-01, -8.1698257e-01, -4.8479760e-01, -7.6957029e-01,
       -5.9300935e-01,  2.7846721e-01, -1.3200891e-01, -8.2167700e-02,
       -1.2008241e+00,  1.0093498e+00,  7.6582752e-02,  1.4041541e+00,
        2.5691733e+00,  1.1274645e+00,  2.1495555e-01, -3.3625132e-01,
        1.4682111e+00,  1.7501213e-01,  7.0661485e-02,  1.7895256e+00,
       -7.8046918e-01, -9.4779813e-01, -1.2235588e+00,  2.2896054e+00,
        4.3282560e-01,  3.1813496e-01,  7.1764320e-01,  3.1026547e+00,
        2.7679476e-01, -1.7087544e+00,  3.5239469e-02, -2.5133288e-01,
        5.3907430e-01, -1.7997757e-01, -2.5379241e+00,  6.9880309e-03,
        1.1036607e+00,  1.4613020e+00,  9.5127404e-01,  5.2258557e-01,
      

### Saving outputs
Save the count vector representations into a file named 
- count_vector.txt

In [12]:
# Save the count vectors to a file in the required format
count_vectors_file = "count_vectors.txt"
with open(count_vectors_file, 'w', encoding='utf-8') as f:
    for webindex, count_vector in zip(webindex_numbers, count_vectors):
        # Convert the count vector to a comma-separated string
        count_vector_str = ','.join([f"{i}:{count}" for i, count in enumerate(count_vector.toarray()[0]) if count > 0])
        f.write(f"#{webindex},{count_vector_str}\n")

## Task 3. Job Advertisement Classification

### Language Models Comparison: Unweighted & TF-IDF Weighted Word Embedding

Job categories are derived according to each folder name, namely; Accounting_Finance, Engineering, Healthcare_Nursing, and Sales with index number 0, 1, 2, 3 respectively. This categories will be used in the training and testing using various models.

In [5]:
# Load raw files to retrieve job categories from folder names using integers

job_data = load_files(r"data")  

job_categories = job_data.target
job_categories = [int(c) for c in job_categories]
job_categories

[0,
 0,
 2,
 0,
 2,
 1,
 2,
 0,
 3,
 3,
 0,
 0,
 1,
 3,
 1,
 3,
 3,
 1,
 3,
 2,
 2,
 2,
 3,
 3,
 0,
 2,
 2,
 2,
 0,
 2,
 3,
 1,
 2,
 0,
 1,
 3,
 3,
 1,
 1,
 0,
 2,
 2,
 2,
 2,
 0,
 0,
 2,
 1,
 3,
 1,
 1,
 2,
 2,
 3,
 0,
 0,
 1,
 0,
 2,
 2,
 3,
 3,
 3,
 0,
 3,
 0,
 1,
 2,
 3,
 1,
 3,
 2,
 3,
 1,
 3,
 2,
 1,
 3,
 2,
 1,
 3,
 2,
 2,
 1,
 0,
 1,
 1,
 1,
 3,
 0,
 3,
 1,
 3,
 2,
 2,
 0,
 2,
 3,
 2,
 1,
 0,
 1,
 1,
 2,
 0,
 3,
 0,
 1,
 3,
 2,
 1,
 2,
 0,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 3,
 2,
 0,
 0,
 1,
 3,
 2,
 0,
 1,
 0,
 3,
 1,
 2,
 1,
 0,
 0,
 0,
 3,
 0,
 1,
 2,
 3,
 1,
 1,
 1,
 2,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 2,
 0,
 2,
 2,
 0,
 2,
 3,
 2,
 2,
 0,
 2,
 1,
 0,
 1,
 1,
 1,
 3,
 1,
 3,
 1,
 0,
 3,
 1,
 0,
 2,
 0,
 0,
 2,
 1,
 1,
 0,
 1,
 3,
 0,
 1,
 1,
 3,
 0,
 1,
 0,
 2,
 3,
 0,
 2,
 0,
 1,
 0,
 1,
 3,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 2,
 1,
 3,
 1,
 2,
 3,
 1,
 1,
 2,
 0,
 0,
 1,
 2,
 0,
 3,
 2,
 3,
 2,
 2,
 3,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,


For Q1, We are using 3 different machine learning models to evaluate the accuracy for each feature representation, including:

- Logistic Regression Model (Easy to understand and acts as a basic model for comparison)
- Random Forest Model (Less likely to overfit and can handle noisy features)
- Support Vector Machine Model (Can handle complicated relationships between features and categories.

We are using unweighted word embeddings and TF-IDF weighted embeddings as feature representations. Both representations are based on cleaned description from Task 1.

In [14]:
# Prepare Unweighted and TF-IDF Weighted Descriptions for training and testing

seed = 15
max_iter = 1000

# Split the Unweighted embeddings for descriptions into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(unweighted_descriptions, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

# Split the TF-IDF weighted embeddings for descriptions into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(tfidf_descriptions, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

In [15]:
# First, choose Logistic Regression Model
lr_model = LogisticRegression(random_state = seed, max_iter = max_iter)

In [16]:
# Calculate the accuracy score using unweighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
unweighted_description_scores_lr = cross_val_score(lr_model, unweighted_descriptions, 
                                                   job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", unweighted_description_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", unweighted_description_scores_lr.mean())
print("Standard deviation:", unweighted_description_scores_lr.std())

Results of Unweighted Word Embeddings for Descriptions:

Cross-validation scores: [0.42948718 0.41935484 0.43870968 0.44516129 0.49677419]
Mean accuracy: 0.4458974358974359
Standard deviation: 0.026886636949392504


In [17]:
# Calculate the accuracy score using TF-IDF weighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
tfidf_description_scores_lr = cross_val_score(lr_model, tfidf_descriptions, 
                                              job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of TF-IDF Weighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", tfidf_description_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", tfidf_description_scores_lr.mean())
print("Standard deviation:", tfidf_description_scores_lr.std())

Results of TF-IDF Weighted Word Embeddings for Descriptions:

Cross-validation scores: [0.62820513 0.59354839 0.62580645 0.58709677 0.64516129]
Mean accuracy: 0.6159636062861868
Standard deviation: 0.02206797345693831


In [18]:
# Then, try Random Forest Model
rf_model = RandomForestClassifier(n_estimators = 100, random_state = 15)

In [19]:
# Calculate the accuracy score using unweighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
unweighted_description_scores_rf = cross_val_score(rf_model, unweighted_descriptions,
                                                   job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", unweighted_description_scores_rf)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", unweighted_description_scores_rf.mean())
print("Standard deviation:", unweighted_description_scores_rf.std())

Results of Unweighted Word Embeddings for Descriptions:

Cross-validation scores: [0.51923077 0.52903226 0.50322581 0.56129032 0.54193548]
Mean accuracy: 0.5309429280397022
Standard deviation: 0.019752796217122354


In [20]:
# Calculate the accuracy score using TF-IDF weighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
tfidf_description_scores_rf = cross_val_score(rf_model, tfidf_descriptions, 
                                              job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", tfidf_description_scores_rf)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", tfidf_description_scores_rf.mean())
print("Standard deviation:", tfidf_description_scores_rf.std())

Results of Unweighted Word Embeddings for Descriptions:

Cross-validation scores: [0.53846154 0.56774194 0.46451613 0.55483871 0.53548387]
Mean accuracy: 0.5322084367245659
Standard deviation: 0.03579619270124285


In [21]:
# Then, try Support Vector Machine Model
svm_model = SVC(kernel = 'linear', C = 1.0)

In [22]:
# Calculate the accuracy score using unweighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
unweighted_description_scores_svm = cross_val_score(svm_model, unweighted_descriptions, 
                                                    job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", unweighted_description_scores_svm)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", unweighted_description_scores_svm.mean())
print("Standard deviation:", unweighted_description_scores_svm.std())

Results of Unweighted Word Embeddings for Descriptions:

Cross-validation scores: [0.43589744 0.45806452 0.41935484 0.43870968 0.46451613]
Mean accuracy: 0.4433085194375517
Standard deviation: 0.016231786535315887


In [23]:
# Calculate the accuracy score using TF-IDF weighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
tfidf_description_scores_svm = cross_val_score(svm_model, tfidf_descriptions, 
                                               job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions:\n")
print("Cross-validation scores:", tfidf_description_scores_svm)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", tfidf_description_scores_svm.mean())
print("Standard deviation:", tfidf_description_scores_svm.std())

Results of Unweighted Word Embeddings for Descriptions:

Cross-validation scores: [0.6025641  0.60645161 0.61290323 0.56129032 0.61935484]
Mean accuracy: 0.6005128205128205
Standard deviation: 0.020427555655148158


In [24]:
# Set the rounded decimal points
dec = 3

# Print comparison
print("Mean Accuracy\n")
print("Logistic Regression Model")
print("Unweighted:\t\t", round(unweighted_description_scores_lr.mean(), dec))
print("TF-IDF Weighted:\t", round(tfidf_description_scores_lr.mean(), dec))
print("\nRandom Forest Model")
print("Unweighted:\t\t", round(unweighted_description_scores_rf.mean(), dec))
print("TF-IDF Weighted:\t", round(tfidf_description_scores_rf.mean(), dec))
print("\nSupport Vector Machine Model")
print("Unweighted:\t\t", round(unweighted_description_scores_svm.mean(), dec))
print("TF-IDF Weighted:\t", round(tfidf_description_scores_svm.mean(), dec))

Mean Accuracy

Logistic Regression Model
Unweighted:		 0.446
TF-IDF Weighted:	 0.616

Random Forest Model
Unweighted:		 0.531
TF-IDF Weighted:	 0.532

Support Vector Machine Model
Unweighted:		 0.443
TF-IDF Weighted:	 0.601


### Q1 Analysis:

#### Logistic Regression Model:

- The TF-IDF weighted feature representation outperforms the unweighted representation significantly.
- This indicates that TF-IDF weighting helps the logistic regression model better capture the distinguishing features among job advertisements.

#### Random Forest Model:

- Though it is slightly lower, the unweighted and TF-IDF weighted representations have similar mean accuracy scores.
- Random Forest is an ensemble model that may not benefit as much from TF-IDF weighting as logistic regression, which relies on linear relationships.

#### Support Vector Machine Model

- Similar to logistic regression, the TF-IDF weighted feature representation performs better.
- SVM, like logistic regression, benefits from TF-IDF weighting as it helps to create better decision boundaries.

#### Summary

In summary, the choice between unweighted and TF-IDF weighted feature representations depends on the specific machine learning model being used. Logistic regression and support vector machines benefit from TF-IDF weighting, as it helps them capture the importance of words. Random Forest, on the other hand, may not show improvement with TF-IDF weighting, as it can handle non-linear relationships differently, but the difference is not very significant. However, looking at this specific model comparison, it could be concluded that the TF-IDF weighted representation consistently performs better across the majority of models in this analysis.

### Accuracy Improvement: Descriptions, Titles, and the Combination of Both

To investigate whether including additional information, such as the title of job advertisements, improves the accuracy of our classification models, we begin with extracting the titles from the raw data. But with a different approach, we will not filter out as much as we did with description, as for the fact that titles contain shorter contents. It is very much similar to Task 1 and Task 2.

In [10]:
# Extract titles from job data

titles = []

# Define a function to extract the title part from a text
def extract_title(text):
    start_text = text.find("Title: ")
    end_text = text.find("\n")
    if start_text != -1:
        title = text[start_text + len("Title: "):end_text]
        return title
    else:
        return ""

# Iterate through the loaded data and extract titles
for text in job_data.data:
    title = extract_title(text.decode("utf-8"))
    titles.append(title)

# See example of the first title    
emp = 0
titles[emp]

'Finance / Accounts Asst Bromley to ****k'

In [11]:
# Extract titles from job data

webbbb = []

# Define a function to extract the title part from a text
def extract_webbb(text):
    start_text = text.find("Webindex: ")
    end_text = text.find("\n")
    if start_text != -1:
        webbb = text[start_text + len("Webindex: "):end_text]
        return webbb
    else:
        return ""

# Iterate through the loaded data and extract titles
for text in job_data.data:
    webbb = extract_webbb(text.decode("utf-8"))
    webbbb.append(webbb)

# See example of the first title    
emp = 1
webbbb[emp]

''

In [26]:
# Define a function to tokenise data

def tokenize_data(data_raw):
    """
        This function first convert all words to lowercases, 
        it then segment the raw review into sentences and tokenize each sentences 
        and convert the review to a list of tokens.
    """        
    # Convert to lower case
    data_lc = data_raw.lower()
    
    # segament into sentences
    sentences = sent_tokenize(data_lc)
    
    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern) 
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]
    
    # merge them into a list of tokens
    data_tokenised = list(chain.from_iterable(token_lists))
    return data_tokenised

In [27]:
# Define a function to print the current stats

def stats_print(data_tk):
    words = list(chain.from_iterable(data_tk))
    vocab = set(words)
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of reviews:", len(data_tk))
    lens = [len(article) for article in data_tk]
    print("Average review length:", np.mean(lens))
    print("Maximun review length:", np.max(lens))
    print("Minimun review length:", np.min(lens))
    print("Standard deviation of review length:", np.std(lens))

In [28]:
# Tokenise the data and compare the result with the original data

title_tk = [tokenize_data(d) for d in titles]

print("Original Data:\n", titles[emp],'\n')
print("Tokenized Data:\n", title_tk[emp])

Original Data:
 Finance / Accounts Asst Bromley to ****k 

Tokenized Data:
 ['finance', 'accounts', 'asst', 'bromley', 'to', 'k']


In [29]:
# Check overall stats

stats_print(title_tk)

Vocabulary size:  1003
Total number of tokens:  3157
Lexical diversity:  0.3177066835603421
Total number of reviews: 776
Average review length: 4.068298969072165
Maximun review length: 13
Minimun review length: 1
Standard deviation of review length: 1.8386529115562282


In [30]:
# Filter out single character tokens
title_tk = [[w for w in words if len(w) >=2] for words in title_tk]

print("Tokenized Data with at least 2 characters:\n", title_tk[emp])

Tokenized Data with at least 2 characters:
 ['finance', 'accounts', 'asst', 'bromley', 'to']


In [31]:
# Import stop words from the required file

stopwords_file = "stopwords_en.txt"
with open(stopwords_file, 'r') as f:
    stop_words = set(f.read().split())
    
# Filter out stop words

title_tk = [[w for w in words if w not in stop_words] for words in title_tk]

print("Tokenized Data excluding stop words:\n", title_tk[emp])

Tokenized Data excluding stop words:
 ['finance', 'accounts', 'asst', 'bromley']


In [32]:
# Check stats after cleansing

stats_print(title_tk)

Vocabulary size:  954
Total number of tokens:  2963
Lexical diversity:  0.32197097536280794
Total number of reviews: 776
Average review length: 3.818298969072165
Maximun review length: 10
Minimun review length: 1
Standard deviation of review length: 1.5653217426587334


In [33]:
# Combine tokens and save output as a text file

combined_data = [" ".join(tokens) for tokens in title_tk]
output_file = "cleaned_titles.txt"

with open(output_file, 'w', encoding='utf-8') as f:
    for title in combined_data:
        f.write(title + '\n')

In [34]:
# Load the cleaned titles
with open("cleaned_titles.txt", "r", encoding="utf-8") as file:
    cleaned_titles = file.readlines()

We then create feature representations just like what we did for descriptions and then test it with the model. The difference now is that we are not comparing the performance of different models. our main focus is to assess whether the inclusion of additional information, such as the job title, enhances the accuracy of our classification model. We use only the Logistic Regression Model for this comparison by having different combinations of feature representation as followings:

- Using Only Descriptions
- Using Only Titles
- Using Both Description and Titles

We have done the descriptions in Q1 already, now we have to test for titles and the combination of both. The feature representations for titles will first be created. And later, we will combine it with feature representations for descriptions we have done earlier to come up with the feature representations for the combination of both.

In [35]:
# Generate unweighted word embeddings with handling missing words for the preprocessed data
unweighted_titles = gen_unweighted(cleaned_titles, ft_model)

# Print example
unweighted_titles[0]

array([-0.21515532,  0.05511866,  0.03388772, -0.14548835,  0.128492  ,
       -0.02563727, -0.00583292, -0.1600173 ,  0.05758339, -0.04659123,
       -0.06965681,  0.00988455, -0.14751576, -0.15219685, -0.15865776,
       -0.08833994, -0.05861265, -0.05649972, -0.07056785, -0.09055908,
       -0.07902582, -0.0107549 , -0.01048948, -0.04635653, -0.05785845,
        0.08659559,  0.01476522,  0.16696759,  0.24409226,  0.05835579,
        0.01647919, -0.02232521,  0.15415183,  0.03250293, -0.00145524,
        0.18080746, -0.12884754, -0.05294625, -0.08375809,  0.29147463,
        0.03595039,  0.0235871 ,  0.12984306,  0.25126754,  0.05035678,
       -0.25617716,  0.00262706,  0.01418623,  0.0443195 , -0.0508069 ,
       -0.20478545,  0.0030441 ,  0.11874967,  0.18297692,  0.13667066,
        0.06711508, -0.02365233,  0.04280203, -0.07624581,  0.06980649,
       -0.07172519,  0.04018674, -0.14695258, -0.05339774,  0.05073516,
       -0.0473206 ,  0.00140917, -0.10365963,  0.20051552, -0.24

In [36]:
# Generate TF-IDF weighted word embeddings for the preprocessed data
tfidf_titles = gen_tfidf_weighted(cleaned_titles, ft_model)

# Print example
tfidf_titles[0]

  weighted_embedding = np.sum(


array([-0.6117287 ,  0.15545551,  0.0920928 , -0.41432494,  0.35682794,
       -0.07238081, -0.01507537, -0.4555808 ,  0.16283438, -0.13068989,
       -0.1968213 ,  0.02939172, -0.41377914, -0.4333533 , -0.44627827,
       -0.25225547, -0.170574  , -0.15794107, -0.2012933 , -0.2579033 ,
       -0.2246464 , -0.03302472, -0.02899878, -0.13258171, -0.15978324,
        0.24461642,  0.04358194,  0.47356588,  0.69021434,  0.16246039,
        0.04604499, -0.06245265,  0.436869  ,  0.09366398, -0.00391887,
        0.51171   , -0.36790434, -0.14879237, -0.23486975,  0.8283162 ,
        0.10063478,  0.06558502,  0.37092698,  0.70719737,  0.14355606,
       -0.7295722 ,  0.0076462 ,  0.04319708,  0.1244489 , -0.14593127,
       -0.5773063 ,  0.00892595,  0.33709264,  0.5203637 ,  0.3889023 ,
        0.19064774, -0.06569603,  0.12407997, -0.21603751,  0.19808936,
       -0.20198476,  0.10916024, -0.41299608, -0.14580412,  0.14440708,
       -0.13885088,  0.00254008, -0.29248828,  0.5763472 , -0.68

In [37]:
# Prepare Unweighted and TF-IDF Weighted Titles for training and testing

seed = 15
max_iter = 1000

# Split the unweighted embeddings for titles into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(unweighted_titles, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

# Split the TF-IDF weighted embeddings for titles into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(tfidf_titles, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

In [38]:
# Calculate the accuracy score using unweighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
unweighted_title_scores_lr = cross_val_score(lr_model, unweighted_titles, 
                                                   job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Titles:\n")
print("Cross-validation scores:", unweighted_title_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", unweighted_title_scores_lr.mean())
print("Standard deviation:", unweighted_title_scores_lr.std())

Results of Unweighted Word Embeddings for Titles:

Cross-validation scores: [0.43589744 0.43870968 0.46451613 0.53548387 0.53548387]
Mean accuracy: 0.48201819685690656
Standard deviation: 0.04477997642193246


In [39]:
# Calculate the accuracy score using TF-IDF weighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
tfidf_title_scores_lr = cross_val_score(lr_model, tfidf_titles, 
                                              job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of TF-IDF Weighted Word Embeddings for Titles:\n")
print("Cross-validation scores:", tfidf_title_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", tfidf_title_scores_lr.mean())
print("Standard deviation:", tfidf_title_scores_lr.std())

Results of TF-IDF Weighted Word Embeddings for Titles:

Cross-validation scores: [0.51923077 0.47741935 0.52903226 0.58064516 0.50967742]
Mean accuracy: 0.5232009925558313
Standard deviation: 0.033551286820224145


And now we are combining the two representations then test them for the last set of comparison.

In [40]:
# Concatenate unweighted word embeddings for titles and descriptions horizontally
unweighted_combined = np.hstack((unweighted_descriptions, unweighted_titles))
tfidf_combined = np.hstack((tfidf_descriptions, tfidf_titles))

In [41]:
# Prepare Unweighted and TF-IDF Weighted Descriptions and Titles for training and testing

seed = 15
max_iter = 1000

# Split the unweighted embeddings for both descriptions and titles into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(unweighted_combined, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

# Split the TF-IDF weighted embeddings for both descriptions and titles into training and testing sets
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(tfidf_combined, 
                                                                                 job_categories, 
                                                                                 list(range(0,len(job_categories))),
                                                                                 test_size = 0.2, 
                                                                                 random_state = seed)

In [42]:
# Calculate the accuracy score using unweighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
unweighted_combined_scores_lr = cross_val_score(lr_model, unweighted_combined, 
                                                job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of Unweighted Word Embeddings for Descriptions and Titles:\n")
print("Cross-validation scores:", unweighted_combined_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", unweighted_combined_scores_lr.mean())
print("Standard deviation:", unweighted_combined_scores_lr.std())

Results of Unweighted Word Embeddings for Descriptions and Titles:

Cross-validation scores: [0.48717949 0.50967742 0.48387097 0.5483871  0.52903226]
Mean accuracy: 0.5116294458229943
Standard deviation: 0.024601328288293988


In [43]:
# Calculate the accuracy score using TF-IDF weighted word embeddings

# Perform 5-fold cross-validation and specify the scoring metric
tfidf_combined_scores_lr = cross_val_score(lr_model, tfidf_combined, 
                                           job_categories, cv = 5, scoring = 'accuracy')

# Print the cross-validation scores
print("Results of TF-IDF Weighted Word Embeddings for Descriptions and Titles:\n")
print("Cross-validation scores:", tfidf_combined_scores_lr)

# Calculate and print the mean and standard deviation of the scores
print("Mean accuracy:", tfidf_combined_scores_lr.mean())
print("Standard deviation:", tfidf_combined_scores_lr.std())

Results of TF-IDF Weighted Word Embeddings for Descriptions and Titles:

Cross-validation scores: [0.6474359  0.58064516 0.62580645 0.63870968 0.65806452]
Mean accuracy: 0.6301323407775021
Standard deviation: 0.026910534916375437


In [44]:
# Set the rounded decimal points
dec = 3

# Print comparison
print("Mean Accuracy\n")
print("Unweighted")
print("Only Descriptions:\t", round(unweighted_description_scores_lr.mean(), dec))
print("Only Titles:\t\t", round(unweighted_title_scores_lr.mean(), dec))
print("Combination:\t\t", round(unweighted_combined_scores_lr.mean(), dec))
print("\nTF-IDF Weighted")
print("Only Descriptions:\t", round(tfidf_description_scores_lr.mean(), dec))
print("Only Titles:\t\t", round(tfidf_title_scores_lr.mean(), dec))
print("Combination:\t\t", round(tfidf_combined_scores_lr.mean(), dec))

Mean Accuracy

Unweighted
Only Descriptions:	 0.446
Only Titles:		 0.482
Combination:		 0.512

TF-IDF Weighted
Only Descriptions:	 0.616
Only Titles:		 0.523
Combination:		 0.63


### Q2 Analysis:

#### Unweighted Representations:

- Looking at the mean accuracy achieved, it suggests that using only job descriptions alone for classification leads to relatively low accuracy.
- When using only job titles, the mean accuracy increases. This indicates that titles contribute additional information and improve classification performance compared to descriptions alone.
- By combining both job descriptions and titles, the mean accuracy further improves. This suggests that leveraging both sources of information results in better classification accuracy than using either one individually.

#### TF-IDF Weighted Representations:

- When considering only descriptions, the mean accuracy significantly improves compared to the unweighted representation. TF-IDF weighting seems to enhance the model's ability to classify job advertisements based on descriptions.
- When using only job titles, the mean accuracy is also getting better compared to unweighted representations. This indicates that job titles provide valuable information, and TF-IDF weighting seems to have as substantial an impact here as it does with descriptions.
- Combining both job descriptions and titles while using TF-IDF weighting results in the best mean accuracy score. This demonstrates that the combination of both textual sources, when weighted with TF-IDF, yields a significant impact on classification accuracy.

#### Summary

TF-IDF weighting generally improves classification accuracy compared to unweighted representations for both descriptions and titles. It is also important to point out that combining job descriptions and titles consistently leads to improved accuracy across both unweighted and TF-IDF weighted scenarios. TF-IDF weighted descriptions alone also achieve  quite high accuracy score, emphasising their significance in the classification process. Still, in summary, the results indicate that including both job descriptions and titles, especially when TF-IDF Weighted is applied, leads to the best classification performance.

## Summary

These tasks helped preprocess and represent job advertisement data effectively and evaluate different models for categorising job advertisements. The choice of models, feature representations, and data combination strategies impacted classification accuracy. To improve job advertisement classification further, we can consider exploring and analysing the data, fine-tuning model settings, incorporating additional features, experimenting with various machine learning models, and utilising more advanced natural language processing tools. Though I'm not an expert, through this assignment, I'm getting a better picture of Data Science.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>