# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Padinjarekolath Jithin Nair
#### Student ID: s3952043

Date: 25/09/2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* sklearn
* numpy
* os
* nltk
* gensim
* warnings

## Introduction

#### Task 2 Introduction:

Task 2 focuses on creating feature representations for job advertisement descriptions. We generate count vectors based on Task 1's vocabulary and search meaningful data in word embeddings using language models. This gives us TF-IDF weighted and unweighted vectors, providing diverse ways to understand job ads.

#### Task 3 Introduction:

Task 3 shifts to job ad classification. We aim to classify ads into categories. Two key questions guide our exploration: 1) which language model from Task 2 performs best with our ML model, and 2) does including job titles improve accuracy? These questions drive our model-building and evaluation process.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import warnings
import os
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from gensim.models import Word2Vec

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

### Bag-of-words model

In [2]:
warnings.filterwarnings("ignore")

# Load the preprocessed job advertisements
preprocessed_ads = pd.read_csv("preprocessed_ads.csv", header=None, names=["Webindex", "Description"])

# Load the vocabulary created in Task 1
vocabulary = {}
with open("vocab.txt", "r", encoding="utf-8") as vocab_file:
    for line in vocab_file:
        word, index = line.strip().split(":")
        vocabulary[word] = int(index)

# Initialize a CountVectorizer with the custom vocabulary
count_vectorizer = CountVectorizer(tokenizer=lambda x: x.split(), vocabulary=vocabulary)

# Transform the descriptions into count vectors
count_vectors = count_vectorizer.transform(preprocessed_ads["Description"])


### Models based on word embeddings

In [3]:
import gensim.downloader as api
word_embedding_model = api.load("word2vec-google-news-300")


In [4]:
import pickle
# Define a regular expression pattern for tokenization (should match the pattern used in Task 1)
pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = re.compile(pattern)

custom_tokenizer = tokenizer.findall

# Initialize a TF-IDF Vectorizer with your custom tokenizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)

# Fit the TF-IDF Vectorizer on your preprocessed descriptions
tfidf_vectorizer.fit(preprocessed_ads["Description"])

# Initialize lists to store TF-IDF weighted and unweighted vectors
tfidf_weighted_vectors = []
unweighted_vectors = []

# Loop through each description
for _, description in preprocessed_ads.itertuples(index=False):
    # Compute TF-IDF vector
    tfidf_vector = tfidf_vectorizer.transform([description])
    
    # Compute TF-IDF weighted vector representation
    weighted_vector = np.mean([tfidf_vector[0, tfidf_vectorizer.vocabulary_[token]] * word_embedding_model[token] 
                               for token in tokenizer.findall(description) if token in word_embedding_model], axis=0)
    
    # Compute unweighted vector representation
    unweighted_vector = np.mean([word_embedding_model[token] 
                                 for token in tokenizer.findall(description) if token in word_embedding_model], axis=0)
    
    # Append the vectors to the respective lists
    tfidf_weighted_vectors.append(weighted_vector)
    unweighted_vectors.append(unweighted_vector)

# Save the TF-IDF weighted vectors to a file (tfidf_weighted_vectors.txt)
np.savetxt("weighted_vectors.txt", tfidf_weighted_vectors)

# Save the unweighted vectors to a file (unweighted_vectors.txt)
np.savetxt("unweighted_vectors.txt", unweighted_vectors)

### Saving outputs
Save the count vector representation as per spectification.
- count_vectors.txt

In [5]:
# Save the count vectors to a file (count_vectors.txt)
with open("count_vectors.txt", "w", encoding="utf-8") as file:
    for webindex, vector in zip(preprocessed_ads["Webindex"], count_vectors):
        non_zero_indices = vector.nonzero()[1]
        line = f"#{webindex},{','.join([f'{index}:{vector[0, index]}' for index in non_zero_indices])}\n"
        file.write(line)

## Task 3. Job Advertisement Classification

#### Q1

In [6]:
data_folder = "data"

# Initialize a list to store extracted information from all files
job_info_list = []

# Loop through each job category folder
job_folders = ["Accounting_Finance", "Engineering", "Healthcare_Nursing", "Sales"]
for category_name in job_folders:
    category_path = os.path.join(data_folder, category_name)

    # Loop through each job advertisement file in the category folder
    for file_name in os.listdir(category_path):
        if file_name.startswith("Job_") and file_name.endswith(".txt"):
            file_path = os.path.join(category_path, file_name)

            # Initialize variables to store extracted values
            title = None
            webindex = None
            company = None
            description = None

            # Read the job advertisement text from the file
            with open(file_path, 'r', encoding='utf-8') as file:
                file_content = file.read()

            # Split the file content into lines
            lines = file_content.strip().split('\n')

            # Initialize the category with the current folder name
            category = category_name

            # Iterate through lines to extract information
            for line in lines:
                key, value = line.strip().split(': ', 1)
                if key == 'Title':
                    title = value
                elif key == 'Webindex':
                    webindex = value
                elif key == 'Company':
                    company = value
                elif key == 'Description':
                    description = value

            # Store extracted information in a dictionary
            job_info = {
                'Title': title,
                'Webindex': webindex,
                'Category': category, 
                'Company': company,
                'Description': description,
            }

            # Append the dictionary to the list
            job_info_list.append(job_info)

# Create a DataFrame from the list of dictionaries
job_info_df = pd.DataFrame(job_info_list)
job_info_df



Unnamed: 0,Title,Webindex,Category,Company,Description
0,Commercial Insurance Underwriter,69092773,Accounting_Finance,Bond Search Selection Ltd,Our client is a leading Insurer based in Belfa...
1,Regulatory Policy Manager,71674555,Accounting_Finance,Michael Page Financial Services,As a Regulatory Policy Manager at a leading Fi...
2,PA to Managing Partner,70758175,Accounting_Finance,Wells Tobias,A brand new position has arisen within a highl...
3,Fixed Income Portfolio Analyst,69564390,Accounting_Finance,Mason Blake Ltd,This is a newly created position will involve ...
4,Commercial Finance Analyst,70255258,Accounting_Finance,Axon Resourcing Ltd,A newly created role for a commercial finance ...
...,...,...,...,...,...
771,European Account Manager (German and English),66399629,Sales,Positive Selection,Our client is in urgent need of a candidate wi...
772,"Graduate Sales Executive, Construction Product...",68702817,Sales,Aaron Wallis Sales Recruitment,"Graduate Sales Executive, Construction Product..."
773,International Sales Account Manager German Ma...,66935937,Sales,INTERACTION RECRUITMENT,INTERNATIONAL SALES ACCOUNT MANAGER managing ...
774,Rail Signal Design Team Leader,71812011,Sales,UKStaffsearch,We are currently looking for Rail Signal Desig...


In [7]:
job_info_df.isna().sum()

Title           0
Webindex        0
Category        0
Company        89
Description     0
dtype: int64

In [8]:
job_info_df.fillna('Others', inplace=True)

In [9]:
job_info_df.isna().sum()

Title          0
Webindex       0
Category       0
Company        0
Description    0
dtype: int64

In [10]:
job_info_df.to_csv("JobInfo.csv", index=False, encoding='utf-8')

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the TF-IDF weighted vectors from Task 2
tfidf_weighted_vectors = pd.read_csv("weighted_vectors.txt", delimiter=" ", header=None)

# Split data into features (X) and target (y)
X = tfidf_weighted_vectors.values
y = job_info_df['Category']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = logistic_regression_model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)


Accuracy: 0.6987179487179487
Classification Report:
                     precision    recall  f1-score   support

Accounting_Finance       0.68      0.74      0.71        35
       Engineering       0.56      0.96      0.71        46
Healthcare_Nursing       0.97      0.88      0.92        40
             Sales       1.00      0.11      0.21        35

          accuracy                           0.70       156
         macro avg       0.81      0.67      0.64       156
      weighted avg       0.79      0.70      0.65       156



From the classification report, we come to know that the model performs well for some categories (e.g., "Healthcare_Nursing") but struggles with others (e.g., "Sales"), where recall is low.

## Title-Only Model(Q2):

In [12]:
# Load the preprocessed job advertisements from Task 1
preprocessed_ads = pd.read_csv("preprocessed_ads.csv", header=None, names=["Webindex", "Description"])

# Load the job titles from your dataset (e.g., "Title" column)
job_titles = job_info_df["Title"]
labels = job_info_df["Category"]

# Initialize a TF-IDF Vectorizer for titles
tfidf_vectorizer = TfidfVectorizer()

# Generate TF-IDF weighted vectors for titles
title_tfidf_vectors = tfidf_vectorizer.fit_transform(job_titles)

# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(title_tfidf_vectors, labels, test_size=0.2, random_state=42)

# Initialize a logistic regression model
title_model = LogisticRegression()

# Train the model on the title vectors
title_model.fit(X_train, y_train)

# Make predictions on the test set
title_predictions = title_model.predict(X_test)

# Evaluate the model
title_accuracy = accuracy_score(y_test, title_predictions)
print(f"Title-Only Model Accuracy: {title_accuracy}")
print(classification_report(y_test, title_predictions))

Title-Only Model Accuracy: 0.8653846153846154
                    precision    recall  f1-score   support

Accounting_Finance       0.72      0.89      0.79        35
       Engineering       0.85      0.85      0.85        46
Healthcare_Nursing       0.97      0.88      0.92        40
             Sales       0.97      0.86      0.91        35

          accuracy                           0.87       156
         macro avg       0.88      0.87      0.87       156
      weighted avg       0.88      0.87      0.87       156



## Title and Description Model(Q2):

In [13]:
# Concatenate job titles and descriptions
combined_text = job_titles + " " + preprocessed_ads["Description"]

# Initialize a TF-IDF Vectorizer for the combined text
tfidf_vectorizer = TfidfVectorizer()

# Generate TF-IDF weighted vectors for the combined text
combined_tfidf_vectors = tfidf_vectorizer.fit_transform(combined_text)

# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(combined_tfidf_vectors, labels, test_size=0.2, random_state=42)

# Initialize a logistic regression model
combined_model = LogisticRegression()

# Train the model on the combined vectors
combined_model.fit(X_train, y_train)

# Make predictions on the test set
combined_predictions = combined_model.predict(X_test)

# Evaluate the model
combined_accuracy = accuracy_score(y_test, combined_predictions)
print(f"Title and Description Model Accuracy: {combined_accuracy}")
print(classification_report(y_test, combined_predictions))


Title and Description Model Accuracy: 0.9166666666666666
                    precision    recall  f1-score   support

Accounting_Finance       0.82      0.94      0.88        35
       Engineering       0.92      0.96      0.94        46
Healthcare_Nursing       0.97      0.95      0.96        40
             Sales       0.97      0.80      0.88        35

          accuracy                           0.92       156
         macro avg       0.92      0.91      0.91       156
      weighted avg       0.92      0.92      0.92       156



In [14]:
import pickle

# Save the trained model to a file using pickle
with open('combined_model.pkl', 'wb') as model_file:
    pickle.dump(combined_model, model_file)

# Save the vectorizer during training
with open('vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(tfidf_vectorizer, vectorizer_file)

* Accuracy increased from 70% to 91.7%.
* Precision, recall, and F1-score values for each category have also improved, suggesting that the model is better at correctly classifying job advertisements into their respective categories.

Based on these results, it's clear that incorporating additional information, such as the job title, can be highly beneficial for improving the accuracy of the classification model.

## Summary

#### Task 2 Generating Feature Representations for Job Advertisement Descriptions:

* Feature Representations:

    * Bag-of-Words (Count Vectors): Count vector representations were generated for each job advertisement based on the vocabulary.
    * Word Embeddings: Word embeddings (TF-IDF weighted and unweighted) were generated using a pre-trained word embedding model using Word2Vec.

* Evaluation: The feature representations were used to train and evaluate machine learning models for classifying job advertisements into categories.

#### Task 3: Classification of Job Advertisements:

Task 3 aimed to build machine learning models for classifying job advertisements into categories. This task involved:

* Training Data: Using the feature representations generated in Task 2 and splitting the data into training and testing sets.
* Model Building: Machine learning models, such as logistic regression, were trained on the training data using different sets of features:
* Title-Only Model: Only job titles were used as features.
* Description-Only Model: Only job descriptions were used as features.
* Title and Description Model: Both job titles and descriptions were used as features, either concatenated or separately.
* Model Evaluation: The models were evaluated using accuracy and classification metrics (precision, recall, F1-score) to determine their performance in classifying job advertisements.

### Summary:

* Task 2 focused on feature representation, including count vectors and word embeddings, to represent job advertisements.
* Task 3 built machine learning models to classify job advertisements based on these features, comparing different approaches (title-only, description-only, and combined title-description) to find the best-performing model.
* Combining title and description as features led to the highest accuracy and improved precision, recall, and F1-score, demonstrating that additional information can enhance model accuracy.