# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Sukhum Boondecharak
#### Student ID: S3940976

Date: 04 Oct 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction

The data are separated in 4 sub-folders, which can also be identified as 4 job categories, containing the total of 776 job files. The primary goal for this task is to prepare the raw data for subsequent analysis and model building. This foundational step involves mainly on data cleaning and text preprocessing. We will focus on understanding the data's characteristics, handling missing values, and transforming text data into a suitable format for natural language processing (NLP) tasks.

## Importing libraries 

In [1]:
from sklearn.datasets import load_files  
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
from collections import Counter
import numpy as np
import os

### 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [2]:
# Load job data
job_data = load_files(r"data")

In [3]:
# Extract descriptions from job data
descriptions = []

# Define a function to extract the description part from a text
def extract_description(text):
    start_text = text.find("Description: ")
    if start_text != -1:
        description = text[start_text + len("Description: "):]
        return description
    else:
        return ""

# Iterate through the loaded data and extract descriptions
for text in job_data.data:
    description = extract_description(text.decode("utf-8"))  # Decode bytes to string
    descriptions.append(description)

# See example of the first description    
emp = 0
descriptions[emp]

'Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role'

In [4]:
# Extract webindex from job data

# Indicate original data folder
original_data_folder = "data"

# Initiate an empty list to store webindex numbers
webindex_numbers = []

# Iterate through the original data files and extract webindex
for category_folder in os.listdir(original_data_folder):
    category_path = os.path.join(original_data_folder, category_folder)
    if os.path.isdir(category_path):
        for job_file in os.listdir(category_path):
            if job_file.startswith("Job_") and job_file.endswith(".txt"):
                with open(os.path.join(category_path, job_file), "r", encoding="utf-8") as f:
                    content = f.read()
                    
                    # Extract the webindex from the original data and remove the newline character
                    webindex = content.split("Webindex: ")[1].split("\n")[0]
                    
                    webindex_numbers.append(webindex)

webindex_numbers

['72444142',
 '68687567',
 '68257980',
 '71168766',
 '72441930',
 '70205492',
 '69929266',
 '68814305',
 '71737507',
 '69540434',
 '71199751',
 '70457475',
 '72411451',
 '69001764',
 '71171544',
 '68057786',
 '69040220',
 '68784018',
 '69146761',
 '68256016',
 '70251801',
 '68180459',
 '68056671',
 '72448172',
 '69250788',
 '67639091',
 '68256188',
 '69577650',
 '66600427',
 '71339723',
 '72691163',
 '69989027',
 '69635720',
 '69993409',
 '68356152',
 '72439398',
 '69577820',
 '68708197',
 '67895483',
 '71848552',
 '68257221',
 '68686791',
 '68806418',
 '69188332',
 '68062805',
 '68678164',
 '72680220',
 '69553242',
 '71633491',
 '71171000',
 '71677705',
 '62269820',
 '71677311',
 '70597736',
 '68258658',
 '68553492',
 '71596865',
 '69694967',
 '71750603',
 '72457901',
 '70597879',
 '72673887',
 '70599432',
 '71198896',
 '72672874',
 '71678606',
 '69250648',
 '71848359',
 '72233918',
 '70190910',
 '70256074',
 '68062611',
 '66887344',
 '72240625',
 '68257449',
 '70520065',
 '71849489',

### 1.2 Pre-processing data
Perform the required text pre-processing steps.

We begin with defining functions for tokenising data and printing stats. Within the tokenising function, we use <span style="color: red"> r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?" </span> as a regular expression. We also transform every word into lower-case.

In [5]:
# Define a function to tokenise data

def tokenize_data(data_raw):
    """
        This function first convert all words to lowercases, 
        it then segment the raw review into sentences and tokenize each sentences 
        and convert the review to a list of tokens.
    """        
    # Convert to lower case
    data_lc = data_raw.lower()
    
    # segament into sentences
    sentences = sent_tokenize(data_lc)
    
    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern) 
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]
    
    # merge them into a list of tokens
    data_tokenised = list(chain.from_iterable(token_lists))
    return data_tokenised

In [6]:
# Define a function to print the current stats

def stats_print(data_tk):
    words = list(chain.from_iterable(data_tk))
    vocab = set(words)
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of reviews:", len(data_tk))
    lens = [len(article) for article in data_tk]
    print("Average review length:", np.mean(lens))
    print("Maximun review length:", np.max(lens))
    print("Minimun review length:", np.min(lens))
    print("Standard deviation of review length:", np.std(lens))

In [7]:
# Tokenise the data and compare the result with the original data

data_tk = [tokenize_data(d) for d in descriptions]

print("Original Data:\n",descriptions[emp],'\n')
print("Tokenized Data:\n",data_tk[emp])

Original Data:
 Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role 

Tokenized Data:
 ['accountant', 'partqualified', 'to', 'p', 'a', 'south', 'east', 'london', 'our', 'client', 'a', 'successful', 'manufacturing', 'company', 'has', 'an', 'immediate', 'requirement', 'f

In [8]:
# First check point for overall stats

stats_print(data_tk)

Vocabulary size:  9834
Total number of tokens:  186952
Lexical diversity:  0.052601737344345076
Total number of reviews: 776
Average review length: 240.91752577319588
Maximun review length: 815
Minimun review length: 13
Standard deviation of review length: 124.97750685071483


After tokenising the data, we can now filter out required pre-processing steps:

- Remove words with length less than 2
- Remove stopwords using the provided stop words list (i.e, stopwords_en.txt)
- Remove the words that appear only once in the document collection, based on term frequency
- Remove the top 50 most frequent words based on document frequency

In [9]:
# Check all the single character tokens

# Create a list of single character token for each review
st_list = [[w for w in words if len(w) < 2 ] for words in data_tk] 

# Merge them together in one list
list(chain.from_iterable(st_list))

['p',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'k',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'b',
 'b',
 'a',
 'b',
 'b',
 'a',
 'a',
 'b',
 'b',
 's',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'a',
 's',
 'a',
 'd',
 'd',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'a',
 'm',
 'm',
 'm',
 'm',
 'a',
 'a',
 'a',
 'a',
 's',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 'a',
 'a',
 's',
 's',
 's',
 'a',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'b',
 'c',
 'd',
 'e',
 'k',
 'a',
 'p',
 'l',
 'a'

In [10]:
# Filter out single character tokens
data_tk = [[w for w in words if len(w) >=2] for words in data_tk]

print("Tokenized Data with at least 2 characters:\n", data_tk[emp])

Tokenized Data with at least 2 characters:
 ['accountant', 'partqualified', 'to', 'south', 'east', 'london', 'our', 'client', 'successful', 'manufacturing', 'company', 'has', 'an', 'immediate', 'requirement', 'for', 'an', 'accountant', 'for', 'permanent', 'role', 'in', 'their', 'modern', 'offices', 'in', 'south', 'east', 'london', 'the', 'role', 'credit', 'control', 'purchase', 'sales', 'ledger', 'daily', 'collection', 'of', 'debts', 'by', 'phone', 'letter', 'and', 'email', 'handling', 'of', 'ledger', 'accounts', 'handling', 'disputed', 'accounts', 'and', 'negotiating', 'payment', 'terms', 'allocating', 'of', 'cash', 'and', 'reconciliation', 'of', 'accounts', 'adhoc', 'administration', 'duties', 'within', 'the', 'business', 'the', 'person', 'the', 'ideal', 'candidate', 'will', 'have', 'previous', 'experience', 'in', 'credit', 'control', 'capacity', 'you', 'will', 'possess', 'exceptional', 'customer', 'service', 'and', 'communication', 'skills', 'together', 'with', 'it', 'proficiency', 

In [11]:
# Check stats after eliminating single character tokens

stats_print(data_tk)

Vocabulary size:  9808
Total number of tokens:  180913
Lexical diversity:  0.05421390392066905
Total number of reviews: 776
Average review length: 233.13530927835052
Maximun review length: 795
Minimun review length: 13
Standard deviation of review length: 121.6048654015839


In [12]:
# Import stop words from the required file

stopwords_file = "stopwords_en.txt"
with open(stopwords_file, 'r') as f:
    stop_words = set(f.read().split())
stop_words

{'a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [13]:
# Filter out stop words

data_tk = [[w for w in words if w not in stop_words] for words in data_tk]

print("Tokenized Data excluding stop words:\n", data_tk[emp])

Tokenized Data excluding stop words:
 ['accountant', 'partqualified', 'south', 'east', 'london', 'client', 'successful', 'manufacturing', 'company', 'requirement', 'accountant', 'permanent', 'role', 'modern', 'offices', 'south', 'east', 'london', 'role', 'credit', 'control', 'purchase', 'sales', 'ledger', 'daily', 'collection', 'debts', 'phone', 'letter', 'email', 'handling', 'ledger', 'accounts', 'handling', 'disputed', 'accounts', 'negotiating', 'payment', 'terms', 'allocating', 'cash', 'reconciliation', 'accounts', 'adhoc', 'administration', 'duties', 'business', 'person', 'ideal', 'candidate', 'previous', 'experience', 'credit', 'control', 'capacity', 'possess', 'exceptional', 'customer', 'service', 'communication', 'skills', 'proficiency', 'part', 'fully', 'qualified', 'accountant', 'considered', 'role']


In [14]:
# Check stats after eliminating stop words

stats_print(data_tk)

Vocabulary size:  9404
Total number of tokens:  107161
Lexical diversity:  0.0877558066834016
Total number of reviews: 776
Average review length: 138.09407216494844
Maximun review length: 487
Minimun review length: 12
Standard deviation of review length: 73.07847897002313


In [15]:
# Check counts for each remaining words

word_counts = Counter(w for words in data_tk for w in words)
word_counts

Counter({'accountant': 53,
         'partqualified': 5,
         'south': 72,
         'east': 78,
         'london': 175,
         'client': 594,
         'successful': 340,
         'manufacturing': 159,
         'company': 614,
         'requirement': 29,
         'permanent': 147,
         'role': 946,
         'modern': 21,
         'offices': 41,
         'credit': 37,
         'control': 179,
         'purchase': 29,
         'sales': 1030,
         'ledger': 37,
         'daily': 62,
         'collection': 13,
         'debts': 2,
         'phone': 33,
         'letter': 25,
         'email': 144,
         'handling': 33,
         'accounts': 153,
         'disputed': 1,
         'negotiating': 7,
         'payment': 22,
         'terms': 26,
         'allocating': 1,
         'cash': 47,
         'reconciliation': 15,
         'adhoc': 18,
         'administration': 72,
         'duties': 147,
         'business': 832,
         'person': 99,
         'ideal': 102,
         'ca

In [16]:
# Filter out words that appear only once

data_tk = [[w for w in words if word_counts[w] > 1] for words in data_tk]

print("Tokenized Data with more than one occurance:\n", data_tk[emp])

Tokenized Data with more than one occurance:
 ['accountant', 'partqualified', 'south', 'east', 'london', 'client', 'successful', 'manufacturing', 'company', 'requirement', 'accountant', 'permanent', 'role', 'modern', 'offices', 'south', 'east', 'london', 'role', 'credit', 'control', 'purchase', 'sales', 'ledger', 'daily', 'collection', 'debts', 'phone', 'letter', 'email', 'handling', 'ledger', 'accounts', 'handling', 'accounts', 'negotiating', 'payment', 'terms', 'cash', 'reconciliation', 'accounts', 'adhoc', 'administration', 'duties', 'business', 'person', 'ideal', 'candidate', 'previous', 'experience', 'credit', 'control', 'capacity', 'possess', 'exceptional', 'customer', 'service', 'communication', 'skills', 'part', 'fully', 'qualified', 'accountant', 'considered', 'role']


In [17]:
# Check stats after eliminating words that appear only once

stats_print(data_tk)

Vocabulary size:  5218
Total number of tokens:  102975
Lexical diversity:  0.05067249332362224
Total number of reviews: 776
Average review length: 132.69974226804123
Maximun review length: 471
Minimun review length: 12
Standard deviation of review length: 70.3782402519735


In [18]:
# Check 50 most common words

# Indicate _ to only include words in the list
most_common_words = [w for w, _ in word_counts.most_common(50)]

# Check with the numbers
most_common_words_count = [w for w in word_counts.most_common(50)]
most_common_words_count

[('experience', 1276),
 ('sales', 1030),
 ('role', 946),
 ('work', 861),
 ('business', 832),
 ('team', 789),
 ('working', 719),
 ('job', 688),
 ('care', 675),
 ('skills', 669),
 ('company', 614),
 ('client', 594),
 ('management', 572),
 ('manager', 519),
 ('support', 501),
 ('uk', 496),
 ('service', 481),
 ('excellent', 455),
 ('development', 431),
 ('required', 399),
 ('based', 376),
 ('opportunity', 372),
 ('services', 369),
 ('knowledge', 349),
 ('apply', 349),
 ('successful', 340),
 ('training', 338),
 ('design', 337),
 ('engineering', 336),
 ('customer', 335),
 ('recruitment', 335),
 ('salary', 322),
 ('candidate', 319),
 ('clients', 310),
 ('high', 309),
 ('join', 302),
 ('ability', 301),
 ('strong', 299),
 ('provide', 298),
 ('home', 291),
 ('ensure', 290),
 ('leading', 289),
 ('including', 287),
 ('engineer', 285),
 ('financial', 279),
 ('good', 274),
 ('staff', 271),
 ('position', 268),
 ('systems', 267),
 ('full', 263)]

In [19]:
# Filter out top 50 most common words based on document frequency

data_tk = [[w for w in words if w not in most_common_words] for words in data_tk]

print("Tokenized Data without 50 most frequent words:\n", data_tk[emp])

Tokenized Data without 50 most frequent words:
 ['accountant', 'partqualified', 'south', 'east', 'london', 'manufacturing', 'requirement', 'accountant', 'permanent', 'modern', 'offices', 'south', 'east', 'london', 'credit', 'control', 'purchase', 'ledger', 'daily', 'collection', 'debts', 'phone', 'letter', 'email', 'handling', 'ledger', 'accounts', 'handling', 'accounts', 'negotiating', 'payment', 'terms', 'cash', 'reconciliation', 'accounts', 'adhoc', 'administration', 'duties', 'person', 'ideal', 'previous', 'credit', 'control', 'capacity', 'possess', 'exceptional', 'communication', 'part', 'fully', 'qualified', 'accountant', 'considered']


In [20]:
# Check stats after eliminating 50 most common words 

stats_print(data_tk)

Vocabulary size:  5168
Total number of tokens:  80068
Lexical diversity:  0.06454513663386122
Total number of reviews: 776
Average review length: 103.18041237113403
Maximun review length: 390
Minimun review length: 7
Standard deviation of review length: 56.69634188671351


## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt

In [21]:
# Combine tokens and save output as a text file

combined_data = [" ".join(tokens) for tokens in data_tk]
output_file = "cleaned_descriptions.txt"

with open(output_file, 'w', encoding='utf-8') as f:
    for description in combined_data:
        f.write(description + '\n')

In [22]:
# Save a list of sorted vocabs as a text file

unique_words = set(w for words in data_tk for w in words)
sorted_unique_words = sorted(unique_words)

vocab_file = "vocab.txt"

with open(vocab_file, 'w') as f:
    for index, word in enumerate(sorted_unique_words):
        f.write(f"{word}:{index}\n")

## Summary

By completing this task, we have set the stage for more advanced analyses and modeling in subsequent tasks. Our clean and well-structured dataset, along with the insights gained through exploratory data analysis, will enable us to build robust machine learning models and extract meaningful information from the data. The outputs from this task will be used further in the following tasks.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>