# Assignment 2: Milestone 1 Natural Language Processing
## Task 1. Basic Text Pre-processing
By Harold Davies

Date: 5/05/2024

Environment: Python 3 and Jupyter notebook

Libraries used:
* nltk
* itertools
* os
* numpy
* nltk.probability

## Introduction
Some raw data has been procured which will be used to develop an algorithm for making predictions based on similar data. The raw data consists of a file structure containing numerous text files with details of job advertisements including job category, title, id, company and description. In this task, the files will be parsed into data structures which allow for the descriptions to be preprocessed ready to be used to construct vectors for model development. Preprocessing tasks include removing case variation, and removing: single letter words, stop words, words appearing only once in the document set and the 50 words appearing in the most documents. The process involves tokenising the job descriptions, but will result in exporting a text document containing the cleaned job information and another text document with the indexed vocabulary set after preprocessing. 

## Importing libraries 

In [14]:
import nltk
from itertools import chain
import os
import numpy as np
from nltk.probability import *
from nltk.corpus import stopwords

### 1.1 Examining and loading data

Within the working directory we have a folder called data, itself containing 4 folders: Accounting_Finance, Engineering, Healthcare_Nursing and Sales. Each folder contains a number of text files with the naming convention Job_#####, all text file names are unique. This is a sample of one of the text files from the Accounting_Finance folder: 
"Title: FP&A  Blue Chip
Webindex: 68802053
Company: Hays Senior Finance
Description: A market leading retail business is going through rapid growth and, due to this expansion, is looking to add a Financial Planning Analyst to its central team based in central London. This is a fantastic opportunity to join a newly created team..."

Here we are assuming the job_##### information is unimportant, as we have webindexes to use as well and will extract them and discard the job#s. The rest of the information available, including job category, procured from the parent folder name, will be saved in lists. 


In [2]:
#code to extract categories (folder name) and text file contents (info)
folder_1 = "./data/Engineering"
folder_2 = "./data/Accounting_Finance"
folder_3 = "./data/Healthcare_Nursing"
folder_4 = "./data/Sales"
folders = [folder_1, folder_2, folder_3, folder_4]

#lists for all job details
job_info = []                   #raw text
job_titles = []                 #Title
job_ids = []                    #Webindex
job_companies = []              #Company
job_descriptions = []           #Description
job_categories = []             #Category

#extract category and info
for folder in folders:
    for filename in sorted(os.listdir(folder)):
        if filename.endswith(".txt"):
            job_categories.append(folder[7:])
            path = os.path.join(folder,filename)
            with open(path,"r",encoding = 'unicode_escape') as f:
                job_info.append(f.read()) # read the file into a string, and append it to the job_info list
                f.close()

#code to split info into job title, index, company (if present) and description
for job in job_info:
    job = job.split('\n')
    job_titles.append(job[0][7:])
    job_ids.append(job[1][10:])
    if job[2].startswith('Company'):
        job_companies.append(job[2][9:])
        job_descriptions.append(job[3][13:])
    else:
        job_companies.append(None)
        job_descriptions.append(job[2][13:])

In [3]:
#code to check file contents have been extracted correctly
ind = 130
#print(f"Info: {job_info[ind]}")                # Info
print(f"Title: {job_titles[ind]}")              # Title
print(f"Webindex: {job_ids[ind]}")              # Webindex
print(f"Company: {job_companies[ind]}")         # Company
print(f"Description: {job_descriptions[ind]}")  # Description
print(f"Category: {job_categories[ind]}")       # Category

Title: Auto Electrician
Webindex: 69551242
Company: BEVAN GROUP
Description: All aspects of electrical installation work relating to Commercial vehicle bodybuilding. Work includes fitting of electrical ancillary systems such as vehicle lighting , reverse cameras and weighing equipment etc
Category: Engineering


In [4]:
#check how many job listings we have, and that the lists all have the same length.
print(f"Number of Job Titles: {len(job_titles)}")
print(f"Number of Webindices: {len(job_ids)}")
print(f"Number of Companies: {len(job_companies)}")
print(f"Number of Descriptions: {len(job_descriptions)}")
print(f"Number of Categories: {len(job_categories)}")

Number of Job Titles: 776
Number of Webindices: 776
Number of Companies: 776
Number of Descriptions: 776
Number of Categories: 776


### 1.2 Pre-processing data
In this section I will process the descriptions as follows: change all to lower case, tokenize, remove words with length < 2, remove stop words, remove words that only appear once in all desciptions, remove the 50 words which appear in the most descriptions, save them to file in post-processed form and also save the post-processing vocabulary. 

#### Tokenize job descriptions

In [5]:
def tokenizeData(text):
    """
        This function tokenizes a raw text string.
    """
    #change text to lower case
    text_lower = text.lower()

    #regex expression provided in assignment spec
    exp = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"

    #set tokenizer using regex expression
    tokenizer = nltk.RegexpTokenizer(exp)

    #apply tokenizer to text
    tokenised_job = tokenizer.tokenize(text_lower)

    #return tokenized text
    return tokenised_job

In [6]:
#use the function to tokenize job descriptions
tokenized_jobs = [tokenizeData(job_desc) for job_desc in job_descriptions]

In [7]:
#check tokenization against example from above
tokenized_jobs[ind]

['all',
 'aspects',
 'of',
 'electrical',
 'installation',
 'work',
 'relating',
 'to',
 'commercial',
 'vehicle',
 'bodybuilding',
 'work',
 'includes',
 'fitting',
 'of',
 'electrical',
 'ancillary',
 'systems',
 'such',
 'as',
 'vehicle',
 'lighting',
 'reverse',
 'cameras',
 'and',
 'weighing',
 'equipment',
 'etc']

In [8]:
def stats_print(tokenised_text):
    """
    Function to print stats of a list of tokenized text
    """
    words = list(chain.from_iterable(tokenised_text))             #put tokens in single list
    vocab = set(words)                                            #convert list to a set
    lexical_diversity = len(vocab)/len(words)                     #calc lexical diversity
    print(f"Vocabulary size: {len(vocab)}")
    print(f"Total number of tokens: {len(words)}")
    print(f"Lexical diversity: {lexical_diversity}")
    print(f"Total number of articles: {len(tokenised_text)}")
    
    lens = [len(text) for text in tokenised_text]                 #compile list of text lengths
    print(f"Average document length: {np.mean(lens)}")
    print(f"Maximum document length: {np.max(lens)}")
    print(f"Minimum document length: {np.min(lens)}")
    print(f"Standard deviation of document length: {np.std(lens)}")

In [9]:
#let's see the stats
stats_print(tokenized_jobs)

Vocabulary size: 9834
Total number of tokens: 186952
Lexical diversity: 0.052601737344345076
Total number of articles: 776
Average document length: 240.91752577319588
Maximum document length: 815
Minimum document length: 13
Standard deviation of document length: 124.97750685071483


#### Remove words with length of 1

Here we remove 26 single letter words from the vocabulary using the revomeShortWords function. 

In [10]:
#define function for removing short words from a list
def removeShortWords(text, max_length_to_remove):
    return [w for w in text if len(w) > max_length_to_remove]

In [11]:
#remove words with length < 2
threshold = 1
tokenized_jobs = [removeShortWords(text, threshold) for text in tokenized_jobs]

In [12]:
#let's see the stats again
stats_print(tokenized_jobs)

Vocabulary size: 9808
Total number of tokens: 180913
Lexical diversity: 0.05421390392066905
Total number of articles: 776
Average document length: 233.13530927835052
Maximum document length: 795
Minimum document length: 13
Standard deviation of document length: 121.6048654015839


#### Remove Stopwords

Here we remove an additional 134 stop words from the vocabulary using removeWords. 

In [None]:
# #extract stop words from nltk into list variable
# nltk.download('stopwords')
# stopwords = stopwords.words('english')

In [31]:
#extract stop words from file into list variable
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()

In [32]:
def removeWords(text, list_words_to_remove):
    """
    Function to remove a list of words from a list of words
    """
    return [w for w in text if w not in list_words_to_remove]

In [33]:
#remove stop words
tokenized_jobs = [removeWords(text, stopwords) for text in tokenized_jobs]

In [34]:
#let's see the stats again
stats_print(tokenized_jobs)

Vocabulary size: 5169
Total number of tokens: 82786
Lexical diversity: 0.06243809339743435
Total number of articles: 776
Average document length: 106.68298969072166
Maximum document length: 402
Minimum document length: 7
Standard deviation of document length: 59.07955949247282


#### Remove words appearing only once

First we need to compile the term frequency to find words which appear only once. Then we use the RemoveWords function defined above to remove those words. This resulted in 4186 additional words being removed. 

In [35]:
#save all words in one list and generate the term frequency distribution
words = list(chain.from_iterable(tokenized_jobs))
term_fd = FreqDist(words)

In [36]:
#extract list of words appearing only once
words_appearing_once = [word for word, freq in term_fd.items() if freq == 1]
words_appearing_once[0:8]

[]

In [37]:
#how many words appear only once?
len(words_appearing_once)

0

In [38]:
#remove words appearing once
tokenized_jobs = [removeWords(text, words_appearing_once) for text in tokenized_jobs]

In [39]:
#let's see the stats again
stats_print(tokenized_jobs)

Vocabulary size: 5169
Total number of tokens: 82786
Lexical diversity: 0.06243809339743435
Total number of articles: 776
Average document length: 106.68298969072166
Maximum document length: 402
Minimum document length: 7
Standard deviation of document length: 59.07955949247282


#### Remove top 50 words in most descriptions

First we will find the 50 words which appear in the most documents, then we will remove them from our descriptions using removeWords. 

In [40]:
#generate document frequency distribution
word_document_freq = list(chain.from_iterable([set(text) for text in tokenized_jobs]))
docfd = FreqDist(word_document_freq)
docfd.most_common(8)

[('originally', 191),
 ('jobseeking', 191),
 ('include', 187),
 ('clients', 187),
 ('good', 187),
 ('essential', 186),
 ('information', 184),
 ('customer', 182)]

In [41]:
#generate a library of the most 50 most common and extract a list of the words
lib_of_most_common = docfd.most_common(50)
most_common_words = []
for pair in docfd.most_common(50):
    most_common_words.append(pair[0])

In [42]:
#remove the most common words
tokenized_jobs = [removeWords(text, most_common_words) for text in tokenized_jobs]

In [43]:
#let's see the stats one last time
stats_print(tokenized_jobs)

Vocabulary size: 5119
Total number of tokens: 71421
Lexical diversity: 0.07167359740132453
Total number of articles: 776
Average document length: 92.03737113402062
Maximum document length: 351
Minimum document length: 6
Standard deviation of document length: 52.3153837469842


## Saving required outputs
We save the data, including cleaned job descriptions and job categories, into a text file jobs_clean.txt. Note that info for all jobs are in the same text file, seperated with new lines, which will be easy to parse in the next task thanks to the consistent formatting, with missing company names populated with "None". We also save the vocabulary in a text file vocab.txt, in the format "\<word\>:\<index\>\n". 

In [44]:
#text file for use in milestone 2
output_file = open("./jobs_clean.txt", "w")
for a, b, c, d, e, f in zip(job_titles, job_ids, job_companies, job_descriptions, tokenized_jobs, job_categories):
    output_file.write(a + '\n' + b + '\n' + str(c) + '\n' + d + '\n' + ' '.join(e) + '\n' + f + '\n')
output_file.close()

In [45]:
#get set of words from clean descriptions
vocab = set(list(chain.from_iterable([text for text in tokenized_jobs])))

#convert to sorted list
vocab = sorted(list(vocab))

#save clean vocab to output file
output_file = open("./vocab.txt", "w")
for i in range(0, len(vocab)):
    output_file.write("{}:{}\n".format(vocab[i], i))
output_file.close()

## Summary
Preprocessing tasks included removing case variation, then 26 single letter words, 404 stop words, 4186 words appearing only once in the document set and the 50 words appearing in the most documents were removed. In total 105,747 occurances of 4,666 words were removed. The lexical diversity of the document set was changed from 0.0526 to 0.064. The process involved tokenising the job descriptions, with the result being an exported text document containing the cleaned job information and another text document with the indexed vocabulary set after preprocessing. 