# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name:
#### Student ID:

Date: 02/10/2022

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* nltk
* itertools
* sklearn
* numpy
* re
* os

## Introduction
In this task, I will conduct basic text pre-processing on the given dataset, such as tokenization and deleting most/least frequent words and stop words. We are just to concentrate on pre-processing the job description for this task. After the data has been preprocessed, I will create a vocabulary using the cleaned job advertisement descriptions, save it in a txt file `vocab.txt`, and then save the text and information for all job advertisements in txt files.

## Importing libraries 

In [1]:
# Code to import libraries
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from nltk.probability import *
from itertools import chain
from sklearn.datasets import load_files
import numpy as np
import re
import os

### 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [2]:
# Code to inspect the provided data file...
job_data = load_files(r"data") # reading the files from the data folder
print('Number of folders in the data folder:',len(job_data.target_names),) # number of folders(job categories)
print('Job categories in the data folder:', end=' '), print(*job_data.target_names, sep=', ') # name of the job categories
print('Number of job advertisments (documents):', len(job_data.data))
print('\nFormat of the txt files:\n') # example of the structure of a job advertisment
print(job_data.data[0].decode())

Number of folders in the data folder: 4
Job categories in the data folder: Accounting_Finance, Engineering, Healthcare_Nursing, Sales
Number of job advertisments (documents): 776

Format of the txt files:

Title: Finance / Accounts Asst Bromley to ****k
Webindex: 68997528
Company: First Recruitment Services
Description: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with

As we can see from the above output, there are four different subfolders within the data folder, namely  Accounting_Finance, Engineering, Healthcare_Nursing, and Sales, with each folder name representing a job category. A certain category's text documents for job advertisements can be found in the associated subfolder. Each job advertisement is a `.txt` file with the name `Job_<ID>.txt.`. It includes the title, the Webindex, and the entire job description.

In [3]:
# Extracting Titles, Webindices, and Descriptions into separate lists
file_names = []
titles = []
webindices = []
descriptions = []

for fname in job_data.filenames:
    file_names.append(fname)

for j in job_data.data:
    titles.append(re.findall(r"(?<=Title: ).*\n", j.decode())) # appending the title of each job using regex to [titles]
    webindices.append(re.findall(r"(?<=Webindex: )\d+\n", j.decode())) # using decode() to tranform to a string fotmat
    descriptions.append(re.findall(r"(?<=Description: ).+", j.decode())) # Search for the description of each job using regex
print('Number of job descriptions in the list:', len(descriptions))
print('\nFirst job advertisment in job_data:\n')
print('Title:',*titles[0],'Webindex:',*webindices[0],'Description:',*descriptions[0])

Number of job descriptions in the list: 776

First job advertisment in job_data:

Title: Finance / Accounts Asst Bromley to ****k
 Webindex: 68997528
 Description: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role


### 1.2 Pre-processing data
Perform the required text pre-processing steps.

In the next code block, I created a function for tokenization, lowercasing, removing words of `length < 2` and removing stop words. The function takes raw string as an argument and returns a list of tokenized strings.

In [4]:
# function for tokenizaton, lowercase conversion, removing words less than 2, removing stop words ...
def tokenizeRawData(desc):
    
    stopwords = []
    with open('stopwords_en.txt') as f: # Read the sopwords list
        stopwords = f.read().splitlines()
    
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?") # Word Tokenizer structer in regex
    sentences = sent_tokenize(desc) # Tokenize job descriptions into sentences
    token_lists = [tokenizer.tokenize(sen) for sen in sentences] # Tokenize each sentence in the job description
    tokens = list(chain.from_iterable(token_lists)) # Flatten the lists of words (from sentences) into a single list of words
    tokenized = [token.lower() for token in tokens] # Convert into lower case
    len2words = [token for token in tokenized if len(token) >= 2] # Removing words with length less than 2
    stopped = [token for token in len2words if token not in stopwords] # Removing stop words
    
    return stopped

In [5]:
tokenised_desc = [tokenizeRawData(desc[0]) for desc in descriptions] # Tokenize each job description

In the following cell, I implemented the same function for statistics from the pre-class activites. The function will print several statistics including:
* The total number of tokens across the corpus
* The total size of vocabulary (types)
* The lexical diversity which refers to the ratio of different unique word stems (types) to the total number of words (tokens). 
* The average, minimum and maximum number of tokens in the dataset.

In [6]:
def stats_print(tokenised_desc):
    words = list(chain.from_iterable(tokenised_desc)) # we put all the tokens in the corpus in a single list
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of job decriptions:", len(tokenised_desc))
    lens = [len(desc) for desc in tokenised_desc]
    print("Average job decription length:", np.mean(lens))
    print("Maximum job decription length:", np.max(lens))
    print("Minimum job decription length:", np.min(lens))
    print("Standard deviation of job decription length:", np.std(lens))

In [7]:
stats_print(tokenised_desc)

Vocabulary size:  9404
Total number of tokens:  107161
Lexical diversity:  0.0877558066834016
Total number of job decriptions: 776
Average job decription length: 138.09407216494844
Maximum job decription length: 487
Minimum job decription length: 12
Standard deviation of job decription length: 73.07847897002313


Term frequency measures how frequently a word appears across the entire corpus, regardless of the document it appears in. We can determine the distribution of the total number of word tokens among all the types using the frequency distribution based on term frequency. We find the frequency of words by using `FreqDist` function from `nltk` package, we can then use `hapaxes` function to find the words that only appear once in the whole corpus.

In [8]:
# Removing words that appear once based on term frequency
words = list(chain.from_iterable(tokenised_desc))
term_freq = FreqDist(words) # compute term frequency for each unique word
lessFreqWords = set(term_freq.hapaxes())
print('Number of words that appear only once in the document collection based on term frequency is:', len(lessFreqWords))
for i in range(len(tokenised_desc)): # for loop to iterate through the dictionary items
    removed = [w for w in tokenised_desc[i] if w not in lessFreqWords]
    tokenised_desc[i] = removed

Number of words that appear only once in the document collection based on term frequency is: 4186


In [9]:
stats_print(tokenised_desc)

Vocabulary size:  5218
Total number of tokens:  102975
Lexical diversity:  0.05067249332362224
Total number of job decriptions: 776
Average job decription length: 132.69974226804123
Maximum job decription length: 471
Minimum job decription length: 12
Standard deviation of job decription length: 70.3782402519735


The number of documents in which a word appears in is referred to as document frequency. For instance, if a word appears three times in a document, the number of occurrences for the term will count as one and it will increase by `1` if it appears in another document. To find the top 50 most frequent words based on document frequency we use `most_common()` function which returns the frequency of each word.

In [10]:
# Removing top 50 most frequent words based on document frequency
words_50 = list(chain.from_iterable([set(tok_desc) for tok_desc in tokenised_desc]))
doc_freq = FreqDist(words_50)
top_50 = doc_freq.most_common(50)
top_50 = [w[0] for w in top_50]
print('The top 50 most frequent words based on document frequency are:', ', '.join(top_50))
for i in range(len(tokenised_desc)): # for loop to iterate through the dictionary items
    remove_top50 = [w for w in tokenised_desc[i] if w not in top_50]
    tokenised_desc[i] = remove_top50

The top 50 most frequent words based on document frequency are: experience, role, work, team, working, skills, client, job, company, business, uk, excellent, management, based, apply, opportunity, salary, required, successful, support, join, candidate, service, knowledge, development, leading, high, cv, www, manager, training, sales, strong, provide, including, services, ability, contact, position, recruitment, full, benefits, posted, jobseeking, originally, clients, include, good, essential, information


In [11]:
stats_print(tokenised_desc)

Vocabulary size:  5168
Total number of tokens:  81205
Lexical diversity:  0.06364140139153993
Total number of job decriptions: 776
Average job decription length: 104.64561855670104
Maximum job decription length: 401
Minimum job decription length: 7
Standard deviation of job decription length: 58.44628718710534


## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt

The output of this task contains the file `vocab.txt`. This file contains the unigram vocabulary, one each line, in the format of `word_string:word_integer_index` sorted in alphabetical order. Also, in the last code block the for loop creats a new data folder named `preprocessed_data` with all the corresponding job categories and cleand job advertisements saved to its particular category.

In [12]:
# code to save output data...
words = list(chain.from_iterable(tokenised_desc)) # Combine all the tokens into one list
vocab = sorted(set(words)) # Get the unique vocab sorted alphabetically

with open('vocab.txt', 'w', encoding="utf-8") as file:
    for i in range(len(vocab)): # loob through the len of vocab list
        file.write(vocab[i] + ":" + str(i) + '\n') # write each vocab and the corresponding index

In [13]:
# saving the preprocessed job ads
for i in range(len(file_names)):
    filename = ("preprocessed_" + file_names[i]) #setting the directory and the file name for each job ad
    os.makedirs(os.path.dirname(filename), exist_ok=True) # create a directory if it doesn't exist
    with open(filename, 'w', encoding="utf-8") as file:
        file.write("Title: " + str(*titles[i])) # write the job title
        file.write("Webindex: " + str(*webindices[i])) # write the webindex for the job
        desc = ' '.join(tokenised_desc[i]) # concatenate each token of the job description into one string
        file.write("Description: " + desc) # write the preprocessed job description

## Summary
To summarise what I did in this task, I first performed basic text pre-processing on the provided dataset, such as tokenization and deleting most/least common words and stop words while focusing just on the job description. Using the cleaned job advertisement descriptions as a starting point, I built a vocabulary and placed it in a txt file named "vocab.txt," after which I recorded the text and details for all of the job advertisements in txt files.