# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Padinjarekolath Jithin Nair
#### Student ID: s3952043

Date: 20/09/2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* os
* nltk

## Introduction

Task 1 involves preparing job advertisement descriptions for further analysis. We perform essential text pre-processing steps, including tokenization, lowercasing, stopword removal, and more. The result is a cleaned dataset with a vocabulary that sets the foundation for subsequent tasks.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment
import pandas as pd
import re
import os
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist

### 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [2]:
def load_stopwords(stopwords_file):
    with open(stopwords_file, 'r') as f:
        stopwords_list = f.read().splitlines()
    return set(stopwords_list)

# Load stopwords
stopwords_set = load_stopwords("stopwords_en.txt")

# Define the path to the "data" folder where job category folders are located
data_folder = "data"

num_categories = 0
category_file_counts = {}

# Loop through each job category folder
for category_folder in os.listdir(data_folder):
    category_path = os.path.join(data_folder, category_folder)

    # Check if it's a directory (job category)
    if os.path.isdir(category_path):
        num_categories += 1

        # Count the number of files in the category folder
        file_count = len([file for file in os.listdir(category_path) if file.endswith(".txt")])
        category_file_counts[category_folder] = file_count

# Print the number of job categories
print(f"Number of Job Categories: {num_categories}")

# Print the number of files in each category
for category, file_count in category_file_counts.items():
    print(f"Category: {category}, Number of Files: {file_count}")

Number of Job Categories: 4
Category: Sales, Number of Files: 156
Category: Accounting_Finance, Number of Files: 191
Category: Healthcare_Nursing, Number of Files: 198
Category: Engineering, Number of Files: 231


There are 4 folders namely Sales, Accounting_Finance, Healthcare_Nursing, and Engineering with its respective file count 156, 191, 198, and 231.

In [3]:
# Initialize a list to store extracted information from all files
job_info_list = []

# Loop through each job category folder
job_folders = ["Accounting_Finance", "Engineering", "Healthcare_Nursing", "Sales"]
for category_name in job_folders:
    category_path = os.path.join(data_folder, category_name)

    # Loop through each job advertisement file in the category folder
    for file_name in os.listdir(category_path):
        if file_name.startswith("Job_") and file_name.endswith(".txt"):
            file_path = os.path.join(category_path, file_name)

            # Initialize variables to store extracted values
            title = None
            webindex = None
            company = None
            description = None

            # Read the job advertisement text from the file
            with open(file_path, 'r', encoding='utf-8') as file:
                file_content = file.read()

            # Split the file content into lines
            lines = file_content.strip().split('\n')

            # Iterate through lines to extract information
            for line in lines:
                key, value = line.strip().split(': ', 1)
                if key == 'Title':
                    title = value
                elif key == 'Webindex':
                    webindex = value
                elif key == 'Company':
                    company = value
                elif key == 'Description':
                    description = value

            # Store extracted information in a dictionary
            job_info = {
                'Title': title,
                'Webindex': webindex,
                'Company': company,
                'Description': description
            }

            # Append the dictionary to the list
            job_info_list.append(job_info)

# Print the extracted information for the first job advertisement
print(job_info_list[0])


{'Title': 'Commercial Insurance Underwriter', 'Webindex': '69092773', 'Company': 'Bond Search Selection Ltd', 'Description': 'Our client is a leading Insurer based in Belfast. As part of their continued expansion an opportunity has arisen for an enthusiastic, experienced individual to join the Commercial Underwriting Team in Belfast. Job Specification:  Liaise with the broker network in relation to quotations for Liability, Property Commercial combined risks  Complying with all regulatory and internal systems controls.  Proactively resolving enquiries from the Agency network and Operations Team.  Providing expert technical advice and guidance to customers.  Analysing, underwriting and processing individual risks, including more complex Risks.  Providing coaching and guidance to other members of the Team.  Supporting achievement of Regional growth and loss ratio targets, including liaison with the Operations Team and field visits when required. Personal Specification The successful cand

The format of the file contains Title, Webindex, Company, and Description. I have displayed just the first value of the list as there are 776 rows.

In [4]:
# Convert the list to a dataframe for ease of access
job_info = pd.DataFrame(job_info_list)
job_info

Unnamed: 0,Title,Webindex,Company,Description
0,Commercial Insurance Underwriter,69092773,Bond Search Selection Ltd,Our client is a leading Insurer based in Belfa...
1,Regulatory Policy Manager,71674555,Michael Page Financial Services,As a Regulatory Policy Manager at a leading Fi...
2,PA to Managing Partner,70758175,Wells Tobias,A brand new position has arisen within a highl...
3,Fixed Income Portfolio Analyst,69564390,Mason Blake Ltd,This is a newly created position will involve ...
4,Commercial Finance Analyst,70255258,Axon Resourcing Ltd,A newly created role for a commercial finance ...
...,...,...,...,...
771,European Account Manager (German and English),66399629,Positive Selection,Our client is in urgent need of a candidate wi...
772,"Graduate Sales Executive, Construction Product...",68702817,Aaron Wallis Sales Recruitment,"Graduate Sales Executive, Construction Product..."
773,International Sales Account Manager German Ma...,66935937,INTERACTION RECRUITMENT,INTERNATIONAL SALES ACCOUNT MANAGER managing ...
774,Rail Signal Design Team Leader,71812011,UKStaffsearch,We are currently looking for Rail Signal Desig...


In [5]:
job_info.dtypes

Title          object
Webindex       object
Company        object
Description    object
dtype: object

In [6]:
# Converting WEbindex to appropriate datatype
job_info['Webindex'] = job_info['Webindex'].astype(int)

In [7]:
job_info.dtypes

Title          object
Webindex        int64
Company        object
Description    object
dtype: object

In [8]:
df = job_info[['Webindex', 'Description']]
df

Unnamed: 0,Webindex,Description
0,69092773,Our client is a leading Insurer based in Belfa...
1,71674555,As a Regulatory Policy Manager at a leading Fi...
2,70758175,A brand new position has arisen within a highl...
3,69564390,This is a newly created position will involve ...
4,70255258,A newly created role for a commercial finance ...
...,...,...
771,66399629,Our client is in urgent need of a candidate wi...
772,68702817,"Graduate Sales Executive, Construction Product..."
773,66935937,INTERNATIONAL SALES ACCOUNT MANAGER managing ...
774,71812011,We are currently looking for Rail Signal Desig...


### 1.2 Pre-processing data
Perform the required text pre-processing steps.

In [9]:
# Define a regular expression pattern for tokenization
pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = RegexpTokenizer(pattern)

# Initialize lists for preprocessed descriptions and tokenized descriptions
preprocessed_descriptions = []
tokenized_descriptions = []

# Initialize a list to store all words in the document collection
all_words = []

# Initialize a dictionary to keep track of word frequencies
word_freq = FreqDist()

# Initialize a WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Loop through each description
for webindex, description in zip(df['Webindex'], df['Description']):
    # Tokenize using the provided regular expression
    tokens = tokenizer.tokenize(description)
    
    # Convert tokens to lowercase, filter out short words, and remove stopwords
    filtered_tokens = [token.lower() for token in tokens if len(token) > 2 and token.lower() not in stopwords_set]
    
    # Update the list of all words and word frequencies
    all_words.extend(filtered_tokens)
    word_freq.update(filtered_tokens)
    
    # Perform lemmatization
    lem_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    for original_word, lemmatized_word in zip(filtered_tokens, lem_tokens):
        if lemmatized_word != original_word and lem_tokens.count(lemmatized_word) == 1:
            print(f"Original: {original_word}, Lemmatized: {lemmatized_word}")
    
    # Append the preprocessed description and tokenized description to their respective lists
    preprocessed_description = " ".join(lem_tokens)
    preprocessed_descriptions.append((webindex, preprocessed_description))
    tokenized_descriptions.append(lem_tokens)

# Create a set of words that appear more than once
words_appearing_more_than_once = set(word for word, freq in word_freq.items() if freq > 1)

# Remove words that appear only once in the document collection
preprocessed_descriptions = [(webindex, " ".join([token for token in description.split() 
                                                  if token in words_appearing_more_than_once])) 
                             for webindex, description in preprocessed_descriptions]

# Remove the top 50 most frequent words based on document frequency
sorted_word_freq = word_freq.most_common()
most_common_words = set([word for word, _ in sorted_word_freq[:50]])

preprocessed_descriptions = [(webindex, " ".join([token for token in description.split() 
                                                  if token not in most_common_words])) 
                             for webindex, description in preprocessed_descriptions]

# Save preprocessed job advertisements to a CSV file
preprocessed_df = pd.DataFrame(preprocessed_descriptions)
preprocessed_df.to_csv("preprocessed_ads.csv", index=False, header=False, encoding='utf-8')


Original: quotations, Lemmatized: quotation
Original: systems, Lemmatized: system
Original: controls, Lemmatized: control
Original: enquiries, Lemmatized: enquiry
Original: targets, Lemmatized: target
Original: improvements, Lemmatized: improvement
Original: deadlines, Lemmatized: deadline
Original: standards, Lemmatized: standard
Original: criteria, Lemmatized: criterion
Original: years, Lemmatized: year
Original: benefits, Lemmatized: benefit
Original: details, Lemmatized: detail
Original: vacancies, Lemmatized: vacancy
Original: alerts, Lemmatized: alert
Original: papers, Lemmatized: paper
Original: impacts, Lemmatized: impact
Original: decisions, Lemmatized: decision
Original: requirements, Lemmatized: requirement
Original: regions, Lemmatized: region
Original: lines, Lemmatized: line
Original: findings, Lemmatized: finding
Original: conclusions, Lemmatized: conclusion
Original: timelines, Lemmatized: timeline
Original: individuals, Lemmatized: individual
Original: pressures, Lemma

Original: instructions, Lemmatized: instruction
Original: programmes, Lemmatized: programme
Original: responsibilities, Lemmatized: responsibility
Original: problems, Lemmatized: problem
Original: offices, Lemmatized: office
Original: records, Lemmatized: record
Original: plans, Lemmatized: plan
Original: hobbies, Lemmatized: hobby
Original: intervals, Lemmatized: interval
Original: members, Lemmatized: member
Original: visitors, Lemmatized: visitor
Original: refreshments, Lemmatized: refreshment
Original: beds, Lemmatized: bed
Original: commodes, Lemmatized: commode
Original: resources, Lemmatized: resource
Original: relatives, Lemmatized: relative
Original: wheelchairs, Lemmatized: wheelchair
Original: spectacles, Lemmatized: spectacle
Original: times, Lemmatized: time
Original: requirements, Lemmatized: requirement
Original: drinks, Lemmatized: drink
Original: outings, Lemmatized: outing
Original: visits, Lemmatized: visit
Original: dealings, Lemmatized: dealing
Original: affairs, L

### Saving outputs
Save the count vector representation as per spectification.
- vocab.txt

In [10]:
# Build a vocabulary of the cleaned job advertisement descriptions
vocabulary = {word: index for index, word in enumerate(sorted(set(all_words) - most_common_words))}

# Save the vocabulary to a text file
with open("vocab.txt", 'w', encoding='utf-8') as file:
    for word, index in sorted(vocabulary.items(), key=lambda x: x[1]):
        file.write(f"{word}:{index}\n")

## Summary

1. Data Loading: The task starts with loading job advertisement data which includes data from 4 different job categories namely Accounting Finance, Engineering, Healthcare Nursing, and Sales where each category has the details of the job like Title, Webindex,	Company, and Description.

2. Text Preprocessing: The descriptions of job advertisements are preprocessed to make them suitable for text analysis. Common preprocessing steps include tokenization, lowercase conversion, and the removal of stopwords, punctuation, and short words.

3. Lemmatization: Lemmatization is applied to the tokenized words to reduce them to their base or dictionary form. This helps in standardizing words (e.g., "carries" to "carry") and can improve text analysis accuracy.

4. Word Frequency Analysis: The code computes word frequencies to understand the distribution of words in the job advertisements. This information can be useful for various natural language processing tasks.

5. Saving the files: Once the preprocessing is done i.e., sorting the words, removing 50 most common for words; the description of the job advertisements are stored in preprocessing_ads.csv with Webindex and Description of the jobs. The cleaned vocabulary dictionary is then saved to vocab.txt