# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Kishore Sudhir
#### Student ID: s3971501

Date: 02/10/2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* nltk
* numpy
* sklearn
* itertools

## Introduction
In this task, thorough text preprocessing was performed on a dataset comprising job advertisements, with a primary focus on the description field. A series of critical steps were followed, encompassing tokenization, converting to lowercase, eliminating short words, removing stopwords, and filtering out both infrequent and very frequent terms. The preprocessed job descriptions were then saved, and a vocabulary was created. These preprocessing actions are crucial for improving text data quality, making it suitable for diverse natural language processing tasks. The resultant refined data will prove invaluable for applications like text classification and information retrieval.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import nltk
import os
import numpy as np 
from sklearn.datasets import load_files 

from nltk import RegexpTokenizer
from nltk.probability import *
from nltk.tokenize import sent_tokenize
from itertools import chain


### 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


Reading data using load_files function from sklearn.datasets

In [2]:
# Reading Data
job_data = load_files(r"data")


The data stored will be similar to a dictionary (type), so we can read it's features or keys as shown below

In [3]:
# Code to inspect the provided data file...
job_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

The keys are self explanatory, let's look at how the files are stored

In [4]:
job_data['filenames'][:5]

array(['data\\Accounting_Finance\\Job_00382.txt',
       'data\\Accounting_Finance\\Job_00354.txt',
       'data\\Healthcare_Nursing\\Job_00547.txt',
       'data\\Accounting_Finance\\Job_00246.txt',
       'data\\Healthcare_Nursing\\Job_00543.txt'], dtype='<U37')

From above, it looks like data is categorisrd into folders base pn their industry or field of occupation

In [5]:
job_data['target_names'] # shows the catregory

['Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales']

In [6]:
set(job_data['target']) # 4 types of numbers representing the above categories

{0, 1, 2, 3}

So this data has five categories or classes

Let's see which class matches to which class

In [7]:
targets = set()
for target in job_data['target']:
    # Get the index of the target value
    target_index = target

    # Get the corresponding target name
    target_name = job_data['target_names'][target_index]

    # Create a tuple representing the pair
    values = (target_name, target)

    # Add the pair to the set of unique pairs
    targets.add(values)

sorted_targets = sorted(targets, key=lambda x: x[1])

# Now, unique_pairs contains unique combinations of target_name and target
for value in sorted_targets:
    target_name, target = value
    print(f"{target_name} represents {target}")

Accounting_Finance represents 0
Engineering represents 1
Healthcare_Nursing represents 2
Sales represents 3


It is expected result. Same order as results in prior cells 

Let's validate if extraction is done correctly

In [8]:
# test

ind = 7
job_data['filenames'][ind], job_data['target'][ind]

('data\\Accounting_Finance\\Job_00419.txt', 0)

In [9]:
jdata, jindustry = job_data.data, job_data.target

In [10]:
jdata[ind],jindustry[ind]

(b'Title: Sales & Purchase Ledger Clerk  Maternity Cover\nWebindex: 68684698\nCompany: JK Personnel\nDescription: Our client is looking to recruit an experienced Sales Purchase ledger clerk. You will be covering maternity over from February 2013  mid next year. The ideal candidate would be available immediately. DUTIES AND RESPONSIBILITIES (NOT LIMITED) Sales Ledger To input cheques/ bacs received onto Sage 200 accurately To reconcile remittances with actual receipts. To prepare the banking book To be responsible and ensure that all credit control has been done in a professional and timely manner. To send customer statements on a monthly basis Managing Aged Debtors Report Purchase Ledger To prepare the payment run, ensuring all invoices which are due will be processed and that there are no duplicates. To log all payments onto Sage 200 (cheques, bacs and chaps) To deal with account payable queries Managing Aged Creditors Report SKILLS REQUIREMENTS: 2 years experience Experience of syste

Perfect!

### 1.2 Pre-processing data
Perform the required text pre-processing steps.

From preious examination, we can see that the data features (e.g. Title, Description, ...) are stored in as sentence. 
We only needed Description for this task and Title for future tasks, so `getFeature` function helps us to extract only required data form the new line seperated sentences

In [11]:
def getFeature(data,featureName):
    """
    Function to extract a specific feature from a list of byte strings.

    """
    features = []
    for text in data:
        feature = ""
        # Convert the byte string to a regular string
        text_decode = text.decode('utf-8')

        # Split the text into lines
        lines = text_decode.split('\n')
        for line in lines:
            if line.startswith(featureName):
                feature = line.split(featureName)[1].strip()
                break  # Stop searching after finding the first description

        features.append(feature)

    return features


In [12]:
descp = getFeature(jdata,"Description:") # Extract description
windex = getFeature(jdata,"Webindex") # Extract web index
title = getFeature(jdata,"Title:") # # Extract title, needed only for next task.
windex[ind],title[ind],descp[ind], jindustry[ind] # visually validate

(': 68684698',
 'Sales & Purchase Ledger Clerk  Maternity Cover',
 'Our client is looking to recruit an experienced Sales Purchase ledger clerk. You will be covering maternity over from February 2013  mid next year. The ideal candidate would be available immediately. DUTIES AND RESPONSIBILITIES (NOT LIMITED) Sales Ledger To input cheques/ bacs received onto Sage 200 accurately To reconcile remittances with actual receipts. To prepare the banking book To be responsible and ensure that all credit control has been done in a professional and timely manner. To send customer statements on a monthly basis Managing Aged Debtors Report Purchase Ledger To prepare the payment run, ensuring all invoices which are due will be processed and that there are no duplicates. To log all payments onto Sage 200 (cheques, bacs and chaps) To deal with account payable queries Managing Aged Creditors Report SKILLS REQUIREMENTS: 2 years experience Experience of systems and invoicing A good understanding of curre

Sales Purchase ledger clerk is job related to Accounting_Finance so target show as 0. Then so far so good :)

Tokenization is a crutial task, `tokenizeFeature` helps to tokenize given feature based on regex pattern `r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"`

In [13]:
# Tokenization
def tokenizeFeature(raw_feature):

    """
    Function to tokenize raw feature based on the assignemnt requiremnts
    """
    nl_feature = raw_feature.lower() # cover all words to lowercase
    
    # segament into sentences
    features = sent_tokenize(nl_feature)
    
    # tokenize each sentence
    pattern =  r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern) 
    token_lists = [tokenizer.tokenize(feat) for feat in features]
    
    # merge them into a list of tokens
    tokenised_features = list(chain.from_iterable(token_lists))
    return tokenised_features

In [14]:
# list of tokenized descriptions
tk_descriptions = [tokenizeFeature(d) for d in descp]

In [15]:
tk_descriptions[ind]

['our',
 'client',
 'is',
 'looking',
 'to',
 'recruit',
 'an',
 'experienced',
 'sales',
 'purchase',
 'ledger',
 'clerk',
 'you',
 'will',
 'be',
 'covering',
 'maternity',
 'over',
 'from',
 'february',
 'mid',
 'next',
 'year',
 'the',
 'ideal',
 'candidate',
 'would',
 'be',
 'available',
 'immediately',
 'duties',
 'and',
 'responsibilities',
 'not',
 'limited',
 'sales',
 'ledger',
 'to',
 'input',
 'cheques',
 'bacs',
 'received',
 'onto',
 'sage',
 'accurately',
 'to',
 'reconcile',
 'remittances',
 'with',
 'actual',
 'receipts',
 'to',
 'prepare',
 'the',
 'banking',
 'book',
 'to',
 'be',
 'responsible',
 'and',
 'ensure',
 'that',
 'all',
 'credit',
 'control',
 'has',
 'been',
 'done',
 'in',
 'a',
 'professional',
 'and',
 'timely',
 'manner',
 'to',
 'send',
 'customer',
 'statements',
 'on',
 'a',
 'monthly',
 'basis',
 'managing',
 'aged',
 'debtors',
 'report',
 'purchase',
 'ledger',
 'to',
 'prepare',
 'the',
 'payment',
 'run',
 'ensuring',
 'all',
 'invoices',
 '

Description was tokenized perfectly, below title is tokenized for future use

In [16]:
# tokenize text
tk_title = [tokenizeFeature(t) for t in title]
tk_title[ind]

['sales', 'purchase', 'ledger', 'clerk', 'maternity', 'cover']

In [17]:
# raw view vs tokenized view
descp[ind], tk_descriptions[ind]
print("Raw description:\n", descp[ind], "\n\nTokenized description:\n", tk_descriptions[ind])
print("\nRaw Title:\n", title[ind], "\n\nTokenized Title:\n", tk_title[ind])

Raw description:
 Our client is looking to recruit an experienced Sales Purchase ledger clerk. You will be covering maternity over from February 2013  mid next year. The ideal candidate would be available immediately. DUTIES AND RESPONSIBILITIES (NOT LIMITED) Sales Ledger To input cheques/ bacs received onto Sage 200 accurately To reconcile remittances with actual receipts. To prepare the banking book To be responsible and ensure that all credit control has been done in a professional and timely manner. To send customer statements on a monthly basis Managing Aged Debtors Report Purchase Ledger To prepare the payment run, ensuring all invoices which are due will be processed and that there are no duplicates. To log all payments onto Sage 200 (cheques, bacs and chaps) To deal with account payable queries Managing Aged Creditors Report SKILLS REQUIREMENTS: 2 years experience Experience of systems and invoicing A good understanding of currencies Computer Literate in office, outlook, word a

This ensures that the tokenization process was executed accurately without any data mismatches.

In [18]:
# checking stats, function is taken from week 7 lab exercise, with very little changes
def stats_print(tk_descriptions):
    """
    Function to print few stats of the given tokenized feature.

    NOTE:
    This function is taken from week 7 lab exercise, with very little (or no) changes
    """
    words = list(chain.from_iterable(tk_descriptions))
    vocab = set(words)
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of description:", len(tk_descriptions))
    lens = [len(article) for article in tk_descriptions]
    print("Average description length:", np.mean(lens))
    print("Maximun description length:", np.max(lens))
    print("Minimun description length:", np.min(lens))
    print("Standard deviation of description length:", np.std(lens))


In [19]:
stats_print(tk_descriptions)

Vocabulary size:  9779
Total number of tokens:  184038
Lexical diversity:  0.05313576543974614
Total number of description: 776
Average description length: 237.16237113402062
Maximun description length: 815
Minimun description length: 1
Standard deviation of description length: 128.22977695321202


Starting from this point, tokenized text preprocessing adheres to the specifications outlined in Task 1: Basic Text Pre-processing (as per the assignment requirments).

In [20]:
# Remove words with length less than 2

dc_list = [[w for w in descp if len(w) <= 2 ] \
                      for descp in tk_descriptions]
len(list(chain.from_iterable(dc_list)))

33210

In [21]:
# filtering out
tk_descriptions = [[w for w in descp if len(w) >2] \
                      for descp in tk_descriptions]


In [22]:
# filtering out
tk_title = [[w for w in descp if len(w) >2] \
                      for descp in tk_title]

In [23]:
# stop words

stopwords_en = []
with open('utils/stopwords_en.txt') as f:
    stopwords_en = f.read().splitlines()
    # stopwords_en = f.readlines()

In [24]:
len(stopwords_en)

571

In [25]:
stopwordSet = set(stopwords_en)
len(stopwordSet)

570

In [26]:
words = list(chain.from_iterable(tk_descriptions))
vocab = set(words) 

Let's deviate slightly from the specified requirements and check if there are any duplicated words in the `stopwords_en`.

In [27]:
duplicates = []
seen = set()

for word in stopwords_en:
    if word in seen:
        duplicates.append(word)
    else:
        seen.add(word)

print("Duplicate stopwords:", duplicates)

Duplicate stopwords: ['would']


Below code removes stop word form `tk_descriptions`

In [28]:
tk_descriptions = [[w for w in descp if w not in stopwordSet] for descp in tk_descriptions]
stats_print(tk_descriptions)

Vocabulary size:  9192
Total number of tokens:  104146
Lexical diversity:  0.08826071092504753
Total number of description: 776
Average description length: 134.20876288659792
Maximun description length: 482
Minimun description length: 1
Standard deviation of description length: 73.96986866017521


In [29]:
words = list(chain.from_iterable(tk_descriptions))
vocab_after = set(words) 

Let's exammine the removed words

In [30]:
removed = list(vocab - vocab_after)  # words before - words after
print(f"Removed {len(removed)} number of stop words.")
removed

Removed 368 number of stop words.


['course',
 'followed',
 'comes',
 'goes',
 'alone',
 'only',
 'exactly',
 'second',
 'thus',
 'else',
 'seen',
 'still',
 'known',
 'away',
 'three',
 'reasonably',
 'much',
 'nothing',
 'becoming',
 'her',
 'throughout',
 'why',
 "what's",
 'ourselves',
 'nine',
 'itself',
 'name',
 'kept',
 'without',
 'want',
 'say',
 'available',
 'know',
 'look',
 'likely',
 "you'll",
 'believe',
 'regarding',
 "you're",
 'different',
 'until',
 'willing',
 "you've",
 'uses',
 'into',
 'please',
 'nobody',
 'anyone',
 'against',
 'down',
 'despite',
 'entirely',
 'indicated',
 'non',
 'might',
 'and',
 'least',
 'while',
 'outside',
 'neither',
 'seven',
 'clearly',
 'able',
 'come',
 'however',
 'when',
 'around',
 'necessary',
 'hopefully',
 'upon',
 'definitely',
 'given',
 'probably',
 'those',
 'later',
 'self',
 'beyond',
 'unless',
 'none',
 'anything',
 'some',
 'here',
 'already',
 'across',
 'since',
 'gone',
 'how',
 'also',
 'gives',
 'needs',
 'had',
 'each',
 'not',
 'hence',
 'thou

In [31]:
len(words),len(vocab_after)

(104146, 9192)

In [32]:
# checking stop words for title
titlewords  = list(chain.from_iterable(tk_title))
title_vocab_before = set(titlewords) 

In [33]:
tk_title = [[w for w in title if w not in stopwordSet] for title in tk_title]

In [34]:
titlewords  = list(chain.from_iterable(tk_title))
title_vocab_before = set(titlewords) 
removed_title = list(title_vocab_before - title_vocab_before)  # words before - words after
print(f"Removed {len(removed)} number of stop words.")
removed

Removed 368 number of stop words.


['course',
 'followed',
 'comes',
 'goes',
 'alone',
 'only',
 'exactly',
 'second',
 'thus',
 'else',
 'seen',
 'still',
 'known',
 'away',
 'three',
 'reasonably',
 'much',
 'nothing',
 'becoming',
 'her',
 'throughout',
 'why',
 "what's",
 'ourselves',
 'nine',
 'itself',
 'name',
 'kept',
 'without',
 'want',
 'say',
 'available',
 'know',
 'look',
 'likely',
 "you'll",
 'believe',
 'regarding',
 "you're",
 'different',
 'until',
 'willing',
 "you've",
 'uses',
 'into',
 'please',
 'nobody',
 'anyone',
 'against',
 'down',
 'despite',
 'entirely',
 'indicated',
 'non',
 'might',
 'and',
 'least',
 'while',
 'outside',
 'neither',
 'seven',
 'clearly',
 'able',
 'come',
 'however',
 'when',
 'around',
 'necessary',
 'hopefully',
 'upon',
 'definitely',
 'given',
 'probably',
 'those',
 'later',
 'self',
 'beyond',
 'unless',
 'none',
 'anything',
 'some',
 'here',
 'already',
 'across',
 'since',
 'gone',
 'how',
 'also',
 'gives',
 'needs',
 'had',
 'each',
 'not',
 'hence',
 'thou

Pre processing below are only done to `Description`, not for `Title` as it might not be appropriate to remove them in title based on stats 

Even, Luke Gallagher mentioned this in discussion forum

In [35]:
# term frequency
term_freq = FreqDist(words) # description

In [36]:
term_freq.most_common(5)

[('experience', 1260),
 ('sales', 1023),
 ('role', 941),
 ('work', 842),
 ('business', 823)]

In [37]:
# Less frequent words based on term frequency

rare_words = set(term_freq.hapaxes())
rare_words

{'kilmarnock',
 "ireland's",
 'flm',
 'imaginative',
 'conversational',
 'cascaded',
 'stone',
 'nmcresponsible',
 'payingin',
 'eng',
 'surpassing',
 'biopharma',
 'financeanalyst',
 'homeholly',
 'applythis',
 'discretionary',
 'hcv',
 'equate',
 'remittances',
 'apqp',
 'calendar',
 'couriers',
 'gyfrinachol',
 'decline',
 'discloser',
 'sutherland',
 'entities',
 'mct',
 'puts',
 'genetics',
 'determines',
 'testandvalidationengineer',
 'salesexecutivemeetingseventsaberdeen',
 'urgency',
 'underwrite',
 'rfj',
 'overcoming',
 'dysphagia',
 'xsd',
 'illnesses',
 'shortlist',
 'passing',
 'terry',
 'multinationals',
 'implicitly',
 'reacting',
 'preventable',
 'doubled',
 'netbios',
 'langley',
 'mrcpsych',
 'ruane',
 'absences',
 'mencap',
 'leinster',
 'liamzest',
 'ranalyst',
 'legally',
 'pbo',
 'gilchrist',
 'eating',
 'governments',
 'commencement',
 'staines',
 'painting',
 'wwf',
 'responsibliities',
 'referencesthis',
 'morris',
 'leatherhead',
 'chwilio',
 'celebrated',
 'c

In [38]:
# Remove words

def removeWords(descp,exclude):
    """
    Function to remove words form description, works for both rare and most common words
    """
    return [w for w in descp if w not in exclude]

tk_descriptions = [removeWords(descp,rare_words) for descp in tk_descriptions]

In [39]:
stats_print(tk_descriptions)

Vocabulary size:  5108
Total number of tokens:  100062
Lexical diversity:  0.05104835002298575
Total number of description: 776
Average description length: 128.9458762886598
Maximun description length: 466
Minimun description length: 1
Standard deviation of description length: 71.24901633254898


In [40]:
# most frequent words

common_words = list(chain.from_iterable([set(descp) for descp in tk_descriptions]))

doc_freq = FreqDist(common_words)
doc_freq.most_common(5)

[('experience', 579),
 ('role', 495),
 ('work', 445),
 ('team', 427),
 ('working', 402)]

In [41]:
top50_common_words = [word for word, freq in doc_freq.most_common(50)]
top50_common_words

['experience',
 'role',
 'work',
 'team',
 'working',
 'skills',
 'client',
 'job',
 'business',
 'company',
 'excellent',
 'management',
 'based',
 'apply',
 'opportunity',
 'salary',
 'required',
 'successful',
 'support',
 'join',
 'candidate',
 'knowledge',
 'service',
 'development',
 'leading',
 'high',
 'manager',
 'www',
 'training',
 'strong',
 'sales',
 'including',
 'provide',
 'services',
 'ability',
 'contact',
 'position',
 'full',
 'posted',
 'recruitment',
 'jobseeking',
 'originally',
 'benefits',
 'include',
 'essential',
 'good',
 'clients',
 'communication',
 'information',
 'customer']

In [42]:
words_before = list(chain.from_iterable(tk_descriptions))
vocab_before = set(words_before) 
tk_descriptions_before = tk_descriptions

In [43]:
tk_descriptions = [removeWords(descp,top50_common_words) for descp in tk_descriptions] # remove top 50 most common words

In [44]:
removed = []
for x in tk_descriptions_before:
    if x not in tk_descriptions:
        removed.append(x)
removed = list(chain.from_iterable(removed))

In [45]:
words_after = list(chain.from_iterable(tk_descriptions))
vocab_after = set(words_after) 

In [46]:
removed = list(vocab_before - vocab_after)  # words before - words after
print(f"Removed {len(removed)} number of common words.")

Removed 50 number of common words.


In [47]:
diff1 = set(top50_common_words) - set(removed)
diff2 = set(removed) - set(top50_common_words)
diff1, diff2

(set(), set())

Looks like all top 50 words are removed :)

In [48]:
stats_print(tk_descriptions)

Vocabulary size:  5058
Total number of tokens:  78819
Lexical diversity:  0.06417234423172077
Total number of description: 776
Average description length: 101.57087628865979
Maximun description length: 392
Minimun description length: 0
Standard deviation of description length: 58.893164037654984


_____________________________________________

Vocabulary size:  9779  
Total number of tokens:  184038  
Lexical diversity:  0.05313576543974614  
Total number of description: 776  
Average description length: 237.16237113402062  
Maximun description length: 815  
Minimun description length: 1  
Standard deviation of description length: 128.22977695321202
_____________________________________________

The vocabulary size has decreased by approximately 48.3%, and the number of tokens has been reduced by over half. Additionally, there is an improvement of approximately 20% in the lexical diversity.

Finally, the pre processing is done, now saving the final words and vocab data

In [49]:
# final words and vocab
words = list(chain.from_iterable(tk_descriptions))
vocab = set(words)

In [50]:
sorted_vocab = sorted(vocab)

In [51]:
vocab_indexed = {word: index for index, word in enumerate(sorted_vocab)} # stores in word_string:word_integer_index format
vocab_indexed 

{'aap': 0,
 'aaron': 1,
 'aat': 2,
 'abb': 3,
 'abenefit': 4,
 'aberdeen': 5,
 'abi': 6,
 'abilities': 7,
 'abreast': 8,
 'abroad': 9,
 'absence': 10,
 'absolute': 11,
 'aca': 12,
 'academic': 13,
 'academy': 14,
 'acca': 15,
 'accept': 16,
 'acceptable': 17,
 'acceptance': 18,
 'accepted': 19,
 'access': 20,
 'accessible': 21,
 'accident': 22,
 'accommodates': 23,
 'accommodation': 24,
 'accomplished': 25,
 'accordance': 26,
 'account': 27,
 'accountabilities': 28,
 'accountability': 29,
 'accountable': 30,
 'accountancy': 31,
 'accountant': 32,
 'accountants': 33,
 'accounting': 34,
 'accounts': 35,
 'accreditation': 36,
 'accredited': 37,
 'accruals': 38,
 'accuracy': 39,
 'accurate': 40,
 'accurately': 41,
 'achievable': 42,
 'achieve': 43,
 'achieved': 44,
 'achievement': 45,
 'achievements': 46,
 'achiever': 47,
 'achieving': 48,
 'acii': 49,
 'acquired': 50,
 'acquisition': 51,
 'acquisitions': 52,
 'act': 53,
 'acting': 54,
 'action': 55,
 'actions': 56,
 'actionscript': 57,
 '

## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt

In [52]:
# code to save output data...

def save_feature(featFilename,tk_feature):
    """
    Function to save the processed feature data (descripton/title)
    """
    output_file = open(featFilename, 'w') # creates a txt file and open to save the reviews
    string = "\n".join([" ".join(feat) for feat in tk_feature]) # each line each doc
    string = string.strip()
    output_file.write(string.rstrip())
    output_file.close() # close the file
    
def save_industry(industryFilename,industry):
    """
    Function to save the target data, i.e, industry
    """
    output_file = open(industryFilename, 'w') # creates a txt file and open to save sentiments
    string = "\n".join([str(s) for s in industry])
    output_file.write(string)
    output_file.close() # close the file    


def save_vocabulary(vocabFilename,vocab):
    """
    Function to save the generated vocabulary
    """
    with open(vocabFilename, 'w') as file:
        for word, index in vocab_indexed.items():
            file.write(f"{word}:{index}\n")


def save_webindex(windexFileName, windex):
    """
    Function to save the web index, so later we can use it in process of count vector generation
    """
    with open(windexFileName, 'w') as file:
        for index in windex:
            # print(f"#{index[2:]}\n")
            file.write(f"{index[2:]}\n")



In [53]:
save_feature('generated_features/descriptions.txt',tk_descriptions) # save descriptions
save_feature('generated_features/titles.txt',tk_title) # save titles

In [54]:
save_industry('generated_features/industry.txt',jindustry) # save industry

In [55]:
save_webindex('generated_features/web_index.txt', windex) # save webindex

Let's check the values

In [56]:
print(descp[ind]) # an example of a description txt
print("----")
print(tk_descriptions[ind]) 
print("----")
print(windex[ind]) # 68684698

Our client is looking to recruit an experienced Sales Purchase ledger clerk. You will be covering maternity over from February 2013  mid next year. The ideal candidate would be available immediately. DUTIES AND RESPONSIBILITIES (NOT LIMITED) Sales Ledger To input cheques/ bacs received onto Sage 200 accurately To reconcile remittances with actual receipts. To prepare the banking book To be responsible and ensure that all credit control has been done in a professional and timely manner. To send customer statements on a monthly basis Managing Aged Debtors Report Purchase Ledger To prepare the payment run, ensuring all invoices which are due will be processed and that there are no duplicates. To log all payments onto Sage 200 (cheques, bacs and chaps) To deal with account payable queries Managing Aged Creditors Report SKILLS REQUIREMENTS: 2 years experience Experience of systems and invoicing A good understanding of currencies Computer Literate in office, outlook, word and excel Sage 200 

Great! They are stored without mismatches

In [57]:
all(job_data.target==jindustry) 

True

In [58]:
len(tk_descriptions),len(jindustry),len(windex)

(776, 776, 776)

In [59]:
# saving vocabulary
save_vocabulary('generated_features/vocab.txt',save_vocabulary)

## Summary
This text preprocessing task has been a valuable learning experience. It emphasized the importance of meticulous data cleaning in natural language processing. Key takeaways include tokenization, lowercasing, stop word removal, and filtering rare and common terms based on term frequency and document frequency respectively. These steps are critical for enhancing data quality and preparing it for downstream NLP tasks. Additionally, creating a vocabulary aids in understanding the dataset's unique terms. Overall, this task underscores that proper preprocessing is a fundamental aspect of text analysis, enabling more accurate and insightful results in machine learning and NLP applications.

## Acknowledgements

Some of the code (functions) are used (with modification) from weekly (week 7 to week 9) activity notebooks and lab notebooks

## References
[1]The Coding Train, “12.2: Color Vectors - Programming with Text,” www.youtube.com, Oct. 21, 2018. https://www.youtube.com/watch?v=mI23bDF0VRI (accessed Sep. 30, 2023).

[2]aparrish, “Understanding Word Vectors,” GitHub Gist. https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469 (accessed Sep. 30, 2023).

[3] Week7, Week8, Week9 codes and ideas from both activity and lab materials