# Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Linda Nguyen



Version: 1.0

Environment: Python 3 and Jupyter notebook


## Introduction

We'll perform text pre-processing on a job advertisement dataset. We'll focus on pre-processing the Description only. We'll tokenize, remove single character token, stopwords, most/less frequent words.

We'll focus on pre-processing the Description only. We'll perform the following steps:

1. Extract information from each job advertisement. Perform the following pre-processing steps to the description of each job advertisement;
2. Tokenize each job advertisement description use regularexpression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?";
3. Convert all word to lower case;
4. Remove words with length less than 2.
5. Remove stopwords using the provided stop words list use `stopwords_en.txt`.
6. Remove the word that appears only once in the document collection, based on term frequency.
7. Remove the top 50 most frequent words based on document frequency.
8. Save all job advertisement text and information in a txt file;
9. Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file;

## Importing libraries 

In [1]:
# importing libraries
import pandas as pd # data manipulate/ data structure data analysis
import re  # set of powerful regular expression to check whether a given string matches a given pattern 
import numpy as np # mathematical operations and statistical operations.,use along with Matplotlib, Scikit Learn for AI/ML data analyst purpose

from sklearn.datasets import load_files # package sklearn.datasets to load files, datasets

import nltk  # module natural language toolkit= a set of text processing libraries for classification and semantic reasoning
from nltk.tokenize import sent_tokenize  # library to split text into sentences
from nltk.tokenize import RegexpTokenizer # splits a string into substrings using regular expression

# modulde itertools = functions creating iterators for efficient looping 
from itertools import chain  
                    # chain Make an iterator that returns elements from the first iterable until it is exhausted, 
                    # then proceeds to the next iterable, until all of the iterables are exhausted.
# chain.from_iterable=Alternate constructor for chain.Gets chained inputs from a single iterable argument that is evaluated lazily. 

from nltk.probability import *
from nltk.util import ngrams 

### 1.1 Examining and loading data

Before doing pre-processing, we need to load the data into a proper format. To load the data, we have to explore the data folder. Inside `data`, we have 4 sub-folders which is 4 job categories namely 'Accounting_Finance', 'Engineering', 'Healthcare_Nursing' and 'Sales'. Each sub-folder contains numbers of text files. Each text file contains Title, Webindex, Company(some have no info of Company) and Description. Now, let find out the inforation of the imported datset match the data desscription in this assignment. 

In [2]:
# load the data files
data = load_files(r'data')

In [3]:
# check number of attributes of data 
print(len(data))

5


In [4]:
# check attributes of data: 
print(data.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


The loaded data has 5 attributes as below:
    
- data - a list of text reviews
- target - the corresponding label of the text reviews (integer index)
- target_names - the names of target classes.
- filenames - the filenames holding the dataset.
- DESCR - description of data

In [5]:
# display data
data.data

[b'Title: Finance / Accounts Asst Bromley to ****k\nWebindex: 68997528\nCompany: First Recruitment Services\nDescription: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role',
 b'Title: Fund Accountant  Hedge Fund\nWebindex: 68063513\nCompany: Austin Andrew Ltd\nDescri

In [6]:
# display data filenames
data.filenames

array(['data\\Accounting_Finance\\Job_00382.txt',
       'data\\Accounting_Finance\\Job_00354.txt',
       'data\\Healthcare_Nursing\\Job_00547.txt',
       'data\\Accounting_Finance\\Job_00246.txt',
       'data\\Healthcare_Nursing\\Job_00543.txt',
       'data\\Engineering\\Job_00089.txt',
       'data\\Healthcare_Nursing\\Job_00580.txt',
       'data\\Accounting_Finance\\Job_00419.txt',
       'data\\Sales\\Job_00767.txt', 'data\\Sales\\Job_00670.txt',
       'data\\Accounting_Finance\\Job_00263.txt',
       'data\\Accounting_Finance\\Job_00374.txt',
       'data\\Engineering\\Job_00111.txt', 'data\\Sales\\Job_00775.txt',
       'data\\Engineering\\Job_00057.txt', 'data\\Sales\\Job_00642.txt',
       'data\\Sales\\Job_00657.txt', 'data\\Engineering\\Job_00209.txt',
       'data\\Sales\\Job_00746.txt',
       'data\\Healthcare_Nursing\\Job_00479.txt',
       'data\\Healthcare_Nursing\\Job_00491.txt',
       'data\\Healthcare_Nursing\\Job_00454.txt',
       'data\\Sales\\Job_00745.txt

In [7]:
print( 'We have total', len(data['data']), 'job ads.')

We have total 776 job ads.


In [8]:
# check number of job category of the data set
len(data.target_names)

4

In [9]:
# check name of the 4 job category
data.target_names

['Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales']

In [10]:
# again, check number of jobs in all 4 job catagories
len(data.target)

776

In [11]:
# check label of each target name or their coresponding job category
print(data.target_names[0])
print(data.target_names[1])
print(data.target_names[2])
print(data.target_names[3])

Accounting_Finance
Engineering
Healthcare_Nursing
Sales


We can see that the outputs of the above exploration of imported dataset matched the data description in the assisgnment. Thus, we can move on to the next task and start doing data pre-processing.

In [12]:
# extract all job ads save in job and target in labels
job , lables = data.data, data.target

In [13]:
len(job)

776

In [14]:
len(lables)

776

### 1.2 Pre-processing data

In the following tasks, we'll tackle the following basic text pre-processing for extracted description: 
    
- Word Tokenization
- Case Normalisation
- Removing Single Character Tokens
- Removing stopwords
- Removing words appear once only
- Removing top 50 most common words

Then, we'll save the cleaned description in `vocab.txt` and all job ads info in `job_ads.txt`.

###  Extract Description, Word Tokenization, Case Normalization for each job advertisement

In this task, we'll tokenize each of the text description. In particular, we'll perform extract description from `job`,sentence segmentation, normalize description to lower case, tokenize by sentence, used the provided tokenizer pattern. We then tokernize each sentence in description into tokens, put all tokens of description into a list.     

Be careful, when you try to use the sent_tokenize on the review text files, you may get a TypeError. This is because each job text is read as a byte object, however, the tokenizer cannot apply a string pattern on a bytes-like object. To resolve this, you need to decode each read job text using utf-8, e.g. `raw_job = raw_job.decode('utf-8')`

In [15]:
def tokenize_job (raw_job): # tokenize each job ad
    
    raw_job = raw_job.decode('utf-8') # decode job ads 
    raw_job = re.split(r'\n', raw_job) # split job ads by line
    des = [i for i in raw_job if i.startswith('Description:')] # extract description 
    description = re.sub(r'^Description:\s*', '', des[0]).lower() # normalize lower case for description
    sentences = sent_tokenize(description ) # tokenize by sentence
    
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?" # create tokenizer pattern
    tokenizer = RegexpTokenizer(pattern)
    
    # tokenize each sentence of description into tokens
    tokenize_each_sentence = [tokenizer.tokenize(sent) for sent in sentences]
    # put all tokens of description into a list
    tokens_job = list(chain.from_iterable(tokenize_each_sentence))  
    return tokens_job

In [16]:
# tokenize each job ad, each job ad saved as a list, and 776 job lists saved in a list tk_job
tk_job = [tokenize_job (i) for i in job]

In [17]:
tk_job

[['accountant',
  'partqualified',
  'to',
  'p',
  'a',
  'south',
  'east',
  'london',
  'our',
  'client',
  'a',
  'successful',
  'manufacturing',
  'company',
  'has',
  'an',
  'immediate',
  'requirement',
  'for',
  'an',
  'accountant',
  'for',
  'permanent',
  'role',
  'in',
  'their',
  'modern',
  'offices',
  'in',
  'south',
  'east',
  'london',
  'the',
  'role',
  'credit',
  'control',
  'purchase',
  'sales',
  'ledger',
  'daily',
  'collection',
  'of',
  'debts',
  'by',
  'phone',
  'letter',
  'and',
  'email',
  'handling',
  'of',
  'ledger',
  'accounts',
  'handling',
  'disputed',
  'accounts',
  'and',
  'negotiating',
  'payment',
  'terms',
  'allocating',
  'of',
  'cash',
  'and',
  'reconciliation',
  'of',
  'accounts',
  'adhoc',
  'administration',
  'duties',
  'within',
  'the',
  'business',
  'the',
  'person',
  'the',
  'ideal',
  'candidate',
  'will',
  'have',
  'previous',
  'experience',
  'in',
  'a',
  'credit',
  'control',
  'cap

Here, we need to check the size of the vocabulary at this stage, as well as the total number of tokens, etc. in this job dataset. 

In [18]:
# it takes a tokenize job list and display the number of vocab, words, Lexical diversity, max/min/average review length
def stats_print(tk_job):
    words = list(chain.from_iterable(tk_job)) # we put all the tokens in the corpus in a single list 
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set
    lexical_diversity = len(vocab)/len(words) # compute lexical_diversity
    
    print("Vocabulary size: ",len(vocab))    # count number of unique words
    print("Total number of tokens: ", len(words))  # count number  of tokens
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of job:", len(tk_job))
    
    lens = [len(j) for j in tk_job]  # compute number of tokens 
    print("Average job length:", np.mean(lens))
    print("Maximun job length:", np.max(lens))
    print("Minimun job length:", np.min(lens))
    print("Standard deviation of job length:", np.std(lens))

In [19]:
# print the orginal tokens stats
stats_print(tk_job)

Vocabulary size:  9834
Total number of tokens:  186952
Lexical diversity:  0.052601737344345076
Total number of job: 776
Average job length: 240.91752577319588
Maximun job length: 815
Minimun job length: 13
Standard deviation of job length: 124.97750685071483


We'll compare the stats of each stage with this original stats to see how much words be reduced. 

### Removing Single Character Token

In this sub-task, we'll remove any token that only contains a single character (a token that of length 1). Again, we'll use the stats to check number of removed words. 

In [20]:
st_list= [[w for w in t if len(w) <=1] for t in tk_job ] # create a list contain single character token for each job
list(chain.from_iterable(st_list)) # merge them together in one list

['p',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'k',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'b',
 'b',
 'a',
 'b',
 'b',
 'a',
 'a',
 'b',
 'b',
 's',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'a',
 's',
 'a',
 'd',
 'd',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'a',
 'm',
 'm',
 'm',
 'm',
 'a',
 'a',
 'a',
 'a',
 's',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 'a',
 'a',
 's',
 's',
 's',
 'a',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'b',
 'c',
 'd',
 'e',
 'k',
 'a',
 'p',
 'l',
 'a'

In [21]:
tk_job = [[ w for w in t if len(w) > 1] for t in tk_job] # filter out the single token 

In [22]:
# check stats
stats_print(tk_job)

Vocabulary size:  9808
Total number of tokens:  180913
Lexical diversity:  0.05421390392066905
Total number of job: 776
Average job length: 233.13530927835052
Maximun job length: 795
Minimun job length: 13
Standard deviation of job length: 121.6048654015839


### Removing stop words

In this sub-task, we'll remove the stop words from the tokenized text. You use the provided stopword list.  

In [23]:
# load the provided stopword file
stopwords_en = []
with open('./stopwords_en.txt') as f:
    stopwords_en = f.read().splitlines()

In [24]:
print( "There are", len(stopwords_en), "stopwords in the provided list.")

There are 571 stopwords in the provided list.


In [25]:
# filter out the tokens that belong in stopword_en
tk_job = [ [w for w in t if w not in stopwords_en]  for t in tk_job]

In [26]:
# check stat
stats_print(tk_job)

Vocabulary size:  9404
Total number of tokens:  107161
Lexical diversity:  0.0877558066834016
Total number of job: 776
Average job length: 138.09407216494844
Maximun job length: 487
Minimun job length: 12
Standard deviation of job length: 73.07847897002313


### Removing the word that appears only once in the document collection, based on term frequency.

Term frequency counts the number of times a word occurs in the whole corpus regardless which document it is in. 

Frequency distribution based on term frequency tells us how the total number of word tokens are distributed across all the types.

We use the built-in function `FreqDist` of NLTK to compute this distribution from a set of word tokens. 

Now, let's move on to the less frequent words. 

- find out the list of words that appear only once in the entire corpus
- remove these less frequent words from each tokenized review text
 
We first need to find out the set of less frequent words by using the `hapaxes` function applied on the term frequency  dictionary.

In [27]:
words = list(chain.from_iterable(tk_job)) # get all words from tk_job by put in tk_job list
term_fd = FreqDist(words)  # compute term frequency for each word 
lessFreqWords = set(term_fd.hapaxes()) # extract words that appearces once only

In [28]:
# remove word in lessFreqWords
tk_job = [[ word for word in t if word not in lessFreqWords] for t in tk_job] 

In [29]:
# check stats
stats_print(tk_job)

Vocabulary size:  5218
Total number of tokens:  102975
Lexical diversity:  0.05067249332362224
Total number of job: 776
Average job length: 132.69974226804123
Maximun job length: 471
Minimun job length: 12
Standard deviation of job length: 70.3782402519735


###  Removing the top 50 most frequent words based on document frequency.

Document frequency is slightly different then term frequency as it counts the number of documents a word occurs.<br>
For instance, if a word appear 4 times in a document, when we count the term frequency, this will be added 4 into the total number of occurrence; however, for document frequency, this will stil be counted as 1 only.

We use the built-in function `FreqDist` of NLTK to compute this distribution from a set of unique word tokens. 

In [30]:
set_words = list(chain.from_iterable([set(i) for i in tk_job])) # get the set tokens for each job ad
doc_fd = FreqDist(set_words) # compute document frequency for each unique word/type
doc_fd.most_common(50)  # choose 50 most frequent word
mostFreq =set([ i[0] for i in doc_fd.most_common(50)])
mostFreq

{'ability',
 'apply',
 'based',
 'benefits',
 'business',
 'candidate',
 'client',
 'clients',
 'company',
 'contact',
 'cv',
 'development',
 'essential',
 'excellent',
 'experience',
 'full',
 'good',
 'high',
 'include',
 'including',
 'information',
 'job',
 'jobseeking',
 'join',
 'knowledge',
 'leading',
 'management',
 'manager',
 'opportunity',
 'originally',
 'position',
 'posted',
 'provide',
 'recruitment',
 'required',
 'role',
 'salary',
 'sales',
 'service',
 'services',
 'skills',
 'strong',
 'successful',
 'support',
 'team',
 'training',
 'uk',
 'work',
 'working',
 'www'}

In [31]:
# remove 50 most frequent words
tk_job = [[word for word in t if word not in mostFreq] for t in tk_job]

In [32]:
# check stats
stats_print(tk_job)

Vocabulary size:  5168
Total number of tokens:  81205
Lexical diversity:  0.06364140139153993
Total number of job: 776
Average job length: 104.64561855670104
Maximun job length: 401
Minimun job length: 7
Standard deviation of job length: 58.44628718710534


Recall: from the begining we have:

`Vocabulary size:  9834
Total number of tokens:  186952
Lexical diversity:  0.052601737344345076
Total number of job: 776
Average job length: 240.91752577319588
Maximun job length: 815
Minimun job length: 13
Standard deviation of job length: 124.97750685071483`

We've shrunk about 47% vocabilary size. 

## Saving required outputs

### Saving all job advertisement text and information in `job_ads.txt`

In this sub-task, we'll need to find a way to save all job advertisement information. This will help us to continue process task 2 and task 3. <br>
First, we'll extract job ID which is from each text file name. But here we only keep the 5 digits of the text file. 

In [33]:
file_name = data.filenames.tolist() # convert filenames to list and save in file_name

# extract 5 numbers of job_ID from text file name, convert to string
job_id = [re.sub(r'[^\d{5}]','',i) for i in file_name]

In [34]:
job_id

['00382',
 '00354',
 '00547',
 '00246',
 '00543',
 '00089',
 '00580',
 '00419',
 '00767',
 '00670',
 '00263',
 '00374',
 '00111',
 '00775',
 '00057',
 '00642',
 '00657',
 '00209',
 '00746',
 '00479',
 '00491',
 '00454',
 '00745',
 '00649',
 '00259',
 '00603',
 '00583',
 '00610',
 '00394',
 '00507',
 '00696',
 '00168',
 '00615',
 '00317',
 '00200',
 '00774',
 '00716',
 '00128',
 '00032',
 '00272',
 '00494',
 '00490',
 '00594',
 '00426',
 '00240',
 '00345',
 '00559',
 '00211',
 '00734',
 '00014',
 '00012',
 '00586',
 '00429',
 '00667',
 '00287',
 '00307',
 '00073',
 '00233',
 '00444',
 '00459',
 '00707',
 '00671',
 '00668',
 '00311',
 '00763',
 '00335',
 '00161',
 '00501',
 '00770',
 '00076',
 '00686',
 '00452',
 '00632',
 '00214',
 '00729',
 '00536',
 '00126',
 '00681',
 '00593',
 '00199',
 '00692',
 '00564',
 '00591',
 '00112',
 '00377',
 '00166',
 '00165',
 '00219',
 '00684',
 '00250',
 '00757',
 '00050',
 '00761',
 '00554',
 '00442',
 '00282',
 '00568',
 '00627',
 '00562',
 '00052',


Then, we'll find the corresponding name for each target lable. To do this, we need to create a dictionary for categores, lable each job category and convert them into a list. 

In [35]:
# create dictionary for Category
target_names_dict = {'Accounting_Finance':0, 'Engineering':1, 'Healthcare_Nursing':2, 'Sales':3}

# label and convert category into list by using tolist()
category = lables.tolist()

# convert lable to corresponding category
for i in range(len(category)):
    for key, value in target_names_dict.items():
        if category[i] == value:
            category[i] = key  

In [36]:
category

['Accounting_Finance',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Engineering',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Sales',
 'Sales',
 'Accounting_Finance',
 'Accounting_Finance',
 'Engineering',
 'Sales',
 'Engineering',
 'Sales',
 'Sales',
 'Engineering',
 'Sales',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Sales',
 'Sales',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Sales',
 'Engineering',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Engineering',
 'Sales',
 'Sales',
 'Engineering',
 'Engineering',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'Engineering',
 'Sales',
 'Engineering',
 'Engineering',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Sales',
 'Ac

To extract Title and Webindex, we will define coressponding function. In this function, we will take the raw_job as a string, decode and split them into line, then find line start with Tile: or Webindex by using Regex patterns. Next, the function will extract the infomation after these strings. Finally, we'll add the extract info into a list. 

In [37]:
# finding Title for each job advertisement
def title (raw_job):   
    raw_job = raw_job.decode('utf-8') # decode ad 
    raw_job = re.split(r'\n', raw_job)   # split ad by new line 
    all_titles = [re.sub(r'^Title:\s*', '', i) \
           for i in raw_job if i.startswith('Title:')]  # keep contain after Title: 
    return all_titles   
all_titles =  list(chain.from_iterable([title (i) for i in job ])) # put them into list all_titles

In [38]:
# print the list contain all Title
all_titles

['Finance / Accounts Asst Bromley to ****k',
 'Fund Accountant  Hedge Fund',
 'Deputy Home Manager',
 'Brokers Wanted Imediate Start',
 'RGN Nurses (Hospitals)  Penarth',
 'Production Coordinator',
 'Scrub Nurse',
 'Sales & Purchase Ledger Clerk  Maternity Cover',
 'Recruitment Sales Executive',
 'Business Development Executive  Field Sales  Dartford',
 'Investments & Treasury Controller',
 'European Payroll',
 'Engineering Assessor / Instructor  South Yorkshire',
 'International Account Manager',
 'Senior Production Technologist (Malaysia)',
 'Insurance Sales Executive  Horsham',
 'Vehicle Purchaser / Car Sales',
 'Marine Engines Specialist – Product Support',
 'Sales Manager/Medical Sales Executive',
 'Optical Assistant  Oxfordshire',
 'PERM Unit Mgr RGN Kid minster Flexi ****K due',
 "PERM RGN's in Bangor CoDown  F/T Flexi  ****ph ExOpp  Bangor",
 'Ecommerce Country Manager (Netherlands)',
 'Business Development Manager  Leading Financial Lending PLC',
 'Dynamics AX Finance Consulta

In [39]:
# double check number of Title is 776
len(all_titles)

776

In [40]:
# find Webindex for each job advertisement
def webindex (raw_job):     
    raw_job = raw_job.decode('utf-8') # decode the raw info
    raw_job = re.split(r'\n', raw_job)  # split raw info into a list
    webindex = [re.sub(r'^Webindex:\s*', '', i) \
        for i in raw_job if i.startswith('Webindex:')] # extract 8 digit number after 'Webindex:' use regex pattern
    return webindex
webindex  = list(chain.from_iterable([webindex(i) for i in job ])) # put them into list webindex

In [41]:
# have a look on list webindex
webindex

['68997528',
 '68063513',
 '68700336',
 '67996688',
 '71803987',
 '70322392',
 '70086531',
 '68684698',
 '70251801',
 '72457901',
 '71851935',
 '70757932',
 '71215909',
 '70205492',
 '70207759',
 '69770990',
 '72232029',
 '71213522',
 '68258357',
 '71841735',
 '71692209',
 '71805092',
 '65101527',
 '68256188',
 '72198878',
 '68573837',
 '67749541',
 '71691899',
 '71139623',
 '72443411',
 '69799351',
 '69078766',
 '68508976',
 '68564061',
 '70762357',
 '71737507',
 '69577820',
 '67304988',
 '72452403',
 '70163439',
 '66544069',
 '71903513',
 '69568022',
 '69191349',
 '71142126',
 '72481557',
 '69539327',
 '72236089',
 '72233918',
 '68546047',
 '68217600',
 '72478300',
 '70598762',
 '71171000',
 '62016897',
 '68177629',
 '71196021',
 '70757636',
 '71793578',
 '70763481',
 '68678164',
 '70599432',
 '66887344',
 '68714905',
 '68257980',
 '62004211',
 '71367580',
 '71556854',
 '72411451',
 '69966126',
 '72438284',
 '72692186',
 '66399629',
 '72444694',
 '72448172',
 '69996401',
 '71443055',

Before we store all job infomation in a txt file. Let's check if each feature contain 776 data infomation

In [42]:
len(job_id)

776

In [43]:
# check if they all equal to 776 
len(job_id) == len(all_titles) == len(category) == len(webindex) == len(tk_job)

True

Next, we'll create the job_ads.txt to save job Id, Category, Title, Webindex, and Description of each job advertisement. The info is from the the lists we have above. 
- ID contains 5 digits from job_id list
- Category is from category list
- Title is from all_titles list
- Webindex: 8 digits from webindex list 
- Description from the processed tk_job list

Each job advertisement will be stored as a show_string. We should have total 776 job ads stored in job_ads.txt. We can check it after the text file generated.  

In [44]:
job_ads = open('job_ads.txt ', 'w')  # create file in write mode

# join all job info 
show_string = '\n'.join(['\n'.join(('ID: ' + job_id[i],                 
                               'Category: ' + category[i] ,
                               'Title: ' + all_titles[i],
                               'Webindex: ' + webindex[i],            
                               'Description: ' + 
                               ' '.join(tk_job[i]))) for i in range(len(job))])
job_ads.write(show_string) # save each job in file
job_ads.close() # close file

Here, we skipped infomation of `Company` because task 2 and 3 don't need Company to carry out the tasks. 

### Build a vocabulary of the cleaned job advertisement descriptions and save in `vocab.txt`

Now, we complete all the basic pre-process step and we are ready to move to feature generation. Before we start, in this task, we'll construct the final vocabulary. 

In [45]:
vocab  = sorted(list(set(words))) # sort words in alphabet order
voc = open('vocab.txt', 'w')  # create a file in write mode

# loop vocab, extract each vocab, index, format word:index for each line
show = '\n'.join( ':'.join([vocab[i], str(i)]) for i in range(len(vocab)))   
voc.write(show) # save each job ad in the created textfile 
voc.close() # close file

## Summary

We've done text pre-processing for job description by: 
* word tokenization
* case normalisation
* removing single character words
* removing stopwords
* removing words appear only once by term frequency
* removing top 50 most common words by doccument frequency

After the text preprocessing, the vocabulary size have been reduced by about 47%.We can check the details at each stage by using the created stats. <br>
Then, we'll save the cleaned description in `vocab.txt` and all job ads info in `job_ads.txt`. <br>

The `vocab.txt` contains the unigram vocabulary, one each line, in the format of word_string:word_integer_index. <br>
Words in the vocabulary are sorted in alphabetical order, and the index value starts from 0. 

The `job_ads.txt` contains 776 job advertisment with the infomation of job ID, Category, Title, Webindex and Description. We'll need to use this file to carry out the next tasks. 