# FIT5196 Task 2 in Assessment 1
#### Student Name: Tan Kah Wang
#### Student ID: 29442826

Date: 01/09/2018

Version: 1.0

Environment: Python 3.6.4 and Jupyter notebook

Libraries used: 
* re (for regular expression, included in Anaconda Python 3.6.4) 
* nltk.data (for sentence segmentation using Punkt Sentence Tokenizer, included in Anaconda Python 3.6.4)
* nltk.tokenize (for tokenizing using RegexpTokenizer and MWETokenizer, included in Anaconda Python 3.6.4)
* nltk.collocations (for generating bigrams and using PMI, included in Anaconda Python 3.6.4)
* nltk.probability (for getting Frequency Distribution, included in Anaconda Python 3.6.4)
* nltk.stem (for stemming using PorterStemmer, included in Anaconda Python 3.6.4)
* itertools (for efficient looping, included in Anaconda Python 3.6.4)

## 1. Introduction
This task is to build sparse representations for 250 resumes, which includes word tokenization, vocabulary generation, and the generation of sparse represeentations.

The resume ID of the 250 resumes can be found in `resume_dataset.txt` and they are then extracted using regular expression. After which, 218 unique resumes were loaded into Python as there were duplicated resumes ID in the 250 provided. 

Text pre-processing was performed with the objective of producing a vocabulary for the tokens found in the resumes and the associated sparse count vector for each resume. The pre-processing includes sentence segmentation, case normalization, tokenisation, collocations, removal of stopwords, stemming and removal of rare tokens and tokens of length lesser than 3. The initial tokenised vocabulary of the corpus was 17078 unique words, which was reduced to 2261 words following pre-processing.

## 2. Import libraries

In [1]:
import re
import nltk.data
from nltk.tokenize import RegexpTokenizer
from nltk.collocations import *
from nltk.tokenize import MWETokenizer
from nltk.probability import *
from nltk.stem import PorterStemmer
from itertools import chain

## 3. Extracting assigned resumes

We are provided with a file `resume_dataset.txt` in which it contains all students' ID and their assigned 250 resumes' ID. We can first load in the file and see the format of the lines to come out with ideas on how to extract the 250 resumes' ID pertaining to each student.

In [2]:
# print first 20 lines of resume_dataset.txt
with open("./resume_dataset.txt", 'r') as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 20)]))

29262909:[834  90 217 765 701 148 246 243 440 624 589 566 588 840 736 218 777 251
788 266 774 789 737 348  29 645 628 752 728 493 308 353 117 211 738 438
692 548 778  88 534 725 138 651 451 101 148 797 527 364 613 109 658 683
840 542 266 779 249 860 594 258 566 467 320 289 754 578 465 526 766 429
798 562 466 601 220 798 181  47 547 178 761  84 507 377 471 621 215 560
288  55 119 356 855 590 700 755 242 383 795 197 312 726 735 415 316 654
402 705 762  62 444 431  19 511 381 202 378 719 696 684 543  16 476 725
273 565 686 447 179  91 836 328 561 285 735 359 739 840 445 192 345 650
271 710  75 581 186 295 415 353 117 816 276 724 196 669 284  54 374 593
335  15 657 186 554  14  32 425 188 261 774 258 763 836 128 466 109 319
434 780 775 749 861 792 830 133 412 849 504 410 135 686 857 663 568 652
411 279 385 120 782 163 140   3 220 160 250 553   9 795 863 709 452 385
573 327 832 542 481 733 315 118  75  43 790 660 539 686 420 404 606 361
373 258 233 474 592  45 496 601  62  60 479 590 780 86

We observed that for each student ID, the 250 resumes' ID are stored within `[` and `]` while separated by a space between each resume ID, and each student ID contains a `:` directly after the ID. We also noted that resumes' ID are in either 1, 2 or 3 digits. Thus, a regular rexpression `r"29442826:\[([\d\s]*)\]"` was used to extract the 250 resumes' ID pertaining to my student ID (i.e. 29442826).

We used a capturing group of `([\d\s]*)` to capture all digits and whitespaces, and the `*` will ensure we capture all instances of `[\d\s]` until it reaches the first `]` which is used to mark the end of each student's assigned resumes' ID. Since we know the student ID and they contains `:[` immediately after the student ID, we used it to mark the start of each student's assigned resumes' ID. Putting a `\` in front of `[` and `]` is required to ensure the two characters are escaped as they are special characters in regular expressions. The results of the regular expression are as follow:

In [3]:
with open("./resume_dataset.txt", 'r') as infile:
    all_resumes=infile.read()

temp_data=re.search(r"29442826:\[([\d\s]*)\]",all_resumes)
temp_data.group(1)

'186 139 502 759 634  34 255  30 719 313 745 283 646 524 255 415 492 508\n 142 594 768 418 409 478 484 278 533 141 384 422 707 199 214 784  77 264\n 538 272 593 138  14 679 254 340 169 438 330 634 584 742  34 131 339 633\n 473 280 470 194 479 614 481 336 338  89 800 472 516 476 726 282 531 204\n 641 455 751  54 317 160 335 206 345 181 503 276 206 171 593 844 790 836\n 745 147  70 778  69 230 435 726  42  59 849 373 771 320 194 608 323 721\n 449 172 803 219 550 319 583 387 733 143 704 323 141 827 590 638 625 704\n 189 628 772 756 784 607 852 604 346 684 658 157 547 170 120 777 365 122\n 572 684 177 656 125 663 145 833 778 689 231 661 113 564 826 200 472 340\n 675 461 672 250 265 467  15 212 556 389  90 210 626  58  86 741 845 567\n 365 487 269 662 417 730  22 357 770 449 158 838 564 547 559 839 553 285\n 814 778 369   1 453 191 706  47 232 650 255 397 476 736 223  57 570   4\n 314 422 533 515 548 829 627 240  48 302 571 225 633 443 726 117 768 420\n 614 855 657 326 463 231 788 160 500 6

Once regex has extracted all the resumes' ID from the file, we append each individual resume ID into a list by using for loop and if-else statements. As we know each resume ID is separated by either a space or \n, We looped through all individual character in the result of the regular expression as above, and check if that character is a digit or not (either a space or \n). If the character is a digit, we store it in `temp_str` and continue adding to `temp_str` till it reaches a non-digit character. Once it reaches a non-digit character, and if the length of `temp_str` is not 0, we append `temp_str` which represents a single resume ID into `temp_list`. And then we reassign `temp_str` as null and the loop will continue again. 

Once we have finish extracting the resumes ID into `temp_list`, we removed the duplicated resumes ID by making it into a set and then back to a list `filenames`.

In [4]:
temp_list=[]
temp_str=""
for i in range(0,len(temp_data.group(1))):
    if temp_data.group(1)[i].isdigit():
        temp_str+=temp_data.group(1)[i]
        # to ensure we add the last resume ID into the list as well as the last resume ID does not end with a space or \n after
        if i==len(temp_data.group(1))-1:
            temp_list.append(temp_str)
    elif not temp_data.group(1)[i].isdigit() and (len(temp_str))!=0:
        temp_list.append(temp_str)
        temp_str=""

# convert to set first then to list to remove duplicates
filenames=list(set(temp_list))
        
print("The initial number of resumes: " + str(len(temp_list)))
print("The number of unique resumes: " + str(len(filenames)))

The initial number of resumes: 250
The number of unique resumes: 218


Note that after removing duplicated resumes, we are left with 218 unique resumes out of the assigned 250.

Afterwhich, we convert each resume ID to their filename together with the filepath so that we can read the resumes in more easily. This is done by adding `./resume_(` before and `).txt` after each resume ID as a resume has the path of `./resume_(23).txt` where `23` represents the ID of each resume.

In [5]:
# convert the list of resumes id to filenames for extraction
for i in range(0,len(filenames)):
    filenames[i]='./resume_(' + filenames[i] + ').txt'

# print the first 10 filenames
filenames[0:10]

['./resume_(790).txt',
 './resume_(42).txt',
 './resume_(200).txt',
 './resume_(199).txt',
 './resume_(230).txt',
 './resume_(210).txt',
 './resume_(646).txt',
 './resume_(547).txt',
 './resume_(689).txt',
 './resume_(158).txt']

## 4. Text Preprocessing

We first try opening one resume and see generally what the texts in it look like before we proceed with preprocessing the texts into tokens. Note that in here, we used `encoding="utf-8"` as python has trouble reading in the file if not the encoding was not specified. 

In [6]:
with open("./resume_(186).txt", 'r',encoding="utf-8") as infile:
    rawtext=infile.read()

rawtext

'\ufeff \n\nSimon Sun \n\nsimon.y.sun@hotmail.com ;      +852 55169217 \n\nSummary \n\n\uf06c  6 plus years equity research analyst with expertise in bottom-up investing. \n\uf06c  Strong Asian network with the research focus on China A/H markets. \n\uf06c  Expertise in fundamental analysis&modelling, due diligence, and channel check. \n\uf06c  Fluency in Mandarin and English. Proficiency in Office/Bloomberg/Capital IQ. \n \n\nWork Experience \n\nCredit Suisse (HK) Co., Ltd                                                          Hong Kong, China \n\nAssociate –Senior Equity Analyst, Great China Consumer            Jan 2015– Present \n\n\uf06c  Lead analyst of the China/HK consumer goods and retail research team.   \n\uf06c  Performing due diligence and channel check of clients/suppliers/competitors. \n\uf06c  Presenting bottom-up investment ideas to global institutional investors effectively. \n\uf06c  Reports were published as an "Ideal Engine" product of Credit Suisse globally. \n\u

Note that there are certain string of repeated characters like `\ufeff\` and `\uf06c` which are unicode characters. One method to remove these unicode characters in the data is to encode the text into ascii and ignore all the non-ascii characters (i.e. the unicode characters) and decode them again as we can remove the `\n` or `\t` in the text and strip all spaces before and after. 

We try reading in the data again:

In [7]:
with open("./resume_(186).txt", 'r',encoding="utf-8") as infile:
    rawtext=infile.read().encode('ascii','ignore').decode().replace("\n","").replace("\t"," ").strip()

rawtext

'Simon Sun simon.y.sun@hotmail.com ;      +852 55169217 Summary   6 plus years equity research analyst with expertise in bottom-up investing.   Strong Asian network with the research focus on China A/H markets.   Expertise in fundamental analysis&modelling, due diligence, and channel check.   Fluency in Mandarin and English. Proficiency in Office/Bloomberg/Capital IQ.  Work Experience Credit Suisse (HK) Co., Ltd                                                          Hong Kong, China Associate Senior Equity Analyst, Great China Consumer            Jan 2015 Present   Lead analyst of the China/HK consumer goods and retail research team.     Performing due diligence and channel check of clients/suppliers/competitors.   Presenting bottom-up investment ideas to global institutional investors effectively.   Reports were published as an "Ideal Engine" product of Credit Suisse globally.   Ranked Top 3 Asian ex-Japan Consumer Analyst in Institutional Investor ranking.  CITIC Securities Co., Lt

### 4.1 Sentence Segmentation

The output now seems cleaner with the unicode characters and newline characters removed. As one of the task listed was to normalized the tokens to lowercase except for those appearing in the middle of a sentence/line, we need to know where does each sentence/line starts and ends.

We can make use of <b>Punkt Sentence Tokenizer</b> in the `nltk.data` library which will help us analyse the sentences given a string of text.

In [8]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sent_detector.tokenize(rawtext)
sentences

['Simon Sun simon.y.sun@hotmail.com ;      +852 55169217 Summary   6 plus years equity research analyst with expertise in bottom-up investing.',
 'Strong Asian network with the research focus on China A/H markets.',
 'Expertise in fundamental analysis&modelling, due diligence, and channel check.',
 'Fluency in Mandarin and English.',
 'Proficiency in Office/Bloomberg/Capital IQ.',
 'Work Experience Credit Suisse (HK) Co., Ltd                                                          Hong Kong, China Associate Senior Equity Analyst, Great China Consumer            Jan 2015 Present   Lead analyst of the China/HK consumer goods and retail research team.',
 'Performing due diligence and channel check of clients/suppliers/competitors.',
 'Presenting bottom-up investment ideas to global institutional investors effectively.',
 'Reports were published as an "Ideal Engine" product of Credit Suisse globally.',
 'Ranked Top 3 Asian ex-Japan Consumer Analyst in Institutional Investor ranking.',
 'C

### 4.2 Case Normalization

Once the sentence tokenizer has segmented each sentence and returned the result as a list with each element containing a sentence as seen above, we can proceed with case normalization by changing the first token/word of each sentence into lowercase. 

We can easily use python inbuilt `.lower()` function to convert a string into lowercase, In order to just change the first word of each sentence into lowercase, we will need to extract the first word of each sentence and using `.lower()` on it and putting it back into it's original sentence. A function `search_word` was created to search for indexes of the first character of each word and first space after the first word in each sentence. 

The way the function was written was that it will loop through all characters of a sentence, and using an if-else statement, it will append the index of the first alphanumeric character into a list `result`. As we only want the index of the first alphanumeric character, we can stop appending indexes of alphanumeric characters into `result` by setting the condition `len(result)==0`. The next index we want to append is the first space after the first word. We use an elif statement (i.e. not alphanumeric) and when `len(result)==1` which is when we have extracted the index of the first alphanumeric character. Once we have extracted both the index of the first alphanumeric character and the first space after the first word, we return the result.

In [9]:
# function to search indexes of first character of word and first space after the first word in each sentence
def search_word(sentence):
    result=[]
    for i in range(0,len(sentence)):
        # append index of first alphanumeric character when we have yet to find the first alphanumeric character
        if (sentence[i].isalpha() or sentence[i].isdigit()) and len(result)==0:
            result.append(i)
        # append index of first space after the first word after we have found the index of the first alphanumeric character
        elif sentence[i]==' ' and len(result)==1:
            result.append(i)
    return result

print("The index of first alphanumeric character and first space after each word in the sentence: " + "\n\n" + str(sentences[0])
      + "\n\n" + str(search_word(sentences[0])))

The index of first alphanumeric character and first space after each word in the sentence: 

Simon Sun simon.y.sun@hotmail.com ;      +852 55169217 Summary   6 plus years equity research analyst with expertise in bottom-up investing.

[0, 5]


Once we have the index of the first alphanumeric character and first space after the first word, we can just convert the first word of each sentence using the two indexes.

In order to do this, if the result returned from the `search_word` function is >1 which indicates the particular sentence is not empty or does not contain only one single character, we append the sentence by lowering characters from the first index (representing the first alphanumeric character) to the second index (representing the first space after the first word) and adding the remaining sentence as they were from the second index onwards. 

Else if the `search_word` function returned a result of no index or only one index, we just append that sentence by applying `.lower()` to that sentence.

In [10]:
# assign indexes as the result of search_word function with the above sentence
indexes=search_word(sentences[0])
norm_rawtext=[]
if len(indexes)>1:
    norm_rawtext.append(sentences[0][indexes[0]:indexes[1]].lower()+sentences[0][indexes[1]:])
else:
    norm_rawtext.append(sentences[0].lower())
            
norm_rawtext

['simon Sun simon.y.sun@hotmail.com ;      +852 55169217 Summary   6 plus years equity research analyst with expertise in bottom-up investing.']

### 4.3 Word Tokenization

We see that the first word of the above line has been normalized while the capital words in the rest of the sentence remained as they were. Once case normalization is done, we can proceed with word tokenization. 

We are told to use the regular expression `r"\w+(?:[-']\w+)?"`for word tokenization which will return one or more alphanumeric character (i.e. [A-Za-z0-9_]) followed by an optional `-` `'` or one or more alphanumeric character. A `?:` is placed in the front of the capturing group to make it a non-capturing group. 

We can use `RegexpTokenizer` by importing from `nltk.tokenize` library.

In [11]:
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

### 4.4 Generating unigram tokens for all resumes
Now, we will need to perform data extraction, case normalization and tokenization on each of the resume stored in `filenames` earlier on in Section 3 above. 

A function `tokenize_resume` is written with each filename as an argument and the following is done on each filename:
1. Data reading, encoding and decoding as per Section 4
2. Sentence Segmentation as per Section 4.1 using <b> Punkt Sentence Tokenizer </b>
3. Case Normalization as per Section 4.2 using `search_word` function
4. Word Tokenization as per Section 4.3 using <b> RegexpTokenizer </b>

The result of this function will be the `filename` along with all the unigram tokens after performing tokenization using the given regular expression.

In [12]:
# function to extract data from file, convert to lowercase, and then tokenize
def tokenize_resume(filename):
    with open(filename, 'r',encoding="utf-8") as infile:
        rawtext=infile.read().encode('ascii','ignore').decode().replace("\n","").replace("\t"," ").strip()
    
    sentences = sent_detector.tokenize(rawtext)
    norm_rawtext=[]
    for sentence in sentences:
        indexes=search_word(sentence)
        if len(indexes)>1:
            norm_rawtext.append(sentence[indexes[0]:indexes[1]].lower()+sentence[indexes[1]:])
        else:
            norm_rawtext.append(sentence.lower())
            
    unigram_tokens = tokenizer.tokenize(str(norm_rawtext))
    return (filename, unigram_tokens)

# returns the result after running function tokenize_resume on resume_(186)
tokenize_resume("./resume_(186).txt")

('./resume_(186).txt',
 ['simon',
  'Sun',
  'simon',
  'y',
  'sun',
  'hotmail',
  'com',
  '852',
  '55169217',
  'Summary',
  '6',
  'plus',
  'years',
  'equity',
  'research',
  'analyst',
  'with',
  'expertise',
  'in',
  'bottom-up',
  'investing',
  'strong',
  'Asian',
  'network',
  'with',
  'the',
  'research',
  'focus',
  'on',
  'China',
  'A',
  'H',
  'markets',
  'expertise',
  'in',
  'fundamental',
  'analysis',
  'modelling',
  'due',
  'diligence',
  'and',
  'channel',
  'check',
  'fluency',
  'in',
  'Mandarin',
  'and',
  'English',
  'proficiency',
  'in',
  'Office',
  'Bloomberg',
  'Capital',
  'IQ',
  'work',
  'Experience',
  'Credit',
  'Suisse',
  'HK',
  'Co',
  'Ltd',
  'Hong',
  'Kong',
  'China',
  'Associate',
  'Senior',
  'Equity',
  'Analyst',
  'Great',
  'China',
  'Consumer',
  'Jan',
  '2015',
  'Present',
  'Lead',
  'analyst',
  'of',
  'the',
  'China',
  'HK',
  'consumer',
  'goods',
  'and',
  'retail',
  'research',
  'team',
  'pe

We then run `tokenize_resume` function on each file in `filenames` and store the result `all_unigram_tokens` as a dictionary with the keys being the filenames and the values of each key being the unigram tokens generated by the tokenizer.

In [13]:
# get a dictionary storing all resumes with their tokens after running tokenize_resume for each file
all_unigram_tokens=dict(tokenize_resume(file) for file in filenames)

all_unigram_tokens

{'./resume_(1).txt': ['curriculum',
  'Vitae',
  'V',
  'Gowribalan',
  'MCSI',
  'FCMA',
  'CPA',
  'Aust',
  'cgma',
  'BSc',
  'Hons',
  'investment',
  'Manager',
  'with',
  'an',
  'established',
  'investment',
  'track-record',
  'across',
  'the',
  'GCC',
  'region',
  'spanning',
  'listed',
  'equities',
  'sukuks',
  'and',
  'debt',
  'securities',
  'honed',
  'expertise',
  'of',
  '14',
  'years',
  'in',
  'portfolio',
  'management',
  'and',
  'investment',
  'analysis',
  'experience',
  'includes',
  'establishing',
  'and',
  'leading',
  'the',
  'Asset',
  'Management',
  'Division',
  'AMD',
  'of',
  'Ahli',
  'Bank',
  'SAOG',
  'launching',
  'of',
  'mutual',
  'fund',
  'structuring',
  'of',
  'wealth',
  'management',
  'products',
  'strategizing',
  'acquisitions',
  'handling',
  'initial',
  'public',
  'offerings',
  'IPOs',
  'and',
  'raising',
  'investment',
  'funds',
  'across',
  'asset-classes',
  'and',
  'risk-thresholds',
  'credentials'

If we look at the number of tokens for each resume (i.e. the length of the values for each key), we noticed there are some resumes that have 0 number of token in it (i.e. resume_508)

In [14]:
# print length of number of tokens in each resume
for each in all_unigram_tokens.keys():
    print(each,len(all_unigram_tokens[each]))

./resume_(790).txt 906
./resume_(42).txt 383
./resume_(200).txt 404
./resume_(199).txt 826
./resume_(230).txt 585
./resume_(210).txt 570
./resume_(646).txt 518
./resume_(547).txt 405
./resume_(689).txt 1785
./resume_(158).txt 898
./resume_(225).txt 504
./resume_(570).txt 936
./resume_(171).txt 348
./resume_(397).txt 602
./resume_(594).txt 410
./resume_(515).txt 263
./resume_(145).txt 615
./resume_(777).txt 1191
./resume_(590).txt 661
./resume_(232).txt 538
./resume_(336).txt 451
./resume_(707).txt 946
./resume_(572).txt 1057
./resume_(282).txt 669
./resume_(684).txt 760
./resume_(553).txt 703
./resume_(604).txt 581
./resume_(641).txt 851
./resume_(719).txt 302
./resume_(120).txt 473
./resume_(58).txt 700
./resume_(770).txt 1003
./resume_(583).txt 719
./resume_(189).txt 1368
./resume_(759).txt 697
./resume_(59).txt 1017
./resume_(22).txt 442
./resume_(14).txt 545
./resume_(276).txt 634
./resume_(516).txt 602
./resume_(606).txt 290
./resume_(730).txt 353
./resume_(214).txt 892
./resume_(

As a result, it will be better to remove empty resumes in future analysis as it served little purpose and will affect the results when we are going to calculate the document frequencies of each token. Thus, we removed it from `all_unigram_tokens` as well as from `filenames`.

In [15]:
# create a list of empty resumes
empty_resume=[]
for each in all_unigram_tokens.keys():
    if len(all_unigram_tokens[each])==0:
        empty_resume.append(each)

# remove all empty resumes from all_unigram_tokens and filenames
for each in empty_resume:
    del all_unigram_tokens[each]
    filenames.remove(each)

The initial status of the corpus contains 17078 number of unique words and 136443 tokens, resulting an average (lexical diversity) of 7.989 tokens count for each unique word. Note the total number of unique resumes now is 217 after removal of empty resumes.

In [16]:
# inital status of whole corpus
initial_words = list(chain.from_iterable(all_unigram_tokens.values()))
initial_vocab = set(initial_words)
initial_lexical_diversity=len(initial_words)/len(initial_vocab)
print ("Vocabulary size: ",len(initial_vocab))
print ("Total number of tokens: ", len(initial_words))
print ("Lexical diversity: ", initial_lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  17078
Total number of tokens:  136443
Lexical diversity:  7.989401569270407
Total number of unique resumes: 217


### 4.5 Collocations - Generating Bigrams

Once we have generated the unigram tokens for the resumes, we can now generate the bigrams. The decision to generate bigrams before doing any further preprocessing is if we were to do any further preprocessing, tokens will be removed and we might lose some meaningful bigrams.

After generating bigrams from the list of tokens, we will get the commonly co-occuring ones by ranking the bigrams using Pointwise Mutual Information (PMI) which ranks bigrams according to countings of occurences of each for the 2 individual word and countings of the co-occurences of the 2 words together. 

We first get a list of all the tokens from all resumes in `all_unigram_tokens` by running the `chain.from_iterable` function. Then, we generate the top 260 bigrams using the functions `BigramAssocMeasures()` and `BigramCollocationFinder.from_words()` in the `nltk.collocations` library. Since we are told to remove rare tokens with 2% threshold, we can use the function `apply_freq_filter` to filter off those bigrams with frequencies <= 2% of the total number of unique resumes which will remove those tokens not appearing in at least 2% resume out of the total number of unique resumes (for a token to have a document frequency >2% means the tokens must have a total count of at least 2% of total number of documents). Also we are to remove tokens of length < 3 and hence, we can use the function `apply_word_filter` to remove words that are of length<3. Finally, we get the top 260 bigrams by using PMI via `nbest(bigram_measures.pmi,260)`. 

Note that we set it to be top 260 instead of the top 200 as listed in the assignment requirement as some bigrams generated might not be found in the corpus at all.

In [17]:
# Get the list of all tokens from all the resumes
all_words = list(chain.from_iterable(all_unigram_tokens.values()))

# Generate bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)

# Since rare tokens with 2% threshold will be removed from vocab
finder.apply_freq_filter(len(filenames)*0.02)
# Since tokens of len<3 will be removed from vocab
finder.apply_word_filter(lambda w: len(w) < 3)
# Find the top 260 bigrams based on PMI
top_260_bigrams=finder.nbest(bigram_measures.pmi, 260)
top_260_bigrams

[('Charges', 'Striking'),
 ('ISAE', '3402'),
 ('Inst', 'itute'),
 ('Meg', 'Gie'),
 ('San', 'Francisco'),
 ('Supply', 'Chain'),
 ('Toa', 'Payoh'),
 ('sid', 'dadlani'),
 ('Sri', 'Lanka'),
 ('Rochester', 'Minnesota'),
 ('Appointment', 'Resignation'),
 ('Sdn', 'Bhd'),
 ('Wong', 'Meg'),
 ('EXPECTED', 'SALARY'),
 ('Hwa', 'Chong'),
 ('Leave', 'reason'),
 ('Abdul', 'Rahman'),
 ('PTE', 'LTD'),
 ('problem', 'solving'),
 ('Tunku', 'Abdul'),
 ('SGV', 'Manila'),
 ('Inland', 'Revenue'),
 ('AGM', 'EGM'),
 ('HONG', 'KONG'),
 ('Merrill', 'Lynch'),
 ('Marital', 'Status'),
 ('EGM', 'Shares'),
 ('Interest', 'Rate'),
 ('Adobe', 'Photoshop'),
 ('invoice', 'debit'),
 ('Touche', 'Tohmatsu'),
 ('Due', 'Diligence'),
 ('Issuance', 'Shares'),
 ('Shares', 'Issuance'),
 ('Transfer', 'Change'),
 ('Kuala', 'Lumpur'),
 ('third', 'party'),
 ('Chong', 'Institution'),
 ('Goldman', 'Sachs'),
 ('Royal', 'Melbourne'),
 ('lie', 'nts'),
 ('name', 'Appointment'),
 ('Human', 'Resource'),
 ('Central', 'Provident'),
 ('America', 

### 4.6 Re-tokenizing the data again

After we have generated the top 260 bigrams, we will need to re-tokenzie the data once again to ensure that individual unigram tokens that made up the bigrams are not splitted and captured as unigram tokens, but instead to be captured as bigrams now. We can do this by running `MWETokenizer` in `nltk.tokenize` library on `top_260_bigrams` before tokenizing the data again.

In [18]:
# adding the top_260_bigrams
mwetokenizer = MWETokenizer(top_260_bigrams)
# tokenizing based on the top_260_bigrams to ensure individual unigrams that made up bigrams are not splitted
colloc_tokens =  dict((filename, mwetokenizer.tokenize(resume)) for filename,resume in all_unigram_tokens.items())
colloc_tokens

{'./resume_(1).txt': ['curriculum',
  'Vitae',
  'V',
  'Gowribalan',
  'MCSI',
  'FCMA',
  'CPA',
  'Aust',
  'cgma',
  'BSc_Hons',
  'investment',
  'Manager',
  'with',
  'an',
  'established',
  'investment',
  'track-record',
  'across',
  'the',
  'GCC',
  'region',
  'spanning',
  'listed',
  'equities',
  'sukuks',
  'and',
  'debt',
  'securities',
  'honed',
  'expertise',
  'of',
  '14',
  'years',
  'in',
  'portfolio',
  'management',
  'and',
  'investment',
  'analysis',
  'experience',
  'includes',
  'establishing',
  'and',
  'leading',
  'the',
  'Asset',
  'Management',
  'Division',
  'AMD',
  'of',
  'Ahli',
  'Bank',
  'SAOG',
  'launching',
  'of',
  'mutual',
  'fund',
  'structuring',
  'of',
  'wealth',
  'management',
  'products',
  'strategizing',
  'acquisitions',
  'handling',
  'initial',
  'public',
  'offerings',
  'IPOs',
  'and',
  'raising',
  'investment',
  'funds',
  'across',
  'asset-classes',
  'and',
  'risk-thresholds',
  'credentials',
  '

Note that after adding in the top 260 bigrams, the vocabulary size has increased from 17078 to 17278 and the total number of tokens has reduced from 136443 to 134014.

In [19]:
# status of whole corpus after adding in bigrams
words = list(chain.from_iterable(colloc_tokens.values()))
vocab = set(words)
lexical_diversity=len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  17278
Total number of tokens:  134014
Lexical diversity:  7.7563375390670215
Total number of unique resumes: 217


### 4.7 Removing Context Independent Stopwords

The next step is to remove stopwords and in this assignment, we are told to remove context independent stopwords based on `stopwords_en.txt` file provided as well as context dependent stopwords based on document frequency of threshold at 98%. 

To remove the context independent stopwords, we first load in the stopwords list provided and for each resume in filenames, we remove tokens that can be found in the stopwords list. The result of the tokens after removing stopwords is then stored in `filtered_tokens1`. 

The list of context independent stopwords that were removed from the corpus is as follows:

In [20]:
# Reading in the stopwords file
with open("./stopwords_en.txt", 'r') as infile:
    stopwords=infile.read().split("\n")

filtered_tokens1={}
removed_tokens1=[]
for file in filenames:
    temp_list=[]
    for each in colloc_tokens[file]:
        # add to result if the token of the file is not found in the stopwords list
        if each not in stopwords:
            temp_list.append(each)
        # add to removed_tokens if token is a stopword found in the stopwords list
        else:
            removed_tokens1.append(each)
        filtered_tokens1[file]=temp_list
        
set(removed_tokens1)

{'a',
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'after',
 'against',
 'all',
 'allow',
 'allows',
 'almost',
 'along',
 'also',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'appropriate',
 'are',
 'around',
 'as',
 'aside',
 'associated',
 'at',
 'available',
 'b',
 'be',
 'became',
 'become',
 'been',
 'before',
 'behind',
 'being',
 'believe',
 'below',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 'came',
 'can',
 'cause',
 'certain',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'consider',
 'containing',
 'contains',
 'corresponding',
 'could',
 'course',
 'currently',
 'd',
 'described',
 'despite',
 'did',
 'different',
 'do',
 'doing',
 'done',
 'down',
 'during',
 'e',
 'each',
 'edu',
 'eg',
 'eight',
 'enough',
 'entirely',
 'especially',
 'et',
 'etc',
 'even',
 'ever',
 'every',
 'ex',
 'example',
 'f',
 'few',
 'first',
 'five',
 'followed',
 'following',
 'fol

The total number of tokens after removing context independent stopwords is now 101661 compared to 134014 before removal, which saw a great reduce of around 33000 tokens.

In [21]:
# status of whole corpus after removing context independent stopwords
words = list(chain.from_iterable(filtered_tokens1.values()))
vocab = set(words)
lexical_diversity=len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  16933
Total number of tokens:  101661
Lexical diversity:  6.003720545680033
Total number of unique resumes: 217


### 4.8 Removing Context Dependent Stopwords

To remove the context dependent stopwords, we need to know the document frequency for each token. In order to calculate the document frequency, we first apply `set` function to all token in each individual resume so that each token is only counted once. After which, we pass these sets to `chain.from_iterable` to generate a list of words. Then, we can finally use the `FreqDist` function to get the document frequency of each token.

Afterwhich, we are told to remove tokens which appear in 98% of total number of resumes which can be calculated by document frequency of each token divided by total number of resumes. If this value is >=0.98, we remove them from the corpus. A list of all such context dependent stopwords is stored in `cd_stopwords`.

Finally, the same method used to remove context independent stopwords is used here where we store the tokens not found in `cd_stopwords` in `filtered_tokens2` for each resume and store tokens found in `cd_stopwords` and removed from the corpus in `removed_tokens2`.

The list of context dependent stopwords that were removed from the corpus is as follow:

In [22]:
# Generating document frequency by first applying set to all token in each individual resume
words = list(chain.from_iterable([set(value) for value in filtered_tokens1.values()]))
fd_words = FreqDist(words)

cd_stopwords=[]
for each in fd_words.keys():
    if fd_words[each]/len(filenames)>=0.98:
        cd_stopwords.append(each)
        
filtered_tokens2={}
removed_tokens2=[]
for file in filenames:
    temp_list=[]
    for each in filtered_tokens1[file]:
        if each not in cd_stopwords:
            temp_list.append(each)
        filtered_tokens2[file]=temp_list
        
set(removed_tokens2)

set()

We got an empty set which means that in our entire corpus, there are no tokens that appear in at least 98% of the total number of resumes. By looking at the top 20 most common tokens with their document frequencies using `most_common`, we can clearly see that no tokens appear above our threshold of 98%. Thus, the status of the corpus remains the same as after removal of context independent stopwords.

In [23]:
fd_words.most_common(20)

[('financial', 167),
 ('University', 162),
 ('management', 160),
 ('English', 156),
 ('Management', 151),
 ('team', 146),
 ('clients', 146),
 ('Business', 145),
 ('Finance', 140),
 ('Bachelor', 138),
 ('2014', 131),
 ('Singapore', 130),
 ('including', 130),
 ('business', 128),
 ('gmail', 127),
 ('investment', 127),
 ('2011', 125),
 ('3', 124),
 ('client', 121),
 ('reports', 120)]

### 4.9 Stemming

The next task is to do stemming on the tokens. We are told to use Porter Stemmer which can be found by importing from the library `nltk.stem`. We again loop through each resume and stemmed all tokens found in each resume by using `stemmer.stem` function. However, as stemming generally will change uppercase tokens to lowercase, we use an if-else statement to stop Porter Stemmer from changing uppercase token to lowercase. The results are then stored in `stemmed_tokens`.

In [24]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens={}
for file in filenames:
    temp_list=[]
    for each in filtered_tokens2[file]:
        # if first character of a token is upper case, we capitalize the result after stemming
        if each[0].isupper():
            each=stemmer.stem(each).capitalize()
        # if not upper case, just store result after stemming
        else:
            each=stemmer.stem(each)
        temp_list.append(each)
    stemmed_tokens[file]=temp_list

Note that after stemming, the vocabulary size (i.e. number of unique tokens) has been reduced from 16933 to 12664 resulting in an increase in lexical diversity from 6.003 to 8.028 since the total number of tokens remained the same as stemming does not remove tokens at all.

In [25]:
# status of whole corpus after stemming
words = list(chain.from_iterable(stemmed_tokens.values()))
vocab = set(words)
lexical_diversity=len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  12664
Total number of tokens:  101661
Lexical diversity:  8.02755843335439
Total number of unique resumes: 217


### 4.10 Removing Rare Tokens

We are told to remove rare tokens which are tokens that appeared in less than 2% of the total number of resumes (i.e. the document frequency). We use the same method as we did when we removed context dependent stopwords by using `FreqDist` to calculate document frequency of each token after getting sets of tokens from each resume.

We then store tokens with document frequency lesser or equal to 0.02 in `rare_tokens` and remove tokens in `stemmed_tokens` found in `rare_tokens` and store the result in `final_tokens`

In [26]:
# Generating document frequency by first applying set to all token in each individual resume
final_words = list(chain.from_iterable([set(value) for value in stemmed_tokens.values()]))
fd_final_words = FreqDist(final_words)

rare_tokens=[]
for each in fd_final_words.keys():
    if fd_final_words[each]/len(filenames)<=0.02:   
        rare_tokens.append(each)
        
final_tokens={}
for file in filenames:
    temp_list=[]
    for each in stemmed_tokens[file]:
        if each not in rare_tokens:
            temp_list.append(each)
        final_tokens[file]=temp_list

Note that after removal of rare tokens, the vocabulary size is significantly reduced from 12664 to 2405, while the total number of tokens also decreased from 101661 to 84264, resulting in a greater lexical diversity of 35.037 compared to 8.027.

In [27]:
# status of whole corpus after removing rare tokens
words = list(chain.from_iterable(final_tokens.values()))
vocab = set(words)
lexical_diversity=len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  2405
Total number of tokens:  84264
Lexical diversity:  35.03700623700624
Total number of unique resumes: 217


### 4.11 Removing Tokens of Length < 3

The final removal will be removing tokens of length lesser than 3. We can simply do this by looping through each resume and keeping each token in each resume if the length is greater or equal to 3, else, we remove it from our `final_tokens`.

In [28]:
for file in filenames:
    temp_list=[]
    for each in final_tokens[file]:
        if len(each)>=3:
            temp_list.append(each)
        final_tokens[file]=temp_list

Comparing the status of the corpus after removal of tokens of length <3, we see that the number of tokens is further reduced from 84264 to 78784 and the vocabulary size is reduced from 2405 to 2261.

In [29]:
# status of corpus after removal of tokens of length < 3
final_words = list(chain.from_iterable(final_tokens.values()))
final_vocab = set(final_words)
final_lexical_diversity=len(final_words)/len(final_vocab)
print ("Vocabulary size: ",len(final_vocab))
print ("Total number of tokens: ", len(final_words))
print ("Lexical diversity: ", final_lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Vocabulary size:  2261
Total number of tokens:  78784
Lexical diversity:  34.844758956214065
Total number of unique resumes: 217


Comparing the initial corpus status and final corpus status, the vocabulary size has been reduced by about 85% from 17078 to 2261 while the total number of tokens has been reduced by about 42% from 136443 to 78784. The lexical diversity has increased from 7.989 to 34.845.

In [30]:
print("Initial corpus status are as follows: " + "\n")
print ("Vocabulary size: ",len(initial_vocab))
print ("Total number of tokens: ", len(initial_words))
print ("Lexical diversity: ", initial_lexical_diversity)
print ("Total number of unique resumes:", len(filenames))
print("\n")
print("Final corpus status are as follows: " + "\n")
print ("Vocabulary size: ",len(final_vocab))
print ("Total number of tokens: ", len(final_words))
print ("Lexical diversity: ", final_lexical_diversity)
print ("Total number of unique resumes:", len(filenames))

Initial corpus status are as follows: 

Vocabulary size:  17078
Total number of tokens:  136443
Lexical diversity:  7.989401569270407
Total number of unique resumes: 217


Final corpus status are as follows: 

Vocabulary size:  2261
Total number of tokens:  78784
Lexical diversity:  34.844758956214065
Total number of unique resumes: 217


## 5. Output of vocab.txt

We are required to store the tokens (bigrams and unigrams) in the following format, <b>token_string:integer_index</b> and the tokens must be sorted alphabetically. We can easily generate this by using the `FreqDist` function on all the tokens from all the resumes and writing the results to the output file after sorting it.

Note: Refer to `29442826_vocab.txt` for output

In [31]:
# Generate counts of each token
vocab_count=FreqDist(final_words)
# Open file to write output into
save_file1=open("./29442826_vocab.txt", 'w')

# Sort the tokens by sorting the keys and looping them
for each in sorted(vocab_count.keys()):
    save_file1.write(each + ':' + str(vocab_count[each]) + '\n')

save_file1.close()

## 6. Output of countVec.txt

We are also required to store the sparse representations of the resumes in the following format, <b>file_name,token_index:count,token_index:count,...</b> with each line representing a resume. In order to do this, we first need to assign a token index to each unique token found in the corpus. We can make use of the sorted unique tokens in `vocab_count` from Section 5 and assign each token an index starting from 0. 

After which, we loop through all the resumes in `filenames` and first write the filename to the output. Then, we calculate the frequency distribution `fd_resume` for each resume using `FreqDist`. Once we have the frequency distribution, we then write to the output file for each token found in `fd_resume`their token index which can be found in `vocab_dict` follwoed by the count of the token found in `fd_resume`.

Note: Refer to `29442826_countVec.txt` for output

In [32]:
# Assign a token index starting from 0 to each token in the sorted vocab list in previous output 
vocab_dict={}
i = 0
for each in sorted(vocab_count.keys()):
    vocab_dict[each] = i
    i += 1

# Open file to write input into
save_file2=open("./29442826_countVec.txt", 'w')
for file in filenames:
    # first write the name of the file
    save_file2.write(str(file)+"\n")
    # create fd_resume for each file to store the frequency distribution of each resume
    fd_resume=FreqDist(final_tokens[file])
    # for each token in fd_resume, we print the token index from vocab_dict and it's count from fd_resume
    for each in sorted(fd_resume.keys()):
        save_file2.write("," + str(vocab_dict[each]) + ":" + str(fd_resume[each]))
    save_file2.write("\n")
    
save_file2.close()

## 7. Summary

- Section 3 shows how regular expression can be used to extract required data from a text file
- Section 4 shows the steps of text preprocessing which includes sentence segmentation, case normalization, tokenisation, collocations, removal of stopwords, stemming and removal of rare tokens and tokens of length lesser than 3, using different packages in nltk library
- Section 5 & 6 shows how we can wrangle textual data into proper format of frequency counts and sparse representations for further data analysis

################### End of report ###################