#### Author: Trung Kien Nguyen

Created_Date : 14/08/2018

Modified_Date: 02/09/2018


Environment: Python 3.6.0 and Anaconda 5.2.0 (64-bit)

Version: 2.0

Libraries used:
* nltk 3.2.3
* pdfminer 20160614
* pandas (for data frame, included in Anaconda Python 3.6) 
* re (for regular expression, included in Anaconda Python 3.6) 
* numpy (for numpy array, included in Anconda 


## 1. Introduction
This project comprises 250 CVs with the aim is to build sparse representations for the resumes, which included word tokenization, vocabulary generation and the generation of sparse representations. There a file named of each CV following 'resume_(*number*).txt' with the given number from given file "resume_dataset.txt"

1. File output named resume_vocab.txt which contains bigrams and unigrams tokens in the following format: token_string:integer_index. Words in the vocabulary must be sorted in order

2. File output named resume_countVec.txt which contains all the "selected" resumes in the data-set. Each line in the text file contains the sparse representations of one of the resumes in the data-set in the following format: 

    file_name,token_index:count,token_index:count


More details for each task will be given in the following sections.

## 2. Import Libraries

In [126]:
import io
from io import StringIO
import pandas as pd
import numpy as np
import nltk
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.collocations import *
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# For loading pdf file if needed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


## 3. Pre-processing Data

*cv_list* is a list contain all the number which relevant to the CVs number

In [127]:
cv_list = [i for i in range(1000)]

In [129]:
# a function to get the path name for each particular cv 
def get_path_resume(number):
    """
    This function to get path name for a cv
    :param 
        number: the CV's number
    :result: path name of cv
    """
    return "./resumeTxt/resume_("+str(number)+ ").txt"


def read_cv_txt(path):
    """
    This function to get content of specific resume from given path
    :param 
        path: the path to the specific resume
    :result: the list of each line from the content of resume
    """
    with open(path,'rb') as f:
        data = f.readlines()
    return data



This section to deal with the first requirement of task 2, which construct vocabulary from all the given CVs.

All the given CVs are in "*.txt" format. In order to get information from txt file, we will use a function read_cv_txt(), which is defined above.

If you want to read ".pdf" format, we will use the library, called *pdfminer*. Reading a pdf file is not necessary in this task, however, I will leave a function, called "convert_pdf_to_txt" for the purpose of references.

Note: there are two version of *pdfminer*. Please select the right package *pdfminer* for your version of Python

For python 2:
```shell
        pip install pdfminer
```
For python 3:
```shell
        pip install pdfminer.six
```


In the following script, we defined a function called *convert_pdf_to_txt*. From the name of function tell us pretty much everything what that function doing. It will convert a pdf format file to string


In [130]:
def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text



Now, let's read a specific resume to see the contents.

In [131]:
sample = read_cv_txt(get_path_resume(1))
sample

[b'Curriculum Vitae \r\n',
 b'\r\n',
 b'V. Gowribalan  MCSI \r\n',
 b'  FCMA, CPA (Aust.), CGMA, BSc  (Hons.)  \r\n',
 b'\r\n',
 b' \r\n',
 b'Investment Manager with an  established  investment  track-record  across  the GCC  region  spanning  \r\n',
 b'listed equities, sukuks and debt securities.  Honed expertise of 14 years in portfolio management and \r\n',
 b'investment analysis. Experience  includes  establishing and  leading  the Asset Management Division \r\n',
 b'(AMD) of Ahli Bank SAOG; launching of mutual fund, structuring of wealth management products, \r\n',
 b'strategizing acquisitions, handling initial public offerings (IPOs) and raising investment funds across \r\n',
 b'asset-classes and risk-thresholds.  \r\n',
 b' \r\n',
 b'Credentials  include a First Class Honours Degree  in Applied Accounting   (BSc. Hons.), Member of \r\n',
 b'the Chartered Institute of Securities and Investment \xe2\x80\x93 U.K. (MCSI), Fellow of the Chartered Institute \r\n',
 b'of  Management  A

----------


Now, when we have each particular line for entire text format of CV, we are going to normalized each line by following rule: Each word must be normalized to lowercase except the capital words appeared in the middle of a sentence/line.

For example:

"Manager" appears in the middle  will be retain.

"MANAGER"  appears in the middle will be retain.

but if "Manager" or "MANAGER" appears at the beginning of the sentence will be transform to "manager"

The function *data_byte_to_str* use to fix data-byte issue.

The function *normalized_line* use to normalized each line of content of resume by following the rules above.

The function *normalized_lower_case* use to normalized all the contents of resume by calling the function *normalized_line* and then concatenate all the content together.

In [132]:
# Fix the issue of read data-byte
def data_byte_to_str(cv_lines_arr):
    return [str(line.strip(),'utf-8') for line in cv_lines_arr if line.strip() != b'']


# Normalize each line of content of resume
def normalized_line(line,is_dot_before=False):
    """
    This function use to normalize each line of content of resume
    :param
        line: each single line of content of resume
        is_dot_before: a variable to check whether the previous line end with dot or not. If previous line end with dot,
                       we will normalized the first word of sentence, otherwise, retain it.
    :result
        return a string of line
    """
    
    # if the previous line end with dot
    if is_dot_before == True:
        # seperate the line by dot function if the line invole more than two sentence
        lines = line.split(". ")
        for i in range(0,len(lines)):
            if len(lines[i])>2:
                txt = lines[i].split(" ")
                for j in range(0,len(txt)):
                    if txt[j].isalpha():
                        txt[j] = txt[j].lower()
                        break
                lines[i] = " ".join(txt)
                
    # if the previous line is not end with dot            
    else:
        lines = line.split(". ")
        for i in range(1,len(lines)):
            if len(lines[i])>2:
                txt = lines[i].split(" ")
                for j in range(0,len(txt)):
                    if txt[j].isalpha():
                        txt[j] = txt[j].lower()
                        break
                lines[i] = " ".join(txt)
    return ". ".join(lines)


# Normalize content of resume
def normalized_lower_case(cv_lines_arr):
    """
    This function use to normalize the content of resume
    :param 
        cv_lines: the list of line of the resume content
    :result
        return a string of resume content
    """
    cv_lines_arr = data_byte_to_str(cv_lines_arr)
    cv_lines_arr[0] = normalized_line(cv_lines_arr[0],is_dot_before=True)  
    
    for i in range(1,len(cv_lines_arr)):
        if cv_lines_arr[i-1].endswith(".") or cv_lines_arr[i-1].endswith(":"):
            cv_lines_arr[i] = normalized_line(cv_lines_arr[i],is_dot_before=True)
        else:
            cv_lines_arr[i] = normalized_line(cv_lines_arr[i])
    return " ".join(cv_lines_arr)
    
sample_normalize = normalized_lower_case(sample)
sample_normalize

'curriculum Vitae V. gowribalan  MCSI FCMA, CPA (Aust.), CGMA, BSc  (Hons.) Investment Manager with an  established  investment  track-record  across  the GCC  region  spanning listed equities, sukuks and debt securities.  honed expertise of 14 years in portfolio management and investment analysis. experience  includes  establishing and  leading  the Asset Management Division (AMD) of Ahli Bank SAOG; launching of mutual fund, structuring of wealth management products, strategizing acquisitions, handling initial public offerings (IPOs) and raising investment funds across asset-classes and risk-thresholds. credentials  include a First Class Honours Degree  in Applied Accounting   (BSc. Hons.), member of the Chartered Institute of Securities and Investment – U.K. (MCSI), fellow of the Chartered Institute of  Management  Accountants  –  U.K.  (FCMA),  certified  Practising  Accountant  (CPA  Aust.)  and Chartered Global Management Accountant – U.S.A. (CGMA). average returns generated amoun

------------
In the next part, we are going to extract tokens from the text by using regular expression:
```python
    from nltk.tokenize import RegexpTokenizer
    pattern = "\w+(?:[-']\w+)?"
    tokenizer = RegexpTokenizer(pattern)
```

Before doing that, we need to transfrom the list of line from to string. Again, we put the script of extracting tokens to a function, called *get_tokens*

In [133]:
def get_tokens(resume_content):
    """
    This function use to get token from normalized text 
    :param 
        cv_lines: a list of line that have normalized
    :result
        the string of all tokens
    """
    pattern = r"\w+(?:[-']\w+)?"
    tokenizer = RegexpTokenizer(pattern)
    cv_tokens = tokenizer.tokenize(resume_content)
    cv_text = " ".join(cv_tokens)
    return cv_text

cv_text = get_tokens(sample_normalize)
cv_text

'curriculum Vitae V gowribalan MCSI FCMA CPA Aust CGMA BSc Hons Investment Manager with an established investment track-record across the GCC region spanning listed equities sukuks and debt securities honed expertise of 14 years in portfolio management and investment analysis experience includes establishing and leading the Asset Management Division AMD of Ahli Bank SAOG launching of mutual fund structuring of wealth management products strategizing acquisitions handling initial public offerings IPOs and raising investment funds across asset-classes and risk-thresholds credentials include a First Class Honours Degree in Applied Accounting BSc Hons member of the Chartered Institute of Securities and Investment U K MCSI fellow of the Chartered Institute of Management Accountants U K FCMA certified Practising Accountant CPA Aust and Chartered Global Management Accountant U S A CGMA average returns generated amounts to 12 5 annual average over the past 9 years across the GCC listed equitie

-------

Next, we are going to remove all the token has length less than 3. 

Following is regular expression:

``` python
    import re
    line = re.sub(r'\b\w{1,3}\b','', line)
```

In [134]:
def removal_token_length(cv_text):
    """
    This function use to remove all the token has length less than 3
    :param
        cv_text: a string of cv
    :result
        a string of token
    """
    cv_text = re.sub(r'\b\[a-zA-Z0-9-_]\b','',cv_text)
    return cv_text

cv_text = removal_token_length(cv_text)
cv_text

'curriculum Vitae V gowribalan MCSI FCMA CPA Aust CGMA BSc Hons Investment Manager with an established investment track-record across the GCC region spanning listed equities sukuks and debt securities honed expertise of 14 years in portfolio management and investment analysis experience includes establishing and leading the Asset Management Division AMD of Ahli Bank SAOG launching of mutual fund structuring of wealth management products strategizing acquisitions handling initial public offerings IPOs and raising investment funds across asset-classes and risk-thresholds credentials include a First Class Honours Degree in Applied Accounting BSc Hons member of the Chartered Institute of Securities and Investment U K MCSI fellow of the Chartered Institute of Management Accountants U K FCMA certified Practising Accountant CPA Aust and Chartered Global Management Accountant U S A CGMA average returns generated amounts to 12 5 annual average over the past 9 years across the GCC listed equitie


All the token which has length less than 3 have been removed, however we see that the space between the word are not normal. 

We will remove more than one white-space between words, following is regular expression
```python
    import re
    line = re.sub(r'\ {2,}'," ",line)
```

We update removing multiple space into the *removal_token_length* function

In [135]:
def removal_token_length(cv_text):
    """
    This function use to remove all the token has length less than 3
    :param
        cv_text: a string of cv
    :result
        a string of token
    """
    cv_text = re.sub(r'\b[a-zA-Z0-9-_]{1,3}\b','',cv_text)
    
    #remove multiple space between words
    cv_text = re.sub(r'\ {2,}'," ",cv_text)
    return cv_text

cv_text = removal_token_length(cv_text)
cv_text

'curriculum Vitae gowribalan MCSI FCMA Aust CGMA Hons Investment Manager with established investment trackrecord across region spanning listed equities sukuks debt securities honed expertise years portfolio management investment analysis experience includes establishing leading Asset Management Division Ahli Bank SAOG launching mutual fund structuring wealth management products strategizing acquisitions handling initial public offerings IPOs raising investment funds across assetclasses riskthresholds credentials include First Class Honours Degree Applied Accounting Hons member Chartered Institute Securities Investment MCSI fellow Chartered Institute Management Accountants FCMA certified Practising Accountant Aust Chartered Global Management Accountant CGMA average returns generated amounts annual average over past years across listed equities exceeding within fixed income asset class professional EXPERIENCE YEARS 3⅓ Ahli Bank SAOG ahlibank Head Asset Management Reported founding employ

----


The context-dependent and context-independent stop words must be removed from vocab

The following script is a function that removing stopword. The list of stopword is given in the file, named *"stopwords_en.txt"*

In [136]:
def removal_stop_word(cv_text):
    """
    This function use to remove all the stopwords
    :param 
        cv_text: a string of cv
    :result
        a string of token that has been removed all the stopwords
    """
    
    stopwords = None
    with open('stopwords_en.txt','r') as f:
        stopwords = f.readlines()
    stopwords = [word.strip() for word in stopwords]
    
    tokens = cv_text.split(" ")
    cv_removal_stopword = [word for word in tokens if word not in stopwords ]
    
    return " ".join(cv_removal_stopword)
cv_text = removal_stop_word(cv_text)
cv_text

'curriculum Vitae gowribalan MCSI FCMA Aust CGMA Hons Investment Manager established investment trackrecord region spanning listed equities sukuks debt securities honed expertise years portfolio management investment analysis experience includes establishing leading Asset Management Division Ahli Bank SAOG launching mutual fund structuring wealth management products strategizing acquisitions handling initial public offerings IPOs raising investment funds assetclasses riskthresholds credentials include First Class Honours Degree Applied Accounting Hons member Chartered Institute Securities Investment MCSI fellow Chartered Institute Management Accountants FCMA certified Practising Accountant Aust Chartered Global Management Accountant CGMA average returns generated amounts annual average past years listed equities exceeding fixed income asset class professional EXPERIENCE YEARS 3⅓ Ahli Bank SAOG ahlibank Head Asset Management Reported founding employee division tasked establishing busine

Rare token (with the threshold set to %2) and context-dependent (with the threshold set to %98) must be removed.

In order to do that, we are going to construct a dictionary of from given cv with the key are unique tokens and the value is the number of time it happens in all documents, but we count 1 for each document or in other word, we use document frequency

E.g:

the word "skill" may occur several times in resume_1 but we count only  one time.

The following is sample of dictionary from unique tokens.

{

` 'Principled': 1,
 'calendars': 1,
 'walkthroughs': 3,
 'VMware': 2,
 'submitting': 2,
 'review': 102,
 'solution': 5,
 'xcel': 1,
 'BLING': 1,
 .......
`

}

    

Following are two funtions to construct dictionary and remove rare and most frequency

In [137]:
def construct_dictionary(cvs):
    """
    This function use to construct dictionary of tokens
    :param
          cvs: a list of resume content
    :result
          return a dictionary
    """
    unique_token = list(set(" ".join(cvs).split(" ")))
    dic = dict([(key,0) for key in unique_token])
    for cv in cvs:
        tokens = cv.split(" ")
        tokens = list(set(tokens))
        for token in tokens:
            if token in dic:
                if dic[token] == 0:
                    dic[token] = 1
                else:
                    dic[token] = dic[token] + 1
    dic.pop('')
    return dic
                
    
def removal_rare_and_most_frequency(cvs):
    """
    This function use to remove rare the most frequency token (with threshold set to 98%), 
    :param
        cv_text: a string of cv
    :result
        a string of token that has been removed rare token
    """
    
    
    # construct dictionary of stopword with threshold 2% and 98%
    contents = [cv[1] for cv in cvs]
    dict_cv = construct_dictionary(contents)
    number_of_cv = len(cvs)
    for key in dict_cv:
        dict_cv[key] = dict_cv[key] / number_of_cv
        
    #  create a new stopwords which include the rare tokens and most frequency tokens
    for key in list(dict_cv):
        if dict_cv[key] < 0.98 and dict_cv[key] > 0.02:
            dict_cv.pop(key)
    
    # remove the new stopwords
    new_cvs = []
    for cv in cvs:
        _cv = [c for c in cv[1].split(" ") if c not in dict_cv and c.strip() != '']
        _cv = " ".join(_cv)
        new_cvs.append((cv[0],_cv))
    return new_cvs


Now we are going to stemm all the token by using PorterStemmer.


In [138]:
def porter_stemming(cv_text):
    porter = PorterStemmer()
    cv_stem = [porter.stem(word) for word in cv_text.split(" ")]
    return " ".join(cv_stem)


We have been done almost everything regarding of the pre-processing data. 

By now, we can incorporate all the function above to get requirements done.

First of all, we will read txt format cv file, then, we are going to process data by normalizing, extracting tokens, removing stop words,removing token has length less than 3 and stemming by using Porter stemmer.

After pre-processing for each cv, we append data to a variable name cv_text. Finally, we will remove most frequency tokens (with threshold set to 98%), remove rare token (with threshold set to 2%) and finally extract meaningful biagram


In [139]:
# list of cv tokens without stemming
cvs_collocations = []
# list of cv tokens with stemming
cvs_porter = []

for resume_number in cv_list:
    path_resume = get_path_resume(resume_number)
    resume_content = read_cv_txt(path_resume)
    resume_content = normalized_lower_case(resume_content)
    resume_content = get_tokens(resume_content)
    resume_content = removal_token_length(resume_content)
    resume_content = removal_stop_word(resume_content)
    if len(resume_content) > 0:
        cvs_collocations.append((resume_number,resume_content))
        porter_content = porter_stemming(resume_content)
        cvs_porter.append((resume_number,porter_content))

cvs_collocations = removal_rare_and_most_frequency(cvs_collocations)
cvs_porter = removal_rare_and_most_frequency(cvs_porter)


At this time, we can go to extract biagram by using BiagramAssocMeasures from *nltk* library.

In this following script above, we extract biagram with PMI measure

In [140]:
resume_contents = [c[1] for c in cvs_collocations if c[1].strip() != ""]
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(" ".join(resume_contents).split(" "))
bigram_finder.apply_freq_filter(10)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)
bigram_finder.apply_word_filter(lambda w: w[0].isalpha()!=True and w[1].isalpha() !=True)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-100 bigrams
top_200_bigrams

[('Goldman', 'Sachs'),
 ('HONG', 'KONG'),
 ('Curriculum', 'Vitae'),
 ('Morgan', 'Stanley'),
 ('problem', 'solving'),
 ('Monetary', 'Authority'),
 ('journal', 'entries'),
 ('Dean', 'List'),
 ('Thomson', 'Reuters'),
 ('Cayman', 'Islands'),
 ('Date', 'Birth'),
 ('Ernst', 'Young'),
 ('sovereign', 'wealth'),
 ('ADDITIONAL', 'INFORMATION'),
 ('balance', 'sheet'),
 ('Available', 'request'),
 ('Fixed', 'Income'),
 ('Vice', 'President'),
 ('CURRICULAR', 'ACTIVITIES'),
 ('tight', 'deadlines'),
 ('United', 'Kingdom'),
 ('United', 'States'),
 ('phone', 'calls'),
 ('unit', 'trusts'),
 ('Case', 'Competition'),
 ('Expected', 'Salary'),
 ('Nanyang', 'Technological'),
 ('pitch', 'books'),
 ('Credit', 'Suisse'),
 ('Real', 'Estate'),
 ('wide', 'range'),
 ('fixed', 'income'),
 ('prime', 'brokers'),
 ('Temasek', 'Polytechnic'),
 ('real', 'estate'),
 ('Reporting', 'Standards'),
 ('written', 'spoken'),
 ('spoken', 'written'),
 ('Certified', 'Public'),
 ('account', 'opening'),
 ('full', 'spectrum'),
 ('Class'

We also can extract biagram by using the *raw frequency* measure and get the number of highest score

In [141]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = BigramCollocationFinder.from_words(" ".join(resume_contents).split(" "))
bigram_finder.apply_freq_filter(20) 
bigram_finder.score_ngrams(bigram_measures.raw_freq) 

[(('Hong', 'Kong'), 0.0051536830948595656),
 (('financial', 'statements'), 0.001483836777954425),
 (('real', 'estate'), 0.0010996290408055115),
 (('private', 'equity'), 0.0010201377848436672),
 (('Real', 'Estate'), 0.0008611552729199788),
 (('Private', 'Equity'), 0.0008479067302596715),
 (('Microsoft', 'Office'), 0.000834658187599364),
 (('Financial', 'Services'), 0.0008081611022787493),
 (('Bachelor', 'Business'), 0.000794912559618442),
 (('Email', 'gmail'), 0.000794912559618442),
 (('English', 'Mandarin'), 0.000794912559618442),
 (('Business', 'Administration'), 0.0007419183889772125),
 (('hedge', 'funds'), 0.0007154213036565977),
 (('Asset', 'Management'), 0.0007021727609962904),
 (('WORK', 'EXPERIENCE'), 0.000649178590355061),
 (('Investment', 'Banking'), 0.0006359300476947536),
 (('financial', 'models'), 0.0006226815050344462),
 (('asset', 'management'), 0.0006094329623741389),
 (('Fund', 'Services'), 0.0005696873343932167),
 (('Senior', 'Associate'), 0.0005696873343932167),
 (('t

We take the biagrams by using PMI measure. Now, let put the biagram and unigram to the file

In [142]:
bigram_vocab = [bi[0]+" "+bi[1] for bi in top_200_bigrams]
bigram_vocab = sorted(bigram_vocab)
bigram_vocab

['ADDITIONAL INFORMATION',
 'Accountants ACCA',
 'Accounts Assistant',
 'Asia Pacific',
 'Asset Management',
 'Asset Value',
 'Assistant Manager',
 'Assistant Vice',
 'Association Chartered',
 'Audit Assurance',
 'Available request',
 'Bachelor Accountancy',
 'Bachelor Arts',
 'Bachelor Commerce',
 'Bachelor Science',
 'Business Administration',
 'CURRICULAR ACTIVITIES',
 'Cantonese Fluent',
 'Cantonese Native',
 'Capital Markets',
 'Case Competition',
 'Cayman Islands',
 'Certified Accountants',
 'Certified Public',
 'Chartered Accountant',
 'Chartered Accountants',
 'Chartered Certified',
 'Citco Fund',
 'City University',
 'Class Honors',
 'Class Honours',
 'Company Secretarial',
 'Computer Skills',
 'Contact Email',
 'Credit Suisse',
 'Curriculum Vitae',
 'Date Birth',
 'Dean List',
 'Deutsche Bank',
 'Email gmail',
 'Email hotmail',
 'Email yahoo',
 'English Mandarin',
 'Equity Research',
 'Ernst Young',
 'Excel PowerPoint',
 'Excel Word',
 'Exchange Program',
 'Exchange Programme

Now, we going to get unigrams, however, instead of using resume contents from cvs_collocations, we are going to get unigrams by using resume contents from cvs_porter

In [143]:
count_vectorizer = CountVectorizer()
resume_content_porter = " ".join([cv[1] for cv in cvs_porter])
contents_resumes = count_vectorizer.fit_transform([_cvs])
# print(resume_contents.toarray())
unigrams = count_vectorizer.get_feature_names()
for i in range(0,len(unigrams)):
    print(features[i],":",i)

100m : 0
1986 : 1
1988 : 2
1989 : 3
1990 : 4
1991 : 5
1993 : 6
1994 : 7
1995 : 8
1996 : 9
1997 : 10
1998 : 11
1999 : 12
2000 : 13
2001 : 14
2002 : 15
2003 : 16
2004 : 17
2005 : 18
2006 : 19
2007 : 20
2008 : 21
2009 : 22
2010 : 23
2011 : 24
2012 : 25
2013 : 26
2014 : 27
2015 : 28
2016 : 29
2017 : 30
2018 : 31
300m : 32
abil : 33
abl : 34
absolut : 35
academ : 36
academi : 37
acca : 38
accept : 39
access : 40
accomplish : 41
accord : 42
account : 43
accrual : 44
accur : 45
accuraci : 46
achiev : 47
acquir : 48
acquisit : 49
act : 50
action : 51
activ : 52
actual : 53
adapt : 54
addit : 55
address : 56
adept : 57
adequ : 58
adequaci : 59
adher : 60
adjust : 61
admin : 62
administ : 63
administr : 64
admiss : 65
adob : 66
adopt : 67
advanc : 68
advantag : 69
advent : 70
advertis : 71
advic : 72
advis : 73
advisor : 74
advisori : 75
aerospac : 76
affair : 77
affect : 78
affili : 79
africa : 80
age : 81
agenc : 82
agenda : 83
agent : 84
aggreg : 85
agre : 86
agreement : 87
agricultur : 88
ai

payabl : 1084
payment : 1085
payrol : 1086
peer : 1087
pension : 1088
peopl : 1089
perform : 1090
period : 1091
perman : 1092
permit : 1093
person : 1094
personnel : 1095
perspect : 1096
petroleum : 1097
pharmaceut : 1098
phase : 1099
philippin : 1100
phone : 1101
photographi : 1102
photoshop : 1103
physic : 1104
piano : 1105
pick : 1106
pioneer : 1107
pipelin : 1108
pitch : 1109
pivot : 1110
place : 1111
placement : 1112
plan : 1113
plant : 1114
platform : 1115
play : 1116
player : 1117
point : 1118
polici : 1119
polit : 1120
polytechn : 1121
pool : 1122
portal : 1123
portfolio : 1124
posit : 1125
possess : 1126
post : 1127
potenti : 1128
power : 1129
powerpoint : 1130
practic : 1131
practis : 1132
preced : 1133
prefer : 1134
preliminari : 1135
premium : 1136
prepar : 1137
present : 1138
presid : 1139
press : 1140
pressur : 1141
prestigi : 1142
previou : 1143
previous : 1144
price : 1145
pricewaterhousecoop : 1146
primari : 1147
primarili : 1148
prime : 1149
princip : 1150
principl : 

At this time, we are going to write biagrams and unigrams tokens following format, token_string:integer_index

In [144]:
with open("resume_vocab.txt",'w') as f:
    for i in range(0,len(bigram_vocab)):
        f.write(bigram_vocab[i]+":"+str(i))
        f.write("\n")
    for j in range(0,len(unigrams)):
        f.write(unigrams[j]+":"+str(j))
        f.write("\n")

Writing the term frequency of each particular resume from given resume file.

In [145]:
array_dict_resume = []
for cv in cvs_porter:
    dic = {}
    tokens = cv[1].split(" ")
    for token in tokens:
        if token in dic:
            dic[token] = dic[token] + 1
        else:
            dic[token] = 1
    array_dict_resume.append((cv[0],dic))
    

In [147]:
file_vocab_vec = open("resume_countVec.txt",'w')
for d in array_dict_resume:
    file = "resume_("+d[0]+"), "
    token_dic = d[1]
    for key in token_dic:
        file += key+":" + str(token_dic[key]) +", "
    file +="\n"
    file_vocab_vec.write(file)
file_vocab_vec.close()