## 1. Introduction
This analysis extracts data from an PDF file containing 200 unit information crawled from Monash University website. Data was extracted by reading the input file, splitting the document into different sections, i.e. unit code, sypnosis and outcome. Data is then extracted and transform into a vector space model

All the data from the pdf is extracted using the `tika` library.

Text pre-processing was performed with the objective of producing a lexical vocabulary for the 200 units and the associated sparse count vector for each unit. The pre-processing included tokenisation, stemming and removal of stopwords. Additionally, the most frequent and least frequent words were removed, and meaningful bigrams were identified.



## 2. Import library
The tika library can be installed by running the following command:

<font color = 'red'>conda install -c conda-forge tika </font>


In [1]:
from tika import parser
import re

## 3. Parsing pdf file

In [2]:
# read the pdf file
raw = parser.from_file('29481929.pdf')
# store the extracted string into a variable
text = raw['content']

Let's briefly explore the file.

In [3]:
text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n(anonymous)\n\n\nTitle Synopsis Outcomes\n\nATS3070 This unit involves students in teaching and learning\nactivities equally developing language skills and\ncultural competence. It extends skills developed in in\nthe areas of exposition and argument, with a focus on\nspecific expository techniques: document synthesis\nand oral presentation of a sustained argument\ninvolving critical awareness of issues in contemporary\nFrance. Students develop advanced language skills\nand competence in the theory, research methodology\nand practices, and discourses involved in\nsophisticated critical enquiry, understanding and\nanalysis in an area of French studies, working under\nguidance to define and carry out a project.\n\n[\'advanced analytical, expository and argumentative\nskills in the context of writing a synthesis of several\ndocuments and making a presentation on a given\ngeneral topic;\', \

The pdf is well structured and the following pattern can be identify:
* Every unitcode is trailed by <font color = 'red'>\n\n</font>
* All outcomes are inside a pair of square brackets
* All the characters before \n\nATS3070 are useless
* All sypnosis always end with <font color = 'red'>\n\n</font>

## 4. Data extraction

### 4.1 Unit code

In [7]:
# extract unit code using regular expression
unit = re.findall(r'\n\n[A-Z]{3,4}\d{4}', text)
unit

['\n\nATS3070',
 '\n\nATS2185',
 '\n\nAZA2461',
 '\n\nTAD2214',
 '\n\nNUR5704',
 '\n\nIAR4501',
 '\n\nSWM5109',
 '\n\nIDE2114',
 '\n\nMKC3110',
 '\n\nMKF5912',
 '\n\nPAR5480',
 '\n\nACC2200',
 '\n\nFIT5159',
 '\n\nVCM4602',
 '\n\nCDS1511',
 '\n\nIAR4500',
 '\n\nATS3789',
 '\n\nACX5951',
 '\n\nMPH5277',
 '\n\nATS2283',
 '\n\nMGF5600',
 '\n\nATS3882',
 '\n\nBFW3652',
 '\n\nMAT2731',
 '\n\nPSY7141',
 '\n\nSCU2022',
 '\n\nMGW2991',
 '\n\nATS3933',
 '\n\nAHT4112',
 '\n\nBTW3201',
 '\n\nNUR5925',
 '\n\nMGX4400',
 '\n\nMBA5722',
 '\n\nCDS4001',
 '\n\nRAD5112',
 '\n\nDPH6005',
 '\n\nATS2837',
 '\n\nATS1279',
 '\n\nBES4020',
 '\n\nOCC2020',
 '\n\nBFC2140',
 '\n\nMGM5698',
 '\n\nAMU3806',
 '\n\nATS3819',
 '\n\nATS3405',
 '\n\nIAR3118',
 '\n\nATS3972',
 '\n\nMTE6881',
 '\n\nTDN4101',
 '\n\nMGS5900',
 '\n\nEAE4000',
 '\n\nATS3462',
 '\n\nTDN3002',
 '\n\nATS3022',
 '\n\nAZA1001',
 '\n\nMGX5901',
 '\n\nSCI4502',
 '\n\nATS4345',
 '\n\nMID2110',
 '\n\nAPG5135',
 '\n\nAMU2315',
 '\n\nBFF2401',
 '\n\nAT

In [8]:
# remove the 2 newline character for each unit code
unit = [unit[i][2:] for i in range(len(unit))]
unit

['ATS3070',
 'ATS2185',
 'AZA2461',
 'TAD2214',
 'NUR5704',
 'IAR4501',
 'SWM5109',
 'IDE2114',
 'MKC3110',
 'MKF5912',
 'PAR5480',
 'ACC2200',
 'FIT5159',
 'VCM4602',
 'CDS1511',
 'IAR4500',
 'ATS3789',
 'ACX5951',
 'MPH5277',
 'ATS2283',
 'MGF5600',
 'ATS3882',
 'BFW3652',
 'MAT2731',
 'PSY7141',
 'SCU2022',
 'MGW2991',
 'ATS3933',
 'AHT4112',
 'BTW3201',
 'NUR5925',
 'MGX4400',
 'MBA5722',
 'CDS4001',
 'RAD5112',
 'DPH6005',
 'ATS2837',
 'ATS1279',
 'BES4020',
 'OCC2020',
 'BFC2140',
 'MGM5698',
 'AMU3806',
 'ATS3819',
 'ATS3405',
 'IAR3118',
 'ATS3972',
 'MTE6881',
 'TDN4101',
 'MGS5900',
 'EAE4000',
 'ATS3462',
 'TDN3002',
 'ATS3022',
 'AZA1001',
 'MGX5901',
 'SCI4502',
 'ATS4345',
 'MID2110',
 'APG5135',
 'AMU2315',
 'BFF2401',
 'ATS2961',
 'VCO3403',
 'PGC5103',
 'MKB2703',
 'MGB3120',
 'BTC3150',
 'MEC5415',
 'ATS1421',
 'BIO2181',
 'EAE4062',
 'CPS5006',
 'CHM3960',
 'MTE5884',
 'CDS2512',
 'APG5049',
 'MTE3542',
 'CHE5882',
 'RTP5103',
 'GLS1231',
 'FIT2098',
 'MPH5314',
 'AT

### 4.2 Outcome

In [9]:
# Extract outcome using regular expression
o = re.findall(r'\[[\s\S]*?\]',text)
o

["['advanced analytical, expository and argumentative\nskills in the context of writing a synthesis of several\ndocuments and making a presentation on a given\ngeneral topic;', 'advanced knowledge and\nunderstanding of modern and contemporary France\nand its culture;', 'more powerful explicit understanding\nof and competence in the theory, research\nmethodology, practices and discourses of an area of\nFrench studies;', 'the advanced language skills\ninvolved in developing critical enquiry and analysis\nand expressing outcomes and understandings in the\nframework of a research essay; individual and\ncooperative research skills, including;', 'individual\ntransferable research skills in accordance with the\nResearch Skill Development Framework.']",
 "['understand the foundational beliefs of the Bible.',\n'understand the Hebrew Scriptures in their ancient\nNear Eastern context, and the Christian Scriptures in\ntheir Jewish, Greek and intertestamental contexts.',\n'appreciate the diversity 

The format of outcome is quite messy here, it includes lots of new line character and punctuation which requires some cleaning.

In [10]:
# Remove all square brackets and replace all newline character with a space
for i in range(len(o)):
    o[i] = o[i].replace('[','')
    o[i] = o[i].replace(']','')
    o[i] = o[i].replace('\n',' ')

o

["'advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;', 'advanced knowledge and understanding of modern and contemporary France and its culture;', 'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;', 'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;', 'individual transferable research skills in accordance with the Research Skill Development Framework.'",
 "'understand the foundational beliefs of the Bible.', 'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.', 'appreciate the diversity of Biblical literatur

Every individual outcome is delimited by <font color = 'red'>', '</font> or <font color = 'red'>", "</font> or <font color = 'red'>', "</font> or <font color = 'red'>", '</font>. Therefore, we will split the outcomes using these 4 delimiters.

In [11]:
for i in range(len(o)):
    o[i] = re.split('\', \'|\", \"|\', \"|\", \'', o[i])
o

[["'advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;",
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;',
  'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;',
  "individual transferable research skills in accordance with the Research Skill Development Framework.'"],
 ["'understand the foundational beliefs of the Bible.",
  'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.',
  'appreciate the diversity of Bib

Remove the quotation marks at the beginning and end of each block.

In [12]:
for i in range(len(o)):
    o[i][0] = o[i][0][1:]
    o[i][-1] = o[i][-1][:-1]

o

[['advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;',
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;',
  'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;',
  'individual transferable research skills in accordance with the Research Skill Development Framework.'],
 ['understand the foundational beliefs of the Bible.',
  'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.',
  'appreciate the diversity of Biblic

There are some wierd character like <font color = 'red'>\\\n</font>, they will be replaced by a single space.

In [13]:
## remove \\n
for i in range(len(o)):
    for j in range(len(o[i])):
        o[i][j] = o[i][j].replace('\\n', ' ')
o

[['advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;',
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;',
  'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;',
  'individual transferable research skills in accordance with the Research Skill Development Framework.'],
 ['understand the foundational beliefs of the Bible.',
  'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.',
  'appreciate the diversity of Biblic

## 4.3 Sypnosis

The sypnosis of each unit does not have a clear boundary, we only know it will end with two new line characters. Therefore, all the unit code, outcome and the table headings will be remove from the text, this will give us the sypnosis.

In [14]:
# remove unit code
s = re.sub(r'\n\n[A-Z]{3,4}\d{4}','',text)
# remove outcome
s = re.sub(r'\[[\s\S]*?\]','',s)
s

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n(anonymous)\n\n\nTitle Synopsis Outcomes This unit involves students in teaching and learning\nactivities equally developing language skills and\ncultural competence. It extends skills developed in in\nthe areas of exposition and argument, with a focus on\nspecific expository techniques: document synthesis\nand oral presentation of a sustained argument\ninvolving critical awareness of issues in contemporary\nFrance. Students develop advanced language skills\nand competence in the theory, research methodology\nand practices, and discourses involved in\nsophisticated critical enquiry, understanding and\nanalysis in an area of French studies, working under\nguidance to define and carry out a project.\n\n The unit begins with a survey of the Hebrew\nScriptures as viewed in their ancient Near Eastern\nhistorical and cultural setting, and proceeds to\nexamine the Greek Scriptures or New Testa

In [15]:
# remove heading (the first 88 characteres)
s = s[88:]
s

'This unit involves students in teaching and learning\nactivities equally developing language skills and\ncultural competence. It extends skills developed in in\nthe areas of exposition and argument, with a focus on\nspecific expository techniques: document synthesis\nand oral presentation of a sustained argument\ninvolving critical awareness of issues in contemporary\nFrance. Students develop advanced language skills\nand competence in the theory, research methodology\nand practices, and discourses involved in\nsophisticated critical enquiry, understanding and\nanalysis in an area of French studies, working under\nguidance to define and carry out a project.\n\n The unit begins with a survey of the Hebrew\nScriptures as viewed in their ancient Near Eastern\nhistorical and cultural setting, and proceeds to\nexamine the Greek Scriptures or New Testament,\nwhich are situated in their Jewish, Greek and\napocalyptic contexts. Particular attention will be\ndevoted to the Bible as an expressi

Most of the sypnosis for each unit is delimited by <font color = 'red'>\n\n</font> and 2 sypnosis is delimited by double space so we will split this chunk of text using <font color ='red'>\n\n</font> and <font color ='red'>'  '</font>as delimiter.

In [17]:
s1 = re.split('  |\n\n', s)
s1

['This unit involves students in teaching and learning\nactivities equally developing language skills and\ncultural competence. It extends skills developed in in\nthe areas of exposition and argument, with a focus on\nspecific expository techniques: document synthesis\nand oral presentation of a sustained argument\ninvolving critical awareness of issues in contemporary\nFrance. Students develop advanced language skills\nand competence in the theory, research methodology\nand practices, and discourses involved in\nsophisticated critical enquiry, understanding and\nanalysis in an area of French studies, working under\nguidance to define and carry out a project.',
 ' The unit begins with a survey of the Hebrew\nScriptures as viewed in their ancient Near Eastern\nhistorical and cultural setting, and proceeds to\nexamine the Greek Scriptures or New Testament,\nwhich are situated in their Jewish, Greek and\napocalyptic contexts. Particular attention will be\ndevoted to the Bible as an expres

We have some empty string since the sypnosis at the end of each page ends with 4 newline characters. The empty string needs to be removed. The last item in the above list is one single newline character, which needs to be removed too.

In [18]:
# remove empty
s2 = [i for i in s1 if len(i) !=0]
# remove last item
s2 = s2[:-1]
s2

['This unit involves students in teaching and learning\nactivities equally developing language skills and\ncultural competence. It extends skills developed in in\nthe areas of exposition and argument, with a focus on\nspecific expository techniques: document synthesis\nand oral presentation of a sustained argument\ninvolving critical awareness of issues in contemporary\nFrance. Students develop advanced language skills\nand competence in the theory, research methodology\nand practices, and discourses involved in\nsophisticated critical enquiry, understanding and\nanalysis in an area of French studies, working under\nguidance to define and carry out a project.',
 ' The unit begins with a survey of the Hebrew\nScriptures as viewed in their ancient Near Eastern\nhistorical and cultural setting, and proceeds to\nexamine the Greek Scriptures or New Testament,\nwhich are situated in their Jewish, Greek and\napocalyptic contexts. Particular attention will be\ndevoted to the Bible as an expres

The newline character is useless, therefore it is replaced by a single space.

In [19]:
for i in range(len(s2)):
    s2[i] = s2[i].replace('\n',' ')
s2

['This unit involves students in teaching and learning activities equally developing language skills and cultural competence. It extends skills developed in in the areas of exposition and argument, with a focus on specific expository techniques: document synthesis and oral presentation of a sustained argument involving critical awareness of issues in contemporary France. Students develop advanced language skills and competence in the theory, research methodology and practices, and discourses involved in sophisticated critical enquiry, understanding and analysis in an area of French studies, working under guidance to define and carry out a project.',
 ' The unit begins with a survey of the Hebrew Scriptures as viewed in their ancient Near Eastern historical and cultural setting, and proceeds to examine the Greek Scriptures or New Testament, which are situated in their Jewish, Greek and apocalyptic contexts. Particular attention will be devoted to the Bible as an expression of the religi

## 5. Sentence segmentation
Both the sypnosis and outcome requires sentence segmentation.

In [20]:
## import library for sentence segmentation
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

### 5.1 sypnosis 

In [21]:
## sentence segmentation for sypnosis
for i in range(len(s2)):
    s2[i] = sent_detector.tokenize(s2[i].strip())

s2

[['This unit involves students in teaching and learning activities equally developing language skills and cultural competence.',
  'It extends skills developed in in the areas of exposition and argument, with a focus on specific expository techniques: document synthesis and oral presentation of a sustained argument involving critical awareness of issues in contemporary France.',
  'Students develop advanced language skills and competence in the theory, research methodology and practices, and discourses involved in sophisticated critical enquiry, understanding and analysis in an area of French studies, working under guidance to define and carry out a project.'],
 ['The unit begins with a survey of the Hebrew Scriptures as viewed in their ancient Near Eastern historical and cultural setting, and proceeds to examine the Greek Scriptures or New Testament, which are situated in their Jewish, Greek and apocalyptic contexts.',
  'Particular attention will be devoted to the Bible as an express

In [22]:
## only normalize the first character in each sentence to lowercase
for i in range(len(s2)):
    for j in range(len(s2[i])):
        s2[i][j] = s2[i][j][0].lower() + s2[i][j][1:]
s2

[['this unit involves students in teaching and learning activities equally developing language skills and cultural competence.',
  'it extends skills developed in in the areas of exposition and argument, with a focus on specific expository techniques: document synthesis and oral presentation of a sustained argument involving critical awareness of issues in contemporary France.',
  'students develop advanced language skills and competence in the theory, research methodology and practices, and discourses involved in sophisticated critical enquiry, understanding and analysis in an area of French studies, working under guidance to define and carry out a project.'],
 ['the unit begins with a survey of the Hebrew Scriptures as viewed in their ancient Near Eastern historical and cultural setting, and proceeds to examine the Greek Scriptures or New Testament, which are situated in their Jewish, Greek and apocalyptic contexts.',
  'particular attention will be devoted to the Bible as an express

### 5.2 Outcome

In [24]:
## sentence segmentation for outcome
for i in range(len(o)):
    for j in range(len(o[i])):
        o[i][j] = sent_detector.tokenize(o[i][j].strip())

o

[[['advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;'],
  ['advanced knowledge and understanding of modern and contemporary France and its culture;'],
  ['more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;'],
  ['the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;'],
  ['individual transferable research skills in accordance with the Research Skill Development Framework.']],
 [['understand the foundational beliefs of the Bible.'],
  ['understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.'],
  ['appreciate the dive

The outcome has now become a list of list of list, it needs to be flattened to match the dimension of sypnosis.

In [25]:
## flatten the list
for i in range(len(o)):
    new = []
    for j in o[i]:
        for k in j:
            new.append(k)
    o[i] = new

o

[['advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;',
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;',
  'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;',
  'individual transferable research skills in accordance with the Research Skill Development Framework.'],
 ['understand the foundational beliefs of the Bible.',
  'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.',
  'appreciate the diversity of Biblic

In [26]:
# case normalization for first letter in the sentence
for i in range(len(o)):
    for j in range(len(o[i])):
        o[i][j] = o[i][j][0].lower() + o[i][j][1:]

o

[['advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;',
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the theory, research methodology, practices and discourses of an area of French studies;',
  'the advanced language skills involved in developing critical enquiry and analysis and expressing outcomes and understandings in the framework of a research essay; individual and cooperative research skills, including;',
  'individual transferable research skills in accordance with the Research Skill Development Framework.'],
 ['understand the foundational beliefs of the Bible.',
  'understand the Hebrew Scriptures in their ancient Near Eastern context, and the Christian Scriptures in their Jewish, Greek and intertestamental contexts.',
  'appreciate the diversity of Biblic

### 5.3 Combining sypnosis and outcome for each unit

In [27]:
combine = []

for i in range(len(s2)):
    temp = []
    
    for j in range(len(s2[i])):
        temp.append(s2[i][j])
        
    for k in range(len(o[i])):
        temp.append(o[i][k])
        
    combine.append(temp)

combine

[['this unit involves students in teaching and learning activities equally developing language skills and cultural competence.',
  'it extends skills developed in in the areas of exposition and argument, with a focus on specific expository techniques: document synthesis and oral presentation of a sustained argument involving critical awareness of issues in contemporary France.',
  'students develop advanced language skills and competence in the theory, research methodology and practices, and discourses involved in sophisticated critical enquiry, understanding and analysis in an area of French studies, working under guidance to define and carry out a project.',
  'advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic;',
  'advanced knowledge and understanding of modern and contemporary France and its culture;',
  'more powerful explicit understanding of and competence in the 

In [28]:
# joining the strings within each list to reduce dimension

for i in range(len(combine)):
    combine[i] = ' '.join(string for string in combine[i])

combine

['this unit involves students in teaching and learning activities equally developing language skills and cultural competence. it extends skills developed in in the areas of exposition and argument, with a focus on specific expository techniques: document synthesis and oral presentation of a sustained argument involving critical awareness of issues in contemporary France. students develop advanced language skills and competence in the theory, research methodology and practices, and discourses involved in sophisticated critical enquiry, understanding and analysis in an area of French studies, working under guidance to define and carry out a project. advanced analytical, expository and argumentative skills in the context of writing a synthesis of several documents and making a presentation on a given general topic; advanced knowledge and understanding of modern and contemporary France and its culture; more powerful explicit understanding of and competence in the theory, research methodolo

## 6. Tokenization

### 6.1 unigram tokens

In [29]:
# import library
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

# initialize empty list to store tokens for each unit
token_list = []

# tokenize string
for i in range(len(combine)):
    tokens = tokenizer.tokenize(combine[i])
    token_list.append(tokens)

token_list

[['this',
  'unit',
  'involves',
  'students',
  'in',
  'teaching',
  'and',
  'learning',
  'activities',
  'equally',
  'developing',
  'language',
  'skills',
  'and',
  'cultural',
  'competence',
  'it',
  'extends',
  'skills',
  'developed',
  'in',
  'in',
  'the',
  'areas',
  'of',
  'exposition',
  'and',
  'argument',
  'with',
  'a',
  'focus',
  'on',
  'specific',
  'expository',
  'techniques',
  'document',
  'synthesis',
  'and',
  'oral',
  'presentation',
  'of',
  'a',
  'sustained',
  'argument',
  'involving',
  'critical',
  'awareness',
  'of',
  'issues',
  'in',
  'contemporary',
  'France',
  'students',
  'develop',
  'advanced',
  'language',
  'skills',
  'and',
  'competence',
  'in',
  'the',
  'theory',
  'research',
  'methodology',
  'and',
  'practices',
  'and',
  'discourses',
  'involved',
  'in',
  'sophisticated',
  'critical',
  'enquiry',
  'understanding',
  'and',
  'analysis',
  'in',
  'an',
  'area',
  'of',
  'French',
  'studies',
  

### 6.2 bigram tokens

In [30]:
## flatten the token list

word_list = []
for i in range(len(token_list)):
    for j in token_list[i]:
        word_list.append(j)

word_list

['this',
 'unit',
 'involves',
 'students',
 'in',
 'teaching',
 'and',
 'learning',
 'activities',
 'equally',
 'developing',
 'language',
 'skills',
 'and',
 'cultural',
 'competence',
 'it',
 'extends',
 'skills',
 'developed',
 'in',
 'in',
 'the',
 'areas',
 'of',
 'exposition',
 'and',
 'argument',
 'with',
 'a',
 'focus',
 'on',
 'specific',
 'expository',
 'techniques',
 'document',
 'synthesis',
 'and',
 'oral',
 'presentation',
 'of',
 'a',
 'sustained',
 'argument',
 'involving',
 'critical',
 'awareness',
 'of',
 'issues',
 'in',
 'contemporary',
 'France',
 'students',
 'develop',
 'advanced',
 'language',
 'skills',
 'and',
 'competence',
 'in',
 'the',
 'theory',
 'research',
 'methodology',
 'and',
 'practices',
 'and',
 'discourses',
 'involved',
 'in',
 'sophisticated',
 'critical',
 'enquiry',
 'understanding',
 'and',
 'analysis',
 'in',
 'an',
 'area',
 'of',
 'French',
 'studies',
 'working',
 'under',
 'guidance',
 'to',
 'define',
 'and',
 'carry',
 'out',
 'a',

#### 6.2.1 find bigram tokens

In [31]:
## find the first 200 bigram tokens using pmi measures

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(word_list)
bi_token_200 = finder.nbest(bigram_measures.pmi, 200)

bi_token_200

[('5', 'differentiate'),
 ('A', 'N'),
 ('AFW1002', 'ACW1002'),
 ('AIR', 'DATA'),
 ('AN', 'MC'),
 ('ATS2296', 'ATS3296'),
 ('ATS2297', 'ATS3297'),
 ('Accountants', 'SAICA'),
 ('Added', 'Tax'),
 ('Adorno', 'Walter'),
 ('Alain', 'Badiou'),
 ('Assyria', 'Urartu'),
 ('Australasian', 'Triage'),
 ('Badiou', 'Michel'),
 ('Big', 'Data'),
 ('Bode', 'plot'),
 ('CO2', 'emissions'),
 ('Capabilities', 'Sensors'),
 ('Causation', 'Dispositions'),
 ('Centres', 'hospitals'),
 ('Clinical', 'Objectives'),
 ('Concentrator', 'PV'),
 ('Contemporary', 'Practices'),
 ('Council', 'ANMAC'),
 ('Cowen', 'School'),
 ('DATA', 'sensor'),
 ('DNA', 'In'),
 ('Darwinism', 'quantum'),
 ('Derrida', 'Gilles'),
 ('Dialectical', 'Behavior'),
 ('Division', '1RN'),
 ('Dumb', 'Ways'),
 ("Earth's", 'biosphere'),
 ('Engagement', 'CMOP-E'),
 ('Euler', 'angles'),
 ('Financial', 'Statements'),
 ('Fuels', 'Cells'),
 ('GPS', 'INS'),
 ('Gauss', 'elimination'),
 ("Gauss's", 'divergence'),
 ('Gilles', 'Deleuze'),
 ('Gram', 'staining'),
 (

#### 6.2.2 stemming bigram tokens

In [32]:
## import library
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

There are 3 cases that we need to be aware of when doing stemming:
* the string is consist of all uppercase letters e.g. 'AIR'
* only the first character in the string is uppercase letters e.g. 'Adorno'
* all letters are lowercase

Since the porter stemmer will normalize all characters to lowercase, so some if else statements are used to take care of this.

In [33]:
# initialize tempty list to store result
bi_token_200_stem = []

for i in range(len(bi_token_200)):
    # if the string consist of more than one character
    if len(bi_token_200[i][0]) >1:
        # and if if the second character is uppercase
        if bi_token_200[i][0][1].isupper():
            # the whole token will be capitalized after stemming
            temp_tok = stemmer.stem(bi_token_200[i][0]).upper()
        
        # if the second character is lowercase but the first character is uppercase
        elif bi_token_200[i][0][0].isupper():
            temp_tok = stemmer.stem(bi_token_200[i][0])
            # the first letter of the token will be capitalized after stemming
            temp_tok = temp_tok[0].upper() + temp_tok[1:]
        else:
            # if both the first and second character is lowercase, we do not need to capitalize anything
            temp_tok = stemmer.stem(bi_token_200[i][0])
    else:
        # if the length of the token is 1, we do not care whether it is uppercase or not, since it will be removed in later stage
        temp_tok = stemmer.stem(bi_token_200[i][0])
    
    # the second part of the bigram token follows the same procedure
    if len(bi_token_200[i][1]) >1:
        if bi_token_200[i][1][1].isupper():
            temp_tok1 = stemmer.stem(bi_token_200[i][1]).upper()
        elif bi_token_200[i][1][0].isupper():
            temp_tok1 = stemmer.stem(bi_token_200[i][1])
            temp_tok1 = temp_tok1[0].upper() + temp_tok1[1:]
        else:
            temp_tok1 = stemmer.stem(bi_token_200[i][1])
    else:
        temp_tok1 = stemmer.stem(bi_token_200[i][1])
    
    # append to result list
    bi_token_200_stem.append((temp_tok, temp_tok1))

bi_token_200_stem

[('5', 'differenti'),
 ('A', 'N'),
 ('AFW1002', 'ACW1002'),
 ('AIR', 'DATA'),
 ('AN', 'MC'),
 ('ATS2296', 'ATS3296'),
 ('ATS2297', 'ATS3297'),
 ('Account', 'SAICA'),
 ('Ad', 'Tax'),
 ('Adorno', 'Walter'),
 ('Alain', 'Badiou'),
 ('Assyria', 'Urartu'),
 ('Australasian', 'Triag'),
 ('Badiou', 'Michel'),
 ('Big', 'Data'),
 ('Bode', 'plot'),
 ('CO2', 'emiss'),
 ('Capabl', 'Sensor'),
 ('Causat', 'Disposit'),
 ('Centr', 'hospit'),
 ('Clinic', 'Object'),
 ('Concentr', 'PV'),
 ('Contemporari', 'Practic'),
 ('Council', 'ANMAC'),
 ('Cowen', 'School'),
 ('DATA', 'sensor'),
 ('DNA', 'In'),
 ('Darwin', 'quantum'),
 ('Derrida', 'Gill'),
 ('Dialect', 'Behavior'),
 ('Divis', '1RN'),
 ('Dumb', 'Way'),
 ("Earth'", 'biospher'),
 ('Engag', 'CMOP-'),
 ('Euler', 'angl'),
 ('Financi', 'Statement'),
 ('Fuel', 'Cell'),
 ('GP', 'IN'),
 ('Gauss', 'elimin'),
 ("Gauss'", 'diverg'),
 ('Gill', 'Deleuz'),
 ('Gram', 'stain'),
 ("Great'", 'death'),
 ('Hannah', 'Arendt'),
 ('Happi', 'Littl'),
 ('Hardwar', 'Capabl'),
 ('H

## 7. Remove stop words 

In [38]:
# open file
file = open('stopwords_en.txt', 'r')
# split the words into a list using '\n' as delimiter
stop_word_list = file.read().split('\n')

# initialize empty list to store result
token_list1 = []

## store the result in the new list
for i in range(len(token_list)):
    temp = [item for item in token_list[i] if item not in stop_word_list]
    token_list1.append(temp)

token_list1

[['unit',
  'involves',
  'students',
  'teaching',
  'learning',
  'activities',
  'equally',
  'developing',
  'language',
  'skills',
  'cultural',
  'competence',
  'extends',
  'skills',
  'developed',
  'areas',
  'exposition',
  'argument',
  'focus',
  'specific',
  'expository',
  'techniques',
  'document',
  'synthesis',
  'oral',
  'presentation',
  'sustained',
  'argument',
  'involving',
  'critical',
  'awareness',
  'issues',
  'contemporary',
  'France',
  'students',
  'develop',
  'advanced',
  'language',
  'skills',
  'competence',
  'theory',
  'research',
  'methodology',
  'practices',
  'discourses',
  'involved',
  'sophisticated',
  'critical',
  'enquiry',
  'understanding',
  'analysis',
  'area',
  'French',
  'studies',
  'working',
  'guidance',
  'define',
  'carry',
  'project',
  'advanced',
  'analytical',
  'expository',
  'argumentative',
  'skills',
  'context',
  'writing',
  'synthesis',
  'documents',
  'making',
  'presentation',
  'general',

One quick note here, I declared a new variable to store result at each step, so if anything goes wrong, I can always go back to the last checkpoint to see what has gone wrong.

## 8. Stemming
The stemming procedure is the same as above, so I do not repeat the explanation here.

In [39]:
token_list2 = []

for i in range(len(token_list1)):
    temp = []
    
    for j in range(len(token_list1[i])):
        # if length of string in > 1
        if len(token_list1[i][j]) >1:
            # check 2nd character
            if token_list1[i][j][1].isupper():
                temp_tok = stemmer.stem(token_list1[i][j])
                temp_tok = temp_tok.upper()
            # check first character
            elif token_list1[i][j][0].isupper():
                temp_tok = stemmer.stem(token_list1[i][j])
                temp_tok = temp_tok[0].upper() + temp_tok[1:]
            else:
                temp_tok = stemmer.stem(token_list1[i][j])
                
            temp.append(temp_tok)
        else:
            temp_tok = stemmer.stem(token_list1[i][j])
            
    token_list2.append(temp)

token_list2

[['unit',
  'involv',
  'student',
  'teach',
  'learn',
  'activ',
  'equal',
  'develop',
  'languag',
  'skill',
  'cultur',
  'compet',
  'extend',
  'skill',
  'develop',
  'area',
  'exposit',
  'argument',
  'focu',
  'specif',
  'expositori',
  'techniqu',
  'document',
  'synthesi',
  'oral',
  'present',
  'sustain',
  'argument',
  'involv',
  'critic',
  'awar',
  'issu',
  'contemporari',
  'Franc',
  'student',
  'develop',
  'advanc',
  'languag',
  'skill',
  'compet',
  'theori',
  'research',
  'methodolog',
  'practic',
  'discours',
  'involv',
  'sophist',
  'critic',
  'enquiri',
  'understand',
  'analysi',
  'area',
  'French',
  'studi',
  'work',
  'guidanc',
  'defin',
  'carri',
  'project',
  'advanc',
  'analyt',
  'expositori',
  'argument',
  'skill',
  'context',
  'write',
  'synthesi',
  'document',
  'make',
  'present',
  'gener',
  'topic',
  'advanc',
  'knowledg',
  'understand',
  'modern',
  'contemporari',
  'Franc',
  'cultur',
  'power',
  '

## 9. Removing tokens

### 9.1 token with length less than 3

In [40]:
token_list3 =[]

for i in range(len(token_list2)):
    temp = []
    for j in range(len(token_list2[i])):
        if len(token_list2[i][j]) >= 3:
            temp.append(token_list2[i][j])
    token_list3.append(temp)

token_list3

[['unit',
  'involv',
  'student',
  'teach',
  'learn',
  'activ',
  'equal',
  'develop',
  'languag',
  'skill',
  'cultur',
  'compet',
  'extend',
  'skill',
  'develop',
  'area',
  'exposit',
  'argument',
  'focu',
  'specif',
  'expositori',
  'techniqu',
  'document',
  'synthesi',
  'oral',
  'present',
  'sustain',
  'argument',
  'involv',
  'critic',
  'awar',
  'issu',
  'contemporari',
  'Franc',
  'student',
  'develop',
  'advanc',
  'languag',
  'skill',
  'compet',
  'theori',
  'research',
  'methodolog',
  'practic',
  'discours',
  'involv',
  'sophist',
  'critic',
  'enquiri',
  'understand',
  'analysi',
  'area',
  'French',
  'studi',
  'work',
  'guidanc',
  'defin',
  'carri',
  'project',
  'advanc',
  'analyt',
  'expositori',
  'argument',
  'skill',
  'context',
  'write',
  'synthesi',
  'document',
  'make',
  'present',
  'gener',
  'topic',
  'advanc',
  'knowledg',
  'understand',
  'modern',
  'contemporari',
  'Franc',
  'cultur',
  'power',
  '

### 9.2 Most (95%) and least frequent (5%) token

In [41]:
## Count the document frequency for each token

# initialize dictionary to store the result
df_dict = {}

for i in range(len(token_list3)):
    # use set function to get unique tokens
    unique = list(set(token_list3[i]))
    # count frequency using for loop
    for w in unique:
        if w not in df_dict:
            df_dict[w] = 1
        else:
            df_dict[w] += 1

df_dict

{'techniqu': 36,
 'expositori': 1,
 'guidanc': 6,
 'student': 128,
 'outcom': 20,
 'sophist': 7,
 'critic': 91,
 'contemporari': 47,
 'activ': 23,
 'focu': 19,
 'modern': 15,
 'framework': 22,
 'Research': 2,
 'exposit': 1,
 'advanc': 26,
 'synthesi': 9,
 'Develop': 1,
 'languag': 17,
 'knowledg': 78,
 'explicit': 2,
 'Framework': 3,
 'document': 15,
 'discours': 12,
 'compet': 21,
 'includ': 76,
 'write': 26,
 'individu': 32,
 'transfer': 3,
 'analysi': 49,
 'analyt': 14,
 'develop': 129,
 'specif': 30,
 'Franc': 1,
 'extend': 17,
 'enquiri': 2,
 'topic': 39,
 'skill': 84,
 'Skill': 3,
 'studi': 50,
 'power': 19,
 'sustain': 16,
 'awar': 20,
 'cooper': 1,
 'unit': 166,
 'essay': 7,
 'gener': 22,
 'teach': 6,
 'area': 25,
 'understand': 113,
 'involv': 23,
 'argument': 23,
 'French': 1,
 'work': 61,
 'defin': 12,
 'express': 15,
 'theori': 45,
 'methodolog': 24,
 'present': 50,
 'context': 59,
 'learn': 43,
 'equal': 2,
 'carri': 3,
 'issu': 67,
 'make': 20,
 'research': 68,
 'project'

In [42]:
# initailize empty list to store the tokens that we want to keep
most_least_freq_list = []

# loop over dictionary to check frequency
for k,v in df_dict.items():
    if v > len(unit)*0.95 or v < len(unit)*0.05 :
        most_least_freq_list.append(k)

most_least_freq_list

['expositori',
 'guidanc',
 'sophist',
 'Research',
 'exposit',
 'synthesi',
 'Develop',
 'explicit',
 'Framework',
 'transfer',
 'Franc',
 'enquiri',
 'Skill',
 'cooper',
 'essay',
 'teach',
 'French',
 'equal',
 'carri',
 'accord',
 'Scriptur',
 'evil',
 'view',
 'Israelit',
 'cult',
 'New',
 'Israel',
 'Christian',
 'genr',
 'proce',
 'biblic',
 'survey',
 'Hebrew',
 'devot',
 'ancient',
 'divin',
 'Bibl',
 'literari',
 'canonis',
 'Biblic',
 'Ancient',
 'scholarship',
 'intertestament',
 'attent',
 'institut',
 'redempt',
 'feminist',
 'law',
 'revel',
 'apocalypt',
 'religi',
 'Eastern',
 'Near',
 'narr',
 'Testament',
 'belief',
 'composit',
 'suffer',
 'Greek',
 'propheci',
 'authorship',
 'creation',
 'thought',
 'poetri',
 'Jewish',
 'sentenc',
 'sanction',
 'African',
 'punish',
 'correct',
 'multidisciplinari',
 'crime',
 'South',
 'justic',
 'harm',
 'right',
 'Africa',
 'reduct',
 'specialis',
 'altern',
 'formal',
 'crimin',
 'polic',
 'restor',
 'scienc',
 'evolut',
 'br

In [43]:
## Remove 5% and 95% from our token list

# initialize empty list to store result
token_list_final = []

# loop over list of tokens, we do not want the tokens that is in the most_least_freq_list
for i in range(len(token_list3)):
    temp = [item for item in token_list3[i] if item not in most_least_freq_list]
    token_list_final.append(temp)

token_list_final

[['unit',
  'involv',
  'student',
  'learn',
  'activ',
  'develop',
  'languag',
  'skill',
  'cultur',
  'compet',
  'extend',
  'skill',
  'develop',
  'area',
  'argument',
  'focu',
  'specif',
  'techniqu',
  'document',
  'oral',
  'present',
  'sustain',
  'argument',
  'involv',
  'critic',
  'awar',
  'issu',
  'contemporari',
  'student',
  'develop',
  'advanc',
  'languag',
  'skill',
  'compet',
  'theori',
  'research',
  'methodolog',
  'practic',
  'discours',
  'involv',
  'critic',
  'understand',
  'analysi',
  'area',
  'studi',
  'work',
  'defin',
  'project',
  'advanc',
  'analyt',
  'argument',
  'skill',
  'context',
  'write',
  'document',
  'make',
  'present',
  'gener',
  'topic',
  'advanc',
  'knowledg',
  'understand',
  'modern',
  'contemporari',
  'cultur',
  'power',
  'understand',
  'compet',
  'theori',
  'research',
  'methodolog',
  'practic',
  'discours',
  'area',
  'studi',
  'advanc',
  'languag',
  'skill',
  'involv',
  'develop',
  '

## 10 Bigram tokens

Now we can check if any of the bigram tokens that we found using the pmi measures at the beginning can be keep after removing the stop words and most/least frequenct tokens. First, we need to transform our unigram token list into a bigram token list.

In [44]:
## transform unigram token list to bigram token list
# initialize empty list
bi_tok_list =[]

for i in range(len(token_list_final)):
    temp = []
    for j in range(len(token_list_final[i])-1):
        bi = (token_list_final[i][j], token_list_final[i][j+1])
        temp.append(bi)
        
    bi_tok_list.append(temp)

bi_tok_list

[[('unit', 'involv'),
  ('involv', 'student'),
  ('student', 'learn'),
  ('learn', 'activ'),
  ('activ', 'develop'),
  ('develop', 'languag'),
  ('languag', 'skill'),
  ('skill', 'cultur'),
  ('cultur', 'compet'),
  ('compet', 'extend'),
  ('extend', 'skill'),
  ('skill', 'develop'),
  ('develop', 'area'),
  ('area', 'argument'),
  ('argument', 'focu'),
  ('focu', 'specif'),
  ('specif', 'techniqu'),
  ('techniqu', 'document'),
  ('document', 'oral'),
  ('oral', 'present'),
  ('present', 'sustain'),
  ('sustain', 'argument'),
  ('argument', 'involv'),
  ('involv', 'critic'),
  ('critic', 'awar'),
  ('awar', 'issu'),
  ('issu', 'contemporari'),
  ('contemporari', 'student'),
  ('student', 'develop'),
  ('develop', 'advanc'),
  ('advanc', 'languag'),
  ('languag', 'skill'),
  ('skill', 'compet'),
  ('compet', 'theori'),
  ('theori', 'research'),
  ('research', 'methodolog'),
  ('methodolog', 'practic'),
  ('practic', 'discours'),
  ('discours', 'involv'),
  ('involv', 'critic'),
  ('crit

Next, we will remove all the tokens that are not in our 200 bigram list.

In [45]:
## remove all the bigram tokens that are not in the 200 list

for i in range(len(bi_tok_list)):
    bi_tok_list[i] = [ item for item in bi_tok_list[i] if item in bi_token_200_stem]

bi_tok_list

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [('nation', 'posit')],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 

Only 1 token i.e. ('nation', 'posit') left after removing stop words and most/least frequent words from our token list. Since the frequency is only 1 for the above token, it will be discarded. We do not have any bigram tokens in our dictionary.

## 11. Build vocab index dictionary

In [46]:
# import library
from __future__ import division
from itertools import chain

## Build token dictionary
word_dict = {}

for i in range(len(unit)):
    word_dict[unit[i]] = token_list_final[i]

## extract vocab(unique words) from the above dictionary
vocab = list(set(chain.from_iterable(word_dict.values())))

## assemble vocab_index dictionary
vocab_dict = {}
idx = 0
# sort the vocab in alphabetical order and assign index
for w in sorted(vocab):
    vocab_dict[w] = idx
    idx += 1

vocab_dict

{'Australian': 0,
 'abil': 1,
 'academ': 2,
 'access': 3,
 'account': 4,
 'acquir': 5,
 'acquisit': 6,
 'activ': 7,
 'address': 8,
 'advanc': 9,
 'aim': 10,
 'analys': 11,
 'analysi': 12,
 'analyt': 13,
 'appli': 14,
 'applic': 15,
 'apprais': 16,
 'approach': 17,
 'area': 18,
 'argument': 19,
 'art': 20,
 'articul': 21,
 'aspect': 22,
 'assess': 23,
 'awar': 24,
 'base': 25,
 'basi': 26,
 'basic': 27,
 'begin': 28,
 'behaviour': 29,
 'broad': 30,
 'build': 31,
 'busi': 32,
 'capac': 33,
 'care': 34,
 'case': 35,
 'challeng': 36,
 'chang': 37,
 'characterist': 38,
 'clinic': 39,
 'collabor': 40,
 'commun': 41,
 'compar': 42,
 'compet': 43,
 'complet': 44,
 'complex': 45,
 'comprehens': 46,
 'concept': 47,
 'conceptu': 48,
 'conduct': 49,
 'consid': 50,
 'construct': 51,
 'contemporari': 52,
 'context': 53,
 'continu': 54,
 'contribut': 55,
 'control': 56,
 'convent': 57,
 'core': 58,
 'cover': 59,
 'creat': 60,
 'creativ': 61,
 'critic': 62,
 'critiqu': 63,
 'cultur': 64,
 'current': 6

In [47]:
## Write to output file
## open file handle
out = open('29481929_vocab.txt','w')

## write to file
for k,v in vocab_dict.items():
    out.write(k + ':' + str(v) + '\n')

# close file after finished
out.close()

## 12. Build sparse vector

First, we will transform all our tokens into index using the dictionary we build in the above section.

In [48]:
## transform all the tokens in to index using vocab_dict
# initialize empty list to store result
tok_idx_list = []

# looping over intidividual tokens and find out their corresponding index
for i in range(len(token_list_final)):
    temp = []
    for j in token_list_final[i]:
        temp.append(str(vocab_dict[j]))
    tok_idx_list.append(temp)

tok_idx_list

[['292',
  '159',
  '263',
  '166',
  '7',
  '76',
  '163',
  '249',
  '64',
  '43',
  '109',
  '249',
  '76',
  '18',
  '19',
  '117',
  '257',
  '275',
  '84',
  '194',
  '215',
  '269',
  '19',
  '159',
  '62',
  '24',
  '160',
  '52',
  '263',
  '76',
  '9',
  '163',
  '249',
  '43',
  '281',
  '237',
  '183',
  '213',
  '80',
  '159',
  '62',
  '290',
  '12',
  '18',
  '264',
  '297',
  '70',
  '223',
  '9',
  '13',
  '19',
  '249',
  '53',
  '299',
  '84',
  '174',
  '215',
  '127',
  '285',
  '9',
  '162',
  '290',
  '186',
  '52',
  '64',
  '212',
  '290',
  '43',
  '281',
  '237',
  '183',
  '213',
  '80',
  '18',
  '264',
  '9',
  '163',
  '249',
  '159',
  '76',
  '62',
  '12',
  '108',
  '198',
  '290',
  '122',
  '237',
  '146',
  '237',
  '249',
  '143',
  '146',
  '237',
  '249'],
 ['292',
  '28',
  '132',
  '64',
  '246',
  '101',
  '248',
  '53',
  '108',
  '169',
  '285',
  '101',
  '143',
  '107',
  '58',
  '217',
  '47',
  '20',
  '107',
  '160',
  '290',
  '121',
 

Build a dictionary using unit code as key and the list of token idex as value

In [50]:
unit_idx = {}
for i in range(len(unit)):
    unit_idx[unit[i]] = tok_idx_list[i]

unit_idx

{'ATS3070': ['292',
  '159',
  '263',
  '166',
  '7',
  '76',
  '163',
  '249',
  '64',
  '43',
  '109',
  '249',
  '76',
  '18',
  '19',
  '117',
  '257',
  '275',
  '84',
  '194',
  '215',
  '269',
  '19',
  '159',
  '62',
  '24',
  '160',
  '52',
  '263',
  '76',
  '9',
  '163',
  '249',
  '43',
  '281',
  '237',
  '183',
  '213',
  '80',
  '159',
  '62',
  '290',
  '12',
  '18',
  '264',
  '297',
  '70',
  '223',
  '9',
  '13',
  '19',
  '249',
  '53',
  '299',
  '84',
  '174',
  '215',
  '127',
  '285',
  '9',
  '162',
  '290',
  '186',
  '52',
  '64',
  '212',
  '290',
  '43',
  '281',
  '237',
  '183',
  '213',
  '80',
  '18',
  '264',
  '9',
  '163',
  '249',
  '159',
  '76',
  '62',
  '12',
  '108',
  '198',
  '290',
  '122',
  '237',
  '146',
  '237',
  '249',
  '143',
  '146',
  '237',
  '249'],
 'ATS2185': ['292',
  '28',
  '132',
  '64',
  '246',
  '101',
  '248',
  '53',
  '108',
  '169',
  '285',
  '101',
  '143',
  '107',
  '58',
  '217',
  '47',
  '20',
  '107',
  '160

Now we can count the frequency of index for each unit using the FreqDist function in the nltk.probability module

In [51]:
# import library
from nltk.probability import *

# open file handle
out = open('29481929_countVec.txt', 'w')

for i in range(len(unit)):
    out.write(unit[i] + ', ')
    # count frequency
    d = FreqDist(unit_idx[unit[i]])
    # since the FreqDist object is a dictionary, we can access its items using the items() function
    # write to output file
    for k,v in d.items():
        out.write(k + ':' + str(v) + ', ')
    out.write('\n')

# close file after finish
out.close()