### Text Processing


Date: 14th April 2019

Environment: Python 2.7.11 and Jupyter notebook

## 1. Introduction
For this task, I've provided a step by step process on analyzing textual data by pre processing text and producing a vocab file.

## 2. Import libraries
I've used the libraries below to obtain the output for this assignment

In [None]:
import re
import PyPDF2
import nltk
from nltk.tokenize import RegexpTokenizer
from itertools import chain
from nltk.tokenize import MWETokenizer
from nltk.probability import *
import multiprocessing as mp
from nltk.stem import PorterStemmer

## 3. Reading and extracting from PDF
Using a function from PyPDF2, all the raw text from pdf files can be extracted and stored into a python object for usage throughout the assignment. The expected output of this step is one long string object of the whole document.

In [38]:
#making page content a string
page_content = ""
import PyPDF2
with open('Unit Guide.pdf','rb') as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    #iterate through page numbers in range of pages
    for page_number in range(number_of_pages): 
        page = read_pdf.getPage(page_number)
        page_content+= page.extractText()
page_content

## 4. Creating separate list for unit codes and combined synopsis and outcome

I've created two lists of unitcode and unit information. The first process is to separate each chunk of unit documents into list. Then, I've had created the two lists of unitcodes and summary information. 

In [39]:
# used it to extract units
p = re.findall('([^ ][A-Z]{3}[0-9]{4})\n',page_content)
units= list(p)
 
# used it to extract summary
summary = re.split('[^ ][A-Z]{3}[0-9]{4}\n',page_content)

summary.pop(0) #dont need first line hence i've popped it

'Title\nSynopsis\nOutcomes'

## 5. Normalize the first letters to lowercase
I've used the list of summary to normalize the first letters to lowercase. Text cleaning was done in this process. I've removed unwanted symbols and new line space characters. This needs to be done for tokenization to be correct given that "n" in "\n" can be counted in tokenization.


In [4]:
#lists below to append
sum1 = []
sum2 = []
sum3 = [] 
sum4 = [] 
summary_final = []

# remove new line space with symbol \\n and replace it with space to make it tidier and avoid 'n' being attached to a word
i = 0
for each in summary:
    sub = re.sub(r'\\n', " ", summary[i])
    sum1.append(sub)
    i += 1
    
# remove new line space with symbol \n and replace it with space to make it tidier and avoid 'n' being attached to a word
i = 0
for each in sum1:
    sub = re.sub(r'\n', " ", sum1[i])
    sum2.append(sub)
    i += 1
    
# remove the bracket symbols "[" that exists in outcome to make it tidier.
for each in sum2:
    sub = re.sub(r'\[\'', "", each)
    sum3.append(sub)

# remove the bracket symbols "]" that exists in outcome to make it tidier.
for each in sum3:
    sub = re.sub(r'\'\]', "", each)
    sum4.append(sub)
    
# remove comma and whitespace in outcomes that comes after fullstop so that lowercase function can be applied 
for each in sum4:
    sub = re.sub(r'\',\s\'', " ", each)
    summary_final.append(sub)

In [36]:
# empty list to append on
normalize = []
summary_list = []
# using regex to get the first capital letter after full stop and replacing it with the same match regex alphabet
# but with lowercase
for each in summary_final:
    lower = re.sub(r'((?<=\.\s)[A-Z])', 
                   lambda match: r'{}'.format(match.group(1).lower()),each) 
    
    normalize.append(lower) 

# applying lowercase function for the first sentence only because it doesnt start with a full 
#stop.
for each in normalize:
    # using index 0 to get the first alphabet index
    summary_list.append(each.replace(each[0],each[0].lower()))

In [33]:
# renaming units to units_total for easier reference
p = re.findall('[^ ]([A-Z]{3}[0-9]{4})\n',page_content)
units_total= list(p)

In [34]:
# creating dictionary for units_total and summary_list
sum1 = dict(zip(units_total,summary_list))

## 5. Tokenize summary for every units using regular expression,  "\w+(?:[-']\w+)?
After normalizing the texts, I have tokenize the units with regular expression tokenizer implemented in NLTK. The code below is in reference to the tutorial

In [8]:
#tokenizing based on regex word below
tokenizer = RegexpTokenizer("\w+(?:[-']\w+)?")

def tokenizeunits(units):

    token =  sum1[units] 
    tokenized = tokenizer.tokenize(token)
    tokenized_unit = list(set(tokenized))
    return(units, tokenized_unit) # return a tuple of units and a list of tokens

# created a dictionary 
units_tokenized = dict(tokenizeunits(units) for units in sum1.keys())

## 6. Finding first 200 meaningful bigrams using PMI  measure.

In [40]:
#creating bigrams
all_words = list(chain.from_iterable(units_tokenized.values()))

## 7. Removing tokens with the length less than 3 from the vocab.

In [10]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_measures
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder
#more than 20 times frequency
bigram_finder.apply_freq_filter(20)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# 
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-200 bigrams
top_200_bigrams

[('skills', 'analysis'),
 ('the', 'able'),
 ('research', 'skills'),
 ('with', 'and'),
 ('this', 'research'),
 ('skills', 'for')]

In [41]:
# using MWE tokenizer to obtain top 200 bigrams
mwetokenizer = MWETokenizer(top_200_bigrams)
colloc_patents =  dict((units_total, mwetokenizer.tokenize(summary_list)) for units_total,summary_list in units_tokenized.items())
all_words_colloc = list(chain.from_iterable(colloc_patents.values()))
colloc_voc = list(set(all_words_colloc))

## 8. Remove Stopwords

In [12]:
#open stopwords folder
with open('stopwords_en.txt') as f:
    stopwords = f.read()


In [13]:
#stopwords removed
tokenized_units_1 = {}
for units in sum1.keys():
    tokenized_units_1[units] = [w for w in units_tokenized[units] if w not in stopwords]

## 9. Removing context - independent and context dependent using frequency distribution

In [14]:
words = list(chain.from_iterable([set(value) for value in tokenized_units_1.values()]))
# use freqdist to get frequency of words
fd = FreqDist(words)
r = fd.most_common()

In [15]:
r = fd.most_common()
r

[('unit', 146),
 ('students', 110),
 ('skills', 97),
 ('research', 80),
 ('understanding', 76),
 ('develop', 69),
 ('knowledge', 62),
 ('apply', 61),
 ('analyse', 61),
 ('issues', 60),
 ('including', 59),
 ('critically', 57),
 ('practice', 51),
 ('concepts', 51),
 ('evaluate', 50),
 ('principles', 49),
 ('identify', 49),
 ('design', 49),
 ('analysis', 48),
 ('range', 48),
 ('demonstrate', 46),
 ('work', 46),
 ('critical', 46),
 ('techniques', 45),
 ('context', 44),
 ('health', 43),
 ('development', 42),
 ('key', 41),
 ('contemporary', 39),
 ('learning', 39),
 ('social', 39),
 ('role', 38),
 ('management', 38),
 ('explain', 38),
 ('process', 37),
 ('topics', 37),
 ('relevant', 37),
 ('communication', 36),
 ('professional', 35),
 ('project', 35),
 ('methods', 34),
 ('understand', 33),
 ('ability', 32),
 ('advanced', 31),
 ('writing', 31),
 ('cultural', 31),
 ('communicate', 30),
 ('strategies', 29),
 ('reflect', 28),
 ('discuss', 28),
 ('practices', 28),
 ('based', 28),
 ('interpret', 28

In [16]:
#getting list of more than 95% frequency
list_of_mostcommon = []
for i in r:
    word,freq = i
    if freq > 0.95 * len(sum1):
        le = (word, freq)
        list_of_mostcommon.append(word)
list_of_mostcommon

[]

In [17]:
# checking for least common words
p = fd.most_common()[::-1]
p

[('paddock', 1),
 ('household', 1),
 ('agricultural', 1),
 ('impacting', 1),
 ('salinity', 1),
 ('views', 1),
 ('economists', 1),
 ('plate', 1),
 ('beneficial', 1),
 ('messages', 1),
 ('initiatives', 1),
 ('nutritional', 1),
 ('drive', 1),
 ('drought', 1),
 ('intake', 1),
 ('Industrial', 1),
 ('IDN4001', 1),
 ('perfect', 1),
 ('serve', 1),
 ('initiated', 1),
 ('One', 1),
 ('Major', 1),
 ('Cultivate', 1),
 ('ambitions', 1),
 ('species', 1),
 ('genetics', 1),
 ('practicals', 1),
 ('drift', 1),
 ('discoveries', 1),
 ('mutation', 1),
 ('high-level', 1),
 ('invasive', 1),
 ('mating', 1),
 ('biodiversity', 1),
 ('verse', 1),
 ('Symbolist', 1),
 ('Modernist', 1),
 ('ideogramic', 1),
 ('automatic', 1),
 ('parataxis', 1),
 ('free', 1),
 ('agreement', 1),
 ('requests', 1),
 ('Introductory', 1),
 ('routine', 1),
 ('lifetime', 1),
 ('refusal', 1),
 ('Classification', 1),
 ('accommodating', 1),
 ('improving', 1),
 ('classroom', 1),
 ('childhood', 1),
 ('neurodevelopmental', 1),
 ('learner', 1),
 ('

## 10. Remove rare tokens

In [18]:
#getting list of less than 5%
list_of_leastcommon = []
for i in p:
    word,freq = i
    if freq < 0.05 * len(sum1):
        pe = (word, freq)
        list_of_leastcommon.append(word)
print(len(list_of_leastcommon))

3685


In [19]:
both_removed = []
for k, v in tokenized_units_1.items():
    # using list comprehension for filtering out tokens
    tokenized_units_1[k]=[i for i in v if i not in list_of_leastcommon]

countwords=list(chain.from_iterable(tokenized_units_1.values()))
a=set(countwords)
b=list(a)
print(len(b))

259


## 11. Stemming using Porter Stemmer
The stemming process is done using Porter Stemmer.I've performed the stemming on updated
vocabulary. The code below is in reference to the lecture file.

In [20]:
stemmer = PorterStemmer()

new=['{1}'.format(w, stemmer.stem(w)) for w in b]
print((new))

['cover', 'conceptu', 'base', 'relat', 'health', 'servic', 'abil', 'approach', 'question', 'select', 'solv', 'appreci', 'intern', 'engag', 'studi', 'strategi', 'understand', 'practic', 'commun', 'complex', 'synthesis', 'demonstr', 'financi', 'limit', 'group', 'project', 'structur', 'natur', 'includ', 'program', 'introduc', 'framework', 'methodolog', 'interpret', 'practic', 'work', 'develop', 'fundament', 'polit', 'divers', 'assess', 'student', 'theoret', 'evalu', 'analys', 'laboratori', 'introduct', 'examin', 'learn', 'concept', 'perspect', 'text', 'present', 'system', 'ethic', 'demonstr', 'high', 'commun', 'construct', 'element', 'independ', 'compon', 'role', 'social', 'methodolog', 'oral', 'unit', 'ethic', 'analyt', 'problem', 'conduct', 'report', 'cultur', 'aim', 'data', 'implic', 'examin', 'complet', 'work', 'identifi', 'decis', 'critic', 'evid', 'disciplin', 'influenc', 'address', 'studio', 'inform', 'global', 'australia', 'relat', 'common', 'product', 'approach', 'public', 'chall

# 12. Writing the vocab.txt file
After the pre processing is done, with the cleaned output I am creating a vocabulary by extracting all tokens from all units combined and set it to make it unique. A vocab file is created in this step that should have vocab and index which will be used for creating count vector. The output of this step should be a file text containing all indexed vocabs. 

In [26]:
vocab = list(set(new))
vocab.sort()# sort vocabs alphabetically

199


In [31]:
integer_index = list(range(0,len(vocab))) 

In [42]:
#open vocab file
f = open("vocab.txt", "w+")

i = 0
while i < len(vocab)-1: # create loop only until second to last index because I will write comma after each vocab
    f.write(vocab[i] + ":" + str(integer_index[i]) + "," + "\r\n")
    i += 1
