# FIT5196 Assessment 1: Task 2
#### Student Name: Isobel Rowe
#### Student ID: 30042585

Date: 10/04/2019

Version: 2.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* PyPDF2
* re
* nltk
* nltk.collocations
* nltk.tokenize 
* Pandas
* MWETokenizer from nltk.tokenize
* PorterStemmer from nltk.stem
* itertools
* CountVectorizer from sklearn

## 1. Introduction
This task of assignment 1 focuses on extracting data from PDF files and transforming it into a vector space model. There are a number of steps involved in this process, which can broadly be outlined by the following:

1. Extract the unit information from the PDF file.
2. Process the unit information and transform into the format outlined by the specifications.
3. Generate the output - the vocabulary file, and the count vector file.

More details for each task will be given in the following sections.

## 2. Import libraries

In [1]:
import PyPDF2
import re
import nltk
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
import pandas as pd
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from itertools import *
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

## 3. Opening, reading and extracting data

Firstly, the content of the PDF needs to be extracted and stored in a format that can be used for processing. The PyPDF2 library is used for extraction, and the content is written to a text file.

In [2]:
#Opening and reading the PDF file
pdf_file = open('30042585.pdf', 'rb')
pdf_read = PyPDF2.PdfFileReader(pdf_file)

#Creating an output file 
extract = open('extractedpdf.txt', 'w')

#Looping over contents of PDF reader and writing to the extract file.
for i in range(pdf_read.getNumPages()):
    page = pdf_read.getPage(i)
    page_content = page.extractText()
    extract.write(page_content)

#Closing the output file
extract.close()

### Unit ID retrieval
Next, the unit ID's are extracted from the extracted PDF content file.

In [3]:
#Reading textfile in as string 
with open('extractedpdf.txt', 'r') as file:
    unitid1 = re.findall('^\w{3,4}\d{4}\n', file.read(), flags=re.MULTILINE|re.DOTALL)
    
# Removing the /n characters    
unitid = [x[:-1] for x in unitid1]

#Checking the length of the list to ensure that the regex was a success
print("Length of unit ID: ", len(unitid))
print(unitid[1:10])


Length of unit ID:  200
['ATS1092', 'VCO1303', 'ATS2791', 'DWG3516', 'BFC3440', 'ATS2791', 'CHE3171', 'IMM2011', 'BFX5018']


### Case normalisation
Next, case normalisation. Normally, this would be completed after tokenisation, however, as we want to keep the capital letter that appear in the middle of sentences, it's best to do it now. Another possibility is to use NLTK sentence tokeniser, but, in doing so, sentences starting with "\['" in the 'Outcomes' section would be left capitalised. 

In [4]:
# Opening and reading the file
textfile = open('extractedpdf.txt', 'r+')
text = textfile.read()

# Defining the regex patterns
pattern_a = r'(?<=[.?!]\s)(\w+)' #Matches everything after a sentence stopper - '.', '?', and '!'
pattern_b = r'\[\'\w' #Matches everything in the content square bracket
pattern_c = r'(?<=^\w{3}\d{4}\n)\w' #Matches everything after the unit ID in the synopsis section

# Replacing the capitals for lowercase
for f in re.findall(pattern_a, text):
    text = text.replace(f, f.lower())
for f in re.findall(pattern_b, text):
    text = text.replace(f, f.lower())   
for f in re.findall(pattern_c, text):
    text = text.replace(f, f.lower())  
    
# Validation
print(text[31:424])



this unit introduces students to the technological,
social, economic and political forces driving the
development, and adoption of new media and
communications technologies. it examines case
studies of when 'old technologies were new' such as
the telegraph and radio as well as the social shaping
of very recent examples of new media, such as
Facebook, Sina weibo, Qzone, Renren and Twitter.



### Content retrieval
The information for each unit is extracted from the textfile and stored in a list called 'content'. A regular expression is used here to gather everything in between the unitcode and the final square bracket in the 'Outcomes' section.

In [5]:
content = re.findall('(?<=\w{3}\d{4}\n)(.*?)(?=\])', text, flags = re.MULTILINE|re.DOTALL)
print("Length: ", len(content))

Length:  200


## 4. Processing the data

### Tokenisation
Tokenisation, which is the process of splitting sentences up into 'tokens' where ever there is a space, is started using RegexpTokenizer and the supplied regular expression.

In [6]:
#Creating empty list
tokenised = []

#Defining the tokeniser
tokeniser = RegexpTokenizer(r"\w+(?:[-']\w+)?")

#Looping over the list and tokenising
for element in content:
    tokenise = tokeniser.tokenize(str(element))
    tokenised.append(tokenise)

#Checking the length
print("Length: ", len(tokenised))
print(tokenised[1])

Length:  200
['this', 'unit', 'is', 'for', 'students', 'with', 'little', 'or', 'no', 'knowledge', 'of', 'the', 'language', 'this', 'unit', 'consists', 'of', 'two', 'components', 'component', '1', 'Language', 'a', 'communicatively', 'oriented', 'German', 'language', 'course', 'designed', 'for', 'all-round', 'development', 'in', 'the', 'language', 'component', '2', 'this', 'component', 'will', 'familiarise', 'students', 'with', 'the', 'history', 'culture', 'and', 'the', 'socio-economic', 'conditions', 'of', 'the', 'German-speaking', 'countries', 'na']


### Removing words < three characters
Now that everything has been extracted from the PDF and stored in lists, processing will start with the removal of words that are less than 3 characters long. 

In [7]:
#Creating a new nested list
removed3 = []

# Looping over and appending words which are more than 2 characters
for each in tokenised:
    temp = []
    for each2 in each:
        if(len(each2)>2):
            temp.append(each2)
    removed3.append(temp)  

#Validation
print(len(removed3))
print(removed3[1])

200
['this', 'unit', 'for', 'students', 'with', 'little', 'knowledge', 'the', 'language', 'this', 'unit', 'consists', 'two', 'components', 'component', 'Language', 'communicatively', 'oriented', 'German', 'language', 'course', 'designed', 'for', 'all-round', 'development', 'the', 'language', 'component', 'this', 'component', 'will', 'familiarise', 'students', 'with', 'the', 'history', 'culture', 'and', 'the', 'socio-economic', 'conditions', 'the', 'German-speaking', 'countries']


### Dataframe formatting
From here, I decided that putting the unit ID and information content into a Pandas dataframe would be the best fit for the folowing sections of the task.

In [8]:
# Creating the dataframe
tokens_df = pd.DataFrame(list(zip(unitid, removed3)))
tokens_df.columns = ['unitid','tokens']

# Verification
print("Length: ", len(tokens_df))
tokens_df.head()

Length:  200


Unnamed: 0,unitid,tokens
0,ATS2436,"[this, unit, introduces, students, the, techno..."
1,ATS1092,"[this, unit, for, students, with, little, know..."
2,VCO1303,"[this, subject, students, will, study, the, wo..."
3,ATS2791,"[this, unit, provides, detailed, exploration, ..."
4,DWG3516,"[this, unit, provides, students, art, and, des..."


### Bigram
For the bigram generation, I used NLTK and the PMI filter. I applied a frequency filter so that only the 'meaningful' bigrams would be captured. Without this, a lot of unwanted bigrams were captured, such as dates, for example: ('July', '1993').

In [9]:
# Initialising the bigram
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(list(chain.from_iterable(tokens_df.tokens)))

# Applying the frequency filter
finder.apply_freq_filter(5)

# Finding the bigrams
bigrams = finder.nbest(bigram_measures.pmi, 200)

print(len(bigrams))
print(bigrams[1:10])

200
[('concrete', 'slabs'), ('concise', 'accurate'), ('reinforced', 'concrete'), ('criminal', 'justice'), ('clear', 'concise'), ('interior', 'architecture'), ('drug', 'action'), ('activities', 'dealing'), ('under', 'pressure')]


Next, the bigrams need to be re-tokenised into the dataframe.

In [10]:
# Creating empty lists
token_bigram = []
unit_id_bigram = []
# Definfing tokeniser with bigrams
tokenizer = MWETokenizer(bigrams)

# Grouping by unit ID, and re-tokenising
for name, group in tokens_df.groupby(['unitid']):
    token_list = list(chain.from_iterable(group.tokens))
    tokens_bigrams = tokenizer.tokenize(token_list)
    token_bigram += tokens_bigrams # get new list of tokens
    unit_id_bigram += ([name] * len(tokens_bigrams)) # with their corresponding patent ID
    
# Creating a dictioanry of tokens and unit id
bigram_token_dict = {}
bigram_token_dict['token'] = token_bigram
bigram_token_dict['unitid'] = unit_id_bigram

# Creating new data frame with the dictionary
tokens_bigram_df = pd.DataFrame(bigram_token_dict)

# Validation
print(tokens_bigram_df.shape)
print(tokens_bigram_df.head())

(25007, 2)
          token   unitid
0          this  ACC2200
1  introductory  ACC2200
2    management  ACC2200
3    accounting  ACC2200
4  unit_focuses  ACC2200


### Stopword Removal

Next, stopword removal, which essentially removes all of the 'filler' words - like 'and', 'the' etc. This process involves using the provided stopword file and filtering out every word in the 'token' column of the dataframe which occurs in the file.

In [11]:
# Opening stopword file
stopword_file=open('stopwords_en.txt',"r",encoding="utf8")
# Reading and splitting into list
stop_list=stopword_file.read().split('\n')

# Filtering out the tokens that appear in the stopword list
tokens_stops_df = tokens_bigram_df[~tokens_bigram_df.token.isin(stop_list)]

# Checking shape
print(tokens_stops_df.shape)
print(tokens_stops_df.head())

(18167, 2)
          token   unitid
1  introductory  ACC2200
2    management  ACC2200
3    accounting  ACC2200
4  unit_focuses  ACC2200
6         types  ACC2200


### Rare Token Removal

This step involves removing tokens that appear in 5% (ie. 10) or less of the units. 

In [12]:
rare_words = []

# Grouping by token
for name, group in tokens_stops_df.groupby(['token']): 
    # If the token appears in < 10 units
    if len(set(group.unitid)) <= 10: 
        rare_words.append(name)
        
# Filter the rare words
tokens_norare_df = tokens_stops_df[~tokens_stops_df.token.isin(rare_words)]

print(tokens_norare_df.shape)
print(tokens_norare_df.head())

(6680, 2)
          token   unitid
2    management  ACC2200
8   information  ACC2200
18     planning  ACC2200
20      control  ACC2200
32   techniques  ACC2200


### Stemming
Stemming involves truncating words to their 'base' form. For instance: 'universal' transforms to 'univ'. Here, we're using the Porter stemmer, as provided by the NLTK module.

In [13]:
# Defining the Porter stemmer
stemmer = PorterStemmer()

# Applying stemmer on the token column of dataframe
tokens_norare_df.token = tokens_norare_df['token'].apply(stemmer.stem)
# Defining new dataframe
tokens_stemmed_df = tokens_norare_df

# Validating
print(tokens_stemmed_df.shape)
print(tokens_stemmed_df.head())

(6680, 2)
       token   unitid
2      manag  ACC2200
8     inform  ACC2200
18      plan  ACC2200
20   control  ACC2200
32  techniqu  ACC2200


## 5. Generate Output

### Vocabulary
The final step is to generate the output files for the vocabulary and the vector space model. First is the vocabulary, where we use CountVectorisor from scikit-learn which converts the processed text into a matrix of token counts.

In [14]:
#Use count vectorizer to count the words and remove context dependent stopwords with 95% frequency   
vectorizer = CountVectorizer(analyzer = 'word', max_df=0.95)
vectorizerobject = vectorizer.fit_transform(tokens_stemmed_df.token)

# Initialising vectoriser to find vocab
vocab=vectorizer.get_feature_names()
output=vectorizer.vocabulary_

# Defining keylist and sorting in alphabetical order
keys = list(output.keys()) 
keys.sort()

# Creating the string to write to file
vocabstring = '' 
for key in keys:
    vocabstring = vocabstring + str(key) + ':' + str(output[key]) + '\n' 
    
# Writing the string to the file
with open('30042585_vocab.txt', 'w') as vocaboutput:
    vocaboutput.write(vocabstring)

### Vector space model

Finally, the vector space model. The first step in producing this is to create a new dataframe, with the frequency count of each token per unit ID.

In [15]:
# Creating the dataframe
tokens_freq_df = pd.DataFrame(
                        {'frequency': tokens_stemmed_df.groupby(['unitid', 'token']).size()}
                    ).reset_index()
# Validate
print(tokens_freq_df.head())
print(tokens_freq_df.shape)

    unitid    token  frequency
0  ACC2200   analys          2
1  ACC2200  analysi          1
2  ACC2200  context          1
3  ACC2200  control          1
4  ACC2200    decis          2
(4409, 3)


The next step is to map the token ID (from the vocabulary dictionary) to the new dataframe.

In [16]:
# Creating dictionary with word:word_id
vocabulary_dict = {token:index for index, token in enumerate(keys)}

# Mapping the token ID 
tokens_freq_df['tokenid'] = tokens_freq_df.token.map(vocabulary_dict)

# Validating
print(tokens_freq_df.head())
print(tokens_freq_df.shape)

    unitid    token  frequency  tokenid
0  ACC2200   analys          2        3
1  ACC2200  analysi          1        4
2  ACC2200  context          1       33
3  ACC2200  control          1       34
4  ACC2200    decis          2       41
(4409, 4)


Finally, writing the vector space model a file.

In [17]:
a_writer = open('30042585_countVec.txt', 'w')

# Grouping by unit ID
for unitid, group in tokens_freq_df.groupby(['unitid']):
    # Writing in the specified format to the file
    a_writer.write(unitid + ','+ ','.join(group.tokenid.map(str) + ':' + group.frequency.map(str)) + '\n')
    
a_writer.close()

## 5. Summary

This task has demonstrated the basics of text pre-processing in the Python. The main outcomes achieved while executing this task were: 

* extracting data from PDF files, 
* natural language processing with:
    * word tokenisation, 
    * normalisation
    * the removal of stopwords, rare tokens, and tokens less than three characters long,
    * bigram generation, and
    * stemming.
* creation of vocabulary and sparse representations.



## 6. References

The Python Software Foundation. (2019). *'9.7. itertools — Functions creating iterators for efficient looping'*. Retrieved from https://docs.python.org/2/library/itertools.html#module-itertools

NLTK Project. (2017). NLTK 3.0 documentation: *nltk.tokenize.mwe module*. Retrieved from http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.mwe