# FIT5196 Assessment 1 Task 2
#### Student Name: Pichaphop Sunthornjittanon
#### Student ID: 31258301

Date: 31/8/21
 ##############################
Version: 2.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

## 1. Introduction
This task touches on the next step of analyzing textual data, i.e., converting the extracted data
into a numeric representation. In this task, you are required to write Python code to preprocess a
set of articles about cryptocurrency and convert them into numerical representations (which are
suitable for input into recommender-systems/ information-retrieval algorithms).

Your task is to extract and transform the information of the PDF file performing the following task:
1. Generate the corpus vocabulary with the same structure as sample_vocab.txt. Please
note that the vocabulary must be sorted alphabetically.
2. For each day (articles come with a date at their title), generate the sparse representation
(i.e., doc-term matrix) of the PDF file according to the structure of the
sample_countVec.txt. The articles of the same date must be concatenated before
converting to the vector representation. The order of concatenation is not important
for us (e.g., assuming “article1” and “article2” are both written on the same day, then you
can either do article1+article2 or article2+article1).

The following steps must be performed (not necessarily in the same order) to complete the
assessment.please note that the order of preprocessing matters and will result in different
vocabulary and hence different count vectors. It is part of the assessment to figure out the
correct order of preprocessing which makes the most sense as we learned in the tutorials. If in
doubt, you are encouraged to ask questions and discuss with the teaching team.
1. The word tokenization must use the following regular expression,
"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
2. The context-independent and context-dependent stopwords must be removed from
the vocabulary.
○ For context-independent, The provided context-independent stop words list (i.e,
stopwords_en.txt) must be used.
○ For context-dependent stopwords, you must set the threshold to more than
ceil(Number_of_days / 2).
3. Tokens should be stemmed using the Porter stemmer.
4. Rare tokens (with the threshold set to less than 10 days) must be removed from the
vocab.
5. Creating the sparse matrix using countvectorizer.
6. Tokens with a length less than 3 should be removed from the vocab.
7. First 200 meaningful bigrams (i.e., collocations) must be included in the vocab using
PMI measure.

## 2.  Importing libraries 

In [1]:
# Download pdfminer
!pip install pdfminer.six==20181108


# Use for regular expression
import re

# Use for import pdf file
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

# Use for text preprocessing
import nltk 

# Use for tokenisation using regular expression 
from nltk.tokenize import RegexpTokenizer

# Use for porter stemming 
from nltk.stem import PorterStemmer

# Use for calculate term and document frequency
from nltk.probability import *

# Use for retokenization after considering bigrams
from nltk.tokenize import MWETokenizer

# Use for math operation such as ceil
import math
from __future__ import division

# Use the iterator to join all the words together
from itertools import chain
import itertools

# Use for create sparse matrix
from sklearn.feature_extraction.text import CountVectorizer




## 3. Examining and loading data

In this section, we loaded and roughly explored the file from google drive with the file name 31258301_task2_pdf.pdf, which will be further processed. We used pdf Miner to transform PDF to raw text and store the text data into raw text varaible

In [2]:
# Mount Google Drive in the colab environment
from google.colab import drive

drive.mount('/content/drive') 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Data from Task2 datasets/31258301_task2_pdf
# Import PDF File
# Reference : https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html

output_string = StringIO()
with open('/content/drive/Shareddrives/FIT5196-s2-2021-tutorials/Assessment 1/Task2 datasets/31258301_task2_pdf.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)



In [4]:
# Store the raw data from PDF in raw_text
raw_text = output_string.getvalue()


## 4.Creating dictiory that contains key as dates and value as each day article

After we obtained the raw text, we broke down the text in dictionary data type, which keys are dates and values are the articles writen in the given dates (If there were multiple articles in one day, concatatenate multiple articles together)

To create dictionary, we utilised RegexpTokenizer to use regular expression to extract dates for keys and used dates as seperater to extract articles for values.

The regular expression for extract dates that we used is \\[\d{4}-\d{2}-\d{2} since we found that most of the dates are in this format [2018-06-30]. However, some data had square blacket missing such as [2016-01-17 so we considered this case as well.

For the articles, since we want to exclude the topics, the regular expression for seperator is slighly different, which is \\[\d{4}-\d{1,2}-\d{1,2}(?:.*?\n\n)

After that we cleaned the date data by striping ] and whitespace at the beginning and end of the artcle and replace newline to space. Finally, we concat the articles that were written the same day

In [5]:
# Create list of Article, which will use for values in dictionary
article_tokenizer = RegexpTokenizer(r"\[\d{4}-\d{1,2}-\d{1,2}(?:.*?\n\n)", gaps=True,flags= re.DOTALL )
article = article_tokenizer.tokenize(raw_text)

# Create list of Date, which will use for key in dictionary
date_tokenizer = RegexpTokenizer(r"\[\d{4}-\d{1,2}-\d{1,2}", gaps=False)
dates = date_tokenizer.tokenize(raw_text)

In [6]:
# # Clean the date data
dates = [w.strip('[') for w in dates] 
dates_split = [date.split('-') for date in dates] 
dates = [date[0]+'-'+date[1].zfill(2)+'-'+date[2].zfill(2) for date in dates_split]

In [7]:
# Create dictionary and concat articles if there were more than one article a day 
daily_article = {}
for index in range(0,len(dates)):

  if dates[index] not in daily_article:
    daily_article[dates[index]] = article[index]

  else:
    daily_article[dates[index]] = daily_article[dates[index]] + article[index]

  

##5.Tokenizing the Text

In this section, we define the function that converts the articles to lower case and do tokenization breaking words into list by RegexpTokenizer with specified rule given in the specification. Then, we apply the function into the preprocessed data from the previous section and store in dictionary format name tokenised_article

In [8]:
# Define the function used for tokenisation
def tokenizeRawData(date):
    """
        This function tokenizes a raw text document.
    """
    # Convert text to lower case
    raw_article = daily_article[date].lower()

    # Define tokenisation rules
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?", gaps=False)

    # Tokenise the text
    tokenised_article = tokenizer.tokenize(raw_article)

    # Return tuple
    return (date,tokenised_article) 

# Create dict of daily article using date as keys
tokenised_article = dict(tokenizeRawData(date) for date in daily_article.keys())

In [9]:
# Check the size of vocab and token size
words_check = list(chain.from_iterable(tokenised_article.values()))
vocab_check = set(words_check)
lexical_diversity = len(words_check)/len(vocab_check)
print ("Vocabulary size: ",len(vocab_check),"\nTotal number of tokens: ", len(words_check), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10890 
Total number of tokens:  89302 
Lexical diversity:  8.200367309458219


##6.Generate 200 meaningful Bigrams (Collocations)

In this part,we want to generate 200 bigram collocation, given the tokenized articles above.

By doing this, we started from concatenating all words by using chain.from_iterable. Then, BigramAssocMeasures(), BigramCollocationFinder.from_words(all_words) and bigram_finder.nbest(bigram_measures.pmi, 200)  are applied to find 200 meaningful bigrams using PMI measure

In [10]:
# Create list of all words
all_words = list(chain.from_iterable(tokenised_article.values()))

# Find top 200 bigram using PMI
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) 


Next,we use MWETokenizer to tokenize the text again without spliting the words, which are belong in the 200 bigrams

In [11]:
# Combine bigram to the tokens
mwetokenizer = MWETokenizer(top_200_bigrams)
tokenised_colloc_articles =  dict((date, mwetokenizer.tokenize(article)) for date,article in tokenised_article.items())

In [12]:
# Check the size of vocab and token size

words_check = list(chain.from_iterable(tokenised_colloc_articles.values()))
vocab_check = set(words_check)
lexical_diversity = len(words_check)/len(vocab_check)
print ("Vocabulary size: ",len(vocab_check),"\nTotal number of tokens: ", len(words_check), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10706 
Total number of tokens:  89118 
Lexical diversity:  8.324117317392117


##7.Removing  Context-Independent Stopwords  & Tokens with Length Less than 3

In this part, we remove context independent stopwords from our vocabularies since they are the words that are commonly found in most of the documents. By removing the stopwords, we download the given list of stopwords in the stopwords_en.txt and filter those out along with the words that have length less than 3 (Given in specification) 

In [13]:
# Download the given context independent stopwords file
with open('/content/drive/Shareddrives/FIT5196-s2-2021-tutorials/Assessment 1/Task2 datasets/stopwords_en.txt','r') as infile:
    context_ind_stopwords = infile.read().splitlines()

# Convert into set of context_ind_stopwords
context_ind_stopwords = set(context_ind_stopwords)

# Initialise new dictionary
tokenised_colloc_nonstop_articles ={}

# Remove context independent stop word and tokens that have length less than 3
for key in tokenised_colloc_articles.keys() :
  tokenised_colloc_nonstop_articles[key] = [w for w in tokenised_colloc_articles[key] if w not in context_ind_stopwords]
  tokenised_colloc_nonstop_articles[key] = [w for w in tokenised_colloc_nonstop_articles[key] if len(w)>=3]


In [14]:
# Check the size of vocab and token size
words_check = list(chain.from_iterable(tokenised_colloc_nonstop_articles.values()))
vocab_check = set(words_check)
lexical_diversity = len(words_check)/len(vocab_check)
print ("Vocabulary size: ",len(vocab_check),"\nTotal number of tokens: ", len(words_check), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10140 
Total number of tokens:  46313 
Lexical diversity:  4.567357001972387


##8.Removing Context-Dependent Stopwords and Rare Tokens (Less than 10 days) in Unigram


In the requirement, we are asked to exclude Context-Dependent Stopwords and Rare Tokens (Less than 10 days), but include 200 meaningful bigrams. We assume that by removing the stopwords and rare token, it is done in unigrams.

In the first step, we create the words list shown in the code and apply FreqDist function to obtain document frequency in each word.

Then, we use document frequency values to filter the words that are in context-dependent stopwords (doc freq more than ceil(number of days/2)) and rare token(doc freq less than 10) conditions keeping the bigrams in the vocabularies. 

In [15]:
# List of each document unique word
words = list(chain.from_iterable([set(value) for value in tokenised_colloc_nonstop_articles.values()]))

# Find the document frequency in each word
doc_freq = FreqDist(words)

# Create the list of context dependent stopwords that we want to remove
context_dependent = [word for word,freq in doc_freq.items() if freq > math.ceil(len(tokenised_colloc_nonstop_articles)/2)]
context_dependent = set(context_dependent)

# Create the list of rare word that we want to remove
rare_word = [word for word,freq in doc_freq.items() if freq < 10]
rare_word = set(rare_word)

# Create the list of 200 bigrams obtained above
top_200_bigrams_list = ['_'.join(word) for word in top_200_bigrams]
top_200_bigrams_set = set(top_200_bigrams_list)

In [16]:
# Create set of context dependent stopword and rare word that we want to remove from the vocab (exlude bigram)
remove_context_dependent = context_dependent-top_200_bigrams_set
remove_rare_word = rare_word-top_200_bigrams_set

In [17]:
# Initialise the new dict
tokenised_colloc_nonstop_removedocfreq_articles = {}

# Remove Context-Dependent Stopwords and Rare Tokens
for key in tokenised_colloc_nonstop_articles.keys() :
  tokenised_colloc_nonstop_removedocfreq_articles[key] = [w for w in tokenised_colloc_nonstop_articles[key] if w not in remove_context_dependent]
  tokenised_colloc_nonstop_removedocfreq_articles[key] = [w for w in tokenised_colloc_nonstop_removedocfreq_articles[key] 
                                                                       if w not in remove_rare_word]

In [18]:
# Check the size of vocab and token size
words_check = list(chain.from_iterable(tokenised_colloc_nonstop_removedocfreq_articles.values()))
vocab_check = set(words_check)
lexical_diversity = len(words_check)/len(vocab_check)
print ("Vocabulary size: ",len(vocab_check),"\nTotal number of tokens: ", len(words_check), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  904 
Total number of tokens:  23481 
Lexical diversity:  25.974557522123895


##9.Stemming Unigram Using Porter Stemmer

In this part, we use porter stemming algorithm to stem the unigram in order to group words with similar meaning but different forms together

In [19]:
# Use Porter Stemming algorithm
stemmer = PorterStemmer()

# Initialise the new dict
tokenised_colloc_nonstop_removedocfreq_stem_articles = {}

# Do stemming in unigram
for date,words in tokenised_colloc_nonstop_removedocfreq_articles.items() :
  tokenised_colloc_nonstop_removedocfreq_stem_articles[date] = [stemmer.stem(w) if w not in top_200_bigrams_set else w for w in words  ]

In [20]:
# Check the size of vocab and token size
words_check = list(chain.from_iterable(tokenised_colloc_nonstop_removedocfreq_stem_articles.values()))
vocab_check = set(words_check)
lexical_diversity = len(words_check)/len(vocab_check)
print ("Vocabulary size: ",len(vocab_check),"\nTotal number of tokens: ", len(words_check), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  751 
Total number of tokens:  23481 
Lexical diversity:  31.26631158455393


##10.Creating a sparse matrix using countvectrorizer

Finally,we create sparse matrix using countveoctorizer and generate the final outputs, which are count vector for each date in 31258301_countVec.txt file and list of vocabulary in 31258301_vocab.txt file 

In [21]:
# Define spliting function
# Source : https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
def spliting(string):
    return string.split()

# Use CountVectorizer function 
vectorizer = CountVectorizer(tokenizer = spliting) 

# Create data_features using the fit_transform function 
data_features = vectorizer.fit_transform([' '.join(article) for date,article in tokenised_colloc_nonstop_removedocfreq_stem_articles.items()])
print (data_features.shape)

(394, 751)


In [22]:
# Write the 31258301_countVec.txt 

with open('31258301_countVec.txt','w') as file:

    # Create coordinate dat format 
    cx = data_features.tocoo() 

    # Intialise this variable to determine whether it start a new line or not
    var = -1

    # Iterate through all coordiation in sparse matrix
    for i,j,v in itertools.zip_longest(cx.row, cx.col, cx.data):

        # If it changes the row, write a new line and new date
        if var != i:
            if (var != -1):
                file.write("\n") 
            file.write(list(tokenised_colloc_nonstop_removedocfreq_stem_articles.keys())[i])

            # Assign i to var
            var = i
        # Report the count vector of the indexed words
        file.write(","+str(j)+":"+str(v))

In [23]:
# Write the 31258301_vocab.txt 

# Obtain vocabs
vocab = vectorizer.get_feature_names()

# Write vocab to output file
with open('31258301_vocab.txt','w') as file:
  for words in range (len(vocab)):
    file.write(vocab[words]+":"+str(words)+"\n")

#Summary

In this assignment, we perform several steps to convert texts from the given PDF file to the list of vocabulary and count vectors, which are generated into the text files. The steps include :

1. **Introduction** - Understand the requirement of this assignment
2. **Importing libraries** -Import several libraries used for this task
3. **Examining and loading data** - Load the data from the given PDF file 
4. **Creating dictiory that contains key as dates and value as each day article** - Convert the loaded text into dictionary, which keys represent dates and values represent each day article
5.**Tokenizing the Text** - Convert all words into lower cases and tokenize the text into the lists in the dictionary created in previous step
6. **Generate 200 meaningful Bigrams (Collocations)** - Find the first 200 meaningful bigrams using PMI measures and retokenize again using MWETokenizer.
7. **Removing Context-Independent Stopwords & Tokens with Length Less than 3** - Remove the given context-independent stopwords and the words that have length less than 3 to our vocabularies.
8.**Removing Context-Dependent Stopwords and Rare Tokens (Less than 10 days) in Unigram** - Remove the unigram that does not meet the given conditions, which are  in context-dependent stopwords (occur more than 50% of the days) and Rare Tokens (occur less than 10 days)
9.**Stemming Unigram Using Porter Stemmer** - Do stemming for the unigram using porter stemmer
10.**Creating a sparse matrix using countvectrorizer** - Create the sparse matrix using countvectorizer and generate final outputs, which are 31258301_countVec.txt and 31258301_vocab.txt