# FIT5196 Assessment 2
## Text Pre-Processing & Feature Generation

#### Student Name: Akshatha Shivashankar Chindalur
#### Student ID: 29996503
#### Student Name: Pradnya Alchetti
#### Student ID: 29595916

Date: 15/09/2019

Version: 1.0

Environment: Python 3.7.3 and Jupyter notebook

Libraries used: 

* pdfminer (for extracting information in PDFs, included in Anaconda Python 3.7) 
* nltk (for natural language processing, included in Anaconda Python 3.7) 
* re (for regular expressions, included in Anaconda Python 3.7) 

## Task 1: Generating a sparse matrix for Paper Bodies

### Importing the libraries necessary for generation of the sparse matrix.

In [1]:
# The required libraries are imported below.

import os
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import multiprocessing as mp
import glob
import re
import requests
import collections
import pandas as pd
import nltk
import nltk.data
from nltk.util import ngrams
from nltk.probability import *
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
nltk.download('punkt')
from itertools import chain
from functools import partial
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/akshathacs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Generating the corpus of papers as PDF files.

The given dataset containing 200 URLs of published papers is read from a PDF file. From these URLs, the corresponding papers are downloaded into the **papers** directory (folder).

In [2]:
#   This function downloads the paper from its respective URL as a PDF file. The downloaded file is
#   is stored in the directory (folder) papers.
#
#   :param url:    the URL (link) of the website from which the published paper needs to be downloaded.
#
#   :return contents:   a string (contents of the paper) obtained after processing the PDF file.
 
def write_to_pdf(url):
    request_pdf = requests.get(url)
    name = request_pdf.headers['content-disposition']
    f_name = re.search("filename=\"(.*)\"", name)

    with open('papers/'+ f_name.group(1),'wb') as f_pdf:
        f_pdf.write(request_pdf.content)   

In [3]:
#   This function extracts the contents from the given PDF file
#
#   :param pdf_path:    the path of the PDF file (paper) from which the data needs to be extracted.
#
#   :return contents:   a string (contents of the paper) obtained after processing the PDF file.

def retrieve_from_pdf(pdf_path):
    
    resource_manager = PDFResourceManager()
    file_fail_handle = StringIO()
    codec = 'utf-8'
    params = LAParams()
    converter = TextConverter(resource_manager, file_fail_handle, codec=codec, laparams=params)
    fp = open(pdf_path, 'rb')
    pdf_interpreter = PDFPageInterpreter(resource_manager, converter)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        pdf_interpreter.process_page(page)

    contents = file_fail_handle.getvalue()

    fp.close()
    converter.close()
    file_fail_handle.close()
    
    return contents

#### Step 1: Obtain 200 URLs from the given dataset.

In [4]:
# 200 URLs from the given PDF file are obtained
dataset = retrieve_from_pdf('./Group113.pdf')

#### Step 2: Download papers from their respective URLs as PDF files.  

In [5]:
# a directory called 'papers' to store the downloaded PDFs is created (if not present already)

if not os.path.exists('./papers'):
    os.mkdir('./papers')
    
urls = re.findall('https:.*', dataset)

for each_url in urls:
    write_to_pdf(each_url)

### Sparse Feature Generation

#### 1. Obtain the content of the 200 papers in the corpus.

In [6]:
def get_raw_text(file_path):
    
    # obtain the name of the document being parsed
    name_of_file = os.path.splitext(os.path.basename(file_path))[0]
    
    # the contents from the PDF file is obtained.
    pdf_text = retrieve_from_pdf(file_path)
    
    return name_of_file, pdf_text

We first retrieve the content in the form of text for each pdf file in the corpus. This is done with the help of 3 threads running parallelly. Further speeding up the process.

In [7]:
# building a pool of 3 processes
pool = mp.Pool(processes = 3) 

# obtaining the list of file names from the 'papers' directory.
filenames = glob.glob('./papers/*.pdf')

# dictionary of raw contents obtained from parsing the PDF files of papers.
raw_text = dict(pool.map(get_raw_text, filenames))

In [8]:
#   This function retrieves only the content of the paper based on regular expression.
#   It also performs certain pre-processing steps that clean up the text obtained from the PDF extractor.
#
#   :param pdf_text:    the PDF text (or paper) that needs to parsed
#
#   :return paper_content:   a string of the paper content obtained after applying the regular expression and 
#   after the intial cleaning process.

def get_paper_content(pdf_text, regex):
    
    # the contents from the PDF file is obtained.
    #pdf_text = retrieve_from_pdf(pdf_file)
    
    # from the processed PDF files, only the content that matches the regular 
    # expression is retrieved
    paper_content = re.search(regex, pdf_text).group(1)
    
    # some of the words continue onto the next line. These words contain a '-' between them, 
    # whcih can be identified with the regular expression '-\\n'. Thus, this is replaced with
    # a null character such that the new word is a whole.For instance, the word 'man-\nually' 
    # become 'manually' and the word 'compar-\nisons' becomes 'comparisons'.
    paper_content = re.sub('-\\n','', paper_content)
    
    # every new page begins with the special character '\x0c'. This is removed by replacing it
    # with a null character.
    paper_content = re.sub('\\x0c', '', paper_content)
    
    # the page numbers can be identified with the the regular expression '\\n(\d+)\\n\\n' which
    # is removed by replacing it with a null character.
    paper_content = re.sub('\\n(\d+)\\n\\n','',paper_content)
    
    # all the single new line characters are replaced with a space.
    paper_content = re.sub('\n',' ', paper_content)
    
    # lastly, the a single or multiple inline reference such as [1] or [22, 23] can be identified 
    # with the regular expression '(\s\[(\d+)(,\s*\d+)*\])' which is then replaced by a null 
    # character.
    paper_content = re.sub('(\s\[(\d+)(,\s*\d+)*\])','', paper_content)


    return paper_content

#### 2. Sentence Segmentation: tokens are normalised to lower case except the one appearing in the middle of a sentence.

In [9]:
#   This function segments the given string of text into sentences.
#
#   :param text:    the pre-processed body of the paper from the downloaded PDF files.
#
#   :return sentences:   an array of strings - each a sentence from the paper body.

def get_sentences(text):
    
    detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = detector.tokenize(text.strip())
    
    return sentences

#   This function converts all tokens to lower case except the ones appearing in the middle
#   of a sentence.
#
#   :param raw_text:    the cleansed body of the paper after being extracted from 
#                       its respective PDF file.
#
#   :return normalise_sentence:   an array of strings - each normalised to lower case.

def case_normalisation(raw_text):
    
    sentences = get_sentences(raw_text)
    
    normalise_sentence = []
    
    for sentence in sentences:
        normalise_sentence.append(sentence.replace(sentence[0],sentence[0].lower(),1))

    return normalise_sentence

#### 3. Word Tokenization: using the regular expression "[A-Za-z]\w+(?:[-'?]\w+)?"

In [10]:
#   This function tokenises the text based on the regular expression "[A-Za-z]\w+(?:[-'?]\w+)?".
#
#   :param text:    the text which needs to be tokenized.
#
#   :return tokens:   an array of unigram tokens (strings).

def get_tokens(text):
    
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    tokens = tokenizer.tokenize(text)
    
    return tokens

#   This function genrates a list of unigram tokens that contain only alphabetic characters for 
#   every document (or paper) in the corpus.
#
#   :param text:    the text which needs to be tokenized.
#
#   :return doc_tokens:   a list of alphabetic tokens for the respective document.

def tokenize(normalised_text):
    
    doc_tokens = []
    
    for sentence in normalised_text:
        tokens = get_tokens(sentence)
        doc_tokens.extend(tokens)
        
    doc_tokens = [token for token in doc_tokens if token.isalpha()]
    
    return doc_tokens

In order to generate the sparse matrix, firstly each document in the corpus has to be tokenised individually.

In [11]:
tokenized_data = {}

for paper in raw_text:
    paper_body = get_paper_content(raw_text[paper], 'Paper Body([\s\S]*)\d+\sReferences')
    normalised_text = case_normalisation(paper_body)
    tokenized_data[paper] = tokenize(normalised_text)

In [12]:
# genrating a corpus of tokens from all the 200 documents in the 'papers' directory

combine_tokens = []

for each_doc in tokenized_data.values():
    combine_tokens.extend(each_doc)

#### 4. Bigrams: extracting 200 meaningful bigrams from the above generated token corpus.

In [13]:
# obtain all the possible bigrams from the token corpus
all_bigrams = ngrams(combine_tokens, 2)

# calculating the frequency of each bigram within the corpus
bigram_freq = collections.Counter(all_bigrams)

# obtaining the 3000 most frequent bigrams. From these 3000 bigrams only
# the top 200 bigrams (the ones that do not contain the context independent 
# stopwords and tokens of length less than 3) are retained.
bigram_1000 = bigram_freq.most_common(3000)

In [14]:
# reading the given file, 'stopwords_en.txt' which contains the context independent 
# stop words
stopwords_file = open('stopwords_en.txt', 'r')

# storing the context independent stop words in a list
context_independent = stopwords_file.read().split('\n')

In [15]:
#   This function removes the context independent stop words from the list of tokens
#
#   :param tokenized_data:    dictionary of tokens with context independent stop words
#
#   :return tokenized_data:    dictionary of tokens without context independent stop words

def filter_stopwords(tokenized_data):
    
    for each in tokenized_data:
        tokenized_data[each] = [w for w in tokenized_data[each] if w.lower() not in context_independent]
    
    return tokenized_data

In [16]:
# extracting the 200 most meaningful bigrams

# if any one of the words in the bigram is a context independent stop word, it 
# is removed from the final list
bigram_200 = [(bigram[0],bigram[1]) for bigram, freq in bigram_1000 
              if (bigram[0].lower() not in context_independent and bigram[1].lower() not in context_independent)] 

# if either of the words in a bigram is of length less than 3, then that bigram
# is removed from the list
bigram_200 = [(bigram[0],bigram[1]) for bigram in bigram_200 if (len(bigram[0]) >= 3) 
              and (len(bigram[1]) >= 3)]

# finally only the most frequent 200 bigrams from the final list are retained 
# for further processing.
bigram_200 = bigram_200[0:200]

#### 5. Context Independent and Dependent Stop Words 

In [17]:
# the meaningful 200 bigrams are included in the corpus of tokens for each document. The bigrams are 
# joined with a '__' (double underscore) between them.

mwe_tokenizer = MWETokenizer(bigram_200, separator='__')

uni_bigram = {}

for doc,tokens in tokenized_data.items():
    uni_bigram[doc] = mwe_tokenizer.tokenize(tokens)

In [18]:
# from each of the tokenised documents the context independent stop words are removed.

tokenized_data_no_stop = filter_stopwords(uni_bigram)

In [19]:
# the document frequence: the number of documents a word appears in is computed

# a list of unique tokens from each document. This will help in identifying the number of
# documents a word appears in.
unique_tokens = list(chain.from_iterable([set(token) for token in tokenized_data_no_stop.values()]))

# the frequency of each word is determined
freq_utokens = FreqDist(unique_tokens)

# since the threshold is 95% of 200 (documents) = 190, the words that appear in more 
#than 190 documents are deleted.
most_freq_tokens = [key for key, value in freq_utokens.items() if value >= 190]

In [20]:
# both the context depedent and indepedent stop words are removed from the intial corpus
# of tokens.
no_stop_tokens = {}

for key in tokenized_data_no_stop:
    no_stop_tokens[key] = [token for token in tokenized_data_no_stop[key] if token.lower() not in most_freq_tokens]

#### 6. Tokens with length less than 3 are removed

In [21]:
# the tokens which are of length less than 3 are removed from the token corpus.

length_three = {}

for key in no_stop_tokens:
    length_three[key] = [token for token in no_stop_tokens[key] if (len(token) >= 3)]

#### 7. Rare Tokens

In [22]:
# since the threshold for rare tokens to be removed is 3% of 200 documents = 6,
# the tokens that appear in 6 or fewer number of documents are removed

rare_tokens = [key for key, value in freq_utokens.items() if value <= 6]

del_rare_tokens = {}
for key in length_three:
    del_rare_tokens[key] = [token for token in length_three[key] if token not in rare_tokens]

#### 8. Stemming Unigram Tokens

In [23]:
#   This function seperates unigrams and bigrams from the combined corpus
#   of each paper (document).
#
#   :param token_list:    combined list of unigrams and bigrams
#
#   :return unigram:   a list of unigram tokens.
#           bigram:    a list of bigram tokens.

def sep_uni(token_list):
    
    unigram = []
    bigram = []
    for token in token_list:
        if '__' in token:
            bigram.append(token)
        else:
            unigram.append(token)
            
    return unigram, bigram

#   This function performs stemming on the list of unigram tokens. Tokens like Barcelona, USA, HMM
#   were being translated to barcelona, usa an hmm respectively. Although, the case of such tokens
#   has been retained (capitalised/title) by checking if the token is a Title or Capitalised.
#
#   :param unigram_list:    list of unigram tokens to be stemmed
#
#   :return stem_uni:    list of stemmed unigram tokens

def stemming(unigram_list):
    
    ps = PorterStemmer()
    stem_uni = []
    
    for word in unigram_list:
        
        # check if first letter is upper case
        if word.istitle():
            stem_uni.append(ps.stem(word).capitalize())
        # check if all/few letters are in upper case
        elif word.isupper():
            stem_uni.append(ps.stem(word).upper())
        else:
            stem_uni.append(ps.stem(word))
    
    return stem_uni

stem_tokens = {}

# stemming is done for tokens from each document
for key in del_rare_tokens:
    
    # seperate unigrams and bigrams
    uni, bi = sep_uni(del_rare_tokens[key])
    
    # obtain stemmed unigrams
    stemmed_uni = stemming(uni)
    
    # merge the stemmed unigrams and bigrams
    stem_tokens[key] = stemmed_uni + bi

### Output 1: Vocabulary Index File

In [24]:
# defining a count vector for converting tokens into a sparse matrix representation
vectorizer = CountVectorizer(analyzer = "word", lowercase = False) 

# generating the sparse matrix for the pre-processed text
data_features = vectorizer.fit_transform([' '.join(value) for value in stem_tokens.values()])

In [25]:
# obtaining the features
features = vectorizer.get_feature_names()

# writing the features and its list index to the file 'Group113_vocab.txt'
with open('Group113_vocab.txt', "w+") as vocab:
    for i in range(len(features)):
        vocab.write(features[i] + ':')
        vocab.write(str(i)+'\n')

### Output 2: Sparse Count Vectors

In [26]:
# transferring the contents of the sparse matrix to the text file 'Group113_count_vectors.txt'
with open('Group113_count_vectors.txt', 'w+') as count_vector:
    
    # getting the list of paper names
    file_names = list(stem_tokens.keys())
    
    for i in range(data_features.shape[0]):
        
        # from each row in the sparse matrix only the non zero elements are obtained
        row, col = data_features[i].nonzero()
        
        # the contents of the sparse matrix are written to the file
        count_vector.write(file_names[i] + ',')
        for j in range(len(col) - 1):
            count_vector.write(str(col[j]) + ':')
            count_vector.write(str(data_features[i,col[j]])+',')
        count_vector.write(str(col[-1]) + ':')
        count_vector.write(str(data_features[i,col[-1]])+'\n')

## Task 2

In [27]:
# The abstract of each paper is required for feature generation. This is extracted (with the help
# of a regular expression)from the raw text obtained from parsing the PDF files.

# dictionary of tokenised documents with key as the file name and value as the list of tokens.
tokenized_data_abstract = {}

for paper in raw_text:
    
    # get the abstract of the paper for feature extraction
    # regular expression identifies the text between the key words Abstract and Paper Body
    paper_abstract = get_paper_content(raw_text[paper], 'Abstract([\s\S]*)\d+\sPaper Body')
    
    # tokens normalised to lower case excluding the ones in the middle of a 
    # sentence.
    normalised_text = case_normalisation(paper_abstract)
    
    # list of tokens obtained after processing the document are assigned to the respective document
    tokenized_data_abstract[paper] = tokenize(normalised_text)

In [28]:
# The Title of each paper is required for feature generation. This is extracted (with the help
# of a regular expression)from the raw text obtained from parsing the PDF files.

# dictionary of tokenised documents with key as the file name and value as the list of tokens.
tokenized_data_title = {}

for paper in raw_text:
    
    # get the title of the paper for feature extraction
    # the regular expression identifies the group of text between A-Z or a-z or '(' and Authored by.
    paper_title = get_paper_content(raw_text[paper], '(^[A-Za-z(][\s\S]*)\\n\\nAuthored by')
    
    # tokens are all normalised to lowercase.
    normalised_text = paper_title.lower()
    
    # list of tokens obtained after processing the document are assigned to the respective document
    tokenized_data_title[paper] = get_tokens(normalised_text)

In [29]:
# Authors of each paper is required for feature generation. This is extracted (with the help
# of a regular expression) from the raw text obtained from parsing the PDF files.

# dictionary of tokenised documents with key as the file name and value as the list of authors 
# is generated.
authors_data = {}

for paper in raw_text:
    
    # from the processed PDF files, extract the authors on the basis of the regular expression.
    # regular expression: identifies the group that starts between Authored by and Abstract
    paper_author = re.search('Authored by:([\s\S]*)\\n\\nAbstract', raw_text[paper]).group(1)
    
    # split the retrived names using '\n' to create a list of authors 
    authors = paper_author.split('\n')
    
    # remove empty strings from the list that were created due to splitting
    authors = list(filter(None, authors))
    
    authors_data[paper] = authors

In [30]:
#   This function computes the frequency of each word in the given token list.
#
#   :param tokenized_data:    list of tokens without context independent stop words
#          content_type:      type to which the tokens belong: Abstract, Title or Authours
#
#   :return top_ten:    list of top 10 most frequent terms 

def get_most_common_words(tokenized_data, content_type):
    
    # Check content type and filter stopwords for abstract and title
    if content_type == 'abstract' or content_type == 'title':
        tokenized_data = filter_stopwords(tokenized_data)
    
    # create a list of words from all the documents
    words = list(chain.from_iterable(tokenized_data.values()))
    
    # retrieve the 10 most common words
    freq_dist = FreqDist(words)
    most_common = freq_dist.most_common(10)
    
    # Create a list of top 10 common words
    top_ten = []
    for word in most_common:
        top_ten.append(word[0])
    
    return top_ten

In [31]:
# get top 10 most occurring terms in abstract
top_abstract = get_most_common_words(tokenized_data_abstract,'abstract')

# get top 10 most occurring terms in title
top_title = get_most_common_words(tokenized_data_title,'title')

# get top 10 authors
top_author = get_most_common_words(authors_data,'')

In [32]:
# create a dataframe for the statistics
stats_data = {'top10_terms_in_abstracts':top_abstract,'top10_terms_in_titles':top_title,'top10_authors':top_author}

data_frame = pd.DataFrame(stats_data)

# write to csv
data_frame.to_csv("Group113_stats.csv", encoding='utf-8', index=False)

## 3. Summary

This code performs two tasks of text-processing:
1. Sparse Feature Generation
2. Statistics Generation

The below sequence of steps has been implements for intial text pre-processing:
1. Sentence Segmentation
2. Tokens are normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. 
3. The word are tokenized using the following regular expression, r"[A-Za-z]\w+(?:[-'?]\w+)?"
4. 200 meaningful bigrams are extracted.
5. Context-independent and context-dependent (with the threshold set to %95) stop words are removed from the vocabulary.
6. Tokens with length less than 3 are removed from the vocabulary.
7. Rare tokens (with the threshold set to 3%) are further removed from the vocabulary.
8. Stemming of unigram tokens using Porter Stemmer.

There were **2250** features obtained after processing the paper bodies.

## 4. References

1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
2. https://www.nltk.org/