# Talbe of contents

* [Introduction](#introduction)
* [PART A: Generate a sparse representation for Paper Bodies](#part_a)
    * [Step 01: Import libraries](#step01)
    * [Step 02: Convert pdf file to txt file](#step02)
    * [Step 03: Download the PDF files of all the papers and translate to txt file](#step03)
    * [Step 04: Create context-independent stop word list](#step04)
    * [Step 05: Sparse feature generation](#step05)
        * [5.1 Fisrt time text perporcessing on Paper Bodies](#step05-1)
        * [5.2 Find the top 200 meaningful bigrams](#step05-2)
        * [5.3 Find the context-dependent (with the threshold set to %95) stop words](#step05-3)
        * [5.4 Find the Rare tokens (with the threshold set to 3%)](#step05-4)
        * [5.5 Create all remove list](#step05-5)
        * [5.6 Final text perporcessing on Paper Bodies](#step05-6)
    * [Step 06: Create vocab.txt file](#step06)
    * [Step 07: Create count_vectors.txt](#step07)
* [PART B: Generate a CSV file (stats.csv) containing three columns](#part_b)
    * [Step 01: Define functions to extract information from files.](#step1)
    * [Step 02: Parse one pdf file in pdf_list](#step2)
    * [Step 03: Word Tokenization](#step3)
    * [Step 04: Top 10 Terms](#step4)
    * [Step 05: Create a DataFrame and Generate a CSV file](#step5)
* [Summary](#summary)

# Introduction
<a id="introduction"></a>
In this assignment, we extracting data from non-structured format (PDF file) and convert to the txt file, and extract data into a proper format suitable for a downstream modelling task. Finally, we convert them into numerical representations. 
Therefore, in this case, we will use some libraries, such as pdfminer and nltk we learned in tutorial, to finish this assignment.

# PART A:  Generate a sparse representation for Paper Bodies
<a id="part_a"></a>

## Step 01.  Import libraries
<a id="step01"></a>

Import some libraries for this assignment:
* pandas (for dataframe, included in Anaconda Python 3.7) 
* re (for regular expression, included in Anaconda Python 3.7)
* os (for pdf file to txt file)
* pdfminer.six 
* pdfminer
* nltk
* requests

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
#! pip install pandas
#! pip install nltk
#! pip install pdfminer.six
#! pip install requests
#nltk.download('stopwords')
import os
import re
import nltk
import pandas as pd
import requests
from io import StringIO, open
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk import sent_tokenize 
from nltk.stem.porter import PorterStemmer
from nltk.probability import *
porter_stem=PorterStemmer()
from nltk.util import ngrams

## Step 02. Convert pdf file to txt file
<a id="step02"></a>
To convert the downloaded PDF file to a text file, we are going to use *pdf2txt.py*, a command that comes with pdfminer.

Now we have pdfminer installed and are ready to convert our PDF to text by running the following command:
```shell
    pdf2txt.py -o Group081.txt Group081.pdf
```


In [2]:
output='Group081'
filename='Group081'
os.system("pdf2txt.py -o {}.txt {}.pdf".format(output, filename))

0

## Step 03. Download the PDF files of all the papers and translate to txt file
<a id="step03"></a>

In this steop, we do the following things:
* Use regular expression to extract the paper id and download url.
```shell
    Paper ID regular expression: r"(.*?)\.pdf"
    Paper download url regular expression: r"\.pdf (http.*)"
```
* Use requests library to programmatically download papers from url.

* Download PDF files and use pdf2txt.py to convert PDF to txt file. Therefore, we will have 200 PDF files and 200 txt files. 

* Create a paper_id variable of a list that store the ID of these 200 papers.

In [3]:
# input the file
pdfTxtFile = 'Group081.txt'
pdf_txt = open(pdfTxtFile, 'r')

pattern_url = re.compile(r'\.pdf (http.*)')
pattern_id = re.compile(r'(.*?)\.pdf')
paper_id=[]

#read each line in file, extract the paper id and url, and append id to paper_id list
#use requests to download the url file, and convert pdf file to txt file by using pdf2txt.py
for line in pdf_txt:
    match_url=pattern_url.search(line)
    match_id=pattern_id.search(line)
    if match_url and match_id is not None:
        url=match_url.group(1)
        r = requests.get(url)
        url_id=match_id.group(1)
        url_id_pdf=url_id+'.pdf'
        paper_id.append(url_id)
        with open(url_id_pdf, 'wb') as f:
            f.write(r.content)
            os.system("pdf2txt.py -o {}.txt {}.pdf".format(url_id, url_id))
            
#close the file            
pdf_txt.close()


## Step 04. Create context-independent stop word list
<a id="step04"></a>

In this step, the context-independent stop words list we create in this case is from stopwords_en.txt provided in the zip file.

In [4]:
# input the file
stop_word_txt=open('stopwords_en.txt','r')
context_independent_stopwords=[]

#read each line in file, strip each line and append stop word to stop_word list
for word in stop_word_txt:
    a=word.strip()
    context_independent_stopwords.append(a)

#close the file
stop_word_txt.close()

## Step 05. Sparse feature generation 
<a id="step05"></a>
In this assignment, we will do perform text preprocessing on Paper Bodies to sparse feature feneration two times.
* The first time we perform text preprocessing on paper bodies is to find:
    * First 200 meaningful bigrams
    * Context-dependent (with the threshold set to %95) stop words
    * Rare tokens (with the threshold set to 3%)


* The Second time we perform text preprocessing on paper bodies is to sparse feature generation:
    * Convert the Tokens to bigram if these two tokens are in top 200 meaningful bigrams.
    * remove all the stop words and Rare tokens

### Step 5-1. First time text perporcessing on Paper Bodies
<a id="step05-1"></a>
In this first time text perporcessing on paper body, we will do following things to find the first 200 meaningful biframs, context-dependent stop words, and rare tokens:
* 1. Convert the txt file to one row.
* 2. Extract the Paper bodies in each paper.
* 3. Lowercase the first word in each sentence in paper body. (E)
* 4. Word tokenization by using the following regular expression, r"[A-Za-z]\w+(?:[- '?]\w+)?"      (A)
* 5. Remove context-independent stop words.      (B)
* 6. Remove tokens with the length less than 3      (F)
* 7. Extracting 2-grams to bigrams_raw list.
* 8. append token to first_step that the key is paper id, and the value is token.

<div class="alert alert-block alert-warning">
   Create the to_one_row function that convert the txt file to one row.

In [5]:
def to_one_row(txt_file):   
    txt=[]
    #read each line in txt_file, and replace '\n' to ''
    for rows in txt_file:
        row_replaced=rows.replace('\n','')
        # if the length of row_replaced not equal to 0, append to txt
        if len(row_replaced) != 0:
            txt.append(row_replaced)
    
    # join to one row of string.
    txt_string=' '.join(txt)
    return txt_string

<div class="alert alert-block alert-warning">
   Create the first_word_lowercase function that lowercase the first word in each sentence.

In [6]:
def first_word_lowercase(sentence):
    low_sentence=[]
    #read each line in sentence, and change the first character to lower case
    for rows in sentence:
        row=rows.strip()
        low_sent=row[0].lower() + row[1:]
        low_sentence.append(low_sent)
    return low_sentence

In [7]:
porter_stem=PorterStemmer()
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")

first_step={}
bigrams_raw=[]

# read each line in paper_id
for ID in paper_id:
    pdfTxtFile = ID+'.txt'
    pdf_txt = open(pdfTxtFile, 'r')
    
    # use to_one_row function to convert sentence to one row
    txt_string=to_one_row(pdf_txt)
    
    # extract the paper body
    pattern_context=re.compile(r'1 Paper Body(.*)2 References')
    context=pattern_context.search(txt_string).group(1)
    
    #use sent_tokenize to sentence segmentation.
    sent_token=sent_tokenize(context)
        
    #use first_word_lowercase to lower the fisrt word in sentence segmentation sentence.
    low_sentence=first_word_lowercase(sent_token)
    
    sentence_token = []

    #read each line in low_sentence
    for sent in low_sentence:
        #tokenize the word
        unigram_tokens = tokenizer.tokenize(sent)
        
        #remove context-independent stop word and token less than 3
        remove_stopword_uni=[word for word in unigram_tokens if word.lower() not in context_independent_stopwords and len(word)>2]

        #append token to bigrams_list
        bigrams_list=[word for word in remove_stopword_uni]
        
        # append each token to sentenc_token list
        for uni in remove_stopword_uni:
            sentence_token.append(uni)
        
        # use ngrams function to make 2-grams in each sentence
        bigrams_f = ngrams(bigrams_list, n = 2)
        
        # append bigram to bigrams_raw list
        for bigram in bigrams_f:
            bigrams_raw.append(bigram)

    #append token to first_step that the key is paper id, and the value is sentence_token.
    first_step[ID]=list(set(sentence_token))
    
    pdf_txt.close()

### Step 5-2. Find the top 200 meaningful bigrams
<a id="step05-2"></a>
In this step we use FreqDist to find the top 200 meaningful bigrams and separated meaningful bigrams using double underscore i.e. “__”

In [8]:
#use FreqDist function to count the bigrams
fd_bigram = FreqDist(bigrams_raw)
#find the top 200 meaning bigrams
bigram_200=fd_bigram.most_common(200)

bigram_list=[]
bigram_find=[]

#read each line in bigram_200, and use double underscore to join
for i in bigram_200:
    a='__'.join(i[0])
    bigram_list.append(a)
    bigram_find.append(i[0])

### Step 5-3. Find the context-dependent (with the threshold set to %95) stop words
<a id="step05-3"></a>

In this step, we use count variable to count the number. If the word appears in one paper, the count will plus one. Finally, if the count better than 190, we will append this word to context-dependent stop words

In [9]:
context_dependent_stopwords=[]
# read each word_list in first_stop.values
for word_list in first_step.values():    
    #read each word in word_list
    for word in word_list:
        count=0
        #read each paper id in paper_id
        for Id in paper_id:
            # if word appears in each paper values, count plus one
            if word in first_step[Id]:
                count +=1
        #if count better than 190, and the word not in context_dependent_stopwords, append this word to context_dependent_stopwords list
        if count>190 and word not in context_dependent_stopwords:
            context_dependent_stopwords.append(word)

### Step 5-4. Find the Rare tokens (with the threshold set to 3%)
<a id="step05-4"></a>

In this step, we use count variable to count the number. If the word appears in one paper, the count will plus one. Finally, if the count less than 7, we will append this word to rare token list.

In [10]:
rare_tokens=[]
# read each word_list in first_stop.values
for word_list in first_step.values():
    #read each word in word_list
    for word in word_list:
        count=0
        #read each paper id in paper_id
        for Id in paper_id:
            # if word appears in each paper values, count plus one
            if word in first_step[Id]:
                count +=1
        #if count less than 7, and the word not in rare_tokens, append this word to rare_tokens list
        if count<7 and word not in rare_tokens:   
            rare_tokens.append(word)

### Step 5-5. Create all remove list
<a id="step05-5"></a>

In this step, we create all remove list by combining context-independent and context-dependent stop words and rare token list.

In [11]:
all_remove_list=context_independent_stopwords+context_dependent_stopwords + rare_tokens 

### Step 5-6. Final text perporcessing on Paper Bodies
<a id="step05-6"></a>

In this first time text perporcessing on paper body, we will do following things:
(E->A->G->B->D->F->C)
* 1. Convert the txt file to one row.
* 2. Extract the Paper bodies in each paper.
* 3. Lowercase the first word in each sentence in paper body. (E)
* 4. Word tokenization by using the following regular expression, r"[A-Za-z]\w+(?:[- '?]\w+)?"(A)
* 5. Replace token to bigram if this token is in the top 200 meaningful bigrams list (G)
* 6. Remove context-independent and context-dependent stop words   (B)
* 7. Remove Rare tokens in 3% Rare tokens list (D) 
* 8. Remove tokens with the length less than 3 (F)
* 9. Stem token by using the Porter stemmer. (C)

In [12]:
from nltk.stem.porter import PorterStemmer
from nltk.probability import *
porter_stem=PorterStemmer()

final={}

tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
lookup=set(bigram_find)

# read each id in paper_id
for ID in paper_id:
    pdfTxtFile = ID+'.txt'
    pdf_txt = open(pdfTxtFile, 'r')

    # use to_one_row function to convert sentence to one row
    txt_string=to_one_row(pdf_txt)    
    
    # extract the paper body
    pattern_context=re.compile(r'1 Paper Body(.*)2 References')
    context=pattern_context.search(txt_string).group(1)
    
    #use sent_tokenize to sentence segmentation.
    sent_token=sent_tokenize(context)

    #use first_word_lowercase to lower the fisrt word in sentence segmentation sentence.
    low_sentence=first_word_lowercase(sent_token)
    
    uni_token = []
    
    #read each line in low_sentence
    for sent in low_sentence:
        #tokenize the word
        unigram_tokens = tokenizer.tokenize(sent)
        
        #remove context-independent stop word and token less than 3
        remove_stopword_sent=[word for word in unigram_tokens if word.lower() not in context_independent_stopwords]
        
        for uni in remove_stopword_sent:
            uni_token.append(uni)
    
    bigramed_uni = []
        
    #Replace token to bigram if this token is in the top 200 meaningful bigrams list
    word_iter = iter(range(len(uni_token)))
    for index in word_iter:
        bigramed_uni.append(uni_token[index])
        if index < (len(uni_token) - 1) and (uni_token[index], uni_token[index+1]) in lookup:
            bigramed_uni[-1] =bigramed_uni[-1] +'__'+uni_token[index+1]
            next(word_iter)
    
    
    stemmed_uni=[]
    
    # read each token in bigramed_uni
    for uni in bigramed_uni:
        #if token is upper case, length of token is greater than 2, and token is not in all_remove_list, append token to stemmed_uni list
        if uni[0].isupper() and len(uni)>2 and uni not in all_remove_list:
            stemmed_uni.append(uni)
        # if the length of token is greater than 2, and token is not in all_remove_list, use porter stemming to stem the token, and append to stemmed_uni list
        elif len(uni)>2 and uni not in all_remove_list:
            stem_uni=porter_stem.stem(uni)
            stemmed_uni.append(stem_uni)
            
    #append token to final dictionary that the key is paper id, and the value is stemmed_uni.
    final[ID]=stemmed_uni
    
    pdf_txt.close()

## Step 06. Create vocab.txt file
<a id="step06"></a>
In this step, vocab.txt file contains the bigrams and unigrams tokens in the following format, token_string:token_index. We sort the Words in the vocabulary by using alphabetical ascending order. Finally convert to the txt file

In [24]:
#input the file
f=open('Group081_vocab.txt','w')

#append unique word to word_list
vocab=[]
for word_list in final.values():
    for word in word_list:
        if word not in vocab :
            vocab.append(word)           

# to sort the word_list
sorted_list = sorted(vocab)

final_output={}

#use enumerate to read index and word in sorted_list, and write into Group081_vocab.txt file
for index, word in enumerate(sorted_list):
    final_output[word]=index
    f.write(word+':'+str(index)+'\n')

#close the file    
f.close()

## Step 07. Create count_vectors.txt
<a id="step07"></a>
In this step, Each line in the txt file contains the sparse representations of one of the papers in the following format: (paper_id, token1_index:token1_wordcount, token2_index:token2_wordcount ...) Therefore, we will do following steps:
* Translate the token_string to token_index in each paper
* Use FreqDist to count the token_index in each paper
* Write to count_vectors.txt file

In [25]:
final_value={}
# read each id in paper_id
for ID in paper_id:
    temp=[]
    # read each word in final[ID], and convert the token_string to token_index
    for word in final[ID]:
        temp.append(final_output[word])
    
    #append token_index to final_value dictionary that the key is paper id, and the value is token_index list.
    final_value[ID]=temp

In [26]:
from nltk.probability import *

#open the file
f1=open('Group081_count_vectors.txt','w')

# read key and value in final_value dictionary
for key,value in final_value.items():
    #use FreqDist function to count the frequence of token_index, and write to Group081_count_vectors.txt file
    fd_word_index = FreqDist(value)
    word_index_count=fd_word_index.most_common()
    f1.write(key+',')
    for i in word_index_count:   
        f1.write(str(i[0])+':'+str(i[1])+',')
    f1.write('\n')    

#close the file    
f1.close()

# PART B: Generate a CSV file (stats.csv) containing three columns
<a id="part_b"></a>

## Step 01. Define functions to extract information from files and sort lists.
<a id="step1"></a>
Because we need to extract information from 200 files in a for loop, so using fuctions is an easier and more efficient method.

For abstract, information is between "Abstract" and "1 Paper Body", so its pattern is **"Abstract(.+?)1 Paper Body"**. In the meantime, abstract must be normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. So we use sentence segmentation to split lines into sentences, and lowercase the first letter of a sentence. Finally, we transform sentences into a string again.

For title, information is the lines before "Authored by:", so its pattern is **"(.+?)Authored by:"**. After extract, we lowercase titles.

For author, information is between "Authored by:" and "Abstract", so its pattern is **"Authored by:\s\*(.+?)\s\*Abstract"**. The result returned is a list.

For sort list, we used bubble sort here.

In [16]:
# define a function to extract abstract
def extract_abstract(text):
    abstract_pattern = re.compile(r"Abstract(.+?)1 Paper Body") # pattern for abstract
    abstract = abstract_pattern.search(raw_text).group(1).strip() # delete \n in the end
    
    # sentence segmentation
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 
    sentences = sent_detector.tokenize(abstract)
    abstract_lower_list = []
    for sent in sentences:
        sent = sent[0].lower() + sent[1:] # lowercase the first letter of a sentence
        abstract_lower_list.append(sent) # append to lower list
    abstract_lower = ' '.join(abstract_lower_list) # combine to a string
    return abstract_lower

# define a function to extract title
def extract_title(text):
    title_pattern = re.compile(r"(.+?)Authored by:") # pattern for title
    title = title_pattern.search(raw_text).group(1).strip().lower() # delete \n in the end and lowercase
    return title

# define a function to extract author
def extract_author(text):
    author_pattern = re.compile(r"Authored by:\s*(.+?)\s*Abstract", re.DOTALL) # pattern for title and search in lines
    author_group = author_pattern.search(raw_text).group(1).strip() # delete \n in the end
    author = author_group.split('\n') # split by \n into a list
    return author

# define bubble_sort to sort freqdist
def bubble_sort(tuple_list):
    n = len(tuple_list)
    for i in range(n-1, 0, -1):
        for j in range(i):
            if tuple_list[j][1] == tuple_list[j+1][1]:
                if tuple_list[j][0] > tuple_list[j+1][0]:
                    tuple_list[j],tuple_list[j+1] = tuple_list[j+1],tuple_list[j]

In [17]:
# initialize lists containing items required
abstract_list = []
title_list = []
author_list = []

In [18]:
# initialize stopwords list and turn it into a set
stopwords_list = stopwords.words('english')
stopwords_set = set(stopwords_list)

# add words in stopwords_en.txt into the set
with open('stopwords_en.txt', 'r') as stopwords_en:
    for word in stopwords_en:
        stopwords_set.add(word.strip())

## Step02. Parse one pdf file in pdf_list
<a id="step2"></a>
Fistly, we concatenate lines in the file into one string without '\n'. Then use functions to extract abstracts and titles and append them into lists respectively. After that, we concatenate lines into one string again, however, with '\n' this time. Then we extract a list of authores. Finally, we append items which are not "" into the author list.

In [19]:
# open files in the pdf_list and process
for ID in paper_id:
    pdfTxtFile = ID+'.txt'
    # concatenate lines into one string without '\n'
    with open(pdfTxtFile) as pdf_text:
        raw_list = []
        for lines in pdf_text:
            raw_list.append(lines.strip()) # delete \n in the end of the line
            raw_text = " ".join(raw_list)

    # call functions and append into lists
    abstract_list.append(extract_abstract(raw_text))
    title_list.append(extract_title(raw_text))
    
    # open file and concatenate lines into one string with '\n'
    with open(pdfTxtFile) as pdf_text:
        raw_list = []
        for lines in pdf_text:
            raw_list.append(lines)
            raw_text = "".join(raw_list)
    
    # append author into the author_list
    for author in extract_author(raw_text):
        if author != '':
            author_list.append(author)

## Step 03. Word Tokenization
<a id="step3"></a>
Firstly we set up a expression for tokenizer, then we use it in the two lists which are required tokeniztaion. After that, remove stopwords from them according to the stopwords set. Finally we import nltk.probability package to get the FreqDist of all three lists.

In [20]:
# word tokenization
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
abstract_tokens = tokenizer.tokenize(' '.join(abstract_list))
title_tokens = tokenizer.tokenize(' '.join(title_list))

# remove stopwords
stopped_abstract = [w for w in abstract_tokens if w not in stopwords_set]
stopped_title = [w for w in title_tokens if w not in stopwords_set]

# find freqidst for each list
fd_abstract = FreqDist(stopped_abstract)
fd_title = FreqDist(stopped_title)
fd_author = FreqDist(author_list)

## Step 04. Top 10 Terms
<a id="step4"></a>
Use FreDist to get top 10 terms in each list by most_common function. And use bubble sort to sort the list based on alphabetical ascending order.

In [21]:
# find freqdist for each list and find most_common 10
fd_abstract = FreqDist(stopped_abstract).most_common(10)
fd_title = FreqDist(stopped_title).most_common(10)
fd_author = FreqDist(author_list).most_common(10)

# sort lists by bubble sort
bubble_sort(fd_abstract)
bubble_sort(fd_title)
bubble_sort(fd_author)

# extract terms from lists
top10_terms_in_abstracts = [item[0] for item in fd_abstract]
top10_terms_in_titles = [item[0] for item in fd_title]
top10_authors = [item[0] for item in fd_author]

## Step 05. Create a DataFrame and Generate a CSV file
<a id="step5"></a>

In [22]:
# write lists into a dataframe
df = pd.DataFrame({'top10_terms_in_abstracts': top10_terms_in_abstracts,\
                   'top10_terms_in_titles':top10_terms_in_titles,\
                   'top10_authors':top10_authors})

In [23]:
# write dataframe into a csv file
with open('Group081_stats.csv', 'w') as stats:
    df.to_csv("Group081_stats.csv",index=False,sep=',')

# Summary
<a id="summary"></a>
In this assignment,fisrst, we can understand that PDF is not a preferred storage or presentation format. Because, when we convert the pdf file to txt file, there are some convert mistake. However, sometimes we do not have any other choice. Second, in this case we need to figure out the correct order of operations that produces the correct set of vocabulary. The different order will affect different final result. In a sense, we are supposed to choose the most suitable order of steps according to tasks specifically.