# FIT5196 Assessment 2
#### Student Name: Meng Li, Xuejin Huang
#### Student ID: 29040256, 28651073

Date: 15/09/2019

Version: 1.0

Environment: Python 3.6.5 and Jupyter notebook

Libraries used: 
* pandas (for dataframe, included in Anaconda Python 3.6) 
* re (for regular expression, included in Anaconda Python 3.6) 
* requests (for download pdf files)
* nltk (for text pre-processing)
* pdfminer (for parsing pdf files)

## 1.  Introduction

This assignment is about text pre-processing and feature generation. The dataset includes 200 papers in pdf format, which are downloaded at the first step. Then the papers in pdf format are convert to txt format, where will carry out text pre-processing including tokenization, stemming and some other processes. The text pre-processing is based on two main parts of the paper, one is the paper bodies and the other part includes title, authors and abstracts.

After text-preprocessing, the tokens from paper bodies will be used to generate a vocabulary index and a sparse count vectors file. As for tokens from the other part, the top 10 most frequent terms in title and abstract and the the top 10 most frequent authors will be found.

The output of this assignment should including the following:
1. vocabulary index file
2. sparse count vectors file
3. csv file containing the top 10 most frequent terms in title, abstract and authors

## 2. Import libraries

In this assignment, the most important library is the `nltk` library, which is responsible for the text pre-processing such as tokenization and stemming. There are also some other libraries, for example, `re` library is responsible for regular expression and the `requests` library is responsible for downloading. The comments in the coding part describe each import's functions.

In [2]:
# these pdfminer packages are used in the convert_pdf_to_txt function, to convert the pdf to txt
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

#re packages are used for every time we need to parsing the word.
import re

#request package used in download part, it used to send a HTTP request and get the response,
# download the pdf files using the url in Group006.df
import requests

# sys package is used for the file write and read 
import sys

# These ntlk packages are used for tokenization
import nltk
import itertools
from nltk.util import ngrams # This package is used for bigram at first.
from itertools import chain # when find the 200 meaningful bigrams, we used this package, we used list.chain in our code
from nltk.collocations import *
from nltk.probability import *
import nltk.data #This package is imported because we need to do the sentence segmentation
from nltk.tokenize import RegexpTokenizer # We use this package because we have to tokenization using the given regex
from nltk.tokenize import MWETokenizer #MWETokenizer package is for the subtitu the words after bigram.
from nltk.stem.porter import PorterStemmer #This package is used to do the stemmer work
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer # This package is for count the numbers of words.

# These packages are used for the output
import pandas as pd # pandas imported to make the CSV format
from io import StringIO # This packages used for output our result as files.




## 3. Downloading the pdf files and converting to txt

In this part, a pdf file containing the paper number (name, such as PP3295) and the corresponding urls is parsed at first. Then the paper name and its urls are extracted. After that, the 200 papers pdf files (the dataset) are downloaded and stored in a dictionary, where the key is the paper name and the value is the paper content in txt format. 

### 3.1 Parsing the 'Group006.pdf' 

In this section, the `Group006.pdf` is parsed. The important information including the paper name and the urls are extracted and then stored in the lists.

In [5]:
# This method is used to convert pdf to txt
# Reference: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

In [6]:
# Convert the Group006.pdf to txt format
Group006_txt = convert_pdf_to_txt('Group006.pdf')

In [7]:
# Extract the pdf names such as 'PP3295'
regex = r'\n(.*)?\.pdf'
name = re.findall(regex, Group006_txt)
# print(name)

In [8]:
# Extract the URLs for downloading
regex = r' (.*)?\n'
url = re.findall(regex, Group006_txt)
# print(url)

### 3.2 Downloading the dataset (200 papers in pdf) and convert to txt format

In this section, the 200 papers in pdf format are downloaded to the local site using the urls extracted in the previous step. Then the pdf is converted to the txt format, which is prepared for the text pre-processing, and then each paper and its content are stored in a dictionary.

In [15]:
# Download the 200 pdfs to local
for i in range(0,200):    
    filename = './' + name[i] + '.pdf'
    r = requests.get(url[i])
    with open(filename, 'wb') as f:
        f.write(r.content)    

In [9]:
# Convert the pdf to the text, 
# then create a dictionary to store the names and corresponding content.
txtdic = {}
for aname in name:
    pdfname = aname + '.pdf'
    value = convert_pdf_to_txt(pdfname)
    txtdic[aname] = value

In [10]:
# txtdic['PP3295']

'Discriminative Batch Mode Active Learning\n\nAuthored by:\n\nDale Schuurmans\n\nYuhong Guo\n\nAbstract\n\nActive learning sequentially selects unlabeled instances to label with\nthe goal of reducing the eﬀort needed to learn a good classiﬁer. Most\nprevious studies in active learning have focused on selecting one unla-\nbeled instance at one time while retraining in each iteration. However,\nsingle instance selection systems are unable to exploit a parallelized la-\nbeler when one is available. Recently a few batch mode active learning\napproaches have been proposed that select a set of most informative unla-\nbeled instances in each iteration, guided by some heuristic scores. In this\npaper, we propose a discriminative batch mode active learning approach\nthat formulates the instance selection task as a continuous optimization\nproblem over auxiliary instance selection variables. The optimization is\nformuated to maximize the discriminative classiﬁcation performance of\nthe target cl

## 4. Text pre-processing of the paper bodies

In this part, the paper bodies are extracted at the first and then the text pre-processing is carried out. Based on the assignment specification, there are several steps in the pre-processing, which will be discussed in the following sections.

### 4.1 Tokenization and unigram 

In this section, the paper body is extracted from the paper and then the tokenization is carried out on each papar body. Based on specification, the tokens from the first word of a sentence are converted to the lower case.

In [12]:
# This is to create the object for sentence segmentation.
# nltk.download('punkt')
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [82]:
# Extract the paper body and change the start of sentence to lower case
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")

# This regex is used for extracting paper body
regex = r'Paper Body(.*)Reference' 

unigramdic = {}
for aname in name:
    paperbody = re.sub(r'(-\n)', '', txtdic[aname])
    paperbody = re.sub(r'(\n)', ' ', paperbody)
    paperBody = re.findall(regex, paperbody)
    sentences = sent_detector.tokenize(paperBody[0].strip())
    string = ''
    
    # Convert the first word in a sentence to lower case.
    for i in range(0,len(sentences)):
        string += sentences[i][0].lower() + sentences[i][1:]
    unigram_tokens = tokenizer.tokenize(string)
    
    # Use a dictionary to store tokens, where the key is the paper name 
    # and value is the tokens
    unigramdic[aname] = unigram_tokens

### 4.2 First 200 meaningful bigrams

In this section, the first 200 meaningful bigrams without stop words are extracted. Then the bigrams are replaced with the unigrams. For example, if a bigram is `artificial__intelligence`, then it will replace the unigram `artificial` and `intelligence` that appears together.

In [14]:
# Read the stop words
stop_list=[]
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
stopwords_set = set(stopwords)

In [84]:
# Find the first 200 meaningful bigrams
all_words = list(chain.from_iterable(unigramdic.values()))
bigrams = ngrams(all_words, n = 2)

# Get the total frequency of the bigrams
fdbigram = FreqDist(bigrams)
result_dic = {}
for k,v in fdbigram.items():
    if k[0] not in stopwords_set and k[1] not in stopwords_set: # Take the bigram without stopwords
        result_dic[k] = v   
sortedresult_dic = sorted(result_dic.items(), key=lambda x: x[1], reverse=True) # Sort and find the first 200
result_bigram = sortedresult_dic[0:200]

bigram_list = []
for i in result_bigram:
    bigram_list.append(i[0])


In [85]:
# Replace the unigrams with the corresponding bigrams
mwetokenizer = MWETokenizer(bigram_list, separator='__')
paperbody_dic =  dict((name, mwetokenizer.tokenize(content)) for name,content in unigramdic.items())

### 4.3 Removing the stop words

In this section, the stop words (context-independent) are removed from the tokens dictionary.

In [87]:
# Remove the stop words
for k,v in paperbody_dic.items():
    paperbody_dic[k] = [w for w in v if w.lower() not in stopwords_set]

### 4.4 Unigram stemming

In this section, the unigrams are stemmed. In particular, because the `PorterStemmer` will perform lower casing by default, we take out the tokens containing more than two capital letters with special meaning, such as 'USA'. 

In [89]:
# Only stem the unigrams
# The regex is finding the bigrams and the token with more than two capital letters such as USA
# We don't stem these tokens
stemmer = PorterStemmer()
regex = r'\w*[A-Z_]\w*[A-Z_]\w*'
for k,v in paperbody_dic.items():
    astring = ' '
    astring = astring.join(v)
    no_stem = re.findall(regex, astring)
    no_stem_set = set(no_stem)
    stemmed_tokens = []
    stemmed_tokens = stemmed_tokens + [stemmer.stem(w) for w in v if w not in no_stem_set]
    
    # Add the non-stemmed tokens back
    stemmed_tokens = stemmed_tokens + no_stem
    
    paperbody_dic[k] = stemmed_tokens

### 4.5 Removing token with the length less than 3

In [90]:
# Remove the words with length less than 3
for k, v in paperbody_dic.items():
    paperbody_dic[k] = [word for word in v if len(word)>=3]

### 4.6 Remove the context-dependent tokens and rare tokens

In this section, the token appearing more then 95% and less than 3% (based on the document frequency) are removed. First, we create a dictionary to store the document frequency of a token. Then, the token with document frequency more than 95% or less than 3% are stored in two lists.

In [91]:
# Create the two lists that store the document frequency morethan 0.95 or lessthan 0.03
more95 = []
less3 = []
word = {}
for k in paperbody_dic.keys():
    newlist = list(set(paperbody_dic[k]))
    for item in newlist:
        if item in word.keys():
            word[item] += 1
        else:
            word[item] = 1
for k,v in word.items():
    if v >= (200 * 0.95):
        more95.append(k)
    if v < (200 * 0.03):
        less3.append(k)

len(less3)

15818

In [92]:
# Remove these context-dependent tokens and rare tokens
for k,v in paperbody_dic.items():
    paperbody_dic[k] = [w for w in v if w not in more95 and w not in less3]

### 4.7 Create the vocab list and output the vocab index file

The text pre-processing is finished for the paper body. In this section, we create the vocab index file and output as txt file.

In [94]:
# Create the vocabulary list
vocab = []
for v in paperbody_dic.values():
    vocab += v
vocab = list(set(vocab))
vocab_sorted = sorted(vocab)
print(vocab_sorted)

['AUC', 'BFGS', 'Bayesian__approach', 'Beach__CA', 'CAREER', 'CCF', 'CDF', 'CIFAR', 'CNN', 'CPU', 'DARPA', 'DKL', 'ERC', 'Experiments__We', 'FP7', 'Figure__The', 'Fisher__information', 'GHz', 'GPU', 'GPUs', 'GPs', 'Gaussian__process', 'Gibbs__sampler', 'Gibbs__sampling', 'HMM', 'HMMs', 'III', 'IIS', 'ImageNet', 'In__order', 'KL__divergence', 'LASSO', 'LDA', 'LSTM', 'LeCun', 'MAP', 'MATLAB', 'MCMC', 'MDP', 'MDPs', 'MLE', 'MLP', 'MNIST', 'MSE', 'Markov__chain', 'Monte__Carlo', 'NIPS', 'NIPS__Barcelona', 'NIPS__Long', 'NSF', 'NVIDIA', 'Neural__Information', 'ONR', 'PAC', 'PCA', 'Processing__Systems', 'RAM', 'RBF', 'RGB', 'RHS', 'RKHS', 'RMSE', 'RNN', 'ROC', 'ReLU', 'SDP', 'SGD', 'SVD', 'SVM', 'SVMs', 'This__work', 'UCB', 'UCI', 'USA', 'W911NF', 'abbrevi', 'abil', 'absenc', 'absent', 'absolut', 'abstract', 'abund', 'abus', 'acc', 'acceler', 'accept', 'access', 'accomplish', 'accord', 'account', 'accumul', 'accur', 'accuraci', 'achiev', 'acknowledg', 'acquir', 'acquisit', 'act', 'action', '

In [95]:
# Create the index for each vocab.
vocab_serial = {}
index = 0
for i in vocab_sorted:
    vocab_serial[i] = index
    index += 1
len(vocab_serial)

2546

In [136]:
# Set the output format
vocab_output = ''
for k,v in vocab_serial.items():
    vocab_output = vocab_output + k + ':' + str(v)
    vocab_output += '\n'

print(vocab_output)

AUC:0
BFGS:1
Bayesian__approach:2
Beach__CA:3
CAREER:4
CCF:5
CDF:6
CIFAR:7
CNN:8
CPU:9
DARPA:10
DKL:11
ERC:12
Experiments__We:13
FP7:14
Figure__The:15
Fisher__information:16
GHz:17
GPU:18
GPUs:19
GPs:20
Gaussian__process:21
Gibbs__sampler:22
Gibbs__sampling:23
HMM:24
HMMs:25
III:26
IIS:27
ImageNet:28
In__order:29
KL__divergence:30
LASSO:31
LDA:32
LSTM:33
LeCun:34
MAP:35
MATLAB:36
MCMC:37
MDP:38
MDPs:39
MLE:40
MLP:41
MNIST:42
MSE:43
Markov__chain:44
Monte__Carlo:45
NIPS:46
NIPS__Barcelona:47
NIPS__Long:48
NSF:49
NVIDIA:50
Neural__Information:51
ONR:52
PAC:53
PCA:54
Processing__Systems:55
RAM:56
RBF:57
RGB:58
RHS:59
RKHS:60
RMSE:61
RNN:62
ROC:63
ReLU:64
SDP:65
SGD:66
SVD:67
SVM:68
SVMs:69
This__work:70
UCB:71
UCI:72
USA:73
W911NF:74
abbrevi:75
abil:76
absenc:77
absent:78
absolut:79
abstract:80
abund:81
abus:82
acc:83
acceler:84
accept:85
access:86
accomplish:87
accord:88
account:89
accumul:90
accur:91
accuraci:92
achiev:93
acknowledg:94
acquir:95
acquisit:96
act:97
action:98
activ:99
act

In [143]:
# Output the vocabulary index file
def output_txt(astring):
    text_file = open('Group006_vocab.txt', 'w', encoding='utf-8')
    text_file.write(astring)
    text_file.close()

output_txt(vocab_output)

### 4.8 Create the Sparse count vectors and output

In this section, the sparse count vectors is created and then output as the txt file.

In [139]:
# Create the Sparse count vectors
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word") 
data_features = vectorizer.fit_transform([' '.join(value) for value in paperbody_dic.values()])
# print (data_features.shape)

In [126]:
# Set the output format
vocab2 = vectorizer.get_feature_names()
output = ''
for index in range(0,200):
    output += name[index] + ','
    for word, count in zip(vocab, data_features.toarray()[index]):
        if count > 0:
            if word in vocab_serial.keys():
                output += str(vocab_serial[word]) + ":" + str(count) + ','
    output = output[:-1]
    output += '\n'
print(output)

PP3295,243:2,1491:1,2019:9,2136:9,502:3,1034:34,1186:2,1488:2,1943:8,2181:1,551:1,2526:2,1153:9,824:1,639:1,1320:2,1813:3,1797:20,965:4,1369:2,1341:1,319:1,1524:2,2501:1,982:1,1600:1,838:1,1406:2,2131:1,1709:2,800:3,36:2,1830:27,1668:1,67:1,1487:1,66:2,401:1,1768:1,2520:1,1262:2,1390:1,1184:1,2452:4,2472:1,467:5,1946:13,2187:22,211:2,1368:4,1911:3,1154:3,328:6,1838:1,780:1,671:1,1440:3,1345:1,1040:1,2084:1,1212:1,2192:4,2396:2,51:3,1026:1,775:1,150:5,771:1,2110:2,2292:2,327:5,1784:2,761:1,1642:2,722:2,1020:1,58:1,2021:5,1580:1,2436:6,48:1,2036:1,1170:1,1431:2,958:1,294:2,2014:1,172:5,2385:1,303:5,1009:5,1205:5,1553:1,2093:26,987:1,1796:4,1322:1,928:4,2064:4,1486:1,1588:1,1317:1,2070:2,2208:2,2155:2,2178:1,1241:2,20:3,1762:3,613:1,480:1,413:4,2430:1,1899:4,1077:2,1587:2,1454:1,1641:9,433:1,2481:3,1989:4,2479:2,1519:3,1164:1,1477:1,1579:1,74:1,2024:1,2099:2,559:3,397:7,1456:2,2269:9,1732:1,2169:3,1364:4,1050:5,2495:1,56:2,1426:2,972:7,2393:3,1144:1,134:5,1928:4,2011:4,180:7,1853:1,1087:1

In [127]:
# Output the sparse count vectors file
def output_txt(astring):
    text_file = open('Group006_count_vectors.txt', 'w')
    text_file.write(astring)
    text_file.close()

output_txt(output)

### 4.9 End of the paper body pre-processing

The pre-processing for paper body and the required txt output are finished sucessfully. The next part we will carry out the pre-processing for the paper title, abstract and authors.

## 5. Text pre-processing of the titles, abstract and authors

In this part, we extracted the different part of each article, and then process for each of them to get the output based on the assignment specification.

### 5.1 Find top 10 most frequent Authors

In this section, we used the regex `r'Authored by:(.*?)Abstract'` first to get the Author part, then use another regex `r'\n\n(.*)\n\n'` to get authors in one list, then using `split('\n')` to seprate all the authors. After we get all the authors, we output the top50 rather than top10 because there are many authors with the same times, for these authors, we need to sort them then output, so we first see the `most_common(50)` and get top10 from it after the list get sorted.

In [147]:
regex_author = r'Authored by:(.*?)Abstract'
author_result = []
for v in txtdic.values():
    author = re.findall(regex_author, v, re.DOTALL)
    regex_author2 = r'\n\n(.*)\n\n'
    author_list = re.findall(regex_author2, author[0], re.DOTALL)
    if len(author_list) > 0:
        temp = author_list[0].split('\n') # To extract the full name 
        temp1 = [x for x in temp if x != ''] # There is some papers without authors
        author_result += temp1
author_ten = FreqDist(author_result).most_common(50)
author_ten = sorted(author_ten, key=lambda x: (-x[1], x[0]))

author_ten_list = []
for i in author_ten[0:10]:
    author_ten_list.append(i[0])
author_ten_list


['Dale Schuurmans',
 'Hongyuan Zha',
 'Lihong Li',
 'Michael I. Jordan',
 'Alex J. Smola',
 'Alexander T. Ihler',
 'Anima Anandkumar',
 'Benjamin Recht',
 'Constantine Caramanis',
 'Dan Garber']

### 5.2 Find Top 10 most frequent terms appearing in all Titles

In this section, we first used the regex `r'(.*)\n?'` to get the title part from article, then create a new string names `title_content`, then for each article, we do the re.match function, we get the result and set it as lowercase to this string, then we used the same regex `r"[A-Za-z]\w+(?:[-'?]\w+)?"` to tokenize, after we get the tokenize result, we will get the top15 use the `mosr_common(15)` also to make sure the word with same appearing time can get sorted, after sorted, we output the top10 as our result.

In [20]:
# extract title and change to lower case

regex = r'(.*)\n?'
title_content = ''
for v in txtdic.values():
    title = re.match(regex, v)
    title_content += title.group(1).lower() + ' '
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
title_tokens = tokenizer.tokenize(title_content)
title_tokens2 = [w for w in title_tokens if w not in stopwords_set]
title_ten = FreqDist(title_tokens2).most_common(15)
title_ten = sorted(title_ten, key=lambda x: (-x[1], x[0]))
title_ten_list = []
for i in title_ten[0:10]:
    title_ten_list.append(i[0])
abstract_ten_list
title_ten_list

['learning',
 'bayesian',
 'networks',
 'online',
 'optimal',
 'models',
 'stochastic',
 'deep',
 'model',
 'sparse']

### 5.3 Find Top 10 most frequent terms appearing in all Abstracts

This part is semilar to the 5.2 part get the top 10 frequent terms in Titles. In this section, we first used the regex `r'Abstract(.*)Paper Body'` to get the abstract part from article, then create a new string names `string`, then for each article, we do the re.match function, we get the result and set it as lowercase to this string, then we used the same regex `r"[A-Za-z]\w+(?:[-'?]\w+)?"` to tokenize, after we get the tokenize result, we will get the top11 use the `mosr_common(11)` also to make sure the word with same appearing time can get sorted, after sorted, we output the top10 as our result3

In [15]:
# extract abstract and change the start of sentence to lower case

tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
regex = r'Abstract(.*)Paper Body'
abstract_tokens = []
for v in txtdic.values():
    abstract = re.sub(r'(-\n)', '', v)
    abstract1 = re.sub(r'(\n)', ' ', abstract)
    abstract2 = re.findall(regex, abstract1)
    sentences = sent_detector.tokenize(abstract2[0].strip())
    string = ''
    for i in range(0,len(sentences)):
        string += sentences[i][0].lower() + sentences[i][1:]
    abstract_tokens += tokenizer.tokenize(string)
abstract_tokens2 = [w for w in abstract_tokens if w not in stopwords_set]
abstract_ten = FreqDist(abstract_tokens2).most_common(11)
abstract_ten_list = []
for i in abstract_ten[0:10]:
    abstract_ten_list.append(i[0])
abstract_ten_list

['learning',
 'algorithm',
 'model',
 'data',
 'problem',
 'show',
 'method',
 'methods',
 'models',
 'approach']

### 5.4 Output our result as a file

In this part, we used pandas package we imported to create the top 10 outputs as the dataframe, then use `df.to_csv` to output our dataframe result as the csv file.

In [151]:
csv_data = {
    'top10_terms_in_abstracts': abstract_ten_list,
    'top10_terms_in_titles': title_ten_list,
    'top10_authors': author_ten_list
}

df = pd.DataFrame(csv_data, columns = ['top10_terms_in_abstracts', 'top10_terms_in_titles', 'top10_authors'])
df.to_csv('Group006_stats.csv', index=False)

### 5.5 End of the pre-processing of title, abstract and authors

The text pre-processing of the title, abstract and authors are completed and the required csv file is output. 

## 6. Conclusion

Through this assignment, we first carried out the text pre-processing tasks, including tokenization, stemming and several required tasks based on the assignment specification. Then, with the processed text, we output the required files, including the vocabulary index file, count vectors file and the csv file containing most top10 frequent terms. We learnt lots of knowledge and skills that used in text pre-processing and data wrangling.