## Motivation ##

As a Master's student, sometimes I don't always have the time to complete all my readings especially lengthy research papers that supplement the course material. So, to be able to keep up with them when I'm in a crunch, I used Python libraries to parse research paper PDFs, summarize them and output the summary as an audio file so I can listen to research papers with all my other chores.

Below is the code I used to accomplish this task.

#### Importing required packages ###

I used fitz from the pymuPDF package to read in the PDF in xml format. I parsed the XML using the ElementTree package. Summarizing of the text was done use the Natural Language Toolkit and I output the audio as an mp3 file using Google's text to speech API. 

In [234]:
import fitz
import xml.etree.ElementTree as ET
from gtts import gTTS 
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from math import floor

In [235]:
# Read the PDF
file_name = "research_paper.pdf"
doc = fitz.open(file_name, filetype="pdf")

## PDF Parsing ##

#### Helper functions ####

In [236]:
def xml_parser(xml):
    
    ''' This function takes a pdf page read as xml and extracts text from it.
    It stores the text in the form of nested dictionaries where each key:value pair 
    in the outer dictionary is font name:dictionary of font sizes. 
    The inner dictionary contains key:value pairs of font_size:text '''
    
    font_blocks = {}
    for block in xml.findall('block'):
        for line in block.findall('line'):
            for font in line.findall('font'):
                
                if font_blocks.get(font.get('name'),"NA") == "NA":
                    font_blocks[font.get('name')] = {}
                
                if font_blocks[font.get('name')].get(font.get('size'),"NA") == "NA":
                    font_blocks[font.get('name')][font.get('size')] = ''
                
                font_blocks[font.get('name')][font.get('size')] = \
                font_blocks[font.get('name')][font.get('size')] + " "
                for char in font.findall('char'):
                    try:
                        font_blocks[font.get('name')][font.get('size')] = font_blocks[font.get('name')][font.get('size')] + char.get('c')
                    except Exception as e: 
                        pass
    return font_blocks

def get_paper_text(paper_dictionary):
    ''' This function takes a list of nested dictionaries from xml_parser,
    and compiles them into one dictionary so that all pages of the PDF are compiled
    into one nested dictionary. '''
    
    fonts = {}
    for page in paper_dictionary:
        for font in page:
            #print(page[font])
            if fonts.get(font,"NA") == "NA":
                fonts[font] = {}
            for size in page[font]:
                if fonts[font].get(size,"NA") == "NA":
                    fonts[font][size] = ''
                try:
                    fonts[font][size] = fonts[font][size] + page[font][size]
                except Exception as e:
                    print(e)
    return fonts

def get_main_body(dict_):
    
    ''' This function takes the output from get_paper_text and finds the longest
    text in it. This is the actual content of the research paper with footnotes, 
    references, page numbers, titles, etc. removed '''
    
    max_ = 0
    for font in dict_:
        for size in dict_[font]:
            if len(dict_[font][size]) > max_:
                max_ = len(dict_[font][size])
    
    for font in dict_:
        for size in dict_[font]:
            if len(dict_[font][size]) == max_:
                return dict_[font][size]
                

In [237]:
# Call xml_parser on each page and store the content of each page in the form 
# of nested dictionaries in a list
entire_doc = []
for page in doc:
    xml = page.get_text("xml")
    text_ = ET.fromstring(xml)
    entire_doc.append(xml_parser(text_))


In [238]:
# paper2 now holds the main body (content) of the research paper.
paper2 = get_main_body(get_paper_text(entire_doc))

## Text Summarizing ##

In [239]:
stopWords = set(stopwords.words("english"))

In [240]:
''' Tokenize words, remove stopwords, and store them 
in a dictionary along with their frequency '''
words = word_tokenize(paper2)
freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

In [241]:
''' Tokenize the sentences then stores them in a dictionary against their value.
Value is greater if the sentence includes more important words (more frequent words). '''

sentences = sent_tokenize(paper2)
sentenceValue = dict()

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq


In [242]:
''' The average sentence value is calculated '''
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

In [243]:
''' Each sentence's value is compared to the average sentence value.
If it's value is greater than > 1.2*average, it is considered important enough 
to be in the summary '''
summary = ''

for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence



In [244]:
def floored_percentage(val, digits):
    val *= 10 ** (digits + 2)
    return '{1:.{0}f}%'.format(digits, floor(val) / 10 ** digits)

print("Length of the paper has been reduced by " \
      + floored_percentage(((len(paper2)-len(summary))/len(paper2)),2))

Length of the paper has been reduced by 43.37%


In [245]:
print(summary)

 There is—or was, mostly, in the 1980s—the whole mass of research and trade literature on the much  misrepresented Turing test that would ostensibly show whether my unknown interlocutor is human or a machine,  and it was all about intelligence. Since then, our notion of intelligence has changed radically with regard to artificial  intelligence while our understanding of our own minds, unadvanced significantly either by the revolutionary  progress with mapping the human genome or by mapping out the human brain, has not progressed that much. In fact, if asked to think of a human mental functionality that a robot or any computer is not capable of, an  educated mature thinker will mention language, culture, humor, and on all of those counts, the situation is not clear. The computer can barely do anything with understanding, even though it can output tons of text, for  instance, answer my command to print out any text, including creating new ones, e.g., the list of all human diseases. This 

## Text to Speech ##

In [223]:
language = 'en'
myobj = gTTS(text=summary, lang=language, slow=False)

In [168]:
myobj.save("paper.mp3")

In [169]:
os.system("paper.mp3") 

0