<a href="https://colab.research.google.com/github/plchuaa/gordonchu/blob/master/budget_speech_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Spacy tokenizer + get sentence with largest text frequency**

1. Document preprocessing - retrive all paragraph in the document

2. Text Preprocessing(remove stopwords,punctuations). 

3. Frequency table of words/Word Frequency Distribution - how many times each word appears in the document

4. Score each sentence depending on the words it contains and the frequency table

5. Build summary by joining every sentence above a certain score limit



In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
# !pip install wget

In [0]:
import wget
url = 'https://www.budget.gov.hk/2020/eng/pdf/e_budget_speech_2020-21.pdf'

def download_pdf(url):
  print('Beginning file download with wget module')
  pdf = wget.download(url)
  return pdf


In [4]:
# !pip install PyPDF2
import PyPDF2
pdf = download_pdf(url)
pdfFileObj = open(pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# PdfFileReader(pdfFileObj)
pdfReader.numPages

Beginning file download with wget module


114

In [0]:
import re

class page:
  
      def __init__ (self,sentence):
          sentence = re.sub(r'\n', "", sentence)
          searchObj = re.match( r'^(.+)\s{5,}([0-9]+)\s',sentence)
          self.title = searchObj.group(1)
          self.pagenumber = searchObj.group(2)
          self.content = re.sub(r'\s{2,}', " ", sentence[searchObj.end():])

In [6]:
topics = {}
start_page = 4
end_page = 64
for i in range(start_page,end_page): # page 5 - 64 need to build def to find last paragraph
    pageObj = pdfReader.getPage(i)
    pagetext = page(pageObj.extractText())

    if pagetext.title not in topics.keys():
      topics[pagetext.title] = re.sub(r'^(.*?)([0-9]+\.)', r"\2",pagetext.content) # r"\2" => keep group 2 
    else:
      topics[pagetext.title] += pagetext.content

for topic, sent in topics.items():
    topics[topic] = re.sub(r'[0-9]+\.\s', "", sent) # remove 1. 2. 3. 
    topics[topic] = re.sub(r'\([a-z]\)', "", sent) # remove (a) and (b)
    
topics

{'Building A Liveable City': '113. Hong Kong is our home. We will devote resources to different areas to develop Hong Kong into a more liveable city. Land Resources 114. The Government is committed to resolving the land and housing problems. We accepted the report of the Task Force on Land Supply early last year and are now taking forward its recommendations in full steam. Housing Land 115. In the medium to long term, various new development area projects will bring about over 210 000 housing units. In particular, the first parcel of housing land under the Tung Chung East reclamation works is ready for handover next month, and is expected to provide 10 000 public housing units in the first quarter of 2024. The land resumption exercise for Kwu Tung North/Fanling North new development areas commenced last year, with a view to enabling the first population intake of the public housing development in 2027. Besides, upon funding approval by the LegCo, we will commence in the second half of 

In [0]:
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [0]:
def text_summarizer(sent, n = 2):
    nlp = spacy.load('en')
    texts = nlp(sent)
    mytokens = [token.text for token in texts]

    stopwords = list(STOP_WORDS)

    # Build Word Frequency
    word_frequencies = {}
    for word in texts:
        if word.text not in stopwords:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    
    # Maximum Word Frequency
    maximum_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():  
            word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

    # Sentence Tokens
    sentence_list = [ sentence for sentence in texts.sents ]
    [w.text.lower() for t in sentence_list for w in t ]

    # Sentence Score via comparrng each word with sentence
    sentence_scores = {}  
    for sent in sentence_list:  
            for word in sent:
                if word.text.lower() in word_frequencies.keys():
                    if len(sent.text.split(' ')) < 30:
                        if sent not in sentence_scores.keys():
                            sentence_scores[sent] = word_frequencies[word.text.lower()]
                        else:
                            sentence_scores[sent] += word_frequencies[word.text.lower()]
    
    # retrive and combine sentence with top N highest similarity score
    from heapq import nlargest
    summarized_sentences = nlargest(n, sentence_scores, key=sentence_scores.get)
    summarized_sentences = [ w.text for w in summarized_sentences ]

    return ' '.join(summarized_sentences)


In [9]:
print("Summary:")
for topic, sent in topics.items():
    print (topic, " : ", text_summarizer(sent))

Summary:
Introduction  :  enterprises, safeguarding jobs, stimulating the economy and relieving I am well aware that financial resources alone are not enough to tackle the challenges we are facing. However, no one could have predicted that social incidents would break out in the middle of the year, which not only hit our economy but also broke our hearts.
Economic Situation in 2019  :  However, consumer price inflation went up in the second half of 2019, mainly reflecting the sharp increase in prices of basic foodstuffs amid reduced supply of fresh pork. Netting out the effects of the -off relief measures, the underlying inflation rate was 3 per cent for 2019, up 0.4 percentage point from 2018.
Economic Outlook for 2020 and Medium-term Outlook  :  Yet, on the other hand, we have to strive to overcome the constraints stemming from ageing population, a dwindling labour force and shortage of land. On the Mainland, while its economy showed some deceleration, it still recorded a solid growt