## Text Summarization
Task: Find a document and auto-summarize it. It can be a blog/news article/research paper. Use a machine based approach for this. 

2 versions of text summarization are built:
* Version 1: Do word count of the paragraph. The score of each word will be its occurrence.
* Version 2: Using TF-IDF for the summarization.

Due to time-constraint, I haven't completed the machine based approach, but I tried to study if the news headline and news extracts have high tfidf score. If so, we will try to build model to predict which sentence in a document looks like a news headline and news extracts, where those sentences shall be good summary of a document.

Skills: BeautifulSoup, nltk, TF-IDF (without using tfidf vectorizer)

### Scrape SCMP article

In [None]:
# scraped an SCMP article using beautiful

In [3]:
from bs4 import BeautifulSoup
import requests

In [4]:
url = 'https://www.scmp.com/news/hong-kong/politics/article/2176058/hong-kong-lawmaker-eddie-chus-ban-village-election-based'
scmp = requests.get(url)
print(scmp.text)

<!DOCTYPE html>
<html lang="en" dir="ltr" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:schema="https://schema.org/">
<head profile="http://www.w3.org/1999/xhtml/vocab">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="x-dns-prefetch-control" content="on" />
<link rel="dns-prefetch" href="//cdn1.i-scmp.com" />
<link rel="dns-prefetch" href="//cdn2.i-scmp.com" />
<link rel="dns-prefetch" href="//cdn3.i-scmp.com" />
<link rel="dns-prefetch" href="//cdn4.i-scmp.com" />
<!--[if I

In [5]:
soup = BeautifulSoup(scmp.text, "html.parser")

In [5]:
content = soup.find_all(class_="panel-pane pane-entity-field pane-node-body pane-first pos-0")[0].get_text(strip=True)
print(content)

The decision to ban lawmaker Eddie Chu Hoi-dick from running in a rural representative election was based on a shaky argument that could be struck down in court, according to leading legal scholars, who also called on Hong Kong’s courts to clarify the vagueness in election laws.Johannes Chan Man-mun, the former law dean of the University of Hong Kong, was speaking on Sunday after Chu was told he would not be allowed to run for a post as a local village’s representative.Returning officer Enoch Yuen Ka-lok pointed to Chu’s stance on Hong Kong independence and said the lawmaker had dodged his questions on his political beliefs. Yuen took this to imply that Chu supported the possibility of Hong Kong breaking with Beijing in the future.Chan, however, said Chu’s responses to the returning officer were open to interpretation. The legal scholar did not believe they met the standard of giving the election officer “cogent, clear and compelling” evidence as required by the precedent set in the ca

## Version 1
Do word count of the paragraph. The score of each word will be its occurence. Each sentence will have a total score of the words. Sentences with highest scores are the summary.

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [8]:
para = sent_tokenize(content)  #there is no space after fullstop, so content splitted into paragraphs first
print(len(para))

5


In [9]:
working = content.replace('.','. ') #content now can be splitted into sentences

In [10]:
sent = sent_tokenize(working)
print(len(sent))

40


In [11]:
w_tokens = []
for s in sent:
    w_tokens += word_tokenize(s.lower())

**Replace stopwords and punctuation**

In [12]:
from string import punctuation
from nltk.corpus import stopwords
swp_filter = set(stopwords.words('english')+list(punctuation))

In [13]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [14]:
clean_tokens = [w for w in w_tokens if w not in swp_filter]
print(clean_tokens)

['decision', 'ban', 'lawmaker', 'eddie', 'chu', 'hoi-dick', 'running', 'rural', 'representative', 'election', 'based', 'shaky', 'argument', 'could', 'struck', 'court', 'according', 'leading', 'legal', 'scholars', 'also', 'called', 'hong', 'kong', '’', 'courts', 'clarify', 'vagueness', 'election', 'laws', 'johannes', 'chan', 'man-mun', 'former', 'law', 'dean', 'university', 'hong', 'kong', 'speaking', 'sunday', 'chu', 'told', 'would', 'allowed', 'run', 'post', 'local', 'village', '’', 'representative', 'returning', 'officer', 'enoch', 'yuen', 'ka-lok', 'pointed', 'chu', '’', 'stance', 'hong', 'kong', 'independence', 'said', 'lawmaker', 'dodged', 'questions', 'political', 'beliefs', 'yuen', 'took', 'imply', 'chu', 'supported', 'possibility', 'hong', 'kong', 'breaking', 'beijing', 'future', 'chan', 'however', 'said', 'chu', '’', 'responses', 'returning', 'officer', 'open', 'interpretation', 'legal', 'scholar', 'believe', 'met', 'standard', 'giving', 'election', 'officer', '“', 'cogent', '

In [15]:
clean_tokens = []
for w in w_tokens:
    if w not in swp_filter:
        if len(w) != 1:
            clean_tokens.append(w)
clean_tokens

['decision',
 'ban',
 'lawmaker',
 'eddie',
 'chu',
 'hoi-dick',
 'running',
 'rural',
 'representative',
 'election',
 'based',
 'shaky',
 'argument',
 'could',
 'struck',
 'court',
 'according',
 'leading',
 'legal',
 'scholars',
 'also',
 'called',
 'hong',
 'kong',
 'courts',
 'clarify',
 'vagueness',
 'election',
 'laws',
 'johannes',
 'chan',
 'man-mun',
 'former',
 'law',
 'dean',
 'university',
 'hong',
 'kong',
 'speaking',
 'sunday',
 'chu',
 'told',
 'would',
 'allowed',
 'run',
 'post',
 'local',
 'village',
 'representative',
 'returning',
 'officer',
 'enoch',
 'yuen',
 'ka-lok',
 'pointed',
 'chu',
 'stance',
 'hong',
 'kong',
 'independence',
 'said',
 'lawmaker',
 'dodged',
 'questions',
 'political',
 'beliefs',
 'yuen',
 'took',
 'imply',
 'chu',
 'supported',
 'possibility',
 'hong',
 'kong',
 'breaking',
 'beijing',
 'future',
 'chan',
 'however',
 'said',
 'chu',
 'responses',
 'returning',
 'officer',
 'open',
 'interpretation',
 'legal',
 'scholar',
 'believe',


In [16]:
import nltk

In [17]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(clean_tokens)
sorted(finder.ngram_fd.items())

[(("'s", 'political'), 1),
 (('2003', 'mirroring'), 1),
 (('2016', 'political'), 1),
 (('2016', 'pressed'), 1),
 (('24', 'rural'), 1),
 (('accepts', 'primacy'), 1),
 (('according', 'leading'), 1),
 (('according', 'section'), 1),
 (('added', 'since'), 1),
 (('added', 'would'), 1),
 (('administrative', 'region'), 1),
 (('advocate', 'johannes'), 1),
 (('affairs', 'secretary'), 1),
 (('ago', 'chan'), 1),
 (('agreed', 'room'), 1),
 (('ahead', 'january'), 1),
 (('aid', 'bear'), 1),
 (('alive', 'best'), 1),
 (('allegiance', 'chan'), 1),
 (('allegiance', 'chinastephen'), 1),
 (('allegiance', 'hong'), 2),
 (('allegiance', 'requirement'), 1),
 (('allegiance', 'sar'), 1),
 (('allowed', 'run'), 1),
 (('also', 'called'), 1),
 (('also', 'create'), 1),
 (('also', 'elected'), 1),
 (('also', 'said'), 2),
 (('andy', 'chan'), 3),
 (('another', 'legal'), 1),
 (('another', 'right'), 1),
 (('answered', 'questions'), 1),
 (('anyone', 'back'), 1),
 (('appeal', 'february'), 1),
 (('appeal', 'legal'), 1),
 (('a

In [18]:
from nltk.stem.porter import *
ps = PorterStemmer()

In [19]:
stem_token = [ps.stem(clean_token) for clean_token in clean_tokens]

In [20]:
stem_token

['decis',
 'ban',
 'lawmak',
 'eddi',
 'chu',
 'hoi-dick',
 'run',
 'rural',
 'repres',
 'elect',
 'base',
 'shaki',
 'argument',
 'could',
 'struck',
 'court',
 'accord',
 'lead',
 'legal',
 'scholar',
 'also',
 'call',
 'hong',
 'kong',
 'court',
 'clarifi',
 'vagu',
 'elect',
 'law',
 'johann',
 'chan',
 'man-mun',
 'former',
 'law',
 'dean',
 'univers',
 'hong',
 'kong',
 'speak',
 'sunday',
 'chu',
 'told',
 'would',
 'allow',
 'run',
 'post',
 'local',
 'villag',
 'repres',
 'return',
 'offic',
 'enoch',
 'yuen',
 'ka-lok',
 'point',
 'chu',
 'stanc',
 'hong',
 'kong',
 'independ',
 'said',
 'lawmak',
 'dodg',
 'question',
 'polit',
 'belief',
 'yuen',
 'took',
 'impli',
 'chu',
 'support',
 'possibl',
 'hong',
 'kong',
 'break',
 'beij',
 'futur',
 'chan',
 'howev',
 'said',
 'chu',
 'respons',
 'return',
 'offic',
 'open',
 'interpret',
 'legal',
 'scholar',
 'believ',
 'met',
 'standard',
 'give',
 'elect',
 'offic',
 'cogent',
 'clear',
 'compel',
 'evid',
 'requir',
 'preced

In [21]:
from collections import Counter
para_count = Counter(stem_token)
para_count

Counter({'decis': 5,
         'ban': 3,
         'lawmak': 4,
         'eddi': 3,
         'chu': 27,
         'hoi-dick': 1,
         'run': 4,
         'rural': 4,
         'repres': 7,
         'elect': 20,
         'base': 1,
         'shaki': 1,
         'argument': 2,
         'could': 7,
         'struck': 1,
         'court': 7,
         'accord': 2,
         'lead': 1,
         'legal': 6,
         'scholar': 3,
         'also': 5,
         'call': 1,
         'hong': 10,
         'kong': 10,
         'clarifi': 4,
         'vagu': 3,
         'law': 12,
         'johann': 3,
         'chan': 11,
         'man-mun': 1,
         'former': 2,
         'dean': 1,
         'univers': 1,
         'speak': 2,
         'sunday': 2,
         'told': 1,
         'would': 8,
         'allow': 1,
         'post': 1,
         'local': 1,
         'villag': 5,
         'return': 12,
         'offic': 17,
         'enoch': 1,
         'yuen': 4,
         'ka-lok': 1,
         'point': 1,
  

In [22]:
pos_list = nltk.pos_tag(stem_token)
pos_counts = nltk.collections.Counter((subl[1] for subl in pos_list))
print("the five most common tags are", pos_counts.most_common(5))

the five most common tags are [('NN', 303), ('JJ', 137), ('VBD', 34), ('VBP', 29), ('RB', 29)]


In [23]:
word_tokenize(sent[0])

['The',
 'decision',
 'to',
 'ban',
 'lawmaker',
 'Eddie',
 'Chu',
 'Hoi-dick',
 'from',
 'running',
 'in',
 'a',
 'rural',
 'representative',
 'election',
 'was',
 'based',
 'on',
 'a',
 'shaky',
 'argument',
 'that',
 'could',
 'be',
 'struck',
 'down',
 'in',
 'court',
 ',',
 'according',
 'to',
 'leading',
 'legal',
 'scholars',
 ',',
 'who',
 'also',
 'called',
 'on',
 'Hong',
 'Kong',
 '’',
 's',
 'courts',
 'to',
 'clarify',
 'the',
 'vagueness',
 'in',
 'election',
 'laws',
 '.']

In [24]:
sent_stem_token = [ps.stem(clean_token) for clean_token in word_tokenize(sent[0])]
print(sent_stem_token)

['the', 'decis', 'to', 'ban', 'lawmak', 'eddi', 'chu', 'hoi-dick', 'from', 'run', 'in', 'a', 'rural', 'repres', 'elect', 'wa', 'base', 'on', 'a', 'shaki', 'argument', 'that', 'could', 'be', 'struck', 'down', 'in', 'court', ',', 'accord', 'to', 'lead', 'legal', 'scholar', ',', 'who', 'also', 'call', 'on', 'hong', 'kong', '’', 's', 'court', 'to', 'clarifi', 'the', 'vagu', 'in', 'elect', 'law', '.']


In [25]:
score = 0
for i in sent_stem_token:
    if i in para_count:
        score += para_count[i]

In [26]:
para_score = {}
for i, s in enumerate(sent):
    score = 0
    sent_stem_token = [ps.stem(clean_token) for clean_token in word_tokenize(s)]
    for j in sent_stem_token:
        if j in para_count:
            score += para_count[j]
    para_score[i] = score

In [27]:
para_score

{0: 181,
 1: 110,
 2: 131,
 3: 62,
 4: 97,
 5: 84,
 6: 108,
 7: 129,
 8: 104,
 9: 132,
 10: 201,
 11: 108,
 12: 80,
 13: 39,
 14: 82,
 15: 103,
 16: 160,
 17: 102,
 18: 62,
 19: 100,
 20: 37,
 21: 52,
 22: 120,
 23: 141,
 24: 190,
 25: 10,
 26: 14,
 27: 127,
 28: 88,
 29: 37,
 30: 105,
 31: 95,
 32: 66,
 33: 142,
 34: 45,
 35: 67,
 36: 113,
 37: 46,
 38: 134,
 39: 103}

In [28]:
high_score = sorted(para_score, key=para_score.get, reverse=True)[:3]
print(high_score)

[10, 24, 0]


In [29]:
print("Summary: ")
for i in high_score:
    print(sent[i])

Summary: 
While the landmark ruling was concerned only with Legco elections, Johannes Chan said, after Chu’s case, returning officers for other elections could have similar powers to ban candidates from running, including in the district council elections next year.
Lawmaker accepts primacy of Beijing in bid to keep village election hopes alive“At best, we could argue Chu’s reply to the officer was vague about self-determination – even the returning officer himself confessed Chu was only ‘implicitly’ confirming independence as an option,” he said.
The decision to ban lawmaker Eddie Chu Hoi-dick from running in a rural representative election was based on a shaky argument that could be struck down in court, according to leading legal scholars, who also called on Hong Kong’s courts to clarify the vagueness in election laws.


---

## Putting into Functions

In [30]:
from nltk.tokenize import sent_tokenize, word_tokenize
def para_split(content):

    para = sent_tokenize(content)  #there is no space after fullstop, so content splitted into paragraphs first

    working = content.replace('.','. ') #content now can be splitted into sentences

    sent = sent_tokenize(working)

    w_tokens = []
    for s in sent:
        w_tokens += word_tokenize(s.lower())
    
    return (para, sent, w_tokens)

In [31]:
from string import punctuation
from nltk.corpus import stopwords

def replacestopword(w_tokens):
    swp_filter = set(stopwords.words('english')+list(punctuation))

    clean_tokens = []
    for w in w_tokens:
        if w not in swp_filter:
            if len(w) != 1:
                clean_tokens.append(w)
    return clean_tokens

In [32]:
from nltk.stem.porter import *

def stemming(clean_tokens):
    ps = PorterStemmer()

    stem_token = [ps.stem(clean_token) for clean_token in clean_tokens]
    
    return stem_token

In [33]:
from collections import Counter

def scorecard(stem_token):
    para_count = Counter(stem_token)
    return para_count

In [34]:
def scoring(sent, para_count):
    para_score = {}
    for i, s in enumerate(sent):
        score = 0
        sent_stem_token = [ps.stem(sent_token) for sent_token in word_tokenize(s)]
        for j in sent_stem_token:
            if j in para_count:
                score += para_count[j]
        para_score[i] = score
    high_score = sorted(para_score, key=para_score.get, reverse=True)[:5]
    print("Summary: ")
    for j in high_score:
        print(sent[j])
        print("Score: ",para_score[j])


## Version 1

In [35]:
# Run all the functions
para, sent, w_tokens = para_split(content)
s = scorecard(stemming(replacestopword(w_tokens)))
scoring(sent, s)

Summary: 
While the landmark ruling was concerned only with Legco elections, Johannes Chan said, after Chu’s case, returning officers for other elections could have similar powers to ban candidates from running, including in the district council elections next year.
Score:  201
Lawmaker accepts primacy of Beijing in bid to keep village election hopes alive“At best, we could argue Chu’s reply to the officer was vague about self-determination – even the returning officer himself confessed Chu was only ‘implicitly’ confirming independence as an option,” he said.
Score:  190
The decision to ban lawmaker Eddie Chu Hoi-dick from running in a rural representative election was based on a shaky argument that could be struck down in court, according to leading legal scholars, who also called on Hong Kong’s courts to clarify the vagueness in election laws.
Score:  181
More questions for Eddie Chu over pledge of allegiance to ChinaStephen Fisher, the former deputy home affairs secretary who led th

## Version 2
* Handled modals "could, would"
* Use Term Frequency
* Use TFIDF
* SnowballStemmer

In [36]:
from collections import Counter
para_count = Counter(stem_token)
para_count
#could = 7 score

Counter({'decis': 5,
         'ban': 3,
         'lawmak': 4,
         'eddi': 3,
         'chu': 27,
         'hoi-dick': 1,
         'run': 4,
         'rural': 4,
         'repres': 7,
         'elect': 20,
         'base': 1,
         'shaki': 1,
         'argument': 2,
         'could': 7,
         'struck': 1,
         'court': 7,
         'accord': 2,
         'lead': 1,
         'legal': 6,
         'scholar': 3,
         'also': 5,
         'call': 1,
         'hong': 10,
         'kong': 10,
         'clarifi': 4,
         'vagu': 3,
         'law': 12,
         'johann': 3,
         'chan': 11,
         'man-mun': 1,
         'former': 2,
         'dean': 1,
         'univers': 1,
         'speak': 2,
         'sunday': 2,
         'told': 1,
         'would': 8,
         'allow': 1,
         'post': 1,
         'local': 1,
         'villag': 5,
         'return': 12,
         'offic': 17,
         'enoch': 1,
         'yuen': 4,
         'ka-lok': 1,
         'point': 1,
  

In [37]:
pos_list = nltk.pos_tag(stem_token)
pos_counts = nltk.collections.Counter((subl[1] for subl in pos_list))
print("the five most common tags are", pos_counts.most_common(5))

the five most common tags are [('NN', 303), ('JJ', 137), ('VBD', 34), ('VBP', 29), ('RB', 29)]


In [38]:
noMD_tokens = []
for item in nltk.pos_tag(clean_tokens):
    if item[1] != "MD":
        noMD_tokens.append(item[0])

In [39]:
from string import punctuation
from nltk.corpus import stopwords

def replacestopword_noMD(w_tokens):
    swp_filter = set(stopwords.words('english')+list(punctuation))

    clean_tokens = []
    for w in w_tokens:
        if w not in swp_filter:
            if len(w) != 1:
                clean_tokens.append(w)
                
    noMD_tokens = []
    for item in nltk.pos_tag(clean_tokens):
        if item[1] != "MD":
            noMD_tokens.append(item[0])
            
    return(noMD_tokens)

In [40]:
from nltk.stem.snowball import SnowballStemmer

def snowball_stemming(clean_tokens):
    stemmer = SnowballStemmer("english")

    stem_token = [stemmer.stem(clean_token) for clean_token in clean_tokens]
    
    return stem_token

In [41]:
def tf(para_count, clean_tokens):
    tfDict = {}
    clean_tokens_count = len(clean_tokens)
    for word, count in para_count.items():
        tfDict[word] = count/float(clean_tokens_count)
    return tfDict

In [42]:
# tfDict = tf(para_count, stem_tokens)
# print(tfDict)

In [43]:
def tf_scoring(sent, para_count):
    para_score = {}
    for i, s in enumerate(sent):
        score = 0
        tf_w_count = 0
        sent_stem_token = [ps.stem(sent_token) for sent_token in word_tokenize(s)]
        for j in sent_stem_token:
            if j in para_count:
                score += para_count[j]
                tf_w_count += 1
        para_score[i] = float(score) / tf_w_count
    high_score = sorted(para_score, key=para_score.get, reverse=True)[:5]
    print("Summary: ")
    for j in high_score:
        print(sent[j])
        print("Score: ",para_score[j])
    return para_score

**TF Method**

In [44]:
para, sent, w_tokens = para_split(content)
noMD_tokens = replacestopword_noMD(w_tokens)
stem_tokens = snowball_stemming(noMD_tokens)
para_count = scorecard(stem_tokens)
tf_dict = tf(para_count, stem_tokens)

tf_final = tf_scoring(sent, tf_dict)

Summary: 
Chan, however, said Chu’s responses to the returning officer were open to interpretation.
Score:  0.018175004684279558
“Chu had failed to convince the returning officer that he has true intentions of upholding the Basic Law,” Tong said.
Score:  0.015879707700955592
He also said Hong Kong courts must clarify the vagueness in election laws and process such appeals more quickly.
Score:  0.01548367315652307
Gladys Li, the lawyer who represented Andy Chan, said the ruling would be binding on returning officers for other elections.
Score:  0.01533036946190403
Tong also said it might not have made a difference, had Chu answered all the questions raised by the returning officer.
Score:  0.01533036946190403


**TF-IDF Method**

In [45]:
import math
def computeIDF(para_count, sent):
    working_dict = {}
    idf_dict = {}
    N = len(sent)

    working_dict = dict.fromkeys(para_count.keys(), 0)

    for s in sent:
        s_tokens = word_tokenize(s.lower())
        noMD_tokens = replacestopword_noMD(s_tokens)
        stem_tokens = snowball_stemming(noMD_tokens)
        s_freq = scorecard(stem_tokens)

        for word, val in s_freq.items():
            if val > 0:
                working_dict[word] += 1

    for word, val in working_dict.items():
        idf_dict[word] = math.log10(N / float(val))

    return idf_dict

In [46]:
def computeTFIDF(tf_dict, idfs):
    tfidf = {}
    for word, val in tf_dict.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [47]:
para, sent, w_tokens = para_split(content)
noMD_tokens = replacestopword_noMD(w_tokens)
stem_tokens = snowball_stemming(noMD_tokens)
para_count = scorecard(stem_tokens)
tf_dict = tf(para_count, stem_tokens)

idf_dict = computeIDF(para_count, sent)
tfidf = computeTFIDF(tf_dict, idf_dict)

tfidf_final = tf_scoring(sent, tfidf)

Summary: 
He also said Hong Kong courts must clarify the vagueness in election laws and process such appeals more quickly.
Score:  0.008795636490231115
Gladys Li, the lawyer who represented Andy Chan, said the ruling would be binding on returning officers for other elections.
Score:  0.007763553974345251
“Chu had failed to convince the returning officer that he has true intentions of upholding the Basic Law,” Tong said.
Score:  0.007485977961502632
Chan, however, said Chu’s responses to the returning officer were open to interpretation.
Score:  0.007387435028351706
Both Chan and Li said how the returning officer had come to the disqualification might require clarification in any future court ruling.
Score:  0.007270512972202376


In [48]:
# https://hackernoon.com/finding-the-most-important-sentences-using-nlp-tf-idf-3065028897a3
# https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
# https://nlpforhackers.io/tf-idf/

---

# Machine Learning Method

Study if headline and extracts have high tfidf score

In [49]:
import pandas as pd

In [50]:
tfidf_df = pd.DataFrame(list(tfidf_final.items()), columns = ['sentence', 'tfidf'])

In [51]:
sent_dict = {k:v for k, v in enumerate(sent)}

In [52]:
tfidf_df['sentence'] = tfidf_df['sentence'].map(sent_dict)

In [53]:
tfidf_df.sort_values(by = 'tfidf', ascending = False)

Unnamed: 0,sentence,tfidf
15,He also said Hong Kong courts must clarify the...,0.008796
11,"Gladys Li, the lawyer who represented Andy Cha...",0.007764
36,“Chu had failed to convince the returning offi...,0.007486
4,"Chan, however, said Chu’s responses to the ret...",0.007387
28,Both Chan and Li said how the returning office...,0.007271
10,While the landmark ruling was concerned only w...,0.007221
39,Tong also said it might not have made a differ...,0.007113
8,The allegiance requirement was written into la...,0.007099
0,The decision to ban lawmaker Eddie Chu Hoi-dic...,0.007059
2,Returning officer Enoch Yuen Ka-lok pointed to...,0.007031


In [54]:
# Added headline, and extract to the paragraphs to test if headline and extracts have high tfidf
with open('scmp_content_with_header.txt', 'r') as f:
    content2 = f.read()

In [55]:
content2

"Hong Kong lawmaker Eddie Chu’s ban from village election based on shaky argument, legal scholars say. Former HKU law dean Johannes Chan does not believe election officer can provide ‘cogent, clear and compelling’ evidence for disqualification. Eric Cheung, also of HKU, said the reasons given for Chu’s disqualification were weak and Hong Kong courts must clarify the vagueness in election laws. The decision to ban lawmaker Eddie Chu Hoi-dick from running in a rural representative election was based on a shaky argument that could be struck down in court, according to leading legal scholars, who also called on Hong Kong’s courts to clarify the vagueness in election laws.Johannes Chan Man-mun, the former law dean of the University of Hong Kong, was speaking on Sunday after Chu was told he would not be allowed to run for a post as a local village’s representative.Returning officer Enoch Yuen Ka-lok pointed to Chu’s stance on Hong Kong independence and said the lawmaker had dodged his questi

In [56]:
para, sent, w_tokens = para_split(content2)
noMD_tokens = replacestopword_noMD(w_tokens)
stem_tokens = snowball_stemming(noMD_tokens)
para_count = scorecard(stem_tokens)
tf_dict = tf(para_count, stem_tokens)

idf_dict = computeIDF(para_count, sent)
tfidf = computeTFIDF(tf_dict, idf_dict)

tfidf_final = tf_scoring(sent, tfidf)

Summary: 
He also said Hong Kong courts must clarify the vagueness in election laws and process such appeals more quickly.
Score:  0.008923338893278704
Hong Kong lawmaker Eddie Chu’s ban from village election based on shaky argument, legal scholars say.
Score:  0.0076762922743352875
Eric Cheung, also of HKU, said the reasons given for Chu’s disqualification were weak and Hong Kong courts must clarify the vagueness in election laws.
Score:  0.0075954736005281595
Gladys Li, the lawyer who represented Andy Chan, said the ruling would be binding on returning officers for other elections.
Score:  0.007579479399108593
Both Chan and Li said how the returning officer had come to the disqualification might require clarification in any future court ruling.
Score:  0.007369178389951073


In [57]:
tfidf_df = pd.DataFrame(list(tfidf_final.items()), columns = ['sentence', 'tfidf'])

sent_dict = {k:v for k, v in enumerate(sent)}

tfidf_df['sentence'] = tfidf_df['sentence'].map(sent_dict)

tfidf_df.sort_values(by = 'tfidf', ascending = False)

Unnamed: 0,sentence,tfidf
18,He also said Hong Kong courts must clarify the...,0.008923
0,Hong Kong lawmaker Eddie Chu’s ban from villag...,0.007676
2,"Eric Cheung, also of HKU, said the reasons giv...",0.007595
14,"Gladys Li, the lawyer who represented Andy Cha...",0.007579
31,Both Chan and Li said how the returning office...,0.007369
39,“Chu had failed to convince the returning offi...,0.007339
7,"Chan, however, said Chu’s responses to the ret...",0.007285
3,The decision to ban lawmaker Eddie Chu Hoi-dic...,0.007276
13,While the landmark ruling was concerned only w...,0.007063
42,Tong also said it might not have made a differ...,0.007012


Find another article to see if it proves

In [58]:
url = "https://www.scmp.com/news/hong-kong/law-and-crime/article/2176238/ex-hong-kong-minister-patrick-ho-could-know-early"
scmp2 = requests.get(url)

In [59]:
soup = BeautifulSoup(scmp2.text, "html.parser")

para = soup.find_all(class_="panel-pane pane-entity-field pane-node-body pane-first pos-0")[0].get_text(strip=True)
headline = soup.find(class_ = "title").get_text()
extract = soup.find(class_ = "field-item even").get_text(". ", strip=True)

In [60]:
full_article = headline+". "+extract+para
full_article

"Ex-Hong Kong minister Patrick Ho could know as early as Wednesday whether he will be jailed on corruption charges. Former home affairs secretary exercises right to remain silent, declining to testify on the eight counts of bribery and money laundering he faces. Veteran criminal lawyer says odds are against Ho given the large amount of evidence presented by prosecutionFormer Hong Kong minister Patrick Ho Chi-ping could know as early as Wednesday in the US whether he will be spending time in an American jail on corruption charges.At his hearing on Monday, Ho, who served as the city’s home affairs minister from 2002 to 2007, exercised his legal right to remain silent and did not testify on the eight counts of bribery and money laundering he faces, sending the trial into its final phase.Ho is accused of offering US$2.9 million worth of bribes to officials in Chad and Uganda while seeking to secure oil rights, among other benefits, for Shanghai-based energy conglomerate CEFC China Energy. 

In [61]:
para, sent, w_tokens = para_split(full_article)
noMD_tokens = replacestopword_noMD(w_tokens)
stem_tokens = snowball_stemming(noMD_tokens)
para_count = scorecard(stem_tokens)
tf_dict = tf(para_count, stem_tokens)

idf_dict = computeIDF(para_count, sent)
tfidf = computeTFIDF(tf_dict, idf_dict)

tfidf_newarticle = tf_scoring(sent, tfidf)

Summary: 
“There is not only Gadio's evidence but also intercepted emails,” he said.
Score:  0.007120924286247209
Ex-Hong Kong minister Patrick Ho could know as early as Wednesday whether he will be jailed on corruption charges.
Score:  0.006902724032978779
The prosecution has to secure a unanimous verdict from the jury for each charge.
Score:  0.0065833037202286775
The Ugandan official was allegedly the middleman between Ho and his own president, but has not been charged in this case.
Score:  0.006388783769609249
Carol Calabrese, senior vice-president for payment operation of HSBC’s US branch, said Ho’s transfer to Gadio was sent from HSBC Hong Kong to HSBC’s intermediary in New York before reaching Gadio’s account in Dubai.
Score:  0.006342545900001456


In [63]:
tfidf_df = pd.DataFrame(list(tfidf_newarticle.items()), columns = ['sentence', 'tfidf'])

sent_dict = {k:v for k, v in enumerate(sent)}

tfidf_df['sentence'] = tfidf_df['sentence'].map(sent_dict)

tfidf_df.sort_values(by='tfidf', ascending = False)

Unnamed: 0,sentence,tfidf
16,“There is not only Gadio's evidence but also i...,0.007121
0,Ex-Hong Kong minister Patrick Ho could know as...,0.006903
18,The prosecution has to secure a unanimous verd...,0.006583
25,The Ugandan official was allegedly the middlem...,0.006389
23,"Carol Calabrese, senior vice-president for pay...",0.006343
14,His former co-defendant Cheikh Gadio – turned ...,0.006299
1,Former home affairs secretary exercises right ...,0.006137
22,”Witness in Patrick Ho trial ‘shocked’ by US$2...,0.006033
2,Veteran criminal lawyer says odds are against ...,0.005999
20,“The jury may also feel that he is an unreliab...,0.005987


In [None]:
# Let's scrape another article to test our hypothesis

In [65]:
url = 'https://www.scmp.com/rss/318206/feed'
politics = requests.get(url)

In [84]:
soup = BeautifulSoup(politics.text, "lxml-xml")

In [85]:
soup

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="https://www.scmp.com/rss/318206/feed" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:media="http://www.rssboard.org/media-rss" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="https://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<channel>
<title>South China Morning Post - Politics feed</title>
<link>https://www.scmp.com/rss/318206/feed</link>
<description>Hong Kong political news including political reform and Hong Kong/China relations.
</description>
<language>en</language>
<image>
<url>https://www.scmp.com/sites/all/themes/boom/images/scmp_logo_goo

In [103]:
data = []
for i in range(len(soup.find_all('item'))):
    title = soup.find_all('item')[i].title.get_text()
    link = soup.find_all('item')[i].link.get_text()
    description = soup.find_all('item')[i].description.get_text()
    pub_date = soup.find_all('item')[i].pubDate.get_text()
    author = soup.find_all('item')[i].author.get_text()
    DICT = {'title': title,'description':description,'pub_date':pub_date, 'author': author, 'link':link}
    data.append(DICT)

In [190]:
df_politics = pd.DataFrame.from_dict(data)
df_politics = df_politics[['title','description','pub_date', 'author', 'link']]

In [191]:
df_politics

Unnamed: 0,title,description,pub_date,author,link
0,Occupy founder tearfully describes from the do...,Emotional testimony from one of the founders o...,"Tue, 04 Dec 2018 19:25:52 +0800",Chris Lau,https://www.scmp.com/news/hong-kong/politics/a...
1,Safeguard China’s constitution and sovereignty...,A top legal official from Beijing urged Hongko...,"Tue, 04 Dec 2018 19:15:15 +0800",Kimmy Chung,https://www.scmp.com/news/hong-kong/politics/a...
2,"No plan to unseat Eddie Chu from Legco, Hong K...",Hong Kong&rsquo;s leader has said her administ...,"Tue, 04 Dec 2018 11:51:32 +0800","Sum Lok-kei, Shirley Zhao",https://www.scmp.com/news/hong-kong/politics/a...
3,Former Hong Kong undersecretary for home affai...,Former Hong Kong undersecretary for home affai...,"Mon, 03 Dec 2018 20:15:00 +0800","Gary Cheung, Kanis Leung",https://www.scmp.com/news/hong-kong/politics/a...
4,Hong Kong’s pan-democrats ready to defend Eddi...,Pan-democrats said on Monday they were ready t...,"Mon, 03 Dec 2018 16:38:15 +0800",Sum Lok-kei,https://www.scmp.com/news/hong-kong/politics/a...
5,Hong Kong lawmaker Eddie Chu’s ban from villag...,The decision to ban lawmaker Eddie Chu Hoi-dic...,"Mon, 03 Dec 2018 09:00:30 +0800",Alvin Lum,https://www.scmp.com/news/hong-kong/politics/a...
6,Hong Kong lawmaker Eddie Chu disqualified from...,Lawmaker Eddie Chu Hoi-dick has been disqualif...,"Sun, 02 Dec 2018 19:45:15 +0800","Alvin Lum, Sum Lok-kei",https://www.scmp.com/news/hong-kong/politics/a...
7,George H. W. Bush: the US president who always...,Before George H. W. Bush became president of t...,"Sun, 02 Dec 2018 00:16:12 +0800",Naomi Ng,https://www.scmp.com/news/hong-kong/politics/a...
8,Jack Ma is a Communist Party member – so what?,Shock and horror! Jack Ma is a member of the C...,"Sat, 01 Dec 2018 14:31:58 +0800",Yonden Lhatoo,https://www.scmp.com/news/hong-kong/politics/a...
9,Leader of Occupy movement Chan Kin-man says he...,A founding member of 2014&rsquo;s Occupy prote...,"Fri, 30 Nov 2018 17:35:10 +0800",Chris Lau,https://www.scmp.com/news/hong-kong/politics/a...


In [148]:
print(df_politics['title'].iloc[1]+'. ' +df_politics['description'].iloc[1])

Shen Chunyao, chairman of the Basic Law Committee and the Legislative Affairs Commission of China&rsquo;s top legislative body, the National People&rsquo;s Congress Standing Committee (NPCSC), described the Chinese constitution as having a &ldquo;mother-son&rdquo; and...


In [187]:
def scrapelink(url):

    scmp = requests.get(url)
    soup = BeautifulSoup(scmp.text, "html.parser")
    headline = soup.find(class_ = "title").get_text()
    extract1 = soup.find(class_ = "field-item even").find_all('li')[0].get_text()
    try:
        extract2 = soup.find(class_ = "field-item even").find_all('li')[1].get_text()
    except:
        extract2 = ''
    
    para = soup.find_all(class_="panel-pane pane-entity-field pane-node-body pane-first pos-0")[0].get_text(strip=True)
    full_article = headline+". "+extract1+extract2+para

    return headline, extract1, extract2, para, full_article

In [284]:
headline = []
extract1 = [] 
extract2 = []
para = [] 
full_article = []
headline_para = []
for i, url in enumerate(df_politics['link'].tolist()):
    h,e1,e2,p,f = scrapelink(url)
    
    headline.append(h)
    extract1.append(e1)
    extract2.append(e2)
    para.append(p)
    full_article.append(f)
    headline_para.append(h+p)

In [285]:
df_politics['headline']= headline
df_politics['extract1'] = extract1
df_politics['extract2'] = extract2
df_politics['para'] = para
df_politics['full_article'] = full_article
df_politics['headline_para'] = headline_para

In [286]:
df_politics

Unnamed: 0,title,description,pub_date,author,link,headline,extract1,extract2,para,full_article,headline_para
0,Occupy founder tearfully describes from the do...,Emotional testimony from one of the founders o...,"Tue, 04 Dec 2018 19:25:52 +0800",Chris Lau,https://www.scmp.com/news/hong-kong/politics/a...,Occupy founder tearfully describes from the do...,Sociologist Chan Kin-man told court students w...,Escalation of 79-day demonstration led Chan an...,Emotional testimony from one of the founders o...,Occupy founder tearfully describes from the do...,Occupy founder tearfully describes from the do...
1,Safeguard China’s constitution and sovereignty...,A top legal official from Beijing urged Hongko...,"Tue, 04 Dec 2018 19:15:15 +0800",Kimmy Chung,https://www.scmp.com/news/hong-kong/politics/a...,Safeguard China’s constitution and sovereignty...,Shen Chunyao tells forum that constitution is ...,,A top legal official from Beijing urged Hongko...,Safeguard China’s constitution and sovereignty...,Safeguard China’s constitution and sovereignty...
2,"No plan to unseat Eddie Chu from Legco, Hong K...",Hong Kong&rsquo;s leader has said her administ...,"Tue, 04 Dec 2018 11:51:32 +0800","Sum Lok-kei, Shirley Zhao",https://www.scmp.com/news/hong-kong/politics/a...,"No plan to unseat Eddie Chu from Legco, Hong K...",Chief executive says officials will review the...,Pro-Beijing politicians have called for Chu to...,Hong Kong’s leader has said her administration...,"No plan to unseat Eddie Chu from Legco, Hong K...","No plan to unseat Eddie Chu from Legco, Hong K..."
3,Former Hong Kong undersecretary for home affai...,Former Hong Kong undersecretary for home affai...,"Mon, 03 Dec 2018 20:15:00 +0800","Gary Cheung, Kanis Leung",https://www.scmp.com/news/hong-kong/politics/a...,Former Hong Kong undersecretary for home affai...,"Hui, who was tipped in 2012 to become the city...","She was described as dedicated, passionate and...",Former Hong Kong undersecretary for home affai...,Former Hong Kong undersecretary for home affai...,Former Hong Kong undersecretary for home affai...
4,Hong Kong’s pan-democrats ready to defend Eddi...,Pan-democrats said on Monday they were ready t...,"Mon, 03 Dec 2018 16:38:15 +0800",Sum Lok-kei,https://www.scmp.com/news/hong-kong/politics/a...,Hong Kong’s pan-democrats ready to defend Eddi...,Chu was disqualified from a rural committee el...,A pro-Beijing politician has called for him to...,Pan-democrats said on Monday they were ready t...,Hong Kong’s pan-democrats ready to defend Eddi...,Hong Kong’s pan-democrats ready to defend Eddi...
5,Hong Kong lawmaker Eddie Chu’s ban from villag...,The decision to ban lawmaker Eddie Chu Hoi-dic...,"Mon, 03 Dec 2018 09:00:30 +0800",Alvin Lum,https://www.scmp.com/news/hong-kong/politics/a...,Hong Kong lawmaker Eddie Chu’s ban from villag...,Former HKU law dean Johannes Chan does not bel...,"Eric Cheung, also of HKU, said the reasons giv...",The decision to ban lawmaker Eddie Chu Hoi-dic...,Hong Kong lawmaker Eddie Chu’s ban from villag...,Hong Kong lawmaker Eddie Chu’s ban from villag...
6,Hong Kong lawmaker Eddie Chu disqualified from...,Lawmaker Eddie Chu Hoi-dick has been disqualif...,"Sun, 02 Dec 2018 19:45:15 +0800","Alvin Lum, Sum Lok-kei",https://www.scmp.com/news/hong-kong/politics/a...,Hong Kong lawmaker Eddie Chu disqualified from...,"Chu hits back, says he should not have been ba...",,Lawmaker Eddie Chu Hoi-dick has been disqualif...,Hong Kong lawmaker Eddie Chu disqualified from...,Hong Kong lawmaker Eddie Chu disqualified from...
7,George H. W. Bush: the US president who always...,Before George H. W. Bush became president of t...,"Sun, 02 Dec 2018 00:16:12 +0800",Naomi Ng,https://www.scmp.com/news/hong-kong/politics/a...,George H. W. Bush: the US president who always...,Tsim Sha Tsui suit maker Manu Melwani recalls ...,,Before George H. W. Bush became president of t...,George H. W. Bush: the US president who always...,George H. W. Bush: the US president who always...
8,Jack Ma is a Communist Party member – so what?,Shock and horror! Jack Ma is a member of the C...,"Sat, 01 Dec 2018 14:31:58 +0800",Yonden Lhatoo,https://www.scmp.com/news/hong-kong/politics/a...,Jack Ma is a Communist Party member – so what?,Yonden Lhatoo explains some basics on China an...,,Shock and horror! Jack Ma is a member of the C...,Jack Ma is a Communist Party member – so what?...,Jack Ma is a Communist Party member – so what?...
9,Leader of Occupy movement Chan Kin-man says he...,A founding member of 2014&rsquo;s Occupy prote...,"Fri, 30 Nov 2018 17:35:10 +0800",Chris Lau,https://www.scmp.com/news/hong-kong/politics/a...,Leader of Occupy movement Chan Kin-man says he...,Split between founders and student leaders cam...,Chan says he and Benny Tai were back in univer...,A founding member of 2014’s Occupy protests to...,Leader of Occupy movement Chan Kin-man says he...,Leader of Occupy movement Chan Kin-man says he...


In [369]:
df_final = pd.DataFrame(columns = ['tfidf_score','sentence', 'corelines', 'tfidf_v'])

In [213]:
def get_tfidf_score(sent, para_count):
    para_score = {}
    for i, s in enumerate(sent):
        score = 0
        tf_w_count = 0
        sent_stem_token = [ps.stem(sent_token) for sent_token in word_tokenize(s)]
        for j in sent_stem_token:
            if j in para_count:
                score += para_count[j]
                tf_w_count += 1
        para_score[i] = float(score) / tf_w_count
    return para_score

In [229]:
import numpy as np
para, sent, w_tokens = para_split(df_politics['headline_para'].iloc[0])
noMD_tokens = replacestopword_noMD(w_tokens)
stem_tokens = snowball_stemming(noMD_tokens)
para_count = scorecard(stem_tokens)
tf_dict = tf(para_count, stem_tokens)

idf_dict = computeIDF(para_count, sent)
tfidf = computeTFIDF(tf_dict, idf_dict)

tfidf_score = get_tfidf_score(sent, tfidf)

In [312]:
from nltk import sent_tokenize
df_final['tfidf'] = list(tfidf_score.values())
df_final.loc[:,'sentence'] = df_politics[['headline']].iloc[0].tolist()+sent_tokenize(df_politics['para'].iloc[0].replace('.','. '))
df_final.loc[:,'corelines'] = [1]+np.zeros(len(sent)-1).astype(int).tolist()
    

In [315]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
import nltk
import re

def tokenize_and_stem(text, do_stem=True):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
            
    # stem filtered tokens
    stems = [stemmer.stem(t) for t in filtered_tokens]
    
    if do_stem:
        return stems
    else:
        return filtered_tokens

In [325]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_features=200000,
                                 stop_words=set(stopwords.words('english')+list(punctuation)),
                                 tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(df_politics[['headline']].iloc[0].tolist()+sent_tokenize(df_politics['para'].iloc[0].replace('.','. ')))

In [326]:
tfidf_matrix

<27x909 sparse matrix of type '<class 'numpy.float64'>'
	with 1082 stored elements in Compressed Sparse Row format>

In [328]:
df_final['tfidf_v'] = list(tfidf_matrix)

In [329]:
df_final

Unnamed: 0,tfidf,sentence,corelines,tfidf_v
0,0.007022,Occupy founder tearfully describes from the do...,1,"(0, 528)\t0.1343791198032483\n (0, 314)\t0...."
1,0.00657,Emotional testimony from one of the founders o...,0,"(0, 528)\t0.12071887912229844\n (0, 314)\t0..."
2,0.007442,Sociologist Dr Chan Kin-man tearfully talked a...,0,"(0, 767)\t0.10806267222879776\n (0, 494)\t0..."
3,0.005645,Chan’s emotional outburst prompted Judge Johnn...,0,"(0, 266)\t0.16767943534226165\n (0, 83)\t0...."
4,0.00549,Chan said he and his two co-founders – Benny T...,0,"(0, 723)\t0.11232087585764589\n (0, 83)\t0...."
5,0.006487,"The turning point, he said, was on November 30...",0,"(0, 494)\t0.12135058991312353\n (0, 440)\t0..."
6,0.005303,A failed dialogue between the government and t...,0,"(0, 759)\t0.16887990353955545\n (0, 732)\t0..."
7,0.006072,"Chan, Tai and Chu met the press and announced ...",0,"(0, 83)\t0.10573135298274072\n (0, 751)\t0...."
8,0.007236,Occupy leaders on trial: who they are and what...,0,"(0, 528)\t0.11978940033228024\n (0, 801)\t0..."
9,0.005117,He said the escalation meant protesters resort...,0,"(0, 608)\t0.1018589594175286\n (0, 615)\t0...."
