In this notebook I will try to test to create the pipeline necessary for creating a news summariser via ML. I will iteratively improve them and build ad-hoc software modules in order to enhance the usability of the codebase
# Import libraries

In [1]:
import nltk
import numpy as np
import pandas as pd
import feedparser
import scipy

## Load dataset

Feed

In [2]:
the_guardian = "https://www.theguardian.com/international/rss"
feed = feedparser.parse(the_guardian)

In [3]:
feed.entries[0].title

'Coronavirus: Europe on alert as four more deaths reported in Italy – updates'

In [4]:
feed.entries[0].link

'https://www.theguardian.com/world/live/2020/feb/25/coronavirus-live-updates-outbreak-latest-news-italy-italia-deaths-symptoms-china-stocks-wall-street-dow-jones-economy-falls'

In [5]:
len(feed.entries)

84

In [6]:
articles_to_download = {}
for i in range(1, len(feed.entries)):
    key = feed.entries[i].title
    value = feed.entries[i].link
    articles_to_download[key] = value

In [7]:
articles_to_download.keys()

dict_keys(['Trump lashes out at liberal supreme court justices and demands recusals', "Barnier pours scorn on Johnson's spokesman ahead of trade talks", 'Pop star Duffy says she was raped, drugged and held captive', 'Greta Thunberg and Malala Yousafzai meet at Oxford University', 'Harvey Weinstein to face charges in Los Angeles after guilty verdict in New York', 'Hosni Mubarak, Egyptian president ousted during Arab spring, dies at 91', 'Ulcerative colitis: bacteria findings raise hopes for new treatment', 'Netanyahu announces new settlements days before Israeli election', "Spanish carnival's Holocaust-themed parade of dancing 'Nazis' sparks outrage", "Ex-White House doctor: to help Trump's diet I hid cauliflower in his mash", 'Do you take out the bins – or give great hugs? Why it pays to know your love language', "Bloody eye sockets, defaced statues: the visual legacy of Chile's unrest", 'What does Hair Love’s Oscar success say about diversity in Hollywood?', 'How to cook the perfect p

In [8]:
articles_to_download.values()

dict_values(['https://www.theguardian.com/us-news/2020/feb/25/trump-supreme-court-sonia-sotomayor-ruth-bader-ginsburg', 'https://www.theguardian.com/politics/2020/feb/25/keep-chlorinated-chicken-ban-to-win-trade-deal-eu-tells-uk', 'https://www.theguardian.com/music/2020/feb/25/pop-star-duffy-says-she-was-raped-drugged-and-held-captive', 'https://www.theguardian.com/uk-news/2020/feb/25/greta-thunberg-and-malala-yousafzai-meet-at-oxford-university', 'https://www.theguardian.com/world/2020/feb/24/harvey-weinstein-los-angeles-trial', 'https://www.theguardian.com/world/2020/feb/25/hosni-mubarak-egyptian-president-ousted-during-arab-spring-dies-at-91', 'https://www.theguardian.com/society/2020/feb/25/ulcerative-colitis-bacteria-findings-raise-hopes-for-new-treatment', 'https://www.theguardian.com/world/2020/feb/25/netanyahu-announces-new-settlements-days-before-israeli-election', 'https://www.theguardian.com/world/2020/feb/25/parade-of-nazis-in-spanish-carnival-sparks-furious-criticism', 'ht

To do:
- Store already parsed articles in a db (Mongo?)
- Schedule feed update

#### The Guardian

In [9]:
import requests
from bs4 import BeautifulSoup

In [10]:
page = requests.get('https://www.theguardian.com/politics/2020/feb/24/no-10-uk-aim-is-to-restore-independence-from-eu-by-end-of-year') 
soup = BeautifulSoup(page.content, 'html.parser')

In [11]:
article_content= soup.body.find(class_="content__article-body from-content-api js-article__body")

In [12]:
for paragraph in article_content.find_all('p'):
    print(paragraph.get_text())
    print("----")

Britain’s main goal in trade talks with the EU will be to “restore economic and political independence from 1 January”, No 10 has said, as the government prepares to publish its negotiating aims on Thursday.
----
Boris Johnson’s official spokesman said the “primary objective” was ending the transition period by the end of the year, regardless of whether a deal had been struck.
----
His comments suggest the UK will be prepared to walk away from talks rather than submit to the EU’s requests for some oversight by the European court of justice (ECJ) and future alignment on regulation.
----
The UK has said it will push for a Canada-style trade deal but appears to be prioritising the freedom to set its own rules rather than achieving such an agreement, if the EU insists on more alignment than it has with Canada.
----
The negotiating aims are expected to be signed off by the “XS” committee – including Johnson, Michael Gove, the Cabinet Office minister, and Dominic Raab, the foreign secretary 

In [13]:
article = []
for paragraph in article_content.find_all('p'):
    sentence = paragraph.get_text()
    if sentence != "":
        article.append(paragraph.get_text())

In [14]:
article

['Britain’s main goal in trade talks with the EU will be to “restore economic and political independence from 1 January”, No 10 has said, as the government prepares to publish its negotiating aims on Thursday.',
 'Boris Johnson’s official spokesman said the “primary objective” was ending the transition period by the end of the year, regardless of whether a deal had been struck.',
 'His comments suggest the UK will be prepared to walk away from talks rather than submit to the EU’s requests for some oversight by the European court of justice (ECJ) and future alignment on regulation.',
 'The UK has said it will push for a Canada-style trade deal but appears to be prioritising the freedom to set its own rules rather than achieving such an agreement, if the EU insists on more alignment than it has with Canada.',
 'The negotiating aims are expected to be signed off by the “XS” committee – including Johnson, Michael Gove, the Cabinet Office minister, and Dominic Raab, the foreign secretary – 

To do:
- Read article main class from config in order to make it more general

## Summarise the given text:

Glove model (Wikipedia 2014 + Gigaword 5) download from here: https://nlp.stanford.edu/projects/glove/

Sentence similarity: inspiration taken from here: https://medium.com/@adriensieg/text-similarities-da019229c894

In [15]:
model = {}
with open("../glove.6B/glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        model[word] = vector

In [16]:
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/kappa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/kappa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
from nltk.tokenize import word_tokenize
def tokenize_sentence(s):
    tokenizes_s = word_tokenize(s)
    return tokenizes_s

In [18]:
stopws = stopwords.words("english")
stopws = np.array(stopws[:-36]) # for excluding "negation" words

In [19]:
import re

In [20]:
def preprocess_text(s):
    s = s.lower()
    #s = re.sub('[^A-Za-z0-9-\- ]+', '', s)
    s = re.sub('[^A-Za-z0-9 ]+', ' ', s)
    s = tokenize_sentence(s)
    s = [word for word in s if word not in stopws]
    return s

In [21]:
preprocess_text(article[0])

['britain',
 'main',
 'goal',
 'trade',
 'talks',
 'eu',
 'restore',
 'economic',
 'political',
 'independence',
 '1',
 'january',
 '10',
 'said',
 'government',
 'prepares',
 'publish',
 'negotiating',
 'aims',
 'thursday']

In [22]:
from scipy.spatial.distance import cosine

In [23]:
def vectorize_sentence(s, empty_vector_size=50):
    v = []
    for word in preprocess_text(s):
        try:
            v.append(model[word])
        except:
            v.append(np.zeros(empty_vector_size))
    return np.mean(v, axis=0)
        

In [24]:
def compute_sentence_similarity(s1, s2):
    #vector_1 = np.mean([model[word] for word in preprocess_text(s1)],axis=0)
    #vector_2 = np.mean([model[word] for word in preprocess_text(s2)],axis=0)
    vector_1 = vectorize_sentence(s1)
    vector_2 = vectorize_sentence(s2)
    sim = cosine(vector_1, vector_2)
    #print('Word Embedding method with a cosine distance asses that our two sentences are similar to',round((1-sim)*100,2),'%')
    return sim

In [25]:
compute_sentence_similarity(article[0], article[1])

0.08225750923156738

In [26]:
compute_sentence_similarity(article[0], article[5])

0.07303470373153687

In [27]:
def build_similarity_matrix(sentences):
    n_sent = len(sentences)
    matrix = np.zeros((n_sent, n_sent))
    for i in range(0, n_sent):
        for j in range(0, n_sent):
            if i == j:
                continue
            elif matrix[j,i] != 0:
                matrix[i,j] = matrix[j,i]
            else:
                matrix[i,j] = compute_sentence_similarity(sentences[i], sentences[j])

In [28]:
build_similarity_matrix(article)

To Do:
- Order sentences according to similarity
- Choose top n
- For the future: Try other approaches