## Problem Statement

- **Given** - A Data Set of News articles about various topics, and their cluster information.

- **Objective** - To develop a solution which clusters similar news together based on their contextual similarity.

- **Methodology used** - There are multiple ways of approaching to this problem. I have tried to use a Contextual similarity approach.

# Importing Dependencies
We will add and install all the dependencies
required here.



In [None]:
!pip install spacy
!pip install neuralcoref
!pip install nltk
!pip install bert-extractive-summarizer



In [None]:
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import urllib
import json
import nltk
from nltk.tokenize import sent_tokenize

from google.colab import drive

  import pandas.util.testing as tm


In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Data Loader

In [None]:
# Mounting Drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
def load_convert_data(url):
    """
    Downloads the json file from net and convert into pandas dataframe format.
    """
    with urllib.request.urlopen(url) as url:
        df = json.loads(url.read().decode('utf-8'))
        df = pd.DataFrame.from_dict(df)
    return df

In [None]:
df_data = load_convert_data("https://storage.googleapis.com/public-resources/dataset/clusters.json")

In [None]:
df_data['text'][0]

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.\nAt the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.\nIf there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.\nThe Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but wo

In [None]:
# checking data features
print(df_data.columns )

# checking number of data points
print(df_data.shape)

# checking distinct clusters and there numbers
cluster_names = list(zip(df_data.cluster_name.unique(), df_data.cluster.unique()))
print(cluster_names)

Index(['id', 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name'], dtype='object')
(181, 7)
[('MS fails to respond', '0'), ('Anti-Russia', '1'), ('Claims about China', '2'), ('Collapse', '3'), ('Coronavirus is not serious', '4'), ('Cure', '5'), ('EU fails to respond', '6'), ('Miscellaneous', '7'), ('Origins', '8'), ('Properties', '9'), ('Was predicted', '10'), ('Secret plan of the global elite', '11'), ('Ukraine fails to respond', '12'), ('USA created COVID-2019', '13')]


**Observations**:

- There are total 181 news articles in the data.
- each articles have features :- 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name'
- We will be using 'title' and 'text' for getting contextual representations of each News Article.
- There are total 14 distinct clusters.

# II - Data Preprocessing and Cleaning

In [None]:
# data cleaner function
def clean_txt(sentence):
    res = re.sub('[!*)@#%(&$_^]', '', sentence)
    return res

In [None]:
# Text cleaning
df_data['text'] = df_data['text'].apply(clean_txt)
# # Title cleaning
df_data['title'] = df_data['title'].apply(clean_txt)
df_data['text'][0]

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.\nAt the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.\nIf there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.\nThe Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but wo

#### Text Summarization

MODEL USED -  BERT Extractive Summerizer
Reason  - It is trained on a generic set of data including News articles such as CNN, Daily Mail. Hence we don not need to fine-tune it (although we can, if domain heavy text is there).

![alt text](https://iq.opengenus.org/content/images/2020/01/pic3.png)


**Why Summarisation?**
(My idea behind Approach)
- The idea is extract import information in a concise way to represent it into a shorter form.

- A strong argument to advocate this approach is, that as a human, whe try to say if one News is similar to other, we first try to get a summary in our head of the two news articles.
- We then get the context and information conveyed in both the News Articiles.
- By following this approach, we will try to mimic the process.
- While one may argue that we might end up losing some information, but then, a huge advantage is that, we will end-up focusing on only important part of the news.
- We want to focus on prnciple informations in the news article.


In [None]:
# BEFORE SUMMARIZING 
original_news = df_data['text'][0]
original_news

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.\nAt the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.\nIf there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.\nThe Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but wo

In [None]:
from summarizer import Summarizer
# Instantiating BERT pretrained model
# First time it will Download the model
model = Summarizer()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
def bert_summarizer(news_article):
    """
    Function takes unsummarized news article as an input
    and returns a summarized news article.
    """
    summarized = model(news_article, min_length=30)
    return summarized

In [None]:
# Let us check a sample summarization 
# AFTER SUMMARIZING
summarized_news = bert_summarizer(original_news)
summarized_news

"The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but would anyone really want him to be treated not by a doctor, but, for example, by an artilleryman? And nothing fell into the state reserve. This year's harvest is already at stake, Starkevičius warned."

In [None]:
# Let us Apply summarization over all News articles
# and save back to the dataframe
# NOTE: this might take a while... 
# we can use parallel computing frameworks like Dask to boost it
df_data['news_summarized'] = df_data['text'].apply(bert_summarizer)


In [None]:
# let us again check one of the summarised news
df_data.news_summarized[10]

'The spread of the new type of coronavirus threatens to become a pandemic. Ivan Popel is sure that the maneuvers are fraught with incredible threat: a huge, constantly moving military group can accelerate the spread of the epidemic, even more so that it does not even require a human being. In this way, the soldiers participating in the Defender Europe 2020 exercise will become coronavirus nurseries once transmitted and transmitted to Germany, Poland and the Baltic States.'

# III -  Contextual Clustering Approach

In Contextual Clustering Approach -

*   The idea is to capture what the News paragraph is trying to convey.
*   We can use pretrained Sentence Embedding approaches to such as Transformer based "BERT Setence Transformer".
- I will be using Multilingual Sentence Transformer model to develop the solution.
- We will try to pose the problem as assigning vector to each News <paragraph + Title> a contextual Representation.
- If a new instance of News comes, it will be encoded in a similar fashion




#### Feeding News Articles to Sentence Tokenizers

One we have obtained the summarization of the news, We can now feed it to a Sentence Tokenizer filter, which will divide each News Article into a sentence Token.

In [None]:
def sent_tokens(news_article):
  """
  this function accepts a document/News articles and performs
  sentence tokenization.
  * returns - list of tokenized sentences
  """
  return sent_tokenize(news_article)

In [None]:
df_data['news_tokenized'] = df_data['news_summarized'].apply(sent_tokens)

In [None]:
# Let us check one of the setence tokenized news articles ..
df_data.news_tokenized[0]

['The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.',
 'Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius.',
 'The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but would anyone really want him to be treated not by a doctor, but, for example, by an artilleryman?',
 'And nothing fell into the state reserve.',
 "This year's harvest is already at stake, Starkevičius warned."]

#### Sentence Embeddings
- For this demo, I chose to use google's Universal sentence encoder model, as they are trained on generic corpus. 

- Alternate options are BERT sentence transformer based models. 

-  However, improvement is always as continuous process, the approach will be similar in all of them :)


In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

INFO:absl:Using /tmp/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
INFO:absl:Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [None]:
def embed(input):
  return model(input)

In [None]:
embeddings = []
# obtaining news embeddings

for index, row in df_data.iterrows():
    try:
      news_emb = embed(row['news_tokenized'])
      embeddings.append(news_emb.numpy())
    except:
        news_emb = embed([row['news_summarized']])
        embeddings.append(news_emb.numpy())
    print(f'obtained Embedding of news {index + 1}')

obtained Embedding of news 1
obtained Embedding of news 2
obtained Embedding of news 3
obtained Embedding of news 4
obtained Embedding of news 5
obtained Embedding of news 6
obtained Embedding of news 7
obtained Embedding of news 8
obtained Embedding of news 9
obtained Embedding of news 10
obtained Embedding of news 11
obtained Embedding of news 12
obtained Embedding of news 13
obtained Embedding of news 14
obtained Embedding of news 15
obtained Embedding of news 16
obtained Embedding of news 17
obtained Embedding of news 18
obtained Embedding of news 19
obtained Embedding of news 20
obtained Embedding of news 21
obtained Embedding of news 22
obtained Embedding of news 23
obtained Embedding of news 24
obtained Embedding of news 25
obtained Embedding of news 26
obtained Embedding of news 27
obtained Embedding of news 28
obtained Embedding of news 29
obtained Embedding of news 30
obtained Embedding of news 31
obtained Embedding of news 32
obtained Embedding of news 33
obtained Embedding 

In [None]:
# Creating Embedding Column in Dataframe
# It will hold sentence embedding of each news article
df_data['news_embeddings'] = embeddings
df_data.news_embeddings

0      [[-0.05760742, 0.008265725, -0.032001264, 0.02...
1      [[0.019963518, 0.032315526, -0.03987241, -0.00...
2      [[-0.05454539, -0.008581202, -0.06515903, 0.02...
3      [[0.020184275, 0.023639578, -0.03999493, 0.064...
4      [[0.035867557, -0.008092756, -0.059313223, -0....
                             ...                        
176    [[0.043615624, -0.08399421, -0.04286886, -0.00...
177    [[-0.017788284, 0.044827122, -0.038701214, -0....
178    [[-0.075044796, 0.033335067, -0.024392763, -0....
179    [[0.005409523, -0.050416365, -0.009831261, 0.0...
180    [[0.046023387, -0.040333807, -0.04452484, -0.0...
Name: news_embeddings, Length: 181, dtype: object

# Finding Cluster of a new News Article
 - Now that have our Contextual Embeddings ready.We can pose the question of clustering as a similarity search space.
 - Whenever a new instance comes, we can compute the Get a contextual representation of it.
 - We can compute the cosine-similarity between the new News article, and the existing news articles.
 - We will see which one has the most similar narratives, and information, and choose the cluster to which the top-ranked similar news belongs to.

 - Thus, we can obtain the cluster of a new news article.


In [None]:
class GetCluster():
    def __init__(self, news):
      self.news = news
      self.clean_news = None
      self.news_summary = None
      self.sent_tokens = None
      self.news_embedddings = None
      

    def _cleannews(self):
      self.clean_news = clean_txt(self.news)

    def _get_tokens(self):
        self.sent_tokens = sent_tokens(self.clean_news)
        if len(self.sent_tokens) > 10:
          # we summarize it and then obtain tokens
          self.news_summary = bert_summarizer(self.clean_news)
          self.sent_tokens = sent_tokens(self.news_summary)

    def _get_st_embeddings(self):
        try:
            self.news_embedddings = embed(self.sent_tokens).numpy()
        except:
            print('here')
            self.news_embedddings = embed([self.sent_tokens]).numpy()

    def _cosine_max_compute(self, vect1):
        flatten_similarities = cosine_similarity(vect1, self.news_embedddings).flatten()
        # return (cosine_similarity(vect1, self.news_embedddings)).max()
        # we will get the mean of three best similar sentences b/w both news articles
        flatten_sorted = -np.sort(-flatten_similarities)
        max_mean_score = flatten_sorted[:3].mean()
        return max_mean_score
        


    def _cosine_similarity(self):
        df_data['similarities'] = df_data['news_embeddings'].apply(self._cosine_max_compute)


    def _get_sort_best(self, N=1):
        df = df_data.sort_values('similarities', ascending=False)
        print()
        cluster_nom = df['cluster_name'].head(1).to_string(index=False)
        print(f"RESULT: The NEW NEWS article belongs to cluster :- {cluster_nom}")
        print()
        df = df[['text','title']].head(N)
        return df

    def get_N_similar_news(self, n=1):
        # Step 1 - news will be cleaned
        self._cleannews()
        # Step 2 - get news sentence tokens
        self._get_tokens()
        # Step 3 - Get setence token Embeddings
        self._get_st_embeddings()
        # Step 4 - Get cosine Sims
        self._cosine_similarity()
        # Step 5 - Sort the data-frame by similarity to the news
        n_similar_news = self._get_sort_best(N=n)
        return n_similar_news



# RESULTS -

- Let us check how this approach performs on 
  some real world news. 
- I picked up some news article from the internet.
- Using this solution, we could assign **cluster** to this new article.
- We can also search 'n' best matching similar narrative news articles in our corpus. (Where **'n'** is user choice,, by default we see 1 best matching news article.
- The matchings are based on Contextual narrative comparisons.


##### **Testing: Real World Example 1**
I picked up this snippet of news from the link- 

https://www.who.int/news-room/detail/13-06-2020-a-cluster-of-covid-19-in-beijing-people-s-republic-of-china


In [None]:
new_news = """
WHO is following up with Chinese authorities about a cluster of COVID-19 cases in Beijing, People’s Republic of China.
Today, officials from the National Health Commission and Beijing Health Commission briefed WHO’s China country office, to share details of preliminary investigations ongoing in Beijing.  
As of 13 June, 41 symptomatic laboratory confirmed cases and 46 laboratory confirmed cases without symptoms of COVID-19 have been identified in Beijing.
The first identified case had symptom onset on 9 June, and was confirmed on 11 June.  Several of the initial cases were identified through six fever clinics in Beijing.  Preliminary investigations revealed that some of the initial symptomatic cases had a link to the Xinfadi Market in Beijing.  Preliminary laboratory investigations of throat swabs from humans and environmental samples from Xinfadi Market identified 45 positive human samples (all without symptoms at the time of reporting) and 40 positive environmental samples.  One additional case without symptoms was identified as a close contact of a confirmed case.
"""

Let us try to find, two similar topic articles in our data-set

In [None]:
n = 2
get_cluster = GetCluster(new_news)
results = get_cluster.get_N_similar_news(n)
# We could check the title of the similar text, through 'results.title'
# and full news article through 'results.text'
# The article number is displayed in the left.

print(results.title)


RESULT: The NEW NEWS article belongs to cluster :-  Origins

106    Shock: Americans "Chinese coronavirus" was sick in September and hid
101           Coronavirus in China: 2.8 million infected, 112 thousand dead
Name: title, dtype: object


##### Voila! 

We found two very similar news article from our corpus in terms of topic, 
and narratives. 
- **First one is**  - Titled - 'Shock: Americans "Chinese coronavirus" was sick in September' , News Article Number 106 in our Data-set.
- **Second one is**  - Titled - 'Coronavirus in China: 2.8 million infected, 112' thousand dead.


##### **Testing: Real World Example 2**
I picked up this snippet of news from the link- 

https://economictimes.indiatimes.com/news/international/world-news/us-grapples-with-pandemic-as-its-origins-are-traced-in-china/articleshow/76943301.cms

I took first two paragraphs of the news article

In [None]:
new_news2 = """

The United States was grappling with the worst coronavirus outbreak in the world on Monday, as Florida shattered the national record for a state's largest single-day increase in new confirmed cases.
Meanwhile, two World Health Organization experts went in China for a mission to trace the origin of the pandemic. The virus was first detected in central China's city of Wuhan late last year. Beijing had been reluctant to allow a probe but relented after scores of countries called on the WHO to conduct a thorough investigation.
"""
get_cluster = GetCluster(new_news2)
results = get_cluster.get_N_similar_news()

print('Title is ...')
print(results.title)
print()
print('Text is ...')
print(results.text)




RESULT: The NEW NEWS article belongs to cluster :-  USA created COVID-2019

Title is ...
173    People learned about the coronavirus created in the US laboratory in 2015. They decided: a pandemic is not an accident.
Name: title, dtype: object

Text is ...
173    Social network users found on the Internet a study of bat coronavirus, in which scientists were able to modify the strain so that it infects humans. Scientists from the USA were engaged in the work back in 2015, and people saw this as confirmation of one of their favorite conspiracy theories related to COVID-2019.\nThe COVID-2019 epidemic, which began in the Chinese city of Wuhan, managed to get out of China and spread throughout the world in two months. In mid-March, an article became popular in social networks according to which a dangerous strain was artificially created as part of a study of bat coronavirus by American scientists.\nThe scientific work was written by a team of biologists back in 2015, on November 9 of the s

- In this example as well, we can see that with this approach we can get 
the news article which is very similar in our data set to the new real world article we found over the internet.

- And since, we have a labelled data set with defined clusters. We can put the new unseen news article, in the same cluster, as the the best match news article's cluster is.