# NLP 2. Summarising an article of text

**James Morgan (jhmmorgan)**

_2022-05-04_

# 📖 Background

We want a proof of concept, where an end user can easily be provided with a summary of a news article, along with a warning on whether the text is likely to contain hate speech or fake news.

This proof of concept would be in the form of a standalone application that when provided the URL to a news article, provides the end-user with the summary of the article, along with a flag if the article may contain hate speech or fake news.

### The Task
This notebook is **Part 2** of my NLP project. The task of this notebook is to create the model that will help us summarise an article.


# 🔬 Approach
There are several techniques that can be used to summarise text.  Two techniques are
1. **Extractive Text Summarisation Technique**
2. **Abstractive Summarisation Technique**

If you were to ask a human to summarise an article of text, it'll likely be though an abstractive summarisation technique.  This is where we would write an entirely new sentence, using different volcabulary to paragraph the original text.

The other technique, **extractive text summarisation** is very different.  Rather than paraphrasing the article, it'll extract the most important, or unique sentences from the article and return these.  It's a task that computers are great at and it doesn't require a model to be fitted with existing data.

My approach uses the extractive text technique as an abstractive approach is much more complicated and out of scope for this proof of concept.

<div class="alert alert-block alert-info">
<b>So how does this technique work?</b>
</div>


The model is fairly simple.
1. First, we preprocess the provided text:
    - First, we need to tweak our text
        - As we want to return the most importance sentences, we'll want to eventually separate the article into a list of sentences.  Before we do this however, we'll need to make some amendments.  
        - As we use periods (.) to seperate each sentence, I've found that articles that contain a ranking or seeding, such as "Python is No. 1." would incorrectly result in two sentences, "Python is No" and "1".
        - I therefore look for rankings and remove the period. In our example, this amends the sentence to "Python is No1".
     - Once I've made these amendment, I'll then split the article into a list of individual sentences
     - I then reformat the data to remove any special characters or special formatting.
     - The final pre-processing step is to break each sentence in our list into a sublist of individual words
2. We then compare each sentence to every other sentence.  
    - If our article contains three sentences, we'll be performing six compares.  If our article contained ten sentences then we'd be performing 90 compare.
    - When comparing two sentences, we compare the sentences cosine distance
        - We get all the unique words from any two sentences
        - Create a vector of each sentence by doing a word count for each of these words
        - Produce a matrix containing the cosine distance of this vector
3. We then rank each sentence on how similar they are, with the most unique sentences getting a higher score than sentences that are similar to others.
4. Finally, we extract the top sentences and output them in their original order.
    - What this means is that within an article, if we wanted the top 2 sentnces and found that sentence 5 scored the highest, followed by sentence 3, we wouldn't want to output sentence 5 and then sentence 3, as this may not make sense to the end-user.  We'd instead want to output sentence 3 and then sentence 5.

<div class="alert alert-block alert-info">
<b>So how does this look in practice?</b>
</div>


# 📚 Libraries and functions
We'll start by loading the libraries and then loading in the example data containing various articles of text.

In [1]:
import pandas as pd
from utils import *
from nlp_summary import *

In [2]:
df = pd.read_csv("./data/summary.csv")
print2.heading("Heading of our example data")
df.head()

[1;4mHeading of our example data[0m


Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


---


# ⚙️ Summarise model in action

In [3]:
sw      = summarise()
summary = sw.generate_summary(df.article_text, split = "  ")

In [4]:
print2.heading("Article 1")
print(df['article_text'][0])
print()
print2.heading("Summary of Article 1")
print(summary[0])

[1;4mArticle 1[0m
Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because 

In [5]:
print2.heading("Article 5")
print(df['article_text'][5])
print()
print2.heading("Summary of Article 5")
print(summary[5])

[1;4mArticle 5[0m
Nadal has not played tennis since he was forced to retire from the US Open semi-finals against Juan Martin Del Porto with a knee injury. The world No 1 has been forced to miss Spain's Davis Cup clash with France and the Asian hard court season. But with the ATP World Tour Finals due to begin next month, Nadal is ready to prove his fitness before the season-ending event at the 02 Arena. Nadal flew to Paris on Friday and footage from the Paris Masters official Twitter account shows the Spaniard smiling as he strides onto court for practice. The Paris Masters draw has been made and Nadal will start his campaign on Tuesday or Wednesday against either Fernando Verdasco or Jeremy Chardy. Nadal could then play defending champion Jack Sock in the third round before a potential quarter-final with either Borna Coric or Dominic Thiem. Nadal's appearance in Paris is a big boost to the tournament organisers who could see Roger Federer withdraw. Federer is in action at the Swiss 

---


# 🎓 Summary
This concludes the second part of our NLP project.  Using a simple custom class that we import in, we can easily create a summary of any article or set of articles, whilst deciding how many sentences to return.

Whilst the summary may not always make perfect sense, this is a very efficient and quick way to summarise any article of text.