## Text Summarisation

##### Use case: When you go through a article, you want to summarise it. Typically this would be done manually and identify the key information.. Python can help automate this

## 1. Text Summarisation using genism TextRank

How does TextRank work?

The text corpus is pre-processed - all stop words, punctuations removed, words are lemmatized (what is the root word?).

Then each sentence goes through a vectorisation process - i.e. create words in a short way to represent the sentence

The algorithm looks and calculates the similarities between these sentence vectors.

It will next build a graph representation, each sentence representing a node, and the similiarities between sentences are the edge values. These edge values are stored into a matrix.

The sentences are ranked according to the significance of weights.

### Step 1: Import genism library

In [4]:
import gensim
from gensim.summarization import summarize

### Step 2: Initialise article (sample)

In [6]:
article_text='Junk foods taste good thatâ€™s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you donâ€™t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people donâ€™t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. Hope it will help you'

### Step 3: Pass the article_text corpus into summarize

In [7]:
short_summary=summarize(article_text)
print(short_summary)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
One who like junk food develop more risk to put on extra 

The issue with the above summarised text is that it is way too long. The summarize function can take in two variables:

ratio: value can be between 0 to 1. It is the ratio of the summary to the original text in terms of the length.
word_count: this is number of words in the summary

Test 1: Make the summary only 10% of the original text corpus

In [8]:
summary_by_ratio=summarize(article_text,ratio=0.1)
print(summary_by_ratio)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.


Test 2: Summarise using word_count limit being 20

In [10]:
summary=summarize(article_text,ratio=0.1,word_count=20)
print(summary)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.


## 2. Text Summarisation using spacy

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc=nlp(article_text)

### 2.1 Only filter for proper nouns, adj, nouns and verbs

In [12]:
keywords_list = []

desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']

from string import punctuation #import this to identify stop words or punctuations.


# Iterating through tokens 
for token in doc: 
  # checking if a token is stopword or punctuation
  if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
    # If true, they are just ignored and loop goes to the next token
    continue
  #  checking if the POS tag of the token is in our desired list
  if(token.pos_ in desired_pos):
    # If true, append the token to our keywords list
    keywords_list.append(token.text)

### 2.2 Counter

In [14]:
from collections import Counter

dictionary = Counter(keywords_list) 
print(dictionary)

Counter({'food': 19, 'junk': 13, 'foods': 12, 'high': 9, 'sugar': 7, 'Junk': 6, 'healthy': 6, 'weight': 6, 'cholesterol': 5, 'body': 5, 'blood': 5, 'heart': 5, 'good': 4, 'sodium': 4, 'fat': 4, 'gain': 4, 'life': 4, 'diabetes': 4, 'level': 4, 'increases': 4, 'risk': 4, 'children': 3, 'effects': 3, 'health': 3, 'found': 3, 'nutrients': 3, 'unhealthy': 3, 'lack': 3, 'impact': 3, 'person': 3, 'bad': 3, 'fats': 3, 'result': 3, 'help': 3, 'taste': 2, 'age': 2, 'parents': 2, 'harmful': 2, 'fried': 2, 'dietary': 2, 'fibers': 2, 'makes': 2, 'obesity': 2, 'tastes': 2, 'looks': 2, 'calorie': 2, 'fries': 2, 'burgers': 2, 'candy': 2, 'cookies': 2, 'eating': 2, 'prone': 2, 'type-2': 2, 'disease': 2, 'kidney': 2, 'failure': 2, 'lead': 2, 'nutritional': 2, 'deficiencies': 2, 'essential': 2, 'vitamins': 2, 'minerals': 2, 'cardiovascular': 2, 'diseases': 2, 'diet': 2, 'pressure': 2, 'functioning': 2, 'spike': 2, 'people': 2, 'dull': 2, 'day': 2, 'arteries': 2, 'attack': 2, 'way': 2, 'main': 2, 'instanc

### 2.3 Normalisation

In [17]:
# find the highest frequency 
highest_frequency = Counter(keywords_list).most_common(1)[0][1] 

print(highest_frequency)

19


In [18]:
for word in dictionary:
    dictionary[word] = (dictionary[word]/highest_frequency) 

print(dictionary)

Counter({'food': 1.0, 'junk': 0.6842105263157895, 'foods': 0.631578947368421, 'high': 0.47368421052631576, 'sugar': 0.3684210526315789, 'Junk': 0.3157894736842105, 'healthy': 0.3157894736842105, 'weight': 0.3157894736842105, 'cholesterol': 0.2631578947368421, 'body': 0.2631578947368421, 'blood': 0.2631578947368421, 'heart': 0.2631578947368421, 'good': 0.21052631578947367, 'sodium': 0.21052631578947367, 'fat': 0.21052631578947367, 'gain': 0.21052631578947367, 'life': 0.21052631578947367, 'diabetes': 0.21052631578947367, 'level': 0.21052631578947367, 'increases': 0.21052631578947367, 'risk': 0.21052631578947367, 'children': 0.15789473684210525, 'effects': 0.15789473684210525, 'health': 0.15789473684210525, 'found': 0.15789473684210525, 'nutrients': 0.15789473684210525, 'unhealthy': 0.15789473684210525, 'lack': 0.15789473684210525, 'impact': 0.15789473684210525, 'person': 0.15789473684210525, 'bad': 0.15789473684210525, 'fats': 0.15789473684210525, 'result': 0.15789473684210525, 'help': 0

### 2.4 Assign scores to each sentence

In [21]:
score={}

# Iterating through each sentence
for sentence in doc.sents: 
    # Iterating through token of each sentence
    for token in sentence:
        # checking if the token is a keyword 
        if token.text in dictionary.keys():
            # If true , add the frequency of keyword to the score dictionary 
            if sentence in score.keys():
                score[sentence]+=dictionary[token.text]
            else:
                score[sentence]=dictionary[token.text]

print(score)

{Junk foods taste good thatâ€: 1.3157894736842104, ™s why it is mostly liked by everyone of any age group especially kids and school going children.: 0.5789473684210527, They generally ask for the junk food daily because they have been trend so by their parents from the childhood.: 1.9473684210526316, They never have been discussed by their parents about the harmful effects of junk foods over health.: 1.894736842105263, According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways.: 2.0526315789473686, They are generally fried food found in the market in the packets.: 1.3684210526315788, They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.: 4.36842105263158, Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.: 2.789473

In [22]:
#Sort by descending

sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)

### Summary

In [24]:
# list to store sentences of summary
text_summary=[]

# Deciding the  total no of sentences in summary
no_of_sentences=4

# to count the no of sentence we already added to summary
total = 0

for i in range(len(sorted_score)):
    # appending to the summary
    text_summary.append(str(sorted_score[i][0]).capitalize()) 
    total += 1
    # checking if limit exceeded
    if(total >= no_of_sentences):
        break 

print(text_summary)

['Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods.', 'They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.', 'Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.', 'Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.']
