# Extractive Text Summarization

## Using Sumy

Ne pensez-vous pas qu'il serait très simple et bénéfique d'avoir une bibliothèque, qui va nous permettre d'effectuer des résumés via plusieurs algorithmes ?
Heureusement, nous avons déjà la bibliothèque sumy pour cela !

**Sumy** libraray nous fournit plusieurs algorithmes pour implémenter la synthèse de texte existant sous format de document ou bien en pages HTML. Il nous faut juste d’importer simplement un algorithme de notre choix plutôt que de le coder nous-même.


In [2]:
''' Extractive Text Summarization with the Sumy(Sumy - module for automatic summarization of text documents and HTML pages.) '''
!pip install sumy



In [3]:
import sumy

In [5]:
original_text = 'Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time.'
original_text

'Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie 

### Lex Rank 

LexRank est une approche non supervisés basée sur des graphes pour effectuer une synthèse automatique de texte. La notation des phrases se fait à l'aide de la méthode des graphes. <code>LexRankSummarizer</code> est utilisé pour calculer l'importance des phrases sur la base du concept de centralité des vecteurs propres dans une représentation graphique des phrases.

D’une façon plus simple, si une phrase est similaire à de nombreuses autres phrases du texte alors elle a une forte probabilité d'être importante. Et lorsqu’une phrase particulière est recommandée par d'autres phrases similaires elle est donc classée plus haut.
Et plus le rang est élevé, plus la priorité d'être inclus dans le texte résumé s’augmente.


In [8]:
# Importing the parser and tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [9]:
# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

L'analyse syntaxique(**Parsing**) est le processus d'analyse d'un texte composé d'une séquence de jetons(**Tokens**) pour déterminer sa structure grammaticale par rapport à une grammaire formelle donnée (plus ou moins). L'analyseur(**The Parser**) construit ensuite une structure de données basée sur les jetons.

In [10]:
# Initializing the parser
my_parser = PlaintextParser.from_string(original_text,Tokenizer('english'))

In [11]:
# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document,sentences_count=3)

# Printing the summary
for sentence in lexrank_summary:
    print(sentence)

It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
It is more of fats and cholesterol which will have a harmful impact on your health.
Children will find one way or the other to have it.


### LSA ( Latent Semantic Analysis )

**LSA** (**Latent Semantic Analysis**) également connu sous le nom de **LSI** (**Latent Semantic Index**) LSA utilise un modèle de sac de mots (Bag of Words BoW), qui se traduit par une matrice terme-document (occurrence de termes dans un document). Les lignes représentent les termes et les colonnes représentent les documents. **LSA** apprend les sujets latents en effectuant une décomposition matricielle sur la matrice document-terme à l'aide de la décomposition en valeurs singulières. 

In [12]:
# Import the summarizer
from sumy.summarizers.lsa import LsaSummarizer

In [13]:
# Parsing the text string using PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

In [14]:
# creating the summarizer
lsa_summarizer=LsaSummarizer()
lsa_summary= lsa_summarizer(parser.document,3)

# Printing the summary
for sentence in lsa_summary:
    print(sentence)

Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children.
To make things worse, junk food also clogs your arteries and increases the risk of a heart attack.
Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now.


### Luhn

L'approche de l'algorithme de résumé <code>Luhn</code> est basée sur **TF-IDF** (Term Frequency-Inverse Document Frequency). Il est utile lorsque les mots très peu fréquents ainsi que les mots très fréquents (stopwords) ne sont pas tous les deux significatifs.
On se basons sur ça, le scoring des phrases est effectuée et les phrases les mieux classées parviennent au résumé.


In [15]:
# Import the summarizer
from sumy.summarizers.luhn import LuhnSummarizer

In [16]:
# Creating the parser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

In [17]:
# Creating the summarizer
luhn_summarizer=LuhnSummarizer()
luhn_summary=luhn_summarizer(parser.document,sentences_count=3)

# Printing the summary
for sentence in luhn_summary:
    print(sentence)

They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.


Et comme vous pouvez remarquez si vous essayez de lire les precedents resumes que ce dernier obtenu par <code>LuhnSummarizer()</code> est bien formee aussi quelle a plus de sens.

### KL-Sum 

Algorithme de somme KL (**Kullback-Lieber (KL) Sum**) pour la synthèse de texte qui se concentre sur la minimisation du vocabulaire résumé en vérifiant la divergence par rapport au vocabulaire d'entrée.

Il s'agit d'un algorithme de sélection de phrases où une longueur cible pour le résumé est fixée appelée (L words). Il sélectionne des phrases en fonction de la similitude de la distribution des mots avec le texte original. Il vise à abaisser les critères de divergence KL. Il utilise une approche d'optimisation gourmande et continue à ajouter des phrases jusqu'à ce que la divergence KL diminue.


In [18]:
#importing the KL Summarizer
from sumy.summarizers.kl import KLSummarizer

In [19]:
# Creating the parser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

In [20]:
# Instantiating the KLSummarizer
kl_summarizer=KLSummarizer()
kl_summary=kl_summarizer(parser.document,sentences_count=3)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
Junk food is the easiest way to gain unhealthy weight.


Nous pouvons remarquer que ce résumé est le plus court par rapport aux autres précédents résumés, et aussi il est mieux structure et les phrase sont bien choisie. 

# Abstractive Text Summarization

Un moyen simple et efficace consiste à utiliser le **Haggingface’s Transformers Library**.
Huggingface prend en charge des modèles de pointe pour mettre en œuvre des tâches telles que la synthèse, la classification, etc. Et voila certains modèles courants : GPT-2, GPT-3, BERT, OpenAI, GPT, T5.
Une des fonctionnalités intéressantes de Transformers est qu'ils fournissent aux modèles PreTrained des poids (weights) qui peuvent être facilement instanciés via la méthode <code>from_pretrained()</code>.


In [21]:
''' Abstractive Text Summarization with the Huggingface’s transformers library. '''
!pip install transformers



## Summarization with T5 Transformers

**T5** est un modèle d'encodeur-décodeur pré-entraîné sur un mélange multi-tâches conçus aux tâches non supervisées et supervisées, et pour lequel chaque tâche est convertie en un format texte-à-texte. 
**T5** fonctionne bien sur une variété de tâches prêtes à l'emploi en ajoutant un préfixe différent à l'entrée correspondant à chaque tâche, par exemple, pour la traduction : traduire l'anglais vers l'allemand…, pour le résumé : summarize…, etc.
Il est formé en utilisant le **teacher forcing** ou (forçage des enseignants). Cela signifie que pour l'entraînement, nous avons toujours besoin d'une séquence d'entrée et d'une séquence cible correspondante. La séquence d'entrée est transmise au modèle à l'aide de input_ids. La séquence cible est décalée vers la droite, c'est-à-dire précédée d'un (token) jeton de séquence de démarrage et transmise au décodeur à l'aide de decoder_input_ids. Dans le style de teacher forcing, la séquence cible est ensuite ajoutée par le token (jeton) EOS et correspond aux étiquettes. Le jeton PAD est ici utilisé comme jeton de séquence de démarrage. T5 peut être entraîné / affiné à la fois de manière supervisée et non supervisée.
On peut utiliser <code>T5ForConditionalGeneration</code> (ou la variante Tensorflow/Flax), qui inclut la tête de modélisation du langage au-dessus du décodeur.


In [22]:
# Importing requirements
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

In [23]:
''' In This part I was in need to install the tensorflow and pytorch Library  '''
# Instantiating the model and tokenizer
my_model = T5ForConditionalGeneration.from_pretrained('t5-small')

In [24]:
!pip install sentencepiece



In [25]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')

In [26]:
# Concatenating the word "summarize:" to raw text
text = "summarize:" + original_text
text

'summarize:Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the health

In [27]:
# encoding the input text
input_ids = tokenizer.encode(text, return_tensors='pt', max_length=512)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [28]:
# Generating summary ids
summary_ids = my_model.generate(input_ids)
summary_ids

tensor([[    0, 11797,  4371,    33,     8,   598,    13,  3607,    11, 23363,
          1293,  2485,     3,     5,    79,   582,   306,    16, 10004,     6]])

In [29]:
# Decoding the tensor and printing the summary.
t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)

<pad> junk foods are the means of rapid and unhealthy weight gain. they become high in calories,


## Text Summarization with BART Transformers

**BART** est un auto-encodeur de dé bruitage pour le pré-entraînement des modèles séquence à séquence. Il est entraîné en (1) corrompant le texte avec une fonction de bruit arbitraire et (2) en apprenant un modèle pour reconstruire le texte original. Il utilise une architecture standard de traduction automatique neuronale basée sur Transformer. Il utilise une architecture standard seq2seq/NMT avec un encodeur bidirectionnel (comme **BERT**) et un décodeur de gauche à droite (comme **GPT**). Cela signifie que le masque d'attention du codeur est entièrement visible, comme **BERT**, et que le masque d'attention du décodeur est causal, comme **GPT2**.

In [30]:
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [31]:
# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

In [32]:
# Encoding the inputs and passing them to model.generate()
inputs = tokenizer.batch_encode_plus([original_text],return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

In [33]:
# Decoding and printing the summary
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. According to the research by scientists, it has been found that junk foods have negative effects on the health.
