### Install Sumy

In [1]:
#! pip install sumy

In [2]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

## PART1: TEXT SUMMARIZATION FROM WIKIPEDIA TEXT

### Scrape the text

In [3]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

In [4]:
doc = parser.document
doc

<DOM with 63 paragraphs>

### Summarize - TextRankSummarizer

In [5]:
summarizer = TextRankSummarizer()

In [6]:
summary_text = summarizer(doc, 5)
summary_text

(<Sentence: Text summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.>,
 <Sentence: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sente

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [7]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [8]:
lexSummarizer =  LexRankSummarizer()
luhnSummarizer = LuhnSummarizer()
lsaSummarizer = LsaSummarizer()

### LexRankSummarizer

In [9]:
lex_summary_text = lexSummarizer(doc, 5)
lex_summary_text

(<Sentence: An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.>,
 <Sentence: Image collection summarization is another application example of automatic summarization.>,
 <Sentence: The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".>,
 <Sentence: Automatic Text Summarization.>,
 <Sentence: Automatic Keyphrases Extraction.>)

### LuhnSummarizer

In [10]:
luhn_summary_text = luhnSummarizer(doc, 5)
luhn_summary_text

(<Sentence: Text summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together.>,
 <Sentence: It is worth noting that TextRank was 

### LsaSummarizer

In [11]:
lsa_summary_text = lsaSummarizer(doc, 5)
lsa_summary_text

(<Sentence: For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.>,
 <Sentence: Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.>,
 <Sentence: It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights.>,
 <Sentence: Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.>,
 <Sentence: Although they did not replace other approaches and are often combined with them, by 2019 machine le

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [12]:
#!pip install gensim

### Import the library

In [13]:
import gensim

In [14]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [15]:
import requests
import re
from bs4 import BeautifulSoup

In [16]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [17]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p')
    print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [18]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [19]:
text = collect_text(get_page(url))

paragraphs text = 
 [<p><b>Automatic summarization</b> is the process of shortening a set of data computationally, to create a subset (a <a href="/wiki/Abstract_(summary)" title="Abstract (summary)">summary</a>) that represents the most important or relevant information within the original content.
</p>, <p>In addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;<sup class="reference" id="cite_ref-Torres2014_1-0"><a href="#cite_note-Torres2014-1">[1]</a></sup> various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup><sup class="reference" id="cite_ref-3"><a href="#cite_note-3">[3]</a></sup><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[4]</a></sup> video summarization extracts the most important fr

In [20]:
text

'url: https://en.wikipedia.org/wiki/Automatic_summarization\n\nAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\n\n\nIn addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.[5]\n\n\nIn 2022 Google Docs released an automatic summarization feature.[6]\n\n\nThere are two general approaches to automatic summarization: extraction and abstraction.\n\n\nHere, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [21]:
gensim_summary_text = summarize(text, word_count=200, ratio = 0.1)
gensim_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nText summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.[5]\nExamples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.\nSome techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.\nFor exa

## 3. Summa

### Install the library

In [22]:
#!pip install summa

### Import the library

In [23]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [24]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nText summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.[5]\nExamples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.\nFor text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and

## PART2: 
Taking a medium article, extracting the text and summarizing them using all the above methods and provide the best summary 3
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [25]:
url='https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7'

In [26]:
parser = HtmlParser.from_url(url, Tokenizer("english"))

In [27]:
doc=parser.document
doc

<DOM with 46 paragraphs>

### TextRank

In [28]:
summarizer = TextRankSummarizer()
summary_text = summarizer(doc, 5)
summary_text

(<Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: “Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”>,
 <Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together and form logical connections from the past and gives us an object from our memory.>,
 <Sentence: The same principle is applied for a so

In [29]:
tr_summary=""
for t in summary_text:
    tr_summary+=str(t)
#tr_summary=word_tokenize(tr_summary)

### LEX

In [30]:
lexSummarizer =  LexRankSummarizer()
lex_summary_text = lexSummarizer(doc, 5)
lex_summary_text

(<Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: Was it a dog or a lion?>,
 <Sentence: Do you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.>,
 <Sentence: Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head.>,
 <Sentence: Ultimately, the neurons in your brain tell that it is a lion and not a dog.>)

In [31]:
lex_summary=""
for t in lex_summary_text:
    lex_summary+=str(t)

### LUHN

In [32]:
luhnSummarizer = LuhnSummarizer()
luhn_summary_text = luhnSummarizer(doc, 5)
luhn_summary_text

(<Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.>,
 <Sentence: Every neuron is waiting for your eyes to see something new, for your nose to smell something new, for your ears to hear something new, for your tongue to taste something new.>,
 <Sentence: When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons.>,
 <Sentence: When you see a new object, your brain will ask the neurons, ‘Hey, anybody experience

In [33]:
luhn_summary=""
for t in luhn_summary_text:
    luhn_summary+=str(t)

### LSA

In [34]:
lsaSummarizer = LsaSummarizer()
lsa_summary_text = lsaSummarizer(doc, 5)
lsa_summary_text

(<Sentence: If you’ve noticed, this is how ML people make their machines learn through Reinforcement Learning.>,
 <Sentence: For example, when I showed you a lion picture, your brain asked the neurons who had seen it before.>,
 <Sentence: Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.>,
 <Sentence: And I hope she will not come to me running asking “Papa, what is Meural Metark?” again.>,
 <Sentence: And I have a strong feeling; she would ask me another stunning question sooner or later.>)

In [35]:
lsa_summary=""
for t in lsa_summary_text:
    lsa_summary+=str(t)

### Gensim

In [36]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [37]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p', class_='pw-post-body-paragraph jd je ig jf b jg jh ji jj jk jl jm jn jo jp jq jr js jt ju jv jw jx jy jz ka hz gh')
    #print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [38]:
text = collect_text(get_page(url))

In [39]:
gensim_summary_text = summarize(text, word_count=200, ratio = 0.1)
gensim_summary_text

'After all, neural network inside our brain helps us to learn new things in our life.\nWhat I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nA dog will have features like face, body, legs, and tail.\nA lion will have features like face, body, legs, tail and a beard.\nThe neurons grouped together with features like face, body, legs, tail and a beard forms a lion.\nOnce all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.\nEvery neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.\nUltimately, the neurons in your brain tell that it is a lion and not a dog.\nAll neurons work together like your friend

### Suma

In [40]:
from summa import summarizer
from summa import keywords

In [41]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nA dog will have features like face, body, legs, and tail.\nA lion will have features like face, body, legs, tail and a beard.\nHer neural network got aligned with classifying Dogs and Lions after some training.\nDo you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.\nHow you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.\nFor example, when I showed you a lion picture, your brain asked the neurons who had seen

### Bench marking Summarization Metrics

Ground Truth Reference human-made summary:<br><br>
    A neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home. Our brain consists of billions of neurons. Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on. You just learnt a new thing today. How you learnt it is because of Neural Network inside your brain. When a picture is shown to you, your neurons will group together and tries to signal what that object is by forming logical connections between the past and the present. Once the features are seen and a logical connection is established, neurons signal your brain that it is a lion. When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons. This is what is called as ‘Forming Logical Connections’ with the past. In a few milliseconds, your brain identifies whether the picture is a lion or a dog. This is what a neural network is and this is how it works in identifying things.




In [42]:
reference_summary="A neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home. Our brain consists of billions of neurons. Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on. You just learnt a new thing today. How you learnt it is because of Neural Network inside your brain. When a picture is shown to you, your neurons will group together and tries to signal what that object is by forming logical connections between the past and the present. Once the features are seen and a logical connection is established, neurons signal your brain that it is a lion. When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons. This is what is called as ‘Forming Logical Connections’ with the past. In a few milliseconds, your brain identifies whether the picture is a lion or a dog. This is what a neural network is and this is how it works in identifying things."

In [43]:
reference_summary

'A neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home. Our brain consists of billions of neurons. Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on. You just learnt a new thing today. How you learnt it is because of Neural Network inside your brain. When a picture is shown to you, your neurons will group together and tries to signal what that object is by forming logical connections between the past and the present. Once the features are seen and a logical connection is established, neurons signal your brain that it is a lion. When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons. This is what is called as ‘Forming Logical Connections’ with the past. In a few milliseconds, your brain identifies whet

In [44]:
#pip install rouge-score

In [45]:
import rouge_score
from rouge_score import rouge_scorer
import nltk

In [46]:
def rouge1_eval(text, summary):
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    score = scorer.score(text,str(summary))['rouge1'][2]#f1 score
    return score

def rouge2_eval(text, summary):
    scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)
    score = scorer.score(text,str(summary))['rouge2'][2]#f1 score
    return score

def rouge3_eval(text, summary):
    scorer = rouge_scorer.RougeScorer(['rouge3'], use_stemmer=True)
    score = scorer.score(text,str(summary))['rouge3'][2]#f1 score
    return score

def rougeL_eval(text, summary):
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    score = scorer.score(text,str(summary))['rougeL'][2]#f1 score
    return score

def rougeLsum_eval(text, summary):
    scorer = rouge_scorer.RougeScorer(['rougeLsum'], use_stemmer=True)
    score = scorer.score(text,str(summary))['rougeLsum'][2]#f1 score
    return score

def bleubigram_eval(text, summary):
    text_tokens=text
    text_tokens=sent_tokenize(text_tokens)
    score = nltk.translate.bleu_score.sentence_bleu(text_tokens, summary,  weights = (1/2, 1/2))#bigram
    return score

def bleutrigram_eval(text, summary):
    text_tokens=text
    text_tokens=sent_tokenize(text_tokens)
    score = nltk.translate.bleu_score.sentence_bleu(text_tokens, summary,  weights = (1/3, 1/3, 1/3))#trigram
    return score

In [47]:
import warnings
warnings.filterwarnings("ignore")

summaries={'Textrank':tr_summary, 
           'LEX':lex_summary, 
           'LUHN':luhn_summary, 
           'LSA': lsa_summary, 
           'Gensim':gensim_summary_text, 
           'Summa': summa_summary_text}

results=pd.DataFrame()
    
for name,summary in summaries.items():
    metrics={'rouge1':rouge1_eval(reference_summary, summary),
                 'rouge2':rouge2_eval(reference_summary, summary),
                 'rouge3':rouge3_eval(reference_summary, summary),
                 'rougeL_eval':rougeL_eval(reference_summary, summary),
                 'rougeLsum_eval':rougeLsum_eval(reference_summary, summary),
                 'bleubigram_eval':bleubigram_eval(reference_summary, summary),
                 'bleutrigram_eval':bleutrigram_eval(reference_summary, summary),
                }
    for func_name, func in metrics.items():
        score=func
        results = results.append({'Summarizer': name, 
                                  'Metric': func_name,
                                  'Score':score, 
                                 }, ignore_index=True)

results.sort_values(by='Score', ascending=False)

Unnamed: 0,Summarizer,Metric,Score
14,LUHN,rouge1,0.628297
35,Summa,rouge1,0.553415
26,LSA,bleubigram_eval,0.523382
28,Gensim,rouge1,0.521092
12,LEX,bleubigram_eval,0.503154
27,LSA,bleutrigram_eval,0.502671
13,LEX,bleutrigram_eval,0.484152
0,Textrank,rouge1,0.484108
15,LUHN,rouge2,0.443373
39,Summa,rougeLsum_eval,0.409807


LUHN summary with unigram-rouge evaluation metric has the highest score i.e. 0.63 as compared to all other metrics. It uses unigram and applies the function by comparing the overlap word by word, comparing importance of each word based on the reference summary. <br>

LUHN Summarizer showcases that how important every word is in the document. It mainly gives importance to the words which appear more frequent in the text and have maximum number of occurances in the document.

In [50]:
luhn_summary

'Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.Every neuron is waiting for your eyes to see something new, for your nose to smell something new, for your ears to hear something new, for your tongue to taste something new.When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons.When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain 

In [51]:
results.to_csv('summaryresults.csv')

end