# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [1]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

### Scrape the text

In [2]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
parser = HtmlParser.from_url(url, Tokenizer("english"))

In [3]:
doc = parser.document
doc

<DOM with 46 paragraphs>

### Summarize - TextRankSummarizer

In [56]:
summarizer = TextRankSummarizer()

In [57]:
summary_text = summarizer(doc,5)
summary_text

(<Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: “Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”>,
 <Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together and form logical connections from the past and gives us an object from our memory.>,
 <Sentence: The same principle is applied for a so

In [58]:
string=''.join(map(str,summary_text))

In [59]:
reference=string
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.16195372750642673,
   'p': 0.9767441860465116,
   'f': 0.27783902732849614},
  'rouge-2': {'r': 0.09557235421166306,
   'p': 0.9567567567567568,
   'f': 0.17378497625725742},
  'rouge-l': {'r': 0.16195372750642673,
   'p': 0.9767441860465116,
   'f': 0.27783902732849614}}]

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [23]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [24]:
lexSummarizer =  LexRankSummarizer()
luhnSummarizer = LuhnSummarizer()
lsaSummarizer = LsaSummarizer()

### LexRankSummarizer

In [25]:
lex_summary_text = lexSummarizer(doc, 10)
lex_summary_text

(<Sentence: I said, “Good Job!” and asked her, “Where’s the tail, baby?” She smiled and drew a tail.>,
 <Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: Now I took the pencil and drew a beard in the face.>,
 <Sentence: Was it a dog or a lion?>,
 <Sentence: Do you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.>,
 <Sentence: Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head.>,
 <Sentence: Ultimately, the neurons in your brain tell that it is a lion and not a dog.>,
 <Sentence: This is what a neural network is and this is how it works in identifying things.>,
 <Sentence: Is it features?>,
 <Sentence: Me: So, for a dog, the features are face, body, legs and tail.>)

In [61]:
lex_string=''.join(map(str,lex_summary_text))
reference=lex_string
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.11439588688946016,
   'p': 0.9468085106382979,
   'f': 0.2041284384434181},
  'rouge-2': {'r': 0.06749460043196544,
   'p': 0.8741258741258742,
   'f': 0.12531328187719423},
  'rouge-l': {'r': 0.11439588688946016,
   'p': 0.9468085106382979,
   'f': 0.2041284384434181}}]

### LuhnSummarizer

In [26]:
luhn_summary_text = luhnSummarizer(doc, 10)
luhn_summary_text

(<Sentence: “Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her.>,
 <Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.>,
 <Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collect

In [62]:
luhn_string=''.join(map(str,luhn_summary_text))
reference=luhn_string
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.26735218508997427,
   'p': 0.9904761904761905,
   'f': 0.4210526282314905},
  'rouge-2': {'r': 0.1771058315334773,
   'p': 0.9704142011834319,
   'f': 0.29954337638507955},
  'rouge-l': {'r': 0.26735218508997427,
   'p': 0.9904761904761905,
   'f': 0.4210526282314905}}]

### LsaSummarizer

In [27]:
lsa_summary_text = lsaSummarizer(doc, 10)
lsa_summary_text

(<Sentence: I just finished my huge customary Sunday lunch spread with family and resting along.>,
 <Sentence: As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby.>,
 <Sentence: What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.>,
 <Sentence: I spend some time in collecting pictures of dogs and lions from Google images.>,
 <Sentence: If you’ve noticed, this is how ML people make their machines learn through Reinforcement Learning.>,
 <Sentence: For example, when I showed you a lion picture, your brain asked the neurons who had seen it before.>,
 <Sentence: Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.>,
 <Sentence: When I ask you to draw a dog, what are the features there?>,
 <Sentence: And I hope she will not 

In [63]:
lsa_string=''.join(map(str,lsa_summary_text))
reference=lsa_string
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.17480719794344474,
   'p': 0.9927007299270073,
   'f': 0.2972677570166682},
  'rouge-2': {'r': 0.09449244060475162,
   'p': 0.9562841530054644,
   'f': 0.17199017035338096},
  'rouge-l': {'r': 0.17480719794344474,
   'p': 0.9927007299270073,
   'f': 0.2972677570166682}}]

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [22]:
!pip install gensim==3.4.0

Collecting gensim==3.4.0
  Using cached gensim-3.4.0.tar.gz (22.2 MB)
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25ldone
[?25h  Created wheel for gensim: filename=gensim-3.4.0-cp38-cp38-macosx_10_9_x86_64.whl size=22589400 sha256=29b61057679d55dae325f2380988c2c802b16190891925e20ed2669712fdd75e
  Stored in directory: /Users/harshwardhanbabel/Library/Caches/pip/wheels/b4/a4/71/a301cdb2b7d5d31525936fcb8dcd9a5f144578d047407f7cf9
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.0.1
    Uninstalling gensim-4.0.1:
      Successfully uninstalled gensim-4.0.1
Successfully installed gensim-3.4.0


### Import the library

In [28]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [29]:
import requests
import re
from bs4 import BeautifulSoup

In [30]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    
    return soup

In [31]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p')
    print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"    
    return text

In [48]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

In [49]:
text = collect_text(get_page(url))
text

paragraphs text = 
 [<p class="bb b bc bd bz"><span><a class="be bh ca cb cc cd ce cf cg bm bj bk ch ci cj" href="/m/signin?operation=login&amp;redirect=https%3A%2F%2Fmedium.com%2F%40subashgandyer%2Fpapa-what-is-a-neural-network-c5e5cc427c7&amp;source=post_page-----c5e5cc427c7---------------------nav_reg--------------" rel="noopener follow">Sign in</a></span></p>, <p class="bb b bc bd be">Subash Gandyer</p>, <p class="bb b bc bd bz"><span class="hd"></span>Mar 15, 2018<span class="he">·</span>10 min read</p>, <p class="ho hp fy hq b hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih dn gv" id="7022">It was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.</p>, <p class="ho hp fy hq b hr iz hs ht hu ja hv hw hx jb hy hz ia jc ib ic id jd ie if ih dn gv" id="d273">My 4-year-old angel came run

'url: https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7\n\nSign in\n\nSubash Gandyer\n\nMar 15, 2018·10 min read\n\nIt was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.\n\nMy 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t unde

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [50]:
gensim_summary_text = summarize(text, word_count=200, ratio = 0.1)
gensim_summary_text

'After all, neural network inside our brain helps us to learn new things in our life.\nWhat I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nA dog will have features like face, body, legs, and tail.\nA lion will have features like face, body, legs, tail and a beard.\nThe neurons grouped together with features like face, body, legs, tail and a beard forms a lion.\nOnce all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.\nEvery neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.\nUltimately, the neurons in your brain tell that it is a lion and not a dog.\nAll neurons work together like your friend

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [31]:
!pip install summa

Collecting summa
  Using cached summa-1.2.0.tar.gz (54 kB)
Building wheels for collected packages: summa
  Building wheel for summa (setup.py) ... [?25ldone
[?25h  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54411 sha256=cc4a781f8855bd8674349af9d18a45dd2959a3c02cf29820808db5304ce6ebb4
  Stored in directory: /Users/harshwardhanbabel/Library/Caches/pip/wheels/fd/6a/dd/209eb19d5f2266b9cfd06827539bf70435b0ad5fe8244e52d3
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


### Import the library

In [43]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [51]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nA dog will have features like face, body, legs, and tail.\nA lion will have features like face, body, legs, tail and a beard.\nHer neural network got aligned with classifying Dogs and Lions after some training.\nDo you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.\nHow you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.\nFor example, when I showed you a lion picture, your brain asked the neurons who had seen

In [53]:
summary=text
reference=summa_summary_text
rouge = Rouge()

In [54]:
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.21979434447300772, 'p': 1.0, 'f': 0.3603793437262895},
  'rouge-2': {'r': 0.15334773218142547,
   'p': 0.9435215946843853,
   'f': 0.2638179260667096},
  'rouge-l': {'r': 0.21979434447300772, 'p': 1.0, 'f': 0.3603793437262895}}]

In [55]:
reference=gensim_summary_text
rouge.get_scores(reference,summary)

[{'rouge-1': {'r': 0.13753213367609254, 'p': 1.0, 'f': 0.24180790747879602},
  'rouge-2': {'r': 0.08585313174946005,
   'p': 0.9520958083832335,
   'f': 0.15750371319280115},
  'rouge-l': {'r': 0.13753213367609254, 'p': 1.0, 'f': 0.24180790747879602}}]