# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
! pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: breadability, docopt
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=brea

### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [5]:
!pip install nltk
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer


### Scrape the text

In [3]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"


In [6]:
parser = HtmlParser.from_url(url, Tokenizer("english"))


### Summarize - TextRankSummarizer

In [7]:
summarizer = TextRankSummarizer()

#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2)

for sentence in summary:
  print(sentence)


For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.^ Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures, To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015^ Kai Wei, Rishabh I

(<Sentence: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.>,
 <Sentence: Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm [8] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: While the goal of a brief summary is to simplify information search and cut the

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [8]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer


### Create Summarizers

In [9]:
lex_summarizer = LexRankSummarizer()
luhn_summarizer = LuhnSummarizer()
lsa_summarizer = LsaSummarizer()


### LexRankSummarizer

In [10]:
lex_summary = lex_summarizer(parser.document, 2)

for sentence in lex_summary:
  print(sentence)


The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".
Automatic Text Summarization.


### LuhnSummarizer

In [11]:

luhn_summary = luhn_summarizer(parser.document, 2)

for sentence in luhn_summary:
  print(sentence)


Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.^ Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures, To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015^ Kai Wei, Risha

### LsaSummarizer

In [12]:
lsa_summary = lsa_summarizer(parser.document, 2)

for sentence in lsa_summary:
  print(sentence)


Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.
Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity.


## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [38]:
!pip install gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for gensim (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for gensim[0m[31m
[0m[?25h  Running setup.py clean for gensim
Failed to build gensim
[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (gensim)[0m[31m
[0m

### Import the library

In [15]:
import os
import sys
import requests
import re
# Code here - Import BeautifulSoup library
from bs4 import BeautifulSoup
import gensim



### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [16]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [17]:
def get_page():
  global url

  # URL input code
  url = "https://en.wikipedia.org/wiki/Automatic_summarization"

  # handling possible error
  if not re.match(r'https?://en.wikipedia.org/',url):
    print('Please enter a valid website, or make sure it is a Wikipedia article')
    sys.exit(1)

  # Call get method in requests object, pass url and collect it in res
  # Mimic the request as a browser as direct webscraping request is not allowed
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
  res = requests.get(url, headers=headers)
  res.raise_for_status()
  soup = BeautifulSoup(res.text, 'html.parser')
  return soup

In [18]:
def collect_text(soup):
	text = f'url: {url}\n\n'
	para_text = soup.find_all('p')
	print(f"paragraphs text = \n {para_text}")
	for para in para_text:
		text += f"{para.text}\n\n"
	return text

In [20]:
text = collect_text(get_page())
text

paragraphs text = 
 [<p><b>Automatic summarization</b> is the process of shortening a set of data computationally, to create a subset (a <a href="/wiki/Abstract_(summary)" title="Abstract (summary)">summary</a>) that represents the most important or relevant information within the original content. <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">Artificial intelligence</a> <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> are commonly developed and employed to achieve this, specialized for different types of data.
</p>, <p><a href="/wiki/Plain_text" title="Plain text">Text</a> summarization is usually implemented by <a href="/wiki/Natural_language_processing" title="Natural language processing">natural language processing</a> methods, designed to locate the most informative sentences in a given document.<sup class="reference" id="cite_ref-Torres2014_1-0"><a href="#cite_note-Torres2014-1">[1]</a></sup> On the other hand, visual content can be summarized 

'url: https://en.wikipedia.org/wiki/Automatic_summarization\n\nAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the 

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [39]:
# Summarize
from gensim.summarization import summarize

# By word count
print(summarize(text, word_count=100))

# By ratio
print(summarize(text, ratio=0.1))


ModuleNotFoundError: No module named 'gensim.summarization'

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [24]:
!pip install summa

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: summa
  Building wheel for summa (setup.py) ... [?25l[?25hdone
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54388 sha256=f4eb01089c48f06923538f99de65b1fa040a075c146cb6df01b2153ebea680be
  Stored in directory: /root/.cache/pip/wheels/4a/ca/c5/4958614cfba88ed6ceb7cb5a849f9f89f9ac49971616bc919f
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


### Import the library

In [25]:
import os
import sys
import requests
import re
# Code here - Import BeautifulSoup library
from bs4 import BeautifulSoup

In [None]:
from summa import summarizer


### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [26]:
# function to get the html source text of the medium article
def get_page():
  global url

  # URL input code
  url = input("Enter the URL of the required medium article: ")

  # handling possible error
  if not re.match(r'https?://medium.com/',url):
    print('Please enter a valid website, or make sure it is a medium article')
    sys.exit(1)


    # Code here - Call get method in requests object, pass url and collect it in res mimic the request as a browser as direct webscraping request is not allowed
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
  res = requests.get(url, headers=headers)
  res.raise_for_status()
  soup = BeautifulSoup(res.text, 'html.parser')
  return soup


In [27]:
# function to remove all the html tags and replace some with specific strings
def clean(text):
    rep = {"<br>": "\n", "<br/>": "\n", "<li>":  "\n"}
    rep = dict((re.escape(k), v) for k, v in rep.items())
    pattern = re.compile("|".join(rep.keys()))
    text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
    text = re.sub('\<(.*?)\>', '', text)
    return text

In [28]:
def collect_text(soup):
	text = f'url: {url}\n\n'
	para_text = soup.find_all('p')
	print(f"paragraphs text = \n {para_text}")
	for para in para_text:
		text += f"{para.text}\n\n"
	return text


In [31]:
def save_file(text):
	if not os.path.exists('./scraped_articles'):
		os.mkdir('./scraped_articles')
	name = url.split("/")[-1]
	print(name)
	fname = f'scraped_articles/{name}.txt'

	# Code here - write a file using with (2 lines)
	with open(fname, 'w') as f:
		f.write(text)
	print(f'File saved in directory {fname}')


In [32]:
if __name__ == '__main__':
	text = collect_text(get_page())
	save_file(text)
#https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

Enter the URL of the required medium article: https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7
paragraphs text = 
 [<p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><button class="be b dw dx eg dy dz eh ea eb ei ej ed ek el ef em eo ep eq er es et eu ev ew ex ey ez fa fb fc bl fd fe" data-testid="headerSignUpButton">Sign up</button></span></p>, <p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><a class="af ag ah ai aj ak al am an ao ap aq ar as at" data-testid="headerSignInButton" href="/m/signin?operation=login&amp;redirect=https%3A%2F%2Fmedium.com%2F%40subashgandyer%2Fpapa-what-is-a-neural-network-c5e5cc427c7&amp;source=post_page---two_column_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign in</a></span></p>, <p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><button class="be b dw dx eg dy dz eh ea eb ei ej ed ek el ef em eo ep eq er es et eu ev ew ex ey ez fa fb fc bl fd fe" data-testid="headerSignUpButton"

### Summarize

In [33]:
from summa import summarizer
with open('./scraped_articles/papa-what-is-a-neural-network-c5e5cc427c7.txt') as f:
  text = f.read()
  summary = summarizer.summarize(text, ratio=0.2)
  with open('./scraped_articles/papa-what-is-a-neural-network-c5e5cc427c7_summary.txt', 'w') as s:
    s.write(summary)


## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [41]:
from bs4 import BeautifulSoup
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

In [45]:


# Summarize using Sumy
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_text_rank = TextRankSummarizer()  # Create an instance of TextRankSummarizer
text_rank_summary = summarizer_text_rank(parser.document, 2) # call the summarizer object like a function
print("TextRank Summary:")
for sentence in text_rank_summary:
  print(sentence)


# LexRankSummarizer
lex_summary = lex_summarizer(parser.document, 2)
print("\nLexRank Summary:")
for sentence in lex_summary:
  print(sentence)

# LuhnSummarizer
luhn_summary = luhn_summarizer(parser.document, 2)
print("\nLuhn Summary:")
for sentence in luhn_summary:
  print(sentence)

# LsaSummarizer
lsa_summary = lsa_summarizer(parser.document, 2)
print("\nLSA Summary:")
for sentence in lsa_summary:
  print(sentence)


# Summarize using Summa
from summa import summarizer # make sure summarizer function is in scope
print("\nSumma Summary:")
summary = summarizer.summarize(text, ratio=0.2) # use the summa summarize function
print(summary)

# Save the best summary to a file
with open('./scraped_articles/best_summary.txt', 'w') as s:
    s.write(summary)
    s.write("\n\nReason: I chose Summa as the best summarizer for this article because it provided the most concise and informative summary, capturing the key points of the article without losing important details. While the other summarizers also produced decent summaries, Summa's output was the most balanced in terms of length and information density.")


TextRank Summary:
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”
The same principle is applied for a song that you hear, a cartoon that you watch, a rhyme that you sing, an animal that you draw, a food that you taste, a flower that you smell and so on.

LexRank Summary:
Was it a dog or a lion?
Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head.

Luhn Summary:
Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, 