# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy



## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
!pip install sumy




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Importing the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [2]:
!pip install beautifulsoup4




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scrape the text

In [3]:
!pip install requests
import requests


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip




In [4]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/117.0 Safari/537.36"
}

from bs4 import BeautifulSoup
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
mytext = soup.find_all('p')
document=[]
for text in mytext:
    docc=text.get_text()
    document.append(docc)
    documents='\n'.join(document)

In [5]:
documents

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most importa

In [6]:
type(documents)

str

In [7]:
import nltk

# Download the punkt tokenizer
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\JOY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Summarize - TextRankSummarizer

In [8]:
import sumy
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

parser=PlaintextParser.from_string(document, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary= summarizer(parser.document,5)
summary

(<Sentence: This problem is called multi-document summarization.>,
 <Sentence: They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus.\n', 'Depending on the different literature and the definition of key terms, words or phrases, keyword extraction is a highly related theme.\n', 'Beginning with the work of Turney,[15] many researchers have approached keyphrase extraction as a supervised machine learning problem.\nGiven a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below).>,
 <Sentence: The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original tra

### Trying different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [9]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [10]:
parser=PlaintextParser.from_string(document, Tokenizer("english"))
parser

<sumy.parsers.plaintext.PlaintextParser at 0x196f14135e0>

### LexRankSummarizer

In [11]:
summarizer = LexRankSummarizer()
summary= summarizer(parser.document,15)
for sentence in summary:
    print(sentence)

[9]\n', 'There are two general approaches to automatic summarization: extraction and abstraction.\n', 'Here, content is extracted from the original data, but the extracted content is not modified in any way.
Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n', 'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
This problem is called multi-document summarization.
Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.\n', 'The task is the following.
They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text sea

### LuhnSummarizer

In [12]:
summarizer = LuhnSummarizer()
summary1= summarizer(parser.document,10)
for sentence in summary1:
    print(sentence)

Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n', 'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.\n', 'At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set.
They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus.\n', 'Depending on the different literature and the definition of key terms, words or 