<a href="https://colab.research.google.com/github/nazirumar/NLP/blob/main/Summarizing_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Recipe 5-4. Summarizing Text Data**
If you just look around, there are lots of articles and books available. Let’s
assume you want to learn a concept in NLP and if you Google it, you will
find an article. You like the content of the article, but it’s too long to read
it one more time. You want to basically summarize the article and save it
somewhere so that you can read it later.
Well, NLP has a solution for that. Text summarization will help us do
that. You don’t have to read the full article or book every time.

**Problem**

Text summarization of article/document using different algorithms in
Python.

**Solution**

Text summarization is the process of making large documents into smaller
ones without losing the context, which eventually saves readers time. This
can be done using different techniques like the following:
• TextRank: A graph-based ranking algorithm
• Feature-based text summarization
• LexRank: TF-IDF with a graph-based algorithm
• Topic based
• Using sentence embeddings
• Encoder-Decoder Model: Deep learning techniques

**Method 4-1 TextRank**

*TextRank is the graph-based ranking algorithm for NLP. It is basically
inspired by PageRank, which is used in the Google search engine but
particularly designed for text. It will extract the topics, create nodes out of
them, and capture the relation between nodes to summarize the text.
Let’s see how to do it using the Python package Gensim. “Summarize”
is the function used.*

**# Import BeautifulSoup and urllib libraries to fetch data from
Wikipedia.**

In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [7]:
def  get_only_text(url):
  page = urlopen(url)
  soup = BeautifulSoup(page)
  text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
  print(text)
  return soup.title.text, text

**Mention the Wikipedia url**

In [5]:
url="https://en.wikipedia.org/wiki/Natural_language_processing"

In [8]:
# Call the function created above
text = get_only_text(url)

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
 Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.
 Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time th

In [12]:
# Count the number of letters
len(" ".join(text))

9773

In [14]:
# Lets see first 1000 letters from the text
text[:1000]

('Natural language processing - Wikipedia',
 'Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\n Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.\n Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a 

To continue using gensim.summarization, you will need to downgrade the version of Gensim in the requirements.txt file by replacing it with gensim==3.8.3 or an older version.
[link text](https://stackoverflow.com/questions/68018745/not-able-to-import-from-gensim-summarization-module-in-django)

In [None]:
# Import summarize from gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
# Convert text to string format
text = str(text)

In [None]:
#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)

In [None]:
#keywords
print(keywords(text, ratio=0.1))

**Method 4-2 Feature-based text summarization**


Your feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.

In [21]:
#install the Package
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/97.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels fo

In [22]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer


In [24]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
#Extract and summarizing
LANGUAGE = "english"
SENTENTCE_COUNT = 10

url="https://en.wikipedia.org/wiki/Natural_language_processing"

parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
for sentence in summarizer(parser.document, SENTENTCE_COUNT):
  print(sentence)

1970s: During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data.
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.
Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.
Since 2015,[22] the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning.
Coreference resolution Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities").
Natural-language understanding(NLU) Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate.
PMID 33736486.^ Lee, Jennifer; Yang, Samuel; Holland-Hall, Cynthia; Sezgin, Emre; Gill, Manjot; Linwood, Simon; Huang, Yungui; Hoffman, Je