<a href="https://colab.research.google.com/github/kh-ops69/ML_NLP/blob/master/extractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text Summarization: Getting a summary of text from given sample document. We use different methods, using both pre-built libraries and a custom function to obtain these summaries. difference of outputs between different methods arises due to the fact that all of them use some variation of the same basic idea to obtain summary.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import textwrap
from nltk.corpus import stopwords
from nltk import tokenize
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer

In [15]:
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [5]:
df = pd.read_csv("bbc_text_cls.csv")

In [6]:
df.sample(5), df.labels.unique()

(                                                   text         labels
 525   Arthur Hailey: King of the bestsellers\n\nNove...  entertainment
 1987  Who do you think you are?\n\nThe real danger i...           tech
 627   REM concerts blighted by illness\n\nUS rock ba...  entertainment
 371   Madagascar completes currency switch\n\nMadaga...       business
 1683  All Black magic: New Zealand rugby\n\nPlaying ...          sport,
 array(['business', 'entertainment', 'politics', 'sport', 'tech'],
       dtype=object))

In [7]:
def wrap(x):
  return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings = True)

In [10]:
df.iloc[0]

text      Ad sales boost Time Warner profit\n\nQuarterly...
labels                                             business
Name: 0, dtype: object

In [35]:
print(wrap(df.iloc[1].text.split("\n", 1)[1]))

# split once (arg=1), split by char(\n), and retrieve the second element after
# splitting, in this case, (title, text)


The dollar has hit its highest level against the euro in almost three
months after the Federal Reserve head said the US trade deficit is set
to stabilise.

And Alan Greenspan highlighted the US government's
willingness to curb spending and rising household savings as factors
which may help to reduce it.  In late trading in New York, the dollar
reached $1.2871 against the euro, from $1.2974 on Thursday.  Market
concerns about the deficit has hit the greenback in recent months.  On
Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead
of the meeting of G7 finance ministers sent the dollar higher after it
had earlier tumbled on the back of worse-than-expected US jobs data.
"I think the chairman's taking a much more sanguine view on the
current account deficit than he's taken for some time," said Robert
Sinche, head of currency strategy at Bank of America in New York.
"He's taking a longer-term view, laying out a set of conditions under
which the current account deficit c

In [8]:
def tf_summarizer(texts, arg, factor):
  sents = texts.split("\n", 1)[1]
  sents = nltk.sent_tokenize(sents)
  # tokens = nltk.sent_tokenize(sents)
  featurizer = TfidfVectorizer(max_features=1500, stop_words=stopwords.words("english"), norm='l1')
  x = featurizer.fit_transform(sents)
  if arg=="s":
    s = cosine_similarity(x)
    s /= s.sum(axis=1, keepdims=True)
    u = np.ones_like(s)/len(s)
    # creating a new matrix in order to aid with the smoothing process
    s = (1-factor)*s + factor*u
    # factor component arises in order to control how much weightage is given to each component s and u
    eigenvals, eigenvecs = np.linalg.eig(s.T)

    # for more in-depth understanding, some low level code

    # limiting_dist = np.ones(len(s))/ len(s)
    # threshold = 1e-10
    # delta = float('-inf')
    # iters = 0
    # while delta>threshold:
    #   iters += 1
    #   # getting the new state transition matrix
    #   p = limiting_dist.dot(s)
    #   # updating the difference between limiting distribution and state transition matrix:
    #   # it will help us in iteratively updating delta as and when the
    #   # limiting distribuion comes closer and closer to stationary distribution
    #   delta = np.abs(p-limiting_dist).sum()
    #   limiting_dist = p
    # print(iters, limiting_dist.sum(), np.abs(eigenvecs[:,0] / eigenvecs[:,0].sum() - limiting_dist).sum())

    scores = eigenvecs[:,0] / eigenvecs[:,0].sum()
    sort_idxes = (-scores).argsort()
    for i in sort_idxes[:5]:
      print(wrap("%.2f: %s"% (scores[i], sents[i])))

  # same procedure: replacing cosine similarity for euclidean distances
  elif arg=="e":
    e = euclidean_distances(x)
    e /= e.sum(axis=1, keepdims=True)
    u = np.ones_like(e)/len(e)
    e = (1-factor)*e + factor*u
    eigenvals, eigenvecs = np.linalg.eig(e.T)
    scores = eigenvecs[:,0] / eigenvecs[:,0].sum()
    sort_idxes = (-scores).argsort()
    for i in sort_idxes[:5]:
      print(wrap("%.2f: %s"% (scores[i], sents[i])))

In [33]:
df.iloc[1].text.split("\n", 1)[0], df.iloc[1].labels

('Dollar gains on Greenspan speech', 'business')

First we will check using cosine similarity

In [9]:
tf_summarizer(df.iloc[1].text, "s", factor=0.3)

0.08: 
The dollar has hit its highest level against the euro in almost
three months after the Federal Reserve head said the US trade deficit
is set to stabilise.
0.07: "I think the chairman's taking a much more sanguine view on the
current account deficit than he's taken for some time," said Robert
Sinche, head of currency strategy at Bank of America in New York.
0.07: China's currency remains pegged to the dollar and the US
currency's sharp falls in recent months have therefore made Chinese
export prices highly competitive.
0.07: Market concerns about the deficit has hit the greenback in
recent months.
0.07: On Friday, Federal Reserve chairman Mr Greenspan's speech in
London ahead of the meeting of G7 finance ministers sent the dollar
higher after it had earlier tumbled on the back of worse-than-expected
US jobs data.


Second method is euclidean distances

In [10]:
tf_summarizer(df.iloc[1].text, "e", factor=0.3)

0.08: Worries about the deficit concerns about China do, however,
remain.
0.08: Market concerns about the deficit has hit the greenback in
recent months.
0.07: The G7 meeting is thought unlikely to produce any meaningful
movement in Chinese policy.
0.07: In late trading in New York, the dollar reached $1.2871 against
the euro, from $1.2974 on Thursday.
0.07: The half-point window, some believe, could be enough to keep US
assets looking more attractive, and could help prop up the dollar.


Using some pre-built libraries to obtain summaries instead

In [17]:
summarizer = TextRankSummarizer()
parser = PlaintextParser(df.iloc[1].text.split('\n',1)[1], Tokenizer('english'))
summary = summarizer(parser.document, sentences_count=5)

In [21]:
def get_wrap(summary):
  for sentence in summary:
    print(wrap(str(sentence)))

In [24]:
get_wrap(summary=summary)

The dollar has hit its highest level against the euro in almost three
months after the Federal Reserve head said the US trade deficit is set
to stabilise.
On Friday, Federal Reserve chairman Mr Greenspan's speech in London
ahead of the meeting of G7 finance ministers sent the dollar higher
after it had earlier tumbled on the back of worse-than-expected US
jobs data.
But calls for a shift in Beijing's policy have fallen on deaf ears,
despite recent comments in a major Chinese newspaper that the "time is
ripe" for a loosening of the peg.
In the meantime, the US Federal Reserve's decision on 2 February to
boost interest rates by a quarter of a point - the sixth such move in
as many months - has opened up a differential with European rates.
The recent falls have partly been the result of big budget deficits,
as well as the US's yawning current account gap, both of which need to
be funded by the buying of US bonds and assets by foreign firms and
governments.


In [23]:
Lsumm = LsaSummarizer()
second_summ = Lsumm(parser.document, sentences_count=5)
get_wrap(second_summ)

And Alan Greenspan highlighted the US government's willingness to curb
spending and rising household savings as factors which may help to
reduce it.
"I think the chairman's taking a much more sanguine view on the
current account deficit than he's taken for some time," said Robert
Sinche, head of currency strategy at Bank of America in New York.
China's currency remains pegged to the dollar and the US currency's
sharp falls in recent months have therefore made Chinese export prices
highly competitive.
The G7 meeting is thought unlikely to produce any meaningful movement
in Chinese policy.
The White House will announce its budget on Monday, and many
commentators believe the deficit will remain at close to half a
trillion dollars.


In [50]:
!pip install --upgrade pip



In [48]:
!pip install wheel



In [52]:
!pip install gensim==3.6.0

Collecting gensim==3.6.0
  Downloading gensim-3.6.0.tar.gz (23.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m67.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25l[?25hdone
  Created wheel for gensim: filename=gensim-3.6.0-cp310-cp310-linux_x86_64.whl size=23916462 sha256=c116cb4e81635aecd7decf0deea4dab114f7c0b73833d85b42324d36b743702e
  Stored in directory: /root/.cache/pip/wheels/00/e8/47/96f55c3144a5ea3537f549f7a97607011f5004b9f13fa8dcc5
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.1
    Uninstalling gensim-4.3.1:
      Successfully uninstalled gensim-4.3.1
Successfully installed gensim-3.6.0


In [13]:
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/97.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels fo