In [1]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import SnowballStemmer, WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
import pandas as pd
import numpy as np
import numpy.linalg as LA
import textwrap

#### Wrapper object initialised for displaying text in a readable format

In [2]:
wrapper = textwrap.TextWrapper(width=80)

#### Initialised tfidf vectorizer object with removal of stopwords

In [3]:
tfidf = TfidfVectorizer(stop_words=stopwords.words('english'))

In [4]:
BBC_article = pd.read_csv('bbc_text_cls.csv')
BBC_article

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business
...,...,...
2220,BT program to beat dialler scams\n\nBT is intr...,tech
2221,Spam e-mails tempt net shoppers\n\nComputer us...,tech
2222,Be careful how you code\n\nA new European dire...,tech
2223,US cyber security chief resigns\n\nThe man mak...,tech


#### Displaying sample text on which we are working

In [5]:
sample_article = BBC_article['text'][0]
print("\n".join(wrapper.wrap(sample_article)))

Ad sales boost Time Warner profit  Quarterly profits at US media giant
TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from
$639m year-earlier.  The firm, which is now one of the biggest investors in
Google, benefited from sales of high-speed internet connections and higher
advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from
$10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at
Warner Bros, and less users for AOL.  Time Warner said on Friday that it now
owns 8% of search-engine Google. But its own internet business, AOL, had has
mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were
lower than in the preceding three quarters. However, the company said AOL's
underlying profit before exceptional items rose 8% on the back of stronger
internet advertising revenues. It hopes to increase subscribers by offering the
online service free to TimeWarner internet customers and will try to sign up
AOL

#### Tokenising the document into collection of sentences

In [6]:
sent_token = nltk.sent_tokenize(sample_article)

#### Here we are doing sentence tokenisation

* We are basically treating every sentence in the article as a document
* then we are applying tfidf vectoriser on it and then using text rank we will score them
* after scoring them we will choose the top 5 scoring sentences, these will represent the summary

In [7]:
sent_token_tfidf = tfidf.fit_transform(sent_token[1:])

#### linear_kernel calculates the dot prodect between the 2 tfidf sparse matrices

In [8]:
G = linear_kernel(sent_token_tfidf,sent_token_tfidf)

In [9]:
len(G[0])

19

In [10]:
# sanity check
linear_kernel(sent_token_tfidf[0],sent_token_tfidf[0])

array([[1.]])

In [11]:
type(G)

numpy.ndarray

#### Normalising the matrix so that it is compatible with the Perron-Frobenius conditions of convergence

In [12]:
G = G/G.sum(axis=1).reshape((19,1))

#### Smoothing the G matrix

In [13]:
damping_factor = 0.15

In [14]:
A = damping_factor*1/len(G) + (1-damping_factor)*G

In [15]:
A.sum(axis=1)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1.])

* In the text rank paper, it is stated that we can take any initial state and it will converge with the given G matrix

In [16]:
initial_state = np.random.rand(1,len(A))
initial_state = initial_state/initial_state.sum()

In [17]:
initial_state.sum()

1.0

### Method 1

#### Running the iterative process to find the initial_state of the converged point

In [18]:
epsilon = 1e-8

while(True):
    
    temp = initial_state
    initial_state = np.matmul(initial_state,A)
    
    if abs(initial_state.sum(axis=1)-temp.sum(axis=1))[0]<epsilon:
        break
    

In [19]:
initial_state

array([[0.07500099, 0.04715706, 0.05495944, 0.03733855, 0.05552923,
        0.03663554, 0.05861759, 0.06387914, 0.02802733, 0.03783344,
        0.04950762, 0.04594194, 0.07650792, 0.05113149, 0.02565022,
        0.08572842, 0.0339147 , 0.07802205, 0.05861732]])

#### Getting the index of sentences scoring the most points

In [20]:
initial_state*=-1
initial_state.argsort()

array([[15, 17, 12,  0,  7,  6, 18,  4,  2, 13, 10,  1, 11,  9,  3,  5,
        16,  8, 14]])

In [21]:
summary_index = initial_state.argsort()[0][:5]
summary_index

array([15, 17, 12,  0,  7])

#### Summary created using Iterative process

In [22]:
summary = " ".join([sent_token[index] for index in summary_index])
print("\n".join(wrapper.wrap(summary)))

TimeWarner is to restate its accounts as part of efforts to resolve an inquiry
into AOL by US market regulators. The company said it was unable to estimate the
amount it needed to set aside for legal reserves, which it previously set at
$500m. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its
2003 performance, while revenues grew 6.4% to $42.09bn. Ad sales boost Time
Warner profit  Quarterly profits at US media giant TimeWarner jumped 76% to
$1.13bn (£600m) for the three months to December, from $639m year-earlier.
However, the company said AOL's underlying profit before exceptional items rose
8% on the back of stronger internet advertising revenues.


### Method 2
#### Direct method : find eigvector associated with eigen value = 1

In [36]:
eig_values, eig_vectors = LA.eig(G.T)

In [37]:
eig_values

array([1.        , 0.90460542, 0.35897585, 0.85245415, 0.82768147,
       0.83624879, 0.7696432 , 0.74822436, 0.71818502, 0.42596997,
       0.44619282, 0.4552007 , 0.66972404, 0.64876074, 0.60725952,
       0.52066281, 0.5471149 , 0.53985297, 0.52743861])

In [38]:
eig_vectors[:,0]

array([0.23427978, 0.28349642, 0.2522921 , 0.22792334, 0.22210008,
       0.21928636, 0.28570521, 0.23366379, 0.22516964, 0.25618999,
       0.18378281, 0.25022524, 0.19635511, 0.21631933, 0.23598968,
       0.16634031, 0.17691479, 0.24167348, 0.21062578])

#### Normalising the eigenvectors to satisfy the properties of probability distribution

In [39]:
scores = eig_vectors[:,0]
scores = scores/scores.sum()
scores

array([0.05425236, 0.0656495 , 0.05842349, 0.0527804 , 0.0514319 ,
       0.05078032, 0.06616099, 0.05410972, 0.05214272, 0.05932613,
       0.04255874, 0.05794487, 0.04547012, 0.05009324, 0.05464832,
       0.03851956, 0.0409683 , 0.05596453, 0.04877479])

In [40]:
scores.sum()

1.0

In [41]:
scores = scores*-1
summary_index = scores.argsort()[:5]
summary_index

array([ 6,  1,  9,  2, 11])

In [42]:
summary = " ".join([sent_token[index] for index in summary_index])
print("\n".join(wrapper.wrap(summary)))

It lost 464,000 subscribers in the fourth quarter profits were lower than in the
preceding three quarters. The firm, which is now one of the biggest investors in
Google, benefited from sales of high-speed internet connections and higher
advert sales. TimeWarner also has to restate 2000 and 2003 results following a
probe by the US Securities Exchange Commission (SEC), which is close to
concluding. TimeWarner said fourth quarter sales rose 2% to $11.1bn from
$10.9bn. But its film division saw profits slump 27% to $284m, helped by box-
office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the
third and final film in the Lord of the Rings trilogy boosted results.


### Method 1 function (Iterative): 

In [43]:
def summarise_iterative(text : str, damping_factor : float):
    
    tfidf = TfidfVectorizer(stop_words=stopwords.words('english'))
    
    sent_token = nltk.sent_tokenize(text)
    sent_token_tfidf = tfidf.fit_transform(sent_token[1:])
    
    G = linear_kernel(sent_token_tfidf,sent_token_tfidf)
    G = G/G.sum(axis=1).reshape((19,1))
    
    A = damping_factor*1/len(G) + (1-damping_factor)*G
    
    initial_state = np.random.rand(1,len(A))
    initial_state = initial_state/initial_state.sum()
    
    while(True):
    
        temp = initial_state
        initial_state = np.matmul(initial_state,A)
        
        if abs(initial_state.sum(axis=1)-temp.sum(axis=1))[0]<0.00000001:
            break
        
    initial_state = initial_state*-1
    summary_index = initial_state.argsort()[0][:5]
    summary = " ".join([sent_token[index] for index in summary_index])
    
    return summary


In [44]:
print("\n".join(wrapper.wrap(summarise_iterative(text=sample_article,damping_factor=0.15))))

Time Warner said on Friday that it now owns 8% of search-engine Google. It
intends to adjust the way it accounts for a deal with German music publisher
Bertelsmann's purchase of a stake in AOL Europe, which it had reported as
advertising revenue. Time Warner's fourth quarter profits were slightly better
than analysts' expectations. The company said it was unable to estimate the
amount it needed to set aside for legal reserves, which it previously set at
$500m. However, the company said AOL's underlying profit before exceptional
items rose 8% on the back of stronger internet advertising revenues.


### Method 2 function (Eigen-Vector):

In [45]:
def summarise_direct(text : str, damping_factor : float):
    
    tfidf = TfidfVectorizer(stop_words=stopwords.words('english'))
    
    sent_token = nltk.sent_tokenize(text)
    sent_token_tfidf = tfidf.fit_transform(sent_token[1:])
    
    G = linear_kernel(sent_token_tfidf,sent_token_tfidf)
    G = G/G.sum(axis=1).reshape((19,1))
    
    A = damping_factor*1/len(G) + (1-damping_factor)*G
    
    eig_values, eig_vectors = LA.eig(A.T)
    scores = eig_vectors[:,list(np.floor(eig_values)).index(1.0)]/eig_vectors[:,list(np.floor(eig_values)).index(1.0)].sum()
    
    scores = scores*-1
    summary_index = scores.argsort()[:5]
    summary = " ".join([sent_token[index] for index in summary_index])
    
    return summary

In [46]:
print("\n".join(wrapper.wrap(summarise_direct(text=sample_article,damping_factor=0.15))))

It lost 464,000 subscribers in the fourth quarter profits were lower than in the
preceding three quarters. The firm, which is now one of the biggest investors in
Google, benefited from sales of high-speed internet connections and higher
advert sales. But its film division saw profits slump 27% to $284m, helped by
box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when
the third and final film in the Lord of the Rings trilogy boosted results.
TimeWarner also has to restate 2000 and 2003 results following a probe by the US
Securities Exchange Commission (SEC), which is close to concluding. TimeWarner
said fourth quarter sales rose 2% to $11.1bn from $10.9bn.
