<a href="https://colab.research.google.com/github/kobi-2/GraphTheory_Assignment/blob/main/GT_Assignemnt_Summarization_using_Bipartite_and_HITS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [244]:
import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize

In [245]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [246]:
text = """
Microsoft is investigating a trojan program that attempts to switch off the firm's anti-spyware software.

The spyware tool was only released by Microsoft in the last few weeks and has been downloaded by six million people. 
Stephen Toulouse, a security manager at Microsoft, said the malicious program was called Bankash-A Trojan and was being sent as an e-mail attachment. 
Microsoft said it did not believe the program was widespread and recommended users to use an anti-virus program. 
The program attempts to disable or delete Microsoft's anti-spyware tool and suppress warning messages given to users.

It may also try to steal online banking passwords or other personal information by tracking users' keystrokes.

Microsoft said in a statement it is investigating what it called a criminal attack on its software. 
Earlier this week, Microsoft said it would buy anti-virus software maker Sybari Software to improve its security in its Windows and e-mail software. 
Microsoft has said it plans to offer its own paid-for anti-virus software but it has not yet set a date for its release. 
The anti-spyware program being targeted is currently only in beta form and aims to help users find and remove spyware - programs which monitor internet use, causes advert pop-ups and slow a PC's performance.
"""

In [247]:
headline = """
Microsoft seeking spyware trojan
""".lower()

In [248]:
## lowercase the text
text = text.lower()

## Notations:

* entities = hubs
* sentences = authorities
---
* hubs matrix = initial rank = 1
* authorities matrix = initial rank = 1
---
* adjacency matrix = (hubs x authorities) or (entities x senteces)

In [249]:
## tokenize sentences
sent_tokens = sent_tokenize(text)

## remove punctuaions
import re
sentences = []

for sent in sent_tokens:
  sentences.append(re.sub(r'[^\w\s]', '', sent))

print("number of sentences: ", len(sentences))

number of sentences:  10


In [250]:
for sent in sentences:
  print(sent)


microsoft is investigating a trojan program that attempts to switch off the firms antispyware software
the spyware tool was only released by microsoft in the last few weeks and has been downloaded by six million people
stephen toulouse a security manager at microsoft said the malicious program was called bankasha trojan and was being sent as an email attachment
microsoft said it did not believe the program was widespread and recommended users to use an antivirus program
it may also try to steal online banking passwords or other personal information by tracking users keystrokes
microsoft said in a statement it is investigating what it called a criminal attack on its software
earlier this week microsoft said it would buy antivirus software maker sybari software to improve its security in its windows and email software
microsoft has said it plans to offer its own paidfor antivirus software but it has not yet set a date for its release
the antispyware program being targeted is currently o

In [251]:
## build the vocabulary

## word tokenizer 
word_tokenizer = RegexpTokenizer(r'\w+')

## get all the words
entities = word_tokenizer.tokenize(text)
print("len with duplicate: ", len(entities))

## remove duplicate entities
entities = set(entities)
print("len of set: ", len(entities))

## convert set to list
entities = list(entities)
print("type: ", type(entities))
print("len of entities after removing duplication: ", len(entities))


len with duplicate:  221
len of set:  121
type:  <class 'list'>
len of entities after removing duplication:  121


In [252]:
## create adjacency matrix  
## dimension: (hubs x authorities) = (entities x sentences)

## create zero matrix of the dimension
adj_mat = np.zeros((len(entities), len(sentences)))

## for each word in the vocabulary (entity):
##  for each sentence:
##     mark 1 if the sentence has it, or 0 if does not
in_count = 0
out_count = 0
for i, word in enumerate(entities):
  for j, sent in enumerate(sentences):
    if word in sent:
      adj_mat[i][j] = 1
      in_count += 1
    else:
      out_count += 1

print("total shape:", adj_mat.shape[0]*adj_mat.shape[1], "  total count:", in_count+out_count)
assert (adj_mat.shape[0]*adj_mat.shape[1]) == (in_count+out_count)

total shape: 1210   total count: 1210


In [253]:
## create Authority matrix
## size = sentences, values = initial rank = 1
authorities = np.ones((len(sentences), 1))
print("authorities shape:", authorities.shape)

authorities shape: (10, 1)


In [254]:
## sentence len array...for normalizing the authority score later
sent_len_arr = np.ones((len(sentences), 1))
for i,sent in enumerate(sentences):
  sent_len_arr[i][0] = len(sent.split())

# max_len = max(sent_len_arr)
# min_len = min(sent_len_arr)
# diff = max_len-min_len

# for i in range(sent_len_arr.shape[0]):
#   sent_len_arr[i][0] = (sent_len_arr[i][0]-min_len+0.0001)/diff

print(sent_len_arr)

[[15.]
 [21.]
 [23.]
 [18.]
 [17.]
 [17.]
 [17.]
 [23.]
 [23.]
 [33.]]


In [255]:
## create Hubs Matrix
## size = entities, values = initial rank = 1
hubs = np.ones((len(entities), 1))
print("hubs shape:", hubs.shape)

hubs shape: (121, 1)


In [256]:
## term freq
term_freq = np.ones((len(entities), 1))

for i, term in enumerate(entities):
  term_freq[i][0] = text.count(term)

In [257]:
## for k iterations
k = 300

# for i in range(k):
i=-1
while(True):
  i+=1
  ## Hubs Update
  HubScore = np.matmul(adj_mat, authorities)
  HubScore = HubScore+ term_freq

  ## Authorities Update
  AuthorityScore = np.matmul(adj_mat.transpose(), hubs)
  AuthorityScore = np.divide(AuthorityScore,sent_len_arr)

  old_hubs = hubs
  old_authorities = authorities

  hubs = HubScore/HubScore.sum()
  authorities = AuthorityScore/AuthorityScore.sum()

  print("iter:", i, authorities)
  print()

  hub_diff = abs(hubs-old_hubs)
  auth_diff = abs(authorities-old_authorities)

  if hub_diff.sum() < 1e-10*hub_diff.shape[0] and auth_diff.sum() < 1e-10*auth_diff.shape[0]:
    break

iter: 0 [[0.11236956]
 [0.1058025 ]
 [0.09327117]
 [0.0936413 ]
 [0.11267001]
 [0.13069721]
 [0.08112241]
 [0.08660895]
 [0.09327117]
 [0.09054572]]

iter: 1 [[0.13200128]
 [0.09279265]
 [0.09108937]
 [0.10845175]
 [0.11893239]
 [0.11954755]
 [0.10068242]
 [0.08457216]
 [0.08654248]
 [0.06538796]]

iter: 2 [[0.13172819]
 [0.09300223]
 [0.09018605]
 [0.10804181]
 [0.11793738]
 [0.12060748]
 [0.10373965]
 [0.08429587]
 [0.08642452]
 [0.06403681]]

iter: 3 [[0.13181092]
 [0.09294143]
 [0.09019438]
 [0.10806662]
 [0.11795365]
 [0.12052904]
 [0.10381763]
 [0.08431395]
 [0.08642291]
 [0.06394947]]

iter: 4 [[0.13181053]
 [0.09294162]
 [0.09019198]
 [0.10806326]
 [0.11794835]
 [0.1205334 ]
 [0.10382796]
 [0.08431419]
 [0.08642433]
 [0.06394439]]

iter: 5 [[0.1318109 ]
 [0.09294134]
 [0.09019205]
 [0.10806329]
 [0.11794837]
 [0.12053291]
 [0.10382832]
 [0.08431437]
 [0.0864244 ]
 [0.06394405]]

iter: 6 [[0.1318109 ]
 [0.09294133]
 [0.09019204]
 [0.10806327]
 [0.11794834]
 [0.12053293]
 [0.1038

In [258]:
idx = np.argmax(authorities)
print(sentences[idx])


microsoft is investigating a trojan program that attempts to switch off the firms antispyware software


In [259]:
print(headline)


microsoft seeking spyware trojan

