## TEXT SUMMARISATION ASSIGNMENT

Text summarization is the process of condensing a given text while retaining its main ideas and key information. There are generally two main approaches to text summarization: extractive and abstractive.


## IMPORTING MODULES

In [21]:
#importing stopwords and cosine_distance
from nltk.corpus import stopwords #you can remove stop words for speed
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

## OPEN FILE AND SPLIT INTO SENTENCE

In [22]:
#Opening text1.txt file
file = open("Text4.txt", "r")
#This file contains one paragraph of multiple sentences
filedata = file.readlines()
article = filedata[0].split(". ") #Just do the first paragraph

sentences = []
for sentence in article:
    print(sentence)
    sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))

Imagine there's no heaven
It's easy if you try
No hell below us
Above us, only sky
Imagine all the people livin' for today
Imagine there's no countries
It isn't hard to do
Nothing to kill or die for and no religion, too
Imagine all the people livin' life in peace
You may say I'm a dreamer but I'm not the only one
I hope someday you'll join us and the world will be as one
Imagine no possessions
I wonder if you can
No need for greed or hunger
A brotherhood of man
Imagine all the people sharing all the world




## DISPLAYING AS LIST

In [23]:
# Printing the list of sentences
print("Sentences are ", sentences)

Sentences are  [['Imagine', "there's", 'no', 'heaven'], ["It's", 'easy', 'if', 'you', 'try'], ['No', 'hell', 'below', 'us'], ['Above', 'us,', 'only', 'sky'], ['Imagine', 'all', 'the', 'people', "livin'", 'for', 'today'], ['Imagine', "there's", 'no', 'countries'], ['It', "isn't", 'hard', 'to', 'do'], ['Nothing', 'to', 'kill', 'or', 'die', 'for', 'and', 'no', 'religion,', 'too'], ['Imagine', 'all', 'the', 'people', "livin'", 'life', 'in', 'peace'], ['You', 'may', 'say', "I'm", 'a', 'dreamer', 'but', "I'm", 'not', 'the', 'only', 'one'], ['I', 'hope', 'someday', "you'll", 'join', 'us', 'and', 'the', 'world', 'will', 'be', 'as', 'one'], ['Imagine', 'no', 'possessions'], ['I', 'wonder', 'if', 'you', 'can'], ['No', 'need', 'for', 'greed', 'or', 'hunger'], ['A', 'brotherhood', 'of', 'man'], ['Imagine', 'all', 'the', 'people', 'sharing', 'all', 'the', 'world'], ['\n']]


## FUNCTION TO CALCULATE SIMILARITY

In [24]:
# Defining a function to calculate the similarity between two sentences using vector representation
def sentence_similarity(sent1, sent2 ):
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
    all_words = list(set(sent1 + sent2))
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
     # build the vector for the first sentence
    for w in sent1:
          vector1[all_words.index(w)] += 1
     # build the vector for the second sentence
    for w in sent2:
          vector2[all_words.index(w)] += 1
    return 1 - cosine_distance(vector1, vector2)

## CREATING SIMILARITY MATRIX


A similarity matrix is a mathematical representation used to quantify the similarity between two sets of data points. It is often employed in various fields, such as machine learning, natural language processing, and clustering analysis. The elements of the matrix indicate the degree of similarity between pairs of items.

In [25]:
# Creating a matrix to store similarity scores between sentences
similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
             if idx1 == idx2: #ignore if both are same sentences
                continue 
             similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2])

print("Smilarity matrix \n", similarity_matrix)

Smilarity matrix 
 [[0.         0.         0.25       0.         0.18898224 0.75
  0.         0.15811388 0.1767767  0.         0.         0.57735027
  0.         0.20412415 0.         0.14433757 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.11952286 0.         0.
  0.4        0.         0.         0.         0.        ]
 [0.25       0.         0.         0.         0.         0.25
  0.         0.15811388 0.         0.         0.13867505 0.28867513
  0.         0.20412415 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.13363062 0.         0.
  0.         0.         0.         0.         0.        ]
 [0.18898224 0.         0.         0.         0.         0.18898224
  0.         0.11952286 0.6681531  0.10101525 0.10482848 0.21821789
  0.         0.15430335 0.         0.65465367 0.        ]
 [0.75       0.         0.25       0.         0.1889822

## GETTING PAGERANK SCORES

PageRank is an algorithm used by Google Search to rank web pages in its search engine results.The algorithm assigns a numerical weighting to each element of a hyperlinked set of documents, such as web pages, with the purpose of measuring its relative importance within the set. The underlying idea is that important pages are likely to have more links from other pages.M

In [26]:
# Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(sentence_similarity_graph)
print("scores", scores)

scores {0: 0.09163334044231103, 1: 0.04408562199586315, 2: 0.05432094163511732, 3: 0.01779516045726271, 4: 0.09575693097837697, 5: 0.09163334044231103, 6: 0.014924507861294583, 7: 0.0653124759476629, 8: 0.08186947693779716, 9: 0.07522194125679456, 10: 0.05101030185038315, 11: 0.09222556696979164, 12: 0.05146059071771966, 13: 0.05778838761945695, 14: 0.01779516045726271, 15: 0.08787832873399998, 16: 0.009287925696594429}


## SORTING SENTENCES BY PAGERANK

In [27]:
# Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
print("Indexes of top ranked_sentence order are \n\n", ranked_sentence)

Indexes of top ranked_sentence order are 

 [(0.09575693097837697, ['Imagine', 'all', 'the', 'people', "livin'", 'for', 'today']), (0.09222556696979164, ['Imagine', 'no', 'possessions']), (0.09163334044231103, ['Imagine', "there's", 'no', 'heaven']), (0.09163334044231103, ['Imagine', "there's", 'no', 'countries']), (0.08787832873399998, ['Imagine', 'all', 'the', 'people', 'sharing', 'all', 'the', 'world']), (0.08186947693779716, ['Imagine', 'all', 'the', 'people', "livin'", 'life', 'in', 'peace']), (0.07522194125679456, ['You', 'may', 'say', "I'm", 'a', 'dreamer', 'but', "I'm", 'not', 'the', 'only', 'one']), (0.0653124759476629, ['Nothing', 'to', 'kill', 'or', 'die', 'for', 'and', 'no', 'religion,', 'too']), (0.05778838761945695, ['No', 'need', 'for', 'greed', 'or', 'hunger']), (0.05432094163511732, ['No', 'hell', 'below', 'us']), (0.05146059071771966, ['I', 'wonder', 'if', 'you', 'can']), (0.05101030185038315, ['I', 'hope', 'someday', "you'll", 'join', 'us', 'and', 'the', 'world', 'wi

## PICKING TOP 'N' SENTENCES

In [28]:
#Step 5 - How many sentences to pick
n = int(input("How many sentences do you want in the summary? "))
#n=2
summarize_text = []
for i in range(n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

How many sentences do you want in the summary? 7


## PRINTING SUMMARY

In [39]:
# Step 6 - Output the summarize text
print("Summarized Text: \n", ". ".join(summarize_text))

Summarized Text: 
 The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transIn an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI d

## OPEN FILE AND SPLIT INTO SENTENCE


In [40]:
#Opening text1.txt file
file = open("Text5.txt", "r")
#This file contains one paragraph of multiple sentences
filedata = file.readlines()
article = filedata[0].split(". ") #Just do the first paragraph

sentences = []
for sentence in article:
    print(sentence)
    sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))

In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills
Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services
As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses
The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transIn an attempt to build an AI-re

## DISPLAYING AS LIST

In [41]:
# Printing the list of sentences
print("Sentences are ", sentences)

Sentences are  [['In', 'an', 'attempt', 'to', 'build', 'an', 'AI-ready', 'workforce,', 'Microsoft', 'announced', 'Intelligent', 'Cloud', 'Hub', 'which', 'has', 'been', 'launched', 'to', 'empower', 'the', 'next', 'generation', 'of', 'students', 'with', 'AI-ready', 'skills'], ['Envisioned', 'as', 'a', 'three-year', 'collaborative', 'program,', 'Intelligent', 'Cloud', 'Hub', 'will', 'support', 'around', '100', 'institutions', 'with', 'AI', 'infrastructure,', 'course', 'content', 'and', 'curriculum,', 'developer', 'support,', 'development', 'tools', 'and', 'give', 'students', 'access', 'to', 'cloud', 'and', 'AI', 'services'], ['As', 'part', 'of', 'the', 'program,', 'the', 'Redmond', 'giant', 'which', 'wants', 'to', 'expand', 'its', 'reach', 'and', 'is', 'planning', 'to', 'build', 'a', 'strong', 'developer', 'ecosystem', 'in', 'India', 'with', 'the', 'program', 'will', 'set', 'up', 'the', 'core', 'AI', 'infrastructure', 'and', 'IoT', 'Hub', 'for', 'the', 'selected', 'campuses'], ['The', 'co

## FUNCTION TO CALCULATE SIMILARITY

In [42]:
# Defining a function to calculate the similarity between two sentences using vector representation
def sentence_similarity(sent1, sent2 ):
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
    all_words = list(set(sent1 + sent2))
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
     # build the vector for the first sentence
    for w in sent1:
          vector1[all_words.index(w)] += 1
     # build the vector for the second sentence
    for w in sent2:
          vector2[all_words.index(w)] += 1
    return 1 - cosine_distance(vector1, vector2)

## CREATING SIMILARITY MATRIX


In [43]:
# Creating a matrix to store similarity scores between sentences
similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
             if idx1 == idx2: #ignore if both are same sentences
                continue 
             similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2])

print("Smilarity matrix \n", similarity_matrix)

Smilarity matrix 
 [[0.         0.20994555 0.32141217 0.6415029  0.20994555 0.32141217
  0.15589237 0.04828045 0.15974461 0.40146253 0.27852425 0.33009387
  0.15569979]
 [0.20994555 0.         0.31546459 0.42735216 1.         0.31546459
  0.4500225  0.41812101 0.31127151 0.18964186 0.15075567 0.30785965
  0.20225996]
 [0.32141217 0.31546459 0.         0.45361105 0.31546459 1.
  0.45317826 0.23897606 0.16943475 0.64517472 0.44312937 0.412959
  0.22019275]
 [0.6415029  0.42735216 0.45361105 0.         0.42735216 0.45361105
  0.78978629 0.28827833 0.26013299 0.46555195 0.34016803 0.39970544
  0.25354628]
 [0.20994555 1.         0.31546459 0.42735216 0.         0.31546459
  0.4500225  0.41812101 0.31127151 0.18964186 0.15075567 0.30785965
  0.20225996]
 [0.32141217 0.31546459 1.         0.45361105 0.31546459 0.
  0.45317826 0.23897606 0.16943475 0.64517472 0.44312937 0.412959
  0.22019275]
 [0.15589237 0.4500225  0.45317826 0.78978629 0.4500225  0.45317826
  0.         0.44155786 0.2282771

## GETTING PAGERANK SCORES

In [44]:
# Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(sentence_similarity_graph)
print("scores", scores)

scores {0: 0.06505377280671527, 1: 0.08331029356108051, 2: 0.0944374740565271, 3: 0.0984332545355071, 4: 0.08331029356108051, 5: 0.0944374740565271, 6: 0.08956736183704041, 7: 0.06144739693431779, 8: 0.05275205184695904, 9: 0.08067100597330357, 10: 0.0654765113515673, 11: 0.07702790911312701, 12: 0.054075200366247446}


## SORTING SENTENCES BY PAGERANK


In [45]:
# Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
print("Indexes of top ranked_sentence order are \n\n", ranked_sentence)

Indexes of top ranked_sentence order are 

 [(0.0984332545355071, ['The', 'company', 'will', 'provide', 'AI', 'development', 'tools', 'and', 'Azure', 'AI', 'services', 'such', 'as', 'Microsoft', 'Cognitive', 'Services,', 'Bot', 'Services', 'and', 'Azure', 'Machine', 'Learning.According', 'to', 'Manish', 'Prakash,', 'Country', 'General', 'Manager-PS,', 'Health', 'and', 'Education,', 'Microsoft', 'India,', 'said,', '"With', 'AI', 'being', 'the', 'defining', 'technology', 'of', 'our', 'time,', 'it', 'is', 'transIn', 'an', 'attempt', 'to', 'build', 'an', 'AI-ready', 'workforce,', 'Microsoft', 'announced', 'Intelligent', 'Cloud', 'Hub', 'which', 'has', 'been', 'launched', 'to', 'empower', 'the', 'next', 'generation', 'of', 'students', 'with', 'AI-ready', 'skills']), (0.0944374740565271, ['As', 'part', 'of', 'the', 'program,', 'the', 'Redmond', 'giant', 'which', 'wants', 'to', 'expand', 'its', 'reach', 'and', 'is', 'planning', 'to', 'build', 'a', 'strong', 'developer', 'ecosystem', 'in', 'In

## PICKING TOP 'N' SENTENCES


In [46]:
#Step 5 - How many sentences to pick
n = int(input("How many sentences do you want in the summary? "))
#n=2
summarize_text = []
for i in range(n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

How many sentences do you want in the summary? 7


## PRINTING SUMMARY

In [47]:
# Step 6 - Output the summarize text
print("Summarize Text: \n", ". ".join(summarize_text))

Summarize Text: 
 The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transIn an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI de