**Sentence Ranking based on Word Frequency (SR-WF)**

In [0]:
import re
import string
import nltk
import heapq
import collections
from nltk.stem import WordNetLemmatizer 
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk.corpus import wordnet

#Downloading all the models and corpora required
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
from google.colab import files
uploaded = files.upload()

Saving CricketLong.txt to CricketLong (1).txt


In [0]:
dataset = "CricketLong.txt"
def load_dataset(dataset):
  file = open(dataset,'r',encoding='cp1252')
  text = file.read()  #contains the dataset as a string
  file.close()
  return text

text = load_dataset(dataset)
print("Raw Text:")
print(text)
original_sentences = sent_tokenize(text)

#calculating the optimum length for the summary
doc_length = len(original_sentences)
summ_length = int(0.25 * doc_length)
print("Ideal Summary Length - "+ str(summ_length))

Raw Text:
Cricket grew out of the many stick-and-ball games played in England 500 years ago, under a variety of different rules. The word bat is an old English word that simply means stick or club. By the seventeenth century, cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on Sunday instead of going to church. Till the middle of the eighteenth century, bats were roughly the same shape as hockey sticks, curving outwards at the bottom. There was a simple reason for this: the ball was bowled underarm, along the ground and the curve at the end of the bat gave the batsman the best chance of making contact. How that early version of cricket played in village England grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made. And sport is a large part of contemporary life: it is one way in which 

In [0]:
def only_ascii(word):
  for ch in word:
    if ord(ch)>127:
      return False
  return True

def initial_preprocessing(text):
  punctuation = string.punctuation
  cleaned_sentences = []

  sentences = sent_tokenize(text)
  for sentence in sentences:
    word_list = ""
    words = word_tokenize(sentence)
    for word in words:
      if(only_ascii(word)):
        if word not in punctuation:
          word_list = word_list + word.lower() + " "
    cleaned_sentences.append(word_list)   
  return cleaned_sentences 

cleaned_sentences = initial_preprocessing(text)
print(cleaned_sentences)

['cricket grew out of the many stick-and-ball games played in england 500 years ago under a variety of different rules ', 'the word bat is an old english word that simply means stick or club ', 'by the seventeenth century cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on sunday instead of going to church ', 'till the middle of the eighteenth century bats were roughly the same shape as hockey sticks curving outwards at the bottom ', 'there was a simple reason for this the ball was bowled underarm along the ground and the curve at the end of the bat gave the batsman the best chance of making contact ', 'how that early version of cricket played in village england grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made ', 'and sport is a large part of contemporary life it is one way in whi

In [0]:
#creating a weighted histogram from the sentences
word_count = {}  
wordList = []
total_stopwords = nltk.corpus.stopwords.words('english')

#obtaining word frequency
for sent in cleaned_sentences: 
  for word in sent.split():
    if word not in total_stopwords:  #Stop-Word removal
        wordList.append(word)
        if word not in word_count.keys():
            word_count[word]=1
        else:
            word_count[word]+=1
            

sorted_wc = sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)

#printing word frequency table
for item in sorted_wc:
    key, val = item
    if int(val)>0:
        print(key,"-",val)

#calculating word weights
word_weights={}
for key in word_count.keys():                   
    word_weights[key]=word_count[key]/max(word_count.values())           #to find relative frequency of the word
print(word_weights)

cricket - 27
game - 13
bat - 10
history - 10
ball - 9
made - 9
modern - 8
time - 8
ground - 7
became - 7
england - 6
one - 6
match - 6
first - 6
crickets - 6
played - 5
century - 5
sport - 5
way - 5
team - 5
like - 5
laws - 5
industrial - 5
games - 4
playing - 4
eighteenth - 4
hockey - 4
gave - 4
village - 4
even - 4
length - 4
size - 4
boundaries - 4
rules - 3
club - 3
shape - 3
early - 3
present - 3
life - 3
social - 3
play - 3
test - 3
half - 3
football - 3
innings - 3
another - 3
specified - 3
22 - 3
shall - 3
two - 3
stumps - 3
inches - 3
pads - 3
also - 3
protective - 3
equipment - 3
revolution - 3
materials - 3
wood - 3
grew - 2
many - 2
stick-and-ball - 2
years - 2
word - 2
simply - 2
enough - 2
reason - 2
along - 2
end - 2
batsman - 2
version - 2
part - 2
fit - 2
today - 2
indian - 2
india - 2
ruling - 2
wider - 2
shaped - 2
look - 2
discuss - 2
country - 2
see - 2
nineteenth - 2
days - 2
takes - 2
much - 2
pitch - 2
yards - 2
lay - 2
area - 2
oval - 2
six - 2
shot - 2
codifie

In [0]:
#The sentence score will go up if the no. of frequent words in the sentence is the more
         
sent_score = {}
sent_index = {}
count=0

for sentence in cleaned_sentences:
  sent_index[sentence] = count
  for word in sentence.split():
    if word in word_weights.keys():
      if sentence not in sent_score.keys():
        sent_score[sentence]=word_weights[word]
      else:
        sent_score[sentence]+=word_weights[word]
  count = count+1
#can be improved by removing wordy sentences

print(sent_score)

{'cricket grew out of the many stick-and-ball games played in england 500 years ago under a variety of different rules ': 2.1111111111111103, 'the word bat is an old english word that simply means stick or club ': 0.8518518518518519, 'by the seventeenth century cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on sunday instead of going to church ': 2.370370370370371, 'till the middle of the eighteenth century bats were roughly the same shape as hockey sticks curving outwards at the bottom ': 0.8888888888888886, 'there was a simple reason for this the ball was bowled underarm along the ground and the curve at the end of the bat gave the batsman the best chance of making contact ': 1.7037037037037033, 'how that early version of cricket played in village england grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understa

In [0]:
best_sentences = heapq.nlargest(summ_length,sent_score,key=sent_score.get)
summary_indices=[]
summary=''

for sentence in best_sentences:
  summary_indices.append(sent_index[sentence])
summary_indices.sort() #to maintain coherency

for index in range(0,len(original_sentences)):
  if index in summary_indices:
    summary = summary + original_sentences[index] +'\n'
    
print("***SUMMARY***\n\n"+summary)

***SUMMARY***

By the seventeenth century, cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on Sunday instead of going to church.
How that early version of cricket played in village England grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made.
If tens of millions of Indians today drop everything to watch the Indian team play a Test match or a one-day international, it is reasonable for a history of India to explore how that stick-and-ball game invented in south-eastern England became the ruling passion of the Indian sub-continent.
Our history of cricket will look first at the evolution of cricket as a game in England, and discuss the wider culture of physical training and athleticism of the time.
It will then move to India, discuss the history of the adoption of cricket in this countr

In [0]:
#writing the summary into a text file
from google.colab import files

filename = "SR_WF_Summary.txt"

with open(filename,"w") as f:
  f.write(summary)

files.download(filename)