**Sentence Ranking using Lemmatization - based on Word Frequency (SRL-WF)**

In [0]:
import re
import string
import nltk
import heapq
import collections
from nltk.stem import WordNetLemmatizer 
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk.corpus import wordnet

#Downloading all the models and corpora required
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [0]:
from google.colab import files
uploaded = files.upload()

Saving CricketLong.txt to CricketLong.txt


In [0]:
dataset = "CricketLong.txt"
def load_dataset(dataset):
  file = open(dataset,'r',encoding='cp1252')
  text = file.read()  #contains the dataset as a string
  file.close()
  return text

text = load_dataset(dataset)
print("Raw Text:")
print(text)
original_sentences = sent_tokenize(text)

#calculating the optimum length for the summary
doc_length = len(original_sentences)
summ_length = int(0.25 * doc_length)
print("Ideal Summary Length - "+ str(summ_length))

Raw Text:
Cricket grew out of the many stick-and-ball games played in England 500 years ago, under a variety of different rules. The word bat is an old English word that simply means stick or club. By the seventeenth century, cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on Sunday instead of going to church. Till the middle of the eighteenth century, bats were roughly the same shape as hockey sticks, curving outwards at the bottom. There was a simple reason for this: the ball was bowled underarm, along the ground and the curve at the end of the bat gave the batsman the best chance of making contact. How that early version of cricket played in village England grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made. And sport is a large part of contemporary life: it is one way in which 

In [0]:
def only_ascii(word):
  for ch in word:
    if ord(ch)>127:
      return False
  return True

def initial_preprocessing(text):
  punctuation = string.punctuation
  cleaned_sentences = []

  sentences = sent_tokenize(text)
  for sentence in sentences:
    word_list = ""
    words = word_tokenize(sentence)
    for word in words:
      if(only_ascii(word)):
        if word not in punctuation:
          word_list = word_list + word.lower() + " "
    cleaned_sentences.append(word_list)   
  return cleaned_sentences 

cleaned_sentences = initial_preprocessing(text)
print(cleaned_sentences)

['cricket grew out of the many stick-and-ball games played in england 500 years ago under a variety of different rules ', 'the word bat is an old english word that simply means stick or club ', 'by the seventeenth century cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on sunday instead of going to church ', 'till the middle of the eighteenth century bats were roughly the same shape as hockey sticks curving outwards at the bottom ', 'there was a simple reason for this the ball was bowled underarm along the ground and the curve at the end of the bat gave the batsman the best chance of making contact ', 'how that early version of cricket played in village england grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made ', 'and sport is a large part of contemporary life it is one way in whi

In [0]:
#POS Tagging
def get_mapping(first_char):
  tag2first = dict()
  tag2first = {
              "J": wordnet.ADJ,
              "N": wordnet.NOUN,
              "V": wordnet.VERB,
              "R": wordnet.ADV
              }
  #If the first char doesn't fall under any of the above categories, it can be treated as a noun for lemmatization.
  #The word remains unaltered
  return  tag2first.get(first_char, wordnet.NOUN)  


#getting the wordnet based POS tag of the word to feed in the lemmatizer
def get_tagged_sentence(sentence):
  sent_tag = []
  pos_tag = nltk.pos_tag(sentence.split()) #Sentence is split up into words and passed to the tagger
  for i in range(0,len(sentence.split())):
    first = pos_tag[i][1][0].upper()         #first letter of the pos tag obtained
    sent_tag.append(get_mapping(first))
  return sent_tag


#Combining lemmatization with POS tagging for meaningful lemmatization with context
def sentence_lemmatization(cleaned_sentences):
  processed_sentences = []
  lr = WordNetLemmatizer()

  for sentence in cleaned_sentences:
    lemmatized_sentence = ""
    sent_tag =  get_tagged_sentence(sentence)
    i=0
    for word in sentence.split():
      root_word = lr.lemmatize(word,sent_tag[i])
      lemmatized_sentence = lemmatized_sentence + root_word.lower() + " "
      i=i+1
    processed_sentences.append(lemmatized_sentence)   
  return processed_sentences

lemmatized_text = sentence_lemmatization(cleaned_sentences)
print("---------------------------After Lemmatization------------------------")
print(lemmatized_text)

---------------------------After Lemmatization------------------------
['cricket grow out of the many stick-and-ball game play in england 500 year ago under a variety of different rule ', 'the word bat be an old english word that simply mean stick or club ', 'by the seventeenth century cricket have evolve enough to be recognisable a a distinct game and it be popular enough for it fan to be fin for play it on sunday instead of go to church ', 'till the middle of the eighteenth century bat be roughly the same shape a hockey stick curve outwards at the bottom ', 'there be a simple reason for this the ball be bowl underarm along the ground and the curve at the end of the bat give the batsman the best chance of make contact ', 'how that early version of cricket play in village england grow into the modern game play in giant stadium in great city be a proper subject for history because one of the us of history be to understand how the present be make ', 'and sport be a large part of contempo

In [0]:
#creating a weighted histogram from the sentences
word_count = {}  
wordList = []
total_stopwords = nltk.corpus.stopwords.words('english')

#obtaining word frequency
for sent in lemmatized_text: 
  for word in sent.split():
    if word not in total_stopwords:  #Stop-Word removal
        wordList.append(word)
        if word not in word_count.keys():
            word_count[word]=1
        else:
            word_count[word]+=1
            

sorted_wc = sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)

#printing word frequency table
for item in sorted_wc:
    key, val = item
    if int(val)>0:
        print(key,"-",val)

#calculating word weights
word_weights={}
for key in word_count.keys():                   
    word_weights[key]=word_count[key]/max(word_count.values())           #to find relative frequency of the word
print(word_weights)

cricket - 33
game - 17
bat - 11
play - 10
ball - 10
make - 10
history - 10
time - 9
ground - 8
modern - 8
become - 8
match - 7
change - 7
england - 6
century - 6
one - 6
sport - 6
first - 6
law - 6
rule - 5
shape - 5
way - 5
team - 5
like - 5
boundary - 5
industrial - 5
go - 4
eighteenth - 4
hockey - 4
give - 4
early - 4
village - 4
wide - 4
day - 4
even - 4
length - 4
size - 4
stump - 4
common - 4
material - 4
year - 3
club - 3
bowl - 3
batsman - 3
present - 3
life - 3
social - 3
indian - 3
test - 3
see - 3
take - 3
half - 3
football - 3
inning - 3
another - 3
specify - 3
22 - 3
shot - 3
shall - 3
two - 3
umpire - 3
inch - 3
limit - 3
pad - 3
also - 3
protective - 3
equipment - 3
revolution - 3
wood - 3
grow - 2
many - 2
stick-and-ball - 2
word - 2
simply - 2
mean - 2
stick - 2
enough - 2
curve - 2
reason - 2
along - 2
end - 2
version - 2
city - 2
part - 2
fit - 2
today - 2
india - 2
look - 2
discuss - 2
country - 2
nineteenth - 2
draw - 2
much - 2
playing - 2
characteristic - 2
pitch

In [0]:
#The sentence score will go up if the no. of frequent words in the sentence is the more
         
sent_score = {}
sent_index = {}
count=0

for sentence in lemmatized_text:
  sent_index[sentence] = count
  for word in sentence.split():
    if word in word_weights.keys():
      if sentence not in sent_score.keys():
        sent_score[sentence]=word_weights[word]
      else:
        sent_score[sentence]+=word_weights[word]
  count = count+1
#can be improved by removing wordy sentences

print(sent_score)

{'cricket grow out of the many stick-and-ball game play in england 500 year ago under a variety of different rule ': 2.545454545454545, 'the word bat be an old english word that simply mean stick or club ': 0.7878787878787877, 'by the seventeenth century cricket have evolve enough to be recognisable a a distinct game and it be popular enough for it fan to be fin for play it on sunday instead of go to church ': 2.5454545454545454, 'till the middle of the eighteenth century bat be roughly the same shape a hockey stick curve outwards at the bottom ': 1.1818181818181817, 'there be a simple reason for this the ball be bowl underarm along the ground and the curve at the end of the bat give the batsman the best chance of make contact ': 1.8787878787878782, 'how that early version of cricket play in village england grow into the modern game play in giant stadium in great city be a proper subject for history because one of the us of history be to understand how the present be make ': 4.36363636

In [0]:
best_sentences = heapq.nlargest(summ_length,sent_score,key=sent_score.get)
summary_indices=[]
summary=''

for sentence in best_sentences:
  summary_indices.append(sent_index[sentence])
summary_indices.sort() #to maintain coherency

for index in range(0,len(original_sentences)):
  if index in summary_indices:
    summary = summary + original_sentences[index] +'\n'
    
print("***SUMMARY***\n\n"+summary)

***SUMMARY***

Cricket grew out of the many stick-and-ball games played in England 500 years ago, under a variety of different rules.
By the seventeenth century, cricket had evolved enough to be recognisable as a distinct game and it was popular enough for its fans to be fined for playing it on Sunday instead of going to church.
How that early version of cricket played in village England grew into the modern game played in giant stadiums in great cities is a proper subject for history because one of the uses of history is to understand how the present was made.
If tens of millions of Indians today drop everything to watch the Indian team play a Test match or a one-day international, it is reasonable for a history of India to explore how that stick-and-ball game invented in south-eastern England became the ruling passion of the Indian sub-continent.
Our history of cricket will look first at the evolution of cricket as a game in England, and discuss the wider culture of physical training

In [0]:
#writing the summary into a text file
from google.colab import files

filename = "SRL_WF_Summary.txt"

with open(filename,"w") as f:
  f.write(summary)

files.download(filename)