## Text Summarization - (tf-idf)

#### STEP 1 : Data cleaning 
#### STEP 2 : Score of sentences (tf-idf)
#### STEP 3 : Summary Generation

## Initial Phase
### Importing Libraries and Reading Data 

In [3]:
### importing the necessary libraries

from nltk.corpus import stopwords
import numpy as np
import pandas
import nltk
import re
# from __future__ import division

In [4]:
df = pandas.read_csv('../Data/articles.csv')

In [5]:
print(df['text'])

for a in df['text']:
    print(len(a))

0    A small airplane crashed into a government bui...
1    Witnesses reported hearing a loud explosion wh...
2    Italian authorities confirmed that a small air...
3    Automatic text summarization aims to reduce a ...
4    Centroid-based summarization represents docume...
5    Word embeddings allow semantic comparison betw...
Name: text, dtype: object
415
310
280
305
262
242


### Tokenizing sentences into words which would be used for calculating tf-idf scores

In [7]:
### tokenized the sentences from the different news articles

from nltk.tokenize import sent_tokenize
s = ""
for a in df['text']:
      s += a
sentences = sent_tokenize(s)
# sentences

## STEP 1 : Data Cleaning
### Cleaning sentences, by removing Non Alphabet Characters and converting to Lower Case Letters

In [8]:
### pre processes the sentences by removing non alphabet characters and converting them to lower case letters 
### and stored in variable text

dict = {}
text=""
for a in sentences:
    temp = re.sub("[^a-zA-Z]"," ",a)
    temp = temp.lower()
    dict[temp] = a
    text+=temp
text

'a small airplane crashed into a government building in the heart of milan  setting the top floors on fire italian police reported that there were no immediate reports of casualties rescue workers attempted to clear the area in the city s financial district the incident caused fears of a possible terrorist attack similar to september    emergency services rushed to the scene and investigations were initiated witnesses reported hearing a loud explosion when a small aircraft struck a    story office building near milan s central train station smoke was seen pouring from the upper floors of the building police and ambulances responded quickly authorities stated that the cause of the crash was under investigation italian authorities confirmed that a small airplane accidentally crashed into an office building in milan the building houses regional government offices no terrorism link was confirmed firefighters managed to control the fire while emergency responders evacuated nearby areas auto

## STEP 2 : Getting tf-idf score of sentences
### Finding term frequency ( tf ) of words found in text

In [9]:
### calculated the frequency of the words found in text

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
print (word_frequencies)

{'small': 3, 'airplane': 2, 'crashed': 2, 'government': 2, 'building': 5, 'heart': 1, 'milan': 3, 'setting': 1, 'top': 1, 'floors': 2, 'fire': 2, 'italian': 2, 'police': 2, 'reported': 2, 'immediate': 1, 'reports': 1, 'casualties': 1, 'rescue': 1, 'workers': 1, 'attempted': 1, 'clear': 1, 'area': 1, 'city': 1, 'financial': 1, 'district': 1, 'incident': 1, 'caused': 1, 'fears': 1, 'possible': 1, 'terrorist': 1, 'attack': 1, 'similar': 1, 'september': 1, 'emergency': 2, 'services': 1, 'rushed': 1, 'scene': 1, 'investigations': 1, 'initiated': 1, 'witnesses': 1, 'hearing': 1, 'loud': 1, 'explosion': 1, 'aircraft': 1, 'struck': 1, 'story': 1, 'office': 2, 'near': 1, 'central': 1, 'train': 1, 'station': 1, 'smoke': 1, 'seen': 1, 'pouring': 1, 'upper': 1, 'ambulances': 1, 'responded': 1, 'quickly': 1, 'authorities': 2, 'stated': 1, 'cause': 1, 'crash': 1, 'investigation': 1, 'confirmed': 2, 'accidentally': 1, 'houses': 1, 'regional': 1, 'offices': 1, 'terrorism': 1, 'link': 1, 'firefighters'

### Finding weighted frequency of the words

In [10]:
### finding weighted frequency of the words

max_freq = max(word_frequencies.values())

for w in word_frequencies :
      word_frequencies[w]/=max_freq
# print word_frequencies

### Calculating sentence scores from the word frequncies

In [11]:
### calculating sentence scores from the word frequncies

sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

## STEP 3 : Summary Generation
### Outputting the top 17 sentences as the summary

In [12]:
### getting the summary by taking top score sentences

import heapq
summary_sentences = heapq.nlargest(17, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

In [13]:
summary_sentences

['Sentences close to the semantic centroid are selected as summaries, improving coherence and relevance.Word embeddings allow semantic comparison between sentences beyond surface word matching.',
 'Emergency services rushed to the scene and investigations were initiated.Witnesses reported hearing a loud explosion when a small aircraft struck a 30-story office building near Milan’s central train station.',
 'Recent approaches utilize word embeddings to capture semantic similarity between words and sentences.Centroid-based summarization represents documents using vector representations.',
 'Authorities stated that the cause of the crash was under investigation.Italian authorities confirmed that a small airplane accidentally crashed into an office building in Milan.',
 'Firefighters managed to control the fire while emergency responders evacuated nearby areas.Automatic text summarization aims to reduce a document into a shorter version while preserving the most important information.',
 '

In [14]:
summary

"Sentences close to the semantic centroid are selected as summaries, improving coherence and relevance.Word embeddings allow semantic comparison between sentences beyond surface word matching. Emergency services rushed to the scene and investigations were initiated.Witnesses reported hearing a loud explosion when a small aircraft struck a 30-story office building near Milan’s central train station. Recent approaches utilize word embeddings to capture semantic similarity between words and sentences.Centroid-based summarization represents documents using vector representations. Authorities stated that the cause of the crash was under investigation.Italian authorities confirmed that a small airplane accidentally crashed into an office building in Milan. Firefighters managed to control the fire while emergency responders evacuated nearby areas.Automatic text summarization aims to reduce a document into a shorter version while preserving the most important information. A small airplane cras

In [15]:
len(summary)

1814