### Text Summarization

Text Summarization refers to the process of extracting all the imporatnt essence from the combined input text and give a precise summary and idea about the complete text.


### Steps used in this Algorithm:----

1.  Import all the necessary libraries

2.  Download the required NLTK Data

3.  Define the  Input Text

4.  Convert the Text into Sentences

5.  Perform Text-Preprocessing on the particular text

6.  Convert the Sentences to TF-IDF Matrix

7.  Calculate the Sentence Scores

8.  Select the Top Sentences

9.  Generate the Final Summary

### Step 1: Import all the necessary libraries

In [1466]:
import    numpy                as      np
import    pandas               as      pd
import    matplotlib.pyplot    as      plt
import    seaborn              as      sns

import    nltk

from      nltk.tokenize                     import  sent_tokenize, RegexpTokenizer, word_tokenize
from      nltk.corpus                       import  stopwords

from      sklearn.feature_extraction.text   import TfidfVectorizer

from      sklearn.model_selection           import train_test_split
from      sklearn.preprocessing             import StandardScaler
from      sklearn.metrics                   import accuracy_score, confusion_matrix, classification_report

### OBSERVATIONS:

1.  numpy --------------->  Computation of numerical  array

2.  pandas -------------->  Data Manipulation

3.  matplotlib ---------->  Data Visualization

4.  seaborn ------------->  Data Correlation

5.  nltk ---------------->  NLP Library used for text preprocessing

6.  sent_tokenize --------> breaks the paragraphs into sentences

7.  RegexpTokenizer ------> normalizes and removes all the punctuations from the text

8.  corpus ---------------> A container that has two or more sentences

9.  stopwords ------------> words with no meaning

10. TfidfVectorizer -------> converts the text into Tf-idf sparse/ vectors

11.  train_test_split -----> divides the data into training and testing data

12.  StandardScaler -------> scales all the data in one range from 0 to 1

13.  metrics --------------> evaluates the performance of the model

### Step 2:  Download the required NLTK Data

In [1467]:
nltk.download('punkt_tab')
nltk.download('average_perceptron_tagger_eng')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### OBSERVATIONS:

1.  punkt_tab                     -------------->  Tokenization model

2.  average_perceptron_tagger_eng -------------->  POS Tagging Model

3.  stopwords                     -------------->  Stop words model

### Step 3:  Define the  Input Text

In [1468]:
text = """
Artificial Intelligence is transforming industries worldwide.
Machine learning is a subset of AI that enables systems to learn from data.
Deep learning is a specialized form of machine learning using neural networks.
AI is widely used in healthcare, finance, and transportation.
Companies are investing heavily in AI research and development.
"""

In [1469]:
text

'\nArtificial Intelligence is transforming industries worldwide.\nMachine learning is a subset of AI that enables systems to learn from data.\nDeep learning is a specialized form of machine learning using neural networks.\nAI is widely used in healthcare, finance, and transportation.\nCompanies are investing heavily in AI research and development.\n'

### Step 4: Convert the Text into Sentences

In [1470]:
from  nltk.tokenize  import  sent_tokenize

### perform the sentence tokenization on the texts

sentences = sent_tokenize(text)

In [1471]:
sentences

['\nArtificial Intelligence is transforming industries worldwide.',
 'Machine learning is a subset of AI that enables systems to learn from data.',
 'Deep learning is a specialized form of machine learning using neural networks.',
 'AI is widely used in healthcare, finance, and transportation.',
 'Companies are investing heavily in AI research and development.']

### OBSERVATIONS:

1. Here the text is converted into the sentences using sent_tokenize.

2. Each and every sentence is trated seperately.

### Step 5: Perform Text-Preprocessing on the particular text

In [1472]:
### Convert the sentences in list to string

sentences = " ".join(sentences)

In [1473]:
sentences

'\nArtificial Intelligence is transforming industries worldwide. Machine learning is a subset of AI that enables systems to learn from data. Deep learning is a specialized form of machine learning using neural networks. AI is widely used in healthcare, finance, and transportation. Companies are investing heavily in AI research and development.'

In [1474]:
### Remove all the punctuations from the text

import re

cleaned_text = re.sub(r'[\n]','',sentences)

In [1475]:
cleaned_text

'Artificial Intelligence is transforming industries worldwide. Machine learning is a subset of AI that enables systems to learn from data. Deep learning is a specialized form of machine learning using neural networks. AI is widely used in healthcare, finance, and transportation. Companies are investing heavily in AI research and development.'

In [1476]:
### perform the word tokenization on the sentences

words = word_tokenize(cleaned_text)

In [1477]:
words

['Artificial',
 'Intelligence',
 'is',
 'transforming',
 'industries',
 'worldwide',
 '.',
 'Machine',
 'learning',
 'is',
 'a',
 'subset',
 'of',
 'AI',
 'that',
 'enables',
 'systems',
 'to',
 'learn',
 'from',
 'data',
 '.',
 'Deep',
 'learning',
 'is',
 'a',
 'specialized',
 'form',
 'of',
 'machine',
 'learning',
 'using',
 'neural',
 'networks',
 '.',
 'AI',
 'is',
 'widely',
 'used',
 'in',
 'healthcare',
 ',',
 'finance',
 ',',
 'and',
 'transportation',
 '.',
 'Companies',
 'are',
 'investing',
 'heavily',
 'in',
 'AI',
 'research',
 'and',
 'development',
 '.']

In [1478]:
### define all the english stop words

from nltk.corpus  import stopwords

english_stopwords = stopwords.words("english")

#print(english_stopwords)


### Remove all the filtered words from the text

res = [x for x in words if(x not in english_stopwords)]

print(res)

['Artificial', 'Intelligence', 'transforming', 'industries', 'worldwide', '.', 'Machine', 'learning', 'subset', 'AI', 'enables', 'systems', 'learn', 'data', '.', 'Deep', 'learning', 'specialized', 'form', 'machine', 'learning', 'using', 'neural', 'networks', '.', 'AI', 'widely', 'used', 'healthcare', ',', 'finance', ',', 'transportation', '.', 'Companies', 'investing', 'heavily', 'AI', 'research', 'development', '.']


In [1479]:
### convert the filtered words to string

res = " ".join(res)

In [1480]:
res

'Artificial Intelligence transforming industries worldwide . Machine learning subset AI enables systems learn data . Deep learning specialized form machine learning using neural networks . AI widely used healthcare , finance , transportation . Companies investing heavily AI research development .'

In [1481]:
### convert the filtered words in string to list of sentences

results = res.split(".")

print(results)

['Artificial Intelligence transforming industries worldwide ', ' Machine learning subset AI enables systems learn data ', ' Deep learning specialized form machine learning using neural networks ', ' AI widely used healthcare , finance , transportation ', ' Companies investing heavily AI research development ', '']


### OBSERVATIONS:

1. The sentence text is entered as the input string.

2. The symbol '\n' is removed using regular expression.

3. Word tokenization is performed on the sentences to get it converted into words.

4. All the english stopwords are defined and these stopwords are removed from the input text.

5. Then the filtered words are converted from list to strings.

6. These filtered words in string is converted into the list of strings using split funtion.

7. Then the output is the list of the sentences.

### Step 6:  Convert the Sentences to TF-IDF Matrix

In [1482]:
### Create a object for Tf-idf Vectorizer

tfidf = TfidfVectorizer()

### using the object of tfidf vectorizer, transform the input text

input_sparse_matrix = tfidf.fit_transform(results)

In [1483]:
input_sparse_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 33 stored elements and shape (6, 29)>

### OBSERVATIONS:

1.  Here a sparse matrix is created thst has a maximum of non-zeros values.

In [1484]:
### Now convert the sparse matrix to numpy array for better view

X = input_sparse_matrix.toarray()

In [1485]:
X

array([[0.        , 0.4472136 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.4472136 , 0.4472136 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.4472136 , 0.        ,
        0.        , 0.        , 0.        , 0.4472136 ],
       [0.26501964, 0.        , 0.        , 0.38280352, 0.        ,
        0.        , 0.38280352, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.38280352,
        0.31390437, 0.31390437, 0.        , 0.        , 0.        ,
        0.        , 0.38280352, 0.38280352, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.32682326,
        0.        , 0.        , 0.        , 0.32682326, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
  

### OBSERVATIONS:

1. The sparse matrix input is converted into numpy array.

2. fit_transform(text)  ---->  learns all the vocabularies from the filtered text and builds the matrix of tf-idf scores

3. We have obtained the scores of tf-idf matrix where all the common words have been marked as 0 and all the important words have been marked greater than 0.

### Step 7: Calculate the Sentence Scores

In [1486]:
sentences_scores = np.sum(X,axis=1)

In [1487]:
sentences_scores

array([2.23606798, 2.80684599, 2.76493883, 2.43179145, 2.43179145,
       0.        ])

### OBSERVATIONS:

1. We have obtained the sum of all the tfidf scores for each and every sentence.

2. The sentence having the highest Tf-idf score is considered to be the most important.

### Step 8: Select the Top Sentences

In [1488]:
### extract the top 2 most important sentences

topn = 1
ans = sentences_scores.argsort()[::-topn]
x = ans[0:2]

In [1489]:
print(x)

[1 2]


### OBSERVATIONS:

1. The sentences present at index 1 and index 2 are the top 2 most important sentences.

### Step 9:  Generate the Final Summary

In [1490]:
sentences

'\nArtificial Intelligence is transforming industries worldwide. Machine learning is a subset of AI that enables systems to learn from data. Deep learning is a specialized form of machine learning using neural networks. AI is widely used in healthcare, finance, and transportation. Companies are investing heavily in AI research and development.'

In [1491]:
### convert the sentences in strings to list

sentences = sentences.split(".")

In [1492]:
sentences

['\nArtificial Intelligence is transforming industries worldwide',
 ' Machine learning is a subset of AI that enables systems to learn from data',
 ' Deep learning is a specialized form of machine learning using neural networks',
 ' AI is widely used in healthcare, finance, and transportation',
 ' Companies are investing heavily in AI research and development',
 '']

In [1493]:
### Extract the top2 most imporant sentences and generate the final summary

res = [sentences[i] for i in x]

In [1494]:
res

[' Machine learning is a subset of AI that enables systems to learn from data',
 ' Deep learning is a specialized form of machine learning using neural networks']