<a href="https://colab.research.google.com/github/jigarsiddhpura/TextSummarizer/blob/main/TextSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [100]:
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_lg

In [87]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [110]:
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from heapq import nlargest


In [130]:
document = "This paper focuses on the genre of Gothic romance, and how this genre has evolved into the 21st century. By examining Mike Flannagan’s 2020 tv-show The Haunting of Bly Manor, I will argue that this tv-show subverts the genre of Gothic romance. The Haunting of Bly Manor is a modern adaptation of Henry James’ The Turn of the Screw. It also includes some of James’ other short stories, such as ‘The Romance of Certain Old Clothes’ and ‘The Jolly Corner’. The Haunting of Bly contains all the typical ingredients for a gothic romance – a young governess, a large, haunted manor, two beautiful but slightly peculiar children, and ghosts – but its message concerning love diverges from older specimen. The Gothic helps articulate the social, cultural, and political anxieties of society, and these naturally differ between works depending on the time they were written (Schmitt, 2007). With emancipation and the emergence of women’s rights, the conventional central problem of the (Victorian) Gothic Romance – love as the main hardship in the narrative – is no longer feasible in our 21st century society. Consequently, in The Haunting of Bly, the love story is not the central problem of the narrative – as it is in many gothic romances like Jane Eyre – but rather the solution to plot’s main obstacle. Furthermore, the romance in The Haunting of Bly Manor is a queer romance. With his adaptation of The Turn of the Screw, Flannagan gives the Gothic Romance a modern spin and subverts the genre of Gothic Romance. This is more fitting to a 21st century Gothic Romance, as opposed to the (classic) Victorian Gothic romance."

### Preprocessing

In [131]:
doc = nlp(document)

In [132]:
keywords = []
stopWords = list(STOP_WORDS)
stopWords.remove("not")
pos_tag = ['NOUN','VERB','ADJ','PROPN']

for token in doc:
  if(token.text in stopWords or token.text in punctuation):
    continue
  if(token.pos_ in pos_tag):
    keywords.append(token.text)
  

In [133]:
sentences = [sent for sent in doc.sents]
processed_sentences = []

for sent in sentences:
  processed_sent = " ".join([token.lemma_ for token in sent if token.pos_ in pos_tag if not token.text in stopWords and not token.is_punct])
  processed_sentences.append(processed_sent)

### Identify important sentences

In [134]:
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(processed_sentences)

In [135]:
tf_idf

<12x83 sparse matrix of type '<class 'numpy.float64'>'
	with 132 stored elements in Compressed Sparse Row format>

In [136]:
sentence_scores = tf_idf.sum(axis=1)
sentence_scores

matrix([[2.65111274],
        [3.15168843],
        [2.60685802],
        [1.41421356],
        [3.10972681],
        [4.58939231],
        [3.70356065],
        [4.16518503],
        [3.77580733],
        [2.1913767 ],
        [3.30684514],
        [2.81226646]])

In [137]:
weighted_column_indices = np.argsort(sentence_scores, axis=0)
print("Sorted row indices:", weighted_column_indices)

Sorted row indices: [[ 3]
 [ 9]
 [ 2]
 [ 0]
 [11]
 [ 4]
 [ 1]
 [10]
 [ 6]
 [ 8]
 [ 7]
 [ 5]]


In [138]:
weighted_indices = np.ravel(weighted_column_indices)
weighted_indices

array([ 3,  9,  2,  0, 11,  4,  1, 10,  6,  8,  7,  5])

In [139]:
TOP_SENT_COUNT = 5

In [140]:
top_sentences = []
for index in weighted_indices[-TOP_SENT_COUNT:]:
  top_sentences.append(processed_sentences[index])

### Summarize the text

In [141]:
summarized_sentences = nlargest(3,top_sentences)

In [142]:
summarized_sentences

['haunting Bly love story central problem narrative gothic romance Jane Eyre solution plot main obstacle',
 'haunting Bly contain typical ingredient gothic romance young governess large haunted manor beautiful peculiar child ghost message concern love diverge old specimen',
 'emancipation emergence woman right conventional central problem Victorian Gothic Romance love main hardship narrative feasible 21st century society']

In [143]:
# Define a function to process each sentence
def process_sentence(sentence):
    # Parse the sentence using Spacy's sentence segmentation
    doc = nlp(sentence)

    # Lemmatize the words in the sentence and join them back into a string
    lemmas = [token.lemma_ for token in doc]
    sentence_text = ' '.join(lemmas)

    # Construct a grammatically correct sentence from the lemmatized words
    sentence_doc = nlp(sentence_text)
    sentence_text = ''
    for i, token in enumerate(sentence_doc):
        # Add spaces between words
        if i > 0:
            sentence_text += ' '
        # Add determiners to nouns if necessary
        if token.pos_ == 'NOUN' and token.dep_ != 'compound':
            sentence_text += 'a '
        # Add the word to the sentence
        sentence_text += token.text
    # Capitalize the first letter of the sentence
    sentence_text = sentence_text.capitalize()
    # Add a period to the end of the sentence
    sentence_text += '.'
    return sentence_text


In [144]:
processed_sentences = [process_sentence(sentence) for sentence in summarized_sentences]

# Print the results
for sentence in processed_sentences:
    print(sentence)

Haunt bly a love a story central problem a narrative gothic a romance jane eyre solution a plot main a obstacle.
Haunt bly contain typical a ingredient gothic a romance young a governess large haunt a manor beautiful peculiar child ghost message concern a love diverge old a specimen.
Emancipation emergence a woman right conventional central a problem victorian gothic romance love main hardship a narrative feasible 21st century a society.


### Using Hugging Face Transformer

In [150]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m95.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
[0m

In [151]:
from transformers import pipeline
import os

In [152]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [154]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [155]:
summary_text = summarizer(document , max_length = 100, min_length = 20, do_sample=False)[0]['summary_text']
print(summary_text)

 This paper focuses on the genre of Gothic romance, and how this genre has evolved into the 21st century . Mike Flannagan’s 2020 tv-show The Haunting of Bly Manor is a modern adaptation of Henry James’ The Turn of the Screw .
