In [2]:
!pip install spacy



In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
from typing import Text
import spacy
nlp = spacy.load("en_core_web_sm")
text= "apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

In [5]:
doc

apple is looking at buying U.K. startup for $1 billion

In [6]:
print("Named Entities, Phrases, and concepts:")
for ent in doc.ents:
  print(f"{ent.text:15} {ent.label_:10} {ent.start_char:10} {ent.end_char:10}")

Named Entities, Phrases, and concepts:
apple           ORG                 0          5
U.K.            GPE                27         31
$1 billion      MONEY              44         54


In [7]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("data science and ai has greate career ahead")

In [8]:
doc

data science and ai has greate career ahead

In [9]:
for token in doc:
  print(token.text)

data
science
and
ai
has
greate
career
ahead


In [10]:
doc

data science and ai has greate career ahead

In [11]:
for token in doc:
  print(token.pos_)

NOUN
NOUN
CCONJ
NOUN
AUX
ADJ
NOUN
ADV


In [12]:
for token in doc:
  print(token.text,':',token.pos_)


data : NOUN
science : NOUN
and : CCONJ
ai : NOUN
has : AUX
greate : ADJ
career : NOUN
ahead : ADV


In [13]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

data data NOUN NN compound xxxx True False
science science NOUN NN nsubj xxxx True False
and and CCONJ CC cc xxx True True
ai ai NOUN NN conj xx True False
has have AUX VBZ aux xxx True True
greate greate ADJ JJ ROOT xxxx True False
career career NOUN NN dobj xxxx True False
ahead ahead ADV RB advmod xxxx True False


In [14]:
text = """There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[4] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured """

In [15]:
text

'There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\nAn example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summa

In [16]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation


In [17]:
stopwords = list(STOP_WORDS)
stopwords

['which',
 'seeming',
 'others',
 'under',
 'due',
 'thereby',
 'of',
 'put',
 'n’t',
 'why',
 'hers',
 'anyhow',
 'back',
 'ca',
 'hundred',
 'less',
 'enough',
 'everywhere',
 'three',
 'seemed',
 'twelve',
 'eight',
 'sixty',
 'next',
 'very',
 'beside',
 'together',
 'her',
 'either',
 'become',
 'than',
 'me',
 'alone',
 'another',
 'two',
 'front',
 'it',
 'somehow',
 'towards',
 'many',
 'against',
 'also',
 'we',
 'wherein',
 'own',
 'had',
 'formerly',
 'most',
 'for',
 'they',
 'except',
 'been',
 'only',
 'just',
 'because',
 'wherever',
 'often',
 'anyway',
 'here',
 'former',
 'could',
 'that',
 'top',
 'anyone',
 'call',
 'doing',
 'them',
 'nor',
 'or',
 'became',
 'noone',
 'with',
 'various',
 'much',
 'his',
 'therefore',
 'while',
 'made',
 'on',
 '’ve',
 'indeed',
 '‘s',
 'each',
 'something',
 'however',
 'during',
 'would',
 'though',
 'were',
 'latterly',
 'this',
 'too',
 'everyone',
 'onto',
 'does',
 'in',
 'side',
 'beyond',
 'itself',
 'when',
 'amongst',
 '

In [18]:
len(stopwords)

326

In [19]:
nlp = spacy.load('en_core_web_sm')

In [20]:
text

'There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\nAn example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summa

In [21]:
doc = nlp(text)
doc

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summari

In [22]:
# lets get the tokens from text
tokens = [token.text for token in doc]
print(tokens)
#when we execute everythihg we created tokens from the text & not removed any of the stopwords & didnt cleaned the data

['There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc', '.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query', '-', 'based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine', '-', 'generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', '\n', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarization', ',

In [23]:
#we have to calcualte the freaquency of each and every word, how many time word is repetation in text

word_frequencies = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [24]:
word_frequencies

{'broadly': 1,
 'types': 1,
 'extractive': 1,
 'summarization': 11,
 'tasks': 1,
 'depending': 2,
 'program': 1,
 'focuses': 2,
 'generic': 3,
 'obtaining': 1,
 'summary': 4,
 'abstract': 2,
 'collection': 3,
 'documents': 2,
 'sets': 1,
 'images': 3,
 'videos': 3,
 'news': 4,
 'stories': 1,
 'etc': 1,
 'second': 1,
 'query': 4,
 'relevant': 2,
 'called': 2,
 'based': 1,
 'summarizes': 1,
 'objects': 1,
 'specific': 1,
 'Summarization': 1,
 'systems': 1,
 'able': 1,
 'create': 1,
 'text': 1,
 'summaries': 2,
 'machine': 1,
 'generated': 1,
 'user': 1,
 'needs': 1,
 '\n': 2,
 'example': 3,
 'problem': 2,
 'document': 4,
 'attempts': 1,
 'automatically': 3,
 'produce': 1,
 'given': 2,
 'interested': 1,
 'generating': 1,
 'single': 1,
 'source': 2,
 'use': 1,
 'multiple': 1,
 'cluster': 1,
 'articles': 3,
 'topic': 2,
 'multi': 1,
 'related': 2,
 'application': 2,
 'summarizing': 1,
 'Imagine': 1,
 'system': 3,
 'pulls': 1,
 'web': 1,
 'concisely': 1,
 'represents': 1,
 'latest': 1,
 'Ima

In [25]:
max_frequency= max(word_frequencies.values())
max_frequency

11

In [26]:
#to get normalized/weighted frequencies you should devide all frequencies with 11
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [27]:
word_frequencies

{'broadly': 0.09090909090909091,
 'types': 0.09090909090909091,
 'extractive': 0.09090909090909091,
 'summarization': 1.0,
 'tasks': 0.09090909090909091,
 'depending': 0.18181818181818182,
 'program': 0.09090909090909091,
 'focuses': 0.18181818181818182,
 'generic': 0.2727272727272727,
 'obtaining': 0.09090909090909091,
 'summary': 0.36363636363636365,
 'abstract': 0.18181818181818182,
 'collection': 0.2727272727272727,
 'documents': 0.18181818181818182,
 'sets': 0.09090909090909091,
 'images': 0.2727272727272727,
 'videos': 0.2727272727272727,
 'news': 0.36363636363636365,
 'stories': 0.09090909090909091,
 'etc': 0.09090909090909091,
 'second': 0.09090909090909091,
 'query': 0.36363636363636365,
 'relevant': 0.18181818181818182,
 'called': 0.18181818181818182,
 'based': 0.09090909090909091,
 'summarizes': 0.09090909090909091,
 'objects': 0.09090909090909091,
 'specific': 0.09090909090909091,
 'Summarization': 0.09090909090909091,
 'systems': 0.09090909090909091,
 'able': 0.09090909090

In [28]:
sentence_tokens= [sent for sent in doc.sents]
sentence_tokens

[There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).,
 This problem is called multi-document summarization.,
 A related applica

In [29]:
len(sentence_tokens)

15

In [30]:
# we are going to calculate the sentence score, to calculate the sentence score
sentence_scores = {}

for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]


In [31]:
sentence_scores

{There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.818181818181818,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.9999999999999987,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.909090909090909,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.: 3.2727272727272716,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.9999999999999996,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of artic

In [32]:
len(sentence_scores)

15

In [33]:
from heapq import nlargest

In [35]:
select_length= int(len(sentence_tokens)*0.4)
select_length

6

In [36]:
summary= nlargest(select_length,sentence_scores,key=sentence_scores.get)

In [37]:
summary

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 Image collection summarization is another application example of automatic summarization.,
 Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.]

In [38]:
final_summary=[word.text for word in summary]

In [39]:
final_summary

['An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.',
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).',
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n',
 'Image collection summarization is another application example of automatic summarization.',
 'Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.\n']