In [1]:
import bs4 as bs
import urllib.request
import re
import nltk

In [2]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2')
article = scraped_data.read()

In [4]:
# web scraping using beautifulSoap
parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

In [6]:
paragraphs

[<p class="mw-empty-elt">
 </p>,
 <p><b>Severe acute respiratory syndrome coronavirus 2</b> (<b>SARS‑CoV‑2</b>)<sup class="reference" id="cite_ref-CoronavirusStudyGroup_2-0"><a href="#cite_note-CoronavirusStudyGroup-2">[2]</a></sup> is a strain of <a href="/wiki/Coronavirus" title="Coronavirus">coronavirus</a> that causes <a href="/wiki/COVID-19" title="COVID-19">COVID-19</a> (coronavirus disease 2019), the <a class="mw-redirect" href="/wiki/Respiratory_illness" title="Respiratory illness">respiratory illness</a> responsible for the ongoing <a href="/wiki/COVID-19_pandemic" title="COVID-19 pandemic">COVID-19 pandemic</a>.<sup class="reference" id="cite_ref-NYT-20210226_3-0"><a href="#cite_note-NYT-20210226-3">[3]</a></sup> The virus previously had a <a href="/wiki/Novel_coronavirus" title="Novel coronavirus">provisional name</a>, <b>2019 novel coronavirus</b> (<b>2019-nCoV</b>),<sup class="reference" id="cite_ref-WHO21Jan2020_4-0"><a href="#cite_note-WHO21Jan2020-4">[4]</a></sup><sup c

In [8]:
#Clean
for p in paragraphs:
    article_text += p.text
    
article_text

'\nSevere acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2)[2] is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic.[3] The virus previously had a provisional name, 2019 novel coronavirus (2019-nCoV),[4][5][6][7] and has also been called the human coronavirus 2019 (HCoV-19 or hCoV-19).[8][9][10][11]  First identified in the city of Wuhan, Hubei, China, the World Health Organization declared the outbreak a public health emergency of international concern on January 30, 2020, and a pandemic on March 11, 2020.[12][13] SARS‑CoV‑2 is a positive-sense single-stranded RNA virus[14] that is contagious in humans.[15]\nSARS‑CoV‑2 is a virus of the species severe acute respiratory syndrome–related coronavirus (SARSr-CoV), related to the SARS-CoV-1 virus that caused the 2002–2004 SARS outbreak.[2][16] Despite its close relation to SARS-CoV-1, its closest known relatives, with which it forms a sister gr

In [9]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

article_text

' Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The virus previously had a provisional name, 2019 novel coronavirus (2019-nCoV), and has also been called the human coronavirus 2019 (HCoV-19 or hCoV-19). First identified in the city of Wuhan, Hubei, China, the World Health Organization declared the outbreak a public health emergency of international concern on January 30, 2020, and a pandemic on March 11, 2020. SARS‑CoV‑2 is a positive-sense single-stranded RNA virus that is contagious in humans. SARS‑CoV‑2 is a virus of the species severe acute respiratory syndrome–related coronavirus (SARSr-CoV), related to the SARS-CoV-1 virus that caused the 2002–2004 SARS outbreak. Despite its close relation to SARS-CoV-1, its closest known relatives, with which it forms a sister group, are the derived SARS viruses BANAL-52 and RaTG13. Ava

In [10]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

formatted_article_text

' Severe acute respiratory syndrome coronavirus SARS CoV is a strain of coronavirus that causes COVID coronavirus disease the respiratory illness responsible for the ongoing COVID pandemic The virus previously had a provisional name novel coronavirus nCoV and has also been called the human coronavirus HCoV or hCoV First identified in the city of Wuhan Hubei China the World Health Organization declared the outbreak a public health emergency of international concern on January and a pandemic on March SARS CoV is a positive sense single stranded RNA virus that is contagious in humans SARS CoV is a virus of the species severe acute respiratory syndrome related coronavirus SARSr CoV related to the SARS CoV virus that caused the SARS outbreak Despite its close relation to SARS CoV its closest known relatives with which it forms a sister group are the derived SARS viruses BANAL and RaTG Available evidence indicates that it is most likely of zoonotic origin and has close genetic similarity to 

In [14]:
#splitting the article into multiple sentences
nltk.download('punkt')
sentence_list = nltk.sent_tokenize(article_text)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sxp210146\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [15]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sxp210146\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [16]:
sentence_list

[' Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic.',
 'The virus previously had a provisional name, 2019 novel coronavirus (2019-nCoV), and has also been called the human coronavirus 2019 (HCoV-19 or hCoV-19).',
 'First identified in the city of Wuhan, Hubei, China, the World Health Organization declared the outbreak a public health emergency of international concern on January 30, 2020, and a pandemic on March 11, 2020.',
 'SARS‑CoV‑2 is a positive-sense single-stranded RNA virus that is contagious in humans.',
 'SARS‑CoV‑2 is a virus of the species severe acute respiratory syndrome–related coronavirus (SARSr-CoV), related to the SARS-CoV-1 virus that caused the 2002–2004 SARS outbreak.',
 'Despite its close relation to SARS-CoV-1, its closest known relatives, with which it forms a sister group, are the derived SARS viruses BAN

In [20]:
# Finding word frequencies for every word

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

maximum_frequncy = max(word_frequencies.values())

maximum_frequncy

88

In [21]:
# Weighted Frequency of words

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

word_frequencies

{'Severe': 0.022727272727272728,
 'acute': 0.045454545454545456,
 'respiratory': 0.09090909090909091,
 'syndrome': 0.056818181818181816,
 'coronavirus': 0.20454545454545456,
 'SARS': 1.0,
 'CoV': 0.9659090909090909,
 'strain': 0.022727272727272728,
 'causes': 0.011363636363636364,
 'COVID': 0.09090909090909091,
 'disease': 0.045454545454545456,
 'illness': 0.011363636363636364,
 'responsible': 0.022727272727272728,
 'ongoing': 0.03409090909090909,
 'pandemic': 0.06818181818181818,
 'The': 0.3181818181818182,
 'virus': 0.5113636363636364,
 'previously': 0.011363636363636364,
 'provisional': 0.022727272727272728,
 'name': 0.045454545454545456,
 'novel': 0.022727272727272728,
 'nCoV': 0.045454545454545456,
 'also': 0.056818181818181816,
 'called': 0.022727272727272728,
 'human': 0.14772727272727273,
 'HCoV': 0.022727272727272728,
 'hCoV': 0.011363636363636364,
 'First': 0.011363636363636364,
 'identified': 0.06818181818181818,
 'city': 0.011363636363636364,
 'Wuhan': 0.10227272727272728,


In [22]:
# Finding sentence score by adding the scores of all words present in that sentence

sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
sentence_scores

{' Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic.': 1.1590909090909092,
 'The virus previously had a provisional name, 2019 novel coronavirus (2019-nCoV), and has also been called the human coronavirus 2019 (HCoV-19 or hCoV-19).': 1.25,
 'SARS‑CoV‑2 is a positive-sense single-stranded RNA virus that is contagious in humans.': 0.625,
 'SARS‑CoV‑2 is a virus of the species severe acute respiratory syndrome–related coronavirus (SARSr-CoV), related to the SARS-CoV-1 virus that caused the 2002–2004 SARS outbreak.': 1.681818181818182,
 'Despite its close relation to SARS-CoV-1, its closest known relatives, with which it forms a sister group, are the derived SARS viruses BANAL-52 and RaTG13.': 0.3977272727272727,
 'Available evidence indicates that it is most likely of zoonotic origin and has close genetic similarity to bat coronaviru

In [27]:
import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

SARS‑CoV‑2 is a virus of the species severe acute respiratory syndrome–related coronavirus (SARSr-CoV), related to the SARS-CoV-1 virus that caused the 2002–2004 SARS outbreak. Other studies have suggested that the virus may be airborne as well, with aerosols potentially being able to transmit the virus. The host protein neuropilin 1 (NRP1) may aid the virus in host cell entry using ACE2. During the initial outbreak in Wuhan, China, various names were used for the virus; some names used by different sources included "the coronavirus" or "Wuhan coronavirus". The virus previously had a provisional name, 2019 novel coronavirus (2019-nCoV), and has also been called the human coronavirus 2019 (HCoV-19 or hCoV-19).  Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The original source of viral transmission to humans remains unclear, as 