In [21]:
import spacy
import en_core_web_sm
from goose3 import Goose
from spacy import displacy
import nltk
import string
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [22]:
nlp = spacy.load("en_core_web_sm")

In [23]:
nlp

<spacy.lang.en.English at 0x16bce7d90>

In [24]:
g = Goose()
url = 'https://en.wikipedia.org/wiki/The_Mandalorian'
article = g.extract(url)

In [25]:
dataset = nlp(article.cleaned_text)
dataset

The Mandalorian is an American space western television series created by Jon Favreau for the streaming service Disney+. It is the first live-action series in the Star Wars franchise, beginning five years after the events of Return of the Jedi (1983), and stars Pedro Pascal as the title character, a lone bounty hunter who goes on the run to protect "the Child".

Star Wars creator George Lucas had begun developing a live-action Star Wars television series by 2009, but this project was deemed too expensive to produce. He sold Lucasfilm to Disney in October 2012. Subsequently, work on a new Star Wars series began for Disney+. Favreau signed on in March 2018, serving as writer and showrunner. He executive produces alongside Dave Filoni, Kathleen Kennedy, and Colin Wilson. The series' title was announced in October 2018, with filming starting at Manhattan Beach Studios in California. Visual effects company Industrial Light & Magic developed the StageCraft technology for the series, using vi

In [26]:
# displacy.render(dataset, style = 'ent', jupyter=True)

In [27]:
def preprocessing(sentence):
	# a A -- convert to lower case
	sentence = sentence.lower()
	sentence = sentence.replace('.', '') # useful for when spacy replaces unrecognized symbols with . or [
	tokens = []
	tokens = [token.text for token in nlp(sentence) if not (token.is_stop or token.like_num or token.is_punct or token.is_space or len(token) == 1)]
	tokens = ' '.join([element for element in tokens])

	return tokens

In [28]:
cleaned_article = preprocessing(article.cleaned_text)

In [29]:
original_sentenes = [sentence for sentence in nltk.sent_tokenize(article.cleaned_text)]

In [30]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.sum_basic import SumBasicSummarizer


In [31]:
parser = PlaintextParser.from_string(article.cleaned_text, Tokenizer('english'))
summarizer = SumBasicSummarizer()
summary = summarizer(parser.document, 14)

In [32]:
summary

(<Sentence: Subsequently, work on a new Star Wars series began for Disney+.>,
 <Sentence: This has since been adopted by other film and television productions.>,
 <Sentence: The Mandalorian premiered with the launch of Disney+ on November 12, 2019.>,
 <Sentence: A fourth season is in development.>,
 <Sentence: Pedro Pascal stars as Din Djarin, the series' title character.>,
 <Sentence: More than 50 scripts were written for the series by 2012, but they were eventually deemed too expensive to produce.>,
 <Sentence: [43] Iger announced in February 2020 that the second season would premiere that October.>,
 <Sentence: [46] By late May 2022, Favreau was writing a fourth season, noting it was being informed by the series Ahsoka (2023).>,
 <Sentence: The location is brought to the actors.>,
 <Sentence: He first does so in "Chapter 3: The Sin", when he first leaves the Child with the Client.>,
 <Sentence: [94] Kuiil insists to the Mandalorian: "Droids are not good or bad — they are neutral ref

In [33]:
best_sentences = []
for sentence in summary:
    best_sentences.append(str(sentence))

In [35]:
from IPython.core.display import HTML
text = ''
display(HTML(f'<h1>Summary - {article.title} <h1>'))
for sentence in best_sentences:
	#print(sentence)
	if sentence in best_sentences:
		text += ' ' + str(sentence).replace(sentence, f"<mark>{sentence}</mark>")
	else:
		text += ' ' + sentence
display (HTML(f"""{text}"""))

Subsequently, work on a new Star Wars series began for Disney+.
This has since been adopted by other film and television productions.
The Mandalorian premiered with the launch of Disney+ on November 12, 2019.
A fourth season is in development.
Pedro Pascal stars as Din Djarin, the series' title character.
More than 50 scripts were written for the series by 2012, but they were eventually deemed too expensive to produce.
[43] Iger announced in February 2020 that the second season would premiere that October.
[46] By late May 2022, Favreau was writing a fourth season, noting it was being informed by the series Ahsoka (2023).
The location is brought to the actors.
He first does so in "Chapter 3: The Sin", when he first leaves the Child with the Client.
[94] Kuiil insists to the Mandalorian: "Droids are not good or bad — they are neutral reflections of those who program them.
Favreau, Filoni, and Robert Rodriguez executive produced, with Morrison and Wen reprising their respective roles as 