In [3]:
!pip install spacy transformers sentencepiece
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
from transformers import pipeline, T5ForConditionalGeneration, T5Tokenizer

# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

# Load transformer model
model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="pt")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
def spacy_summarize(text, num_sentences=3):
    doc = nlp(text)

    tokens = []
    stopwords = list(STOP_WORDS)
    allowed_pos = ['ADJ', 'PROPN', 'VERB', 'NOUN']
    for token in doc:
        if token.text in stopwords or token.text in punctuation:
            continue
        if token.pos_ in allowed_pos:
            tokens.append(token.text)

    word_freq = Counter(tokens)
    max_freq = max(word_freq.values())
    for word in word_freq.keys():
        word_freq[word] = word_freq[word] / max_freq

    sent_tokens = [sent for sent in doc.sents]
    sent_scores = {}
    for sent in sent_tokens:
        for word in sent:
            if word.text.lower() in word_freq.keys():
                if sent not in sent_scores.keys():
                    sent_scores[sent] = word_freq[word.text.lower()]
                else:
                    sent_scores[sent] += word_freq[word.text.lower()]

    summarized_sentences = nlargest(num_sentences, sent_scores, key=sent_scores.get)
    final_sentences = [sent.text for sent in summarized_sentences]
    summary = " ".join(final_sentences)
    return summary

def transformer_summarize(text, num_sentences=3):
    max_length = min(512, num_sentences * 50)  # Adjust max_length based on desired summary length
    summary = summarizer(text, max_length=max_length, min_length=25, do_sample=False)
    return summary[0]['summary_text']


In [6]:
text = """
In conclusion, the provided code clusters countries based on the ratio of affected to recovered COVID-19 cases using the K-Means algorithm.
It partitions the data into three clusters, represented by different colors. The choice of K-Means is justified by its simplicity and efficiency,
although it requires specifying the number of clusters in advance. The dataset includes attributes such as country names, total reported cases,
and total recovered cases, with unnecessary columns excluded for this analysis. This approach provides insights into how countries compare in
terms of their COVID-19 recovery rates.
"""

spacy_summary = spacy_summarize(text, num_sentences=3)
print("SpaCy Summary:\n", spacy_summary)


SpaCy Summary:
 The dataset includes attributes such as country names, total reported cases, 
and total recovered cases, with unnecessary columns excluded for this analysis. 
In conclusion, the provided code clusters countries based on the ratio of affected to recovered COVID-19 cases using the K-Means algorithm. 
 The choice of K-Means is justified by its simplicity and efficiency, 
although it requires specifying the number of clusters in advance.


In [7]:
transformer_summary = transformer_summarize(text, num_sentences=3)
print("Transformer Summary:\n", transformer_summary)


Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)


Transformer Summary:
 the code clusters countries based on the ratio of affected to recovered COVID-19 cases . the dataset includes attributes such as country names, total reported cases, and total recovered cases.


In [8]:
def summarize_text(text, method='spacy', num_sentences=3):
    if method == 'spacy':
        return spacy_summarize(text, num_sentences)
    elif method == 'transformer':
        return transformer_summarize(text, num_sentences)
    else:
        raise ValueError("Method must be 'spacy' or 'transformer'")


In [9]:
transformer_summary = transformer_summarize(text, num_sentences=3)
print("Transformer Summary:\n", transformer_summary)


Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)


Transformer Summary:
 the code clusters countries based on the ratio of affected to recovered COVID-19 cases . the dataset includes attributes such as country names, total reported cases, and total recovered cases.


In [10]:
def summarize_text(text, method='spacy', num_sentences=3):
    if method == 'spacy':
        return spacy_summarize(text, num_sentences)
    elif method == 'transformer':
        return transformer_summarize(text, num_sentences)
    else:
        raise ValueError("Method must be 'spacy' or 'transformer'")


In [11]:
text = """
In conclusion, the provided code clusters countries based on the ratio of affected to recovered COVID-19 cases using the K-Means algorithm.
It partitions the data into three clusters, represented by different colors. The choice of K-Means is justified by its simplicity and efficiency,
although it requires specifying the number of clusters in advance. The dataset includes attributes such as country names, total reported cases,
and total recovered cases, with unnecessary columns excluded for this analysis. This approach provides insights into how countries compare in
terms of their COVID-19 recovery rates.
"""

summary = summarize_text(text, method='transformer', num_sentences=3)
print("Summary:\n", summary)


Your max_length is set to 150, but your input_length is only 124. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=62)


Summary:
 the code clusters countries based on the ratio of affected to recovered COVID-19 cases . the dataset includes attributes such as country names, total reported cases, and total recovered cases.
