📝 Challenge Summary (Gen-AI: Summarization)

Problem:
Given a dataset of news articles (id, article), generate a short summary for each article.
Output should be a CSV with id, summary.

✅ What to Use

Allowed anywhere (safe, baseline):

pandas, nltk, sklearn (TfidfVectorizer, sent_tokenize)

Extractive methods (select important sentences)

If platform allows HuggingFace / Transformers:

transformers library → pre-trained summarization models (distilbart, t5-small, bart-large-cnn)

The challenge will likely ask you: “Given the articles, generate a summary of the description.”
Final output should be something like:

guide, summary
001, "Tech giants post strong earnings, boosting stock market optimism."
002, "Heavy rains flood Mumbai, halting transport and raising emergency alert."

## Approach 1: Extractive (TF-IDF, Safe Baseline)

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
# Load dataset
df = pd.read_csv("news_articles.csv")   # update path
print(df.head())

                                               title  \
0  Ukraine: Angry Zelensky vows to punish Russian...   
1  War in Ukraine: Taking cover in a town under a...   
2         Ukraine war 'catastrophic for global food'   
3  Manchester Arena bombing: Saffie Roussos's par...   
4  Ukraine conflict: Oil price soars to highest l...   

                         pubDate  \
0  Mon, 07 Mar 2022 08:01:56 GMT   
1  Sun, 06 Mar 2022 22:49:58 GMT   
2  Mon, 07 Mar 2022 00:14:42 GMT   
3  Mon, 07 Mar 2022 00:05:40 GMT   
4  Mon, 07 Mar 2022 08:15:53 GMT   

                                               guid  \
0  https://www.bbc.co.uk/news/world-europe-60638042   
1  https://www.bbc.co.uk/news/world-europe-60641873   
2      https://www.bbc.co.uk/news/business-60623941   
3            https://www.bbc.co.uk/news/uk-60579079   
4      https://www.bbc.co.uk/news/business-60642786   

                                                link  \
0  https://www.bbc.co.uk/news/world-europe-606380...   
1  

In [14]:
def extractive_summary(text, top_n=2):
    # Check if the input text is missing (NaN)
    if pd.isna(text):
        # Return an empty string if the text is missing
        return ""

    # Tokenize the text into a list of sentences
    sentences = sent_tokenize(text)
    # If the number of sentences is less than or equal to top_n, return the original text
    if len(sentences) <= top_n:
        return text

    # Initialize a TfidfVectorizer to calculate TF-IDF scores
    vectorizer = TfidfVectorizer()

    # Fit the vectorizer to the sentences and transform them into a TF-IDF matrix
    X = vectorizer.fit_transform(sentences)

    # Sum the TF-IDF scores across each sentence to get a single score per sentence
    scores = np.array(X.sum(axis=1)).ravel()

    # Get the indices of the top_n sentences with the highest scores, in descending order
    ranked = [sentences[i] for i in scores.argsort()[-top_n:][::-1]]

    # Join the selected top sentences back into a single string
    return " ".join(ranked)

In [17]:
# Apply the extractive_summary function to the 'description' column of the DataFrame
# and store the results in a new 'summary' column.
# The lambda function is used to pass each description to the extractive_summary function.
# str(x) is used to ensure the input is a string, even if there are missing values.
# top_n=2 specifies that we want the top 2 sentences for the summary.

df['summary'] = df['description'].apply(lambda x: extractive_summary(str(x), top_n=2))

In [16]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [18]:
# Save submission
submission = df[['guid', 'summary']]
submission.to_csv("bbc_summary_tfidf.csv", index=False)
print(submission.head())

                                               guid  \
0  https://www.bbc.co.uk/news/world-europe-60638042   
1  https://www.bbc.co.uk/news/world-europe-60641873   
2      https://www.bbc.co.uk/news/business-60623941   
3            https://www.bbc.co.uk/news/uk-60579079   
4      https://www.bbc.co.uk/news/business-60642786   

                                             summary  
0  The Ukrainian president says the country will ...  
1  Jeremy Bowen was on the frontline in Irpin, as...  
2  One of the world's biggest fertiliser firms sa...  
3  The parents of the Manchester Arena bombing's ...  
4  Consumers are feeling the impact of higher ene...  


### Approach 2: Abstractive (Transformer, If Allowed)

In [20]:
import pandas as pd
from transformers import pipeline

# Load dataset
df = pd.read_csv("news_articles.csv")

In [21]:
# Load small summarizer model
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [22]:
def abstractive_summary(text):
    if pd.isna(text) or len(text.strip()) == 0:
        return ""
    result = summarizer(text, max_length=40, min_length=10, do_sample=False)
    return result[0]['summary_text']

df['summary'] = df['description'].apply(abstractive_summary)


Your max_length is set to 40, but your input_length is only 19. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)
Your max_length is set to 40, but your input_length is only 23. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)
Your max_length is set to 40, but your input_length is only 22. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)
Your max_length is set to 40, but your input_length is only 20. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)
Your max_

KeyboardInterrupt: 

In [None]:
# Save submission
submission = df[['guide', 'summary']]
submission.to_csv("bbc_summary_abstractive.csv", index=False)
print(submission.head())