<a href="https://www.kaggle.com/code/mbhosseini70/web-article-summarization-using-transformer-based?scriptVersionId=146042111" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# Project: Web Article Summarization
# 
# Goal:
# This code is designed to fetch the content from a web article and then use a transformer-based model
# to generate a summarized version of the article.
# 
# Short Description:
# The script employs a combination of the requests and BeautifulSoup libraries to scrape content from a 
# specified URL. The scraped content is then preprocessed (like converting to lowercase and removing punctuation)
# to make it suitable for the summarization model from the HuggingFace's Transformers library. The content is split 
# into manageable chunks and then summarized. Finally, all summarized chunks are concatenated to provide a 
# comprehensive summary of the article.

In [2]:
# Libraries for transformer-based models and web scraping
!pip install transformers
!pip install requests beautifulsoup4 pandas numpy



In [3]:
# Import necessary libraries
from transformers import pipeline  # Import the pipeline module from the transformers library
import requests  # Import the requests library for making HTTP requests
from bs4 import BeautifulSoup  # Import BeautifulSoup for web scraping
import string  # Import the string library for working with strings
import numpy as np  # Import NumPy for numerical operations
import pandas as pd  # Import Pandas for data manipulation


In [4]:
# Initialize the summarization pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [5]:
# URL of the article to be summarized
URL = "https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/"

In [6]:
# Fetch the content of the URL
r = requests.get(URL)
# Parse the content using BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
# Extract text from h1 and p tags (commonly used for headings and paragraphs)
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)

In [7]:
# Preprocessing steps
# Convert text to lowercase
ARTICLE = ARTICLE.lower()
# Remove punctuation
ARTICLE = ARTICLE.translate(str.maketrans('', '', string.punctuation))

In [8]:
# Extract text from h1 and p tags (commonly used for headings and paragraphs)
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)

In [9]:
# Splitting the article into chunks to feed to the summarizer
# This is done because there might be word limits on what the model can handle in a single pass
max_chunk = 515
sentences = ARTICLE.split(' ')
current_chunk = 0
chunks = []

for sentence in sentences:
    if len(chunks) == current_chunk + 1:
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        chunks.append(sentence.split(' '))

# Convert list of words in each chunk to string
for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

In [10]:
# Use the summarizer to summarize each chunk
res = summarizer(chunks, max_length=120, min_length=30, do_sample=False)

In [11]:
# Combine the summarized chunks into a single summary
combined_summary = " ".join([item['summary_text'] for item in res])
print(combined_summary)

 SHAP is a method that explains how individual predictions are made by a machine learning model . It deconstructs a prediction into a sum of contributions from each of the model's input variables . This guide is intended to serve two audiences: This guide prioritises clarity over strict technical accuracy . For those who wish to dig deeper on certain topics, links to useful resources are provided .  SHAP quantifies how important each input variable is to a model for making predictions . It is particularly efficient to compute SHAP for tree-based models, such as random forests and gradient boosted trees . SHAP can be applied to any machine learning model as a post hoc interpretation technique - i.e. it is applied after model training .  The goal of global interpretation methods is to describe the expected behaviour of a machine learning model with respect to the whole distribution of values for its input variables . With SHAP, this is achieved by aggregating the mean absolute SHAP value