# 01 - Live News Classification into Consulting Themes

In this notebook, we will:
- Scrape live business news headlines using `newspaper3k`
- Clean and prepare the data
- Use zero-shot classification (BART-MNLI) to assign each article a consulting-relevant theme

**Themes**:
- Digital Transformation
- Cost Reduction
- Mergers and Acquisitions
- Sustainability / ESG
- Organizational Change
- Supply Chain Optimization

In [None]:
# Install required libraries (run only once)
!pip install newspaper3k transformers torch --quiet

In [None]:
# Imports
from newspaper import Article
from newspaper import build
from transformers import pipeline
import pandas as pd
import nltk
nltk.download('punkt')

## Step 1: Scrape Articles from Business News Site

In [None]:
# Build newspaper source
url = 'https://www.reuters.com/business/'
paper = build(url, memoize_articles=False)

articles = []
for content in paper.articles[:10]:
    try:
        content.download()
        content.parse()
        content.nlp()
        articles.append({
            'title': content.title,
            'summary': content.summary,
            'url': content.url
        })
    except:
        continue
df = pd.DataFrame(articles)
df.head()

## Step 2: Define Consulting Themes

In [None]:
consulting_themes = [
    "Digital Transformation",
    "Cost Reduction",
    "Mergers and Acquisitions",
    "Sustainability / ESG",
    "Organizational Change",
    "Supply Chain Optimization"
]

## Step 3: Run Zero-Shot Classification

In [None]:
# Load classifier pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

results = []
for _, row in df.iterrows():
    text = row['title'] + ". " + row['summary']
    prediction = classifier(text, consulting_themes)
    results.append({
        'Title': row['title'],
        'Summary': row['summary'],
        'Top Theme': prediction['labels'][0],
        'Confidence': round(prediction['scores'][0], 2),
        'URL': row['url']
    })
final_df = pd.DataFrame(results)
final_df.sort_values(by='Confidence', ascending=False).reset_index(drop=True)