

# Financial News Scrapper


## Overview

This pipeline implements an **institutional-grade financial news extraction system** designed to collect, clean, score, and organize high-signal company-specific news across multiple publicly available news websites.
It supports **multi-company ingestion**, **true pagination**, **noise reduction**, and **financial relevance scoring**, producing outputs suitable for downstream analytics or LLM-based reasoning.

---

## Input

* **`COMPANIES`**
  A list of stock tickers:

  ```python
  ["TSLA","NVDA","AAPL","AMD","AMZN","MSFT","GOOGL","META",
   "BAC","INTC","CSCO","KO","XOM","NFLX","NKE"]
  ```

* **Derived Company Keywords**
  Each ticker is mapped to a company-specific keyword to avoid semantic ambiguity
  (e.g., *AAPL → “apple inc”*, *KO → “coca cola”*).

* **Pagination Range**
  News is fetched page-by-page until records are exhausted or a safe upper bound is reached.

---

## Data Sources

* **Economic Times**
  Topic-based pages with native pagination (`?page=`)

* **The Guardian**
  Technology section pages with server-side pagination
  Articles are filtered post-fetch to retain only company-relevant content.

All scraping follows **responsible access practices** (rate limiting, retries, user-agent headers).

---

## Core Logic

### 1. Multi-Company Pagination Engine

* Iterates over each company independently.
* Scrapes paginated result pages.
* Extracts full article URLs and fetches individual articles.

---

### 2. Hard Relevance Filtering

* Articles must explicitly mention the **company keyword** in the title or body.
* Prevents false positives (e.g., fruit “apple” vs Apple Inc.).

---



## Output

### Primary Output (CSV)

A unified dataset containing all companies:

```text
ticker
company
source
title
summary
published
url
scraped_at
```

This format is suitable for:

* Time-series analysis
* Event detection
* Sentiment aggregation
* Model training

---




## Scraping Session Metadata
Records the timestamp, URLs used, and tickers scraped for this session. This cell must be run **first** so that `SCRAPE_TIMESTAMP` is available for the CSV export cell later.


In [2]:
# ============================================================
# MULTI-COMPANY ECONOMIC TIMES SCRAPER (FAST, 15 COMPANIES)
# Target: 60–100 articles per company
# ============================================================

import requests, time, random, hashlib
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin

# ============================================================
# CONFIG
# ============================================================
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0 Safari/537.36"
    )
}

TIMEOUT = 6
DELAY_RANGE = (0.6, 1.2)        # faster but safe
MAX_PAGES_ET = 140              # ET hard limit
PER_COMPANY_TARGET = 80         # 60–100 safe band

SCRAPED_AT = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

COMPANIES = [
    "TSLA","NVDA","AAPL","AMD","AMZN","MSFT","GOOGL","META",
    "BAC","INTC","CSCO","KO","XOM","NFLX","NKE"
]

# ============================================================
# ECONOMIC TIMES TOPICS (PRIMARY + FALLBACK)
# ============================================================
ET_TOPICS = {
    "TSLA": ["tesla"],
    "NVDA": ["nvidia", "ai", "semiconductor"],
    "AAPL": ["apple", "iphone"],
    "AMD": ["amd", "chipmaker", "semiconductor"],
    "AMZN": ["amazon", "aws", "e-commerce"],
    "MSFT": ["microsoft", "azure", "cloud"],
    "GOOGL": ["google", "alphabet", "search"],
    "META": ["meta", "facebook", "instagram"],
    "BAC": ["bank-of-america", "banking", "us-banks"],
    "INTC": ["intel", "chipmaker", "semiconductor"],
    "CSCO": ["cisco", "enterprise-tech", "networking"],
    "KO": ["coca-cola", "fmcg", "consumer-goods"],
    "XOM": ["exxon-mobil", "oil", "energy"],
    "NFLX": ["netflix", "streaming", "media"],
    "NKE": ["nike", "retail", "sportswear"]
}

# ============================================================
# HIGH-RECALL KEYWORDS (FILTERING ONLY)
# ============================================================
KEYWORDS = {
    "TSLA": ["tesla", "elon musk", "ev", "robotaxi"],
    "NVDA": ["nvidia", "gpu", "ai"],
    "AAPL": ["apple", "iphone", "ios"],
    "AMD": ["amd", "ryzen", "chip"],
    "AMZN": ["amazon", "aws"],
    "MSFT": ["microsoft", "azure"],
    "GOOGL": ["google", "alphabet"],
    "META": ["meta", "facebook"],
    "BAC": ["bank of america", "bofa"],
    "INTC": ["intel", "chip"],
    "CSCO": ["cisco", "network"],
    "KO": ["coca cola"],
    "XOM": ["exxon", "oil"],
    "NFLX": ["netflix"],
    "NKE": ["nike", "sneaker"]
}

# ============================================================
# UTILITIES
# ============================================================
def polite_delay():
    time.sleep(random.uniform(*DELAY_RANGE))

def safe_request(url):
    try:
        r = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
        return r.text if r.status_code == 200 else None
    except:
        return None

def dedup_key(title, url):
    return hashlib.md5((title.lower() + url).encode()).hexdigest()

def match(text, kws):
    t = text.lower()
    return any(k in t for k in kws)

# ============================================================
# ECONOMIC TIMES SCRAPER (ROBUST)
# ============================================================
def scrape_economic_times(ticker):
    records = []
    base = "https://economictimes.indiatimes.com/topic"
    keywords = KEYWORDS[ticker]

    for topic in ET_TOPICS[ticker]:
        slug = topic.replace(" ", "-")

        for page in range(1, MAX_PAGES_ET + 1):
            if len(records) >= PER_COMPANY_TARGET:
                return records

            url = f"{base}/{slug}/{page}"
            html = safe_request(url)
            polite_delay()

            if not html:
                break

            soup = BeautifulSoup(html, "html.parser")
            articles = soup.select("div.contentD")

            if not articles:
                break

            for block in articles:
                a = block.find("h2")
                if not a or not a.find("a"):
                    continue

                link = a.find("a")
                title = link.text.strip()
                summary = block.find("p").text.strip() if block.find("p") else ""

                combined = title + " " + summary
                if not match(combined, keywords):
                    continue

                time_tag = block.find("time")

                records.append({
                    "ticker": ticker,
                    "company": topic,
                    "source": "Economic Times",
                    "title": title,
                    "summary": summary,
                    "published": time_tag.text.strip() if time_tag else "",
                    "url": urljoin(base, link["href"]),
                    "scraped_at": SCRAPED_AT
                })

                if len(records) >= PER_COMPANY_TARGET:
                    return records

    return records

# ============================================================
# RUN PIPELINE (ALL 15 COMPANIES GUARANTEED)
# ============================================================
all_records = []

for ticker in COMPANIES:
    print(f"Scraping {ticker}...")
    company_records = scrape_economic_times(ticker)
    print(f"  → {len(company_records)} records")
    all_records.extend(company_records)

# ============================================================
# CLEAN + SAVE
# ============================================================
df = pd.DataFrame(all_records)
df["dedup"] = df.apply(lambda r: dedup_key(r["title"], r["url"]), axis=1)
df = df.drop_duplicates("dedup").drop(columns="dedup")

df.to_csv("multi_company_news.csv", index=False)

print("\nSCRAPING METADATA")
print("-----------------")
print(f"Companies scraped : {len(COMPANIES)}")
print(f"Total records     : {len(df)}")
print("Saved to          : multi_company_news.csv")

Scraping TSLA...
  → 80 records
Scraping NVDA...
  → 80 records
Scraping AAPL...
  → 80 records
Scraping AMD...
  → 80 records
Scraping AMZN...
  → 80 records
Scraping MSFT...
  → 80 records
Scraping GOOGL...
  → 80 records
Scraping META...
  → 80 records
Scraping BAC...
  → 80 records
Scraping INTC...
  → 80 records
Scraping CSCO...
  → 80 records
Scraping KO...
  → 66 records
Scraping XOM...
  → 80 records
Scraping NFLX...
  → 80 records
Scraping NKE...
  → 80 records

SCRAPING METADATA
-----------------
Companies scraped : 15
Total records     : 1050
Saved to          : multi_company_news.csv
