In [1]:
# pip install transformers torch requests beautifulsoup4 pandas nltk feedparser tf-keras
!pip install trafilatura readability-lxml lxml newspaper3k





# News-Sentiment DMA Screener — README

A single-file Python tool that:

* Fetches India equity news via **Google News RSS** using **dynamic company names** loaded from **NSE index CSVs** (Nifty 50 / Next 50 / Bank / 500 / Midcap 100 / Smallcap 100).
* Filters to an **allowlist of publishers** (Moneycontrol, Economic Times, Mint/LiveMint, Business Standard, CNBC TV18) with a safe **fallback** if none match.
* Extracts **full article text** (trafilatura → readability-lxml → newspaper3k) + AMP/canonical cleanup.
* Runs **FinBERT** sentiment on **headlines** and **articles**.
* Applies **recency weighting** (≤24h = 1.5×, 24–48h = 1.2×).
* Outputs a per-ticker **bias**: `LONG` / `SHORT` / `NEUTRAL`.

---

## 1) What you get

For each ticker, the script prints a table:

| column           | meaning                                                     |
| ---------------- | ----------------------------------------------------------- |
| `symbol`         | NSE ticker with `.NS` suffix (e.g., `LUPIN.NS`)             |
| `headline_avg`   | Recency-weighted sentiment from headlines only (−1..+1)     |
| `article_avg`    | Average sentiment from extracted article bodies / summaries |
| `combined_score` | Final score (prefers article if available; else headline)   |
| `bias`           | `LONG` if > +0.05, `SHORT` if < −0.05, else `NEUTRAL`       |
| `n_headlines`    | Number of headlines considered                              |
| `n_articles`     | Number of articles whose text was extracted / used          |

> Tip: If `n_articles` is 0, the script still uses headline sentiment (and, if enabled, RSS summary fallback).

---

## 2) Requirements

* Python **3.10+**
* Packages:

  * Always: `requests`, `feedparser`, `pandas`, `transformers`
  * Optional (recommended for better extraction):
    `trafilatura`, `readability-lxml`, `lxml`, `newspaper3k`
* Model: `ProsusAI/finbert` (downloaded automatically by 🤗 Transformers)
* macOS/Apple Silicon: MPS is fine (transformers prints `Device set to use mps:0`)

Install:

```bash
pip install requests feedparser pandas transformers
# optional but recommended:
pip install trafilatura readability-lxml lxml newspaper3k
```

> If you use a GPU/Metal, Transformers will auto-choose the device. No config needed.

---

## 3) How it works

1. **Dynamic company names**
   The script warms up an NSE session, downloads multiple index CSVs, and builds `NAME_MAP = {SYMBOL: "Company Name"}`.
   Example: `"LUPIN" → "Lupin"`, `"CENTRALBK" → "Central Bank of India"`.

2. **News fetching (Google News RSS)**
   For each ticker, it queries with:

   * `"Company Name" stock india`
   * `Company Name shares`
   * `SYMBOL stock india`
     It keeps **allowlisted publishers** if present; otherwise it **returns all** to avoid empty results.

3. **Text extraction**
   For each link, it:

   * Canonicalizes/cleans the URL (removes AMP and tracking where safe).
   * Tries `trafilatura` → `readability` → `newspaper3k` in order.
   * Uses article text if ≥120 chars; else falls back to RSS summary (if available).

4. **Sentiment & weighting**

   * Headline sentiment: recency-weighted (≤24h: 1.5×, 24–48h: 1.2×, else 1.0×).
   * Article sentiment: average of chunked body text (512-token budget heuristic).
   * Combined: `0.7 * article_avg + 0.3 * headline_avg` *if* any articles were read; otherwise `headline_avg`.

5. **Bias rule**

   * `combined_score > +0.05` → **LONG**
   * `combined_score < −0.05` → **SHORT**
   * otherwise **NEUTRAL**

---

## 4) Running it

Edit the `tickers` list at the bottom and run:

```bash
python news_sentiment_dma.py
```

Example:

```python
if __name__ == "__main__":
    tickers = [
        "CENTRALBK.NS",
        "LUPIN.NS",
        "UCOBANK.NS",
    ]
    sentiment_df = build_sentiment_table(tickers)
    print(sentiment_df)
```

> Note: In Python, each item in the list needs a comma. A missing comma will concatenate adjacent strings.

---

## 5) Configuration knobs

* **Allowlist domains**: update `ALLOWLIST` to tighten/loosen publisher filtering.
* **Recency weighting**: tweak `recency_weight()` thresholds/weights.
* **Neutral band**: adjust `SENTIMENT_NEUTRAL_BAND` (default 0.05).
* **Extraction threshold**: change `MIN_ARTICLE_CHARS` (default 200; logic uses 120 in the final gate).
* **Max items**: `MAX_HEADLINES`, `MAX_ARTICLES_PER_TICKER`.
* **Token budget**: `MAX_TOKENS_PER_ARTICLE` (rough 4 chars/token heuristic).
* **Dynamic names**: extend/override `DEFAULT_NSE_INDEX_URLS` or add entries to `EXTRA_NAME_MAP`.

---

## 6) Troubleshooting

* **All zeros / no headlines**

  * Your query might be too strict or network is blocked. Try printing raw rows:

    ```python
    raw_df = pd.concat([fetch_news_for_ticker("LUPIN.NS")], ignore_index=True)
    print(raw_df[["title","link","allowlisted"]])
    ```
  * If allowlist filters out everything, the script **falls back** to returning all publishers.

* **`n_articles = 0`**

  * Many finance sites are AMP/JS/paywalled; extraction can fail.
  * Lower thresholds (`MIN_ARTICLE_CHARS`), ensure optional libs are installed, and rely on RSS **summary fallback** (already enabled).

* **NSE CSV errors**

  * NSE can be finicky without cookies. The script warms up a session; re-run if a CSV fails transiently.
  * You can limit to fewer CSV URLs if needed.

* **Model errors**

  * If Transformers downloads stall, try `pip install -U transformers` and ensure internet access.

---

## 7) Extending it

* **Combine with DMA/RSI screener**
  Use `combined_score`/`bias` as a **news gate**: only consider longs where both **technicals** (DMA/RSI) and **news** are bullish.

* **Add Bing News RSS fallback**
  You can implement a second fetcher to merge Bing RSS results if Google News is sparse.

* **Recency within articles**
  Weight article paragraphs by detected timestamps or TF-IDF to emphasize fresh info.

* **Caching**
  Cache `NAME_MAP` (JSON) and news results to speed up repeated runs.

---

## 8) Notes & disclaimers

* This is **for research/education**. It’s not investment advice. Backtest before live trading.
* Respect publishers’ **robots/terms**; avoid aggressive scraping.
* Sentiment models can misread sarcasm, corporate wording, or headlines that invert sentiment (e.g., “loss narrows”). Use as one input among many.

---

## 9) Quick reference (key functions)

* `load_name_map_from_nse()` → builds `{SYMBOL: Company}` dynamically
* `get_company_name(ticker)` → returns company name from `NAME_MAP` for `"LUPIN.NS"`
* `fetch_news_for_ticker(ticker)` → DataFrame of news rows for that ticker
* `analyze_ticker_news(df_news, ticker)` → dict with sentiment & bias for 1 ticker
* `build_sentiment_table(tickers)` → final table across tickers

---

Happy screening! If you want, I can add a **CSV export** (e.g., `sentiment_df.to_csv`) or a small **CLI** wrapper (`--tickers`, `--since`) for notebook-free runs.


In [None]:
# news_sentiment_dma.py
import re, time, urllib.parse, requests, feedparser
import pandas as pd
from datetime import datetime, timedelta
from transformers import pipeline

# =========================
# CONFIG
# =========================
UA = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
      "AppleWebKit/537.36 (KHTML, like Gecko) "
      "Chrome/122.0.0.0 Safari/537.36")
HDRS = {
    "User-Agent": UA,
    "Accept-Language": "en-IN,en;q=0.9",
    "Accept": "application/rss+xml, application/xml;q=0.9, */*;q=0.8",
}

SENTIMENT_NEUTRAL_BAND = 0.05
MAX_HEADLINES = 10
MAX_ARTICLES_PER_TICKER = 8
MAX_TOKENS_PER_ARTICLE = 512
BATCH_SIZE = 8
MIN_ARTICLE_CHARS = 200

# Publisher allowlist (substring match on URL)
ALLOWLIST = [
    "moneycontrol.com",
    "economictimes.indiatimes.com",
    "livemint.com",
    "mint",  # include "mint" as some mirrors use it
    "business-standard.com",
    "cnbctv18.com",
]

# FinBERT model
MODEL_NAME = "ProsusAI/finbert"
sentiment_model = pipeline("sentiment-analysis", model=MODEL_NAME, tokenizer=MODEL_NAME)


# ========== DYNAMIC NAME_MAP FROM NSE INDEX CSVs ==========
import requests
import pandas as pd
from io import StringIO
from typing import Iterable

NSE_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.nseindia.com/market-data/live-market-indices"
}

# Common index CSVs — add/remove as you like
DEFAULT_NSE_INDEX_URLS = [
    "https://nsearchives.nseindia.com/content/indices/ind_nifty50list.csv",
    "https://nsearchives.nseindia.com/content/indices/ind_niftynext50list.csv",
    "https://nsearchives.nseindia.com/content/indices/ind_niftybanklist.csv",
    "https://nsearchives.nseindia.com/content/indices/ind_nifty500list.csv",
    "https://nsearchives.nseindia.com/content/indices/ind_niftymidcap100list.csv",
    "https://nsearchives.nseindia.com/content/indices/ind_niftysmallcap100list.csv",
]

def load_name_map_from_nse(csv_urls: Iterable[str] = None, timeout: int = 10) -> dict:
    """
    Build {SYMBOL: Company Name} from one or more NSE index CSVs.
    Symbols are upper-cased WITHOUT the '.NS' suffix (e.g., 'LUPIN', 'CENTRALBK').
    """
    csv_urls = list(csv_urls) if csv_urls else list(DEFAULT_NSE_INDEX_URLS)

    sess = requests.Session()
    sess.headers.update(NSE_HEADERS)
    # Warm up cookies (important for NSE)
    try:
        sess.get("https://www.nseindia.com", timeout=timeout)
    except Exception:
        pass

    frames = []
    for url in csv_urls:
        try:
            r = sess.get(url, timeout=timeout)
            r.raise_for_status()
            df = pd.read_csv(StringIO(r.text))
            # Normalize column names (NSE uses "Symbol" and "Company Name")
            cols = {c.lower(): c for c in df.columns}
            sym_col = cols.get("symbol")
            name_col = cols.get("company name") or cols.get("company") or cols.get("companyname")
            if not sym_col or not name_col:
                continue
            slim = df[[sym_col, name_col]].rename(columns={sym_col: "Symbol", name_col: "Company"})
            frames.append(slim)
        except Exception:
            # Skip bad/temporarily unavailable indices
            continue

    if not frames:
        return {}

    merged = pd.concat(frames, ignore_index=True)
    # Clean + dedupe
    merged["Symbol"] = merged["Symbol"].astype(str).str.strip().str.upper()
    merged["Company"] = merged["Company"].astype(str).str.strip()
    merged = merged.dropna(subset=["Symbol", "Company"]).drop_duplicates(subset=["Symbol"], keep="first")

    # Build dict
    name_map = dict(zip(merged["Symbol"], merged["Company"]))
    return name_map

# Load dynamically at startup
NAME_MAP = load_name_map_from_nse()

# Optional: manual overrides if you want to fix odd cases or add missing ones
EXTRA_NAME_MAP = {
    # "TARIL": "Tata Realty & Infrastructure Ltd",  # example if needed
}
NAME_MAP.update(EXTRA_NAME_MAP)

def get_company_name(symbol_with_ns: str) -> str:
    """
    Given 'LUPIN.NS' → returns 'Lupin' if known, else 'LUPIN'.
    """
    base = (symbol_with_ns or "").split(".")[0].upper()
    return NAME_MAP.get(base, base)


# =========================
# UTILITIES
# =========================
def _clean_title(t: str) -> str:
    """Drop trailing publisher and compress whitespace."""
    t = re.sub(r"\s+-\s+[^-]+$", "", (t or "").strip())
    return re.sub(r"\s+", " ", t)

def canonicalize_url(url: str) -> str:
    """Remove common AMP patterns & excessive tracking params to improve extraction."""
    if not url:
        return url
    url = re.sub(r"/amp/?$", "/", url)
    url = re.sub(r"[?&](amp|outputType=amp)\b[^&]*", "", url)
    parsed = urllib.parse.urlparse(url)
    qs = urllib.parse.parse_qsl(parsed.query, keep_blank_values=True)
    # Keep only a tiny set of params that are sometimes essential
    allow = {"id", "story_fbid"}
    qs = [(k, v) for (k, v) in qs if k in allow]
    new_qs = urllib.parse.urlencode(qs)
    return urllib.parse.urlunparse(parsed._replace(query=new_qs))

def _normalize_label(lab: str) -> str:
    lab = str(lab).lower()
    if lab.startswith("pos"): return "positive"
    if lab.startswith("neg"): return "negative"
    return "neutral"

def score_text_blocks(texts: list[str]) -> list[dict]:
    """Run FinBERT on text blocks."""
    if not texts: return []
    outs = sentiment_model(texts, truncation=True, batch_size=BATCH_SIZE)
    return [{"label": _normalize_label(o["label"]), "score": float(o["score"])} for o in outs]

def chunk_text(text: str, max_tokens: int = MAX_TOKENS_PER_ARTICLE) -> list[str]:
    """Lightweight chunk by char budget (~4 chars per token)."""
    char_budget = max_tokens * 4
    t = text[:min(len(text), char_budget)]
    if len(t) < char_budget // 2:
        return [t]
    mid = len(t) // 2
    return [t[:mid], t[mid:]]

def aggregate_sentiment(rows: list[dict], weights: list[float] | None = None) -> float:
    """Map labels to +1/0/-1 and average by model confidence (and optional weights)."""
    if not rows: return 0.0
    mapping = {"positive": 1.0, "neutral": 0.0, "negative": -1.0}
    vals = [mapping[r["label"]] * r["score"] for r in rows]
    if weights and len(weights) == len(vals):
        # normalize by sum of weights to avoid bias
        wsum = sum(weights) or 1.0
        return float(sum(v * w for v, w in zip(vals, weights)) / wsum)
    return float(sum(vals) / len(vals))

def recency_weight(published: str) -> float:
    """≤24h:1.5, 24–48h:1.2, else:1.0; robust to missing/odd formats."""
    try:
        # Common Google News RSS format: 'Mon, 16 Sep 2025 08:30:00 GMT'
        dt = datetime.strptime(published, "%a, %d %b %Y %H:%M:%S %Z")
        delta = datetime.utcnow() - dt.replace(tzinfo=None)
        if delta <= timedelta(hours=24): return 1.5
        if delta <= timedelta(hours=48): return 1.2
    except Exception:
        pass
    return 1.0

def bias_from_score(avg: float) -> str:
    if avg > SENTIMENT_NEUTRAL_BAND: return "LONG"
    if avg < -SENTIMENT_NEUTRAL_BAND: return "SHORT"
    return "NEUTRAL"


# =========================
# FETCH GOOGLE NEWS (with allowlist & fallback)
# =========================
def fetch_google_news_df(query_text: str, max_items: int = MAX_HEADLINES) -> pd.DataFrame:
    """
    Build Google News RSS URL for last 7 days, download with headers, parse with feedparser.
    Keep only allowlisted publishers if any exist; else return all (never empty due to filter).
    """
    q = urllib.parse.quote_plus(query_text + " when:7d")
    url = f"https://news.google.com/rss/search?q={q}&hl=en-IN&gl=IN&ceid=IN:en"
    r = requests.get(url, headers=HDRS, timeout=10, allow_redirects=True)
    if r.status_code != 200 or not r.content:
        return pd.DataFrame(columns=["title","summary","link","published","fetched_at","source_title","allowlisted"])

    fp = feedparser.parse(r.content)
    entries = getattr(fp, "entries", []) or []
    rows = []
    for e in entries[:max_items]:
        link = getattr(e, "link", "") or ""
        allow = any(dom in link.lower() for dom in ALLOWLIST)
        rows.append({
            "title": _clean_title(getattr(e, "title", "")),
            "summary": getattr(e, "summary", None),
            "link": link,
            "published": getattr(e, "published", None),
            "fetched_at": datetime.now(),
            "source_title": fp.feed.get("title", ""),
            "allowlisted": allow,
        })

    df = pd.DataFrame(rows)
    if df.empty:
        return df

    # Prefer allowlisted only, but if none matched, return everything (fallback)
    if df["allowlisted"].any():
        return df[df["allowlisted"]].reset_index(drop=True)
    return df.reset_index(drop=True)

def fetch_news_for_ticker(ticker: str, max_items: int = MAX_HEADLINES) -> pd.DataFrame:
    company = get_company_name(ticker)  # ← dynamic lookup
    base = ticker.split(".")[0]         # e.g., 'LUPIN'
    queries = [
        f'"{company}" stock india',
        f"{company} shares",
        f"{base} stock india",
    ]
    for q in queries:
        df = fetch_google_news_df(q, max_items=max_items)
        if not df.empty:
            df.insert(0, "symbol", ticker)
            return df
    return pd.DataFrame(columns=["symbol","title","summary","link","published","fetched_at","source_title","allowlisted"])



# =========================
# ARTICLE TEXT EXTRACTION (multi-fallback)
# =========================
def extract_text_trafilatura(url: str, timeout: int = 10) -> str | None:
    try:
        import trafilatura
    except Exception:
        return None
    try:
        downloaded = trafilatura.fetch_url(url, no_ssl=True)
        if not downloaded:
            r = requests.get(url, headers=HDRS, timeout=timeout)
            if r.status_code != 200 or not r.text:
                return None
            txt = trafilatura.extract(r.text, url=url)
        else:
            txt = trafilatura.extract(downloaded, url=url)
        if not txt: return None
        txt = " ".join(txt.split())
        return txt if len(txt) >= MIN_ARTICLE_CHARS else None
    except Exception:
        return None

def extract_text_readability(url: str, timeout: int = 10) -> str | None:
    try:
        from readability import Document
        from lxml import html
    except Exception:
        return None
    try:
        r = requests.get(url, headers=HDRS, timeout=timeout, allow_redirects=True)
        if r.status_code != 200:
            return None
        doc = Document(r.text)
        summ_html = doc.summary(html_partial=True)
        tree = html.fromstring(summ_html)
        txt = " ".join(tree.text_content().split())
        return txt if len(txt) >= MIN_ARTICLE_CHARS else None
    except Exception:
        return None

def extract_text_newspaper(url: str) -> str | None:
    try:
        from newspaper import Article
    except Exception:
        return None
    try:
        art = Article(url, language="en", fetch_images=False, browser_user_agent=UA)
        art.download()
        art.parse()
        txt = " ".join(art.text.split())
        return txt if len(txt) >= MIN_ARTICLE_CHARS else None
    except Exception:
        return None

def extract_article_text(url: str) -> str | None:
    """Try trafilatura → readability → newspaper3k; relax length requirement."""
    url = canonicalize_url(url)
    for fn in (extract_text_trafilatura, extract_text_readability, extract_text_newspaper):
        txt = fn(url)
        if txt and len(txt) >= 120:   # relaxed threshold
            return re.sub(r"\s+", " ", txt).strip()
    return None


# =========================
# PIPELINE: HEADLINE + ARTICLE SENTIMENT with RECENCY WEIGHTS
# =========================
def analyze_ticker_news(df_news: pd.DataFrame, ticker: str) -> dict:
    df = df_news[df_news["symbol"] == ticker].head(MAX_ARTICLES_PER_TICKER).copy()
    if df.empty:
        return {"symbol": ticker, "headline_avg": 0.0, "article_avg": 0.0,
                "combined_score": 0.0, "bias": "NEUTRAL",
                "n_headlines": 0, "n_articles": 0}

    # Headline sentiment (recency-weighted)
    headline_rows = score_text_blocks(df["title"].tolist())
    weights = [recency_weight(p) for p in df["published"]] if "published" in df else None
    headline_avg = aggregate_sentiment(headline_rows, weights)

    # Article sentiment
    article_avgs, extracted_count = [], 0
    for _, row in df.iterrows():
        link = row.get("link", "")
        body = extract_article_text(link)
        if not body:
            # Fallback: RSS summary
            summary = row.get("summary")
            if summary and len(summary) >= 80:
                body = re.sub(r"<[^>]+>", " ", summary)
        if not body:
            continue

        extracted_count += 1
        chunks = chunk_text(body, MAX_TOKENS_PER_ARTICLE)
        chunk_scores = score_text_blocks(chunks)
        article_avgs.append(aggregate_sentiment(chunk_scores))
        time.sleep(0.05)

    article_avg = sum(article_avgs)/len(article_avgs) if article_avgs else 0.0
    combined = 0.7*article_avg + 0.3*headline_avg if article_avgs else headline_avg

    return {
        "symbol": ticker,
        "headline_avg": round(headline_avg, 4),
        "article_avg": round(article_avg, 4),
        "combined_score": round(combined, 4),
        "bias": bias_from_score(combined),
        "n_headlines": int(len(df)),
        "n_articles": int(extracted_count),
    }

def build_sentiment_table(tickers: list[str]) -> pd.DataFrame:
    """Fetch news for all tickers, then compute per-ticker bias table."""
    frames = [fetch_news_for_ticker(t) for t in tickers]
    all_news = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
    results = [analyze_ticker_news(all_news, t) for t in tickers]
    out = pd.DataFrame(results).sort_values(["bias", "combined_score"], ascending=[True, False]).reset_index(drop=True)
    return out


# =========================
# MAIN (example)
# =========================
if __name__ == "__main__":
    tickers = [
        "SBILIFE.NS"
]
    sentiment_df = build_sentiment_table(tickers)
    print(sentiment_df)
    # If you want to inspect raw news rows:
    raw_df = pd.concat([fetch_news_for_ticker(t) for t in tickers], ignore_index=True)
    # print(raw_df[["symbol","title","link","published","allowlisted"]].head(20))


    # make wide tables readable in the terminal
    pd.set_option("display.max_columns", None)
    pd.set_option("display.width", 140)
    pd.set_option("display.max_colwidth", 90)   # control title/link truncation

    print(
        raw_df[["symbol","title","link","published","allowlisted"]]
            .head(20)
            .to_string(index=False)
    )

