# 00 — Environment Setup
### Brand Sentiment Monitor

## Data Architecture
```
OFFLINE STAGE — Colab notebooks 00-09 (this series)
┌──────────────────────────────────────────────────────────┐
│  Kaggle Datasets  →  model training  →  HuggingFace Hub  │
│  Sentiment140 · GoEmotions · SemEval 2018                │
└──────────────────────────────────────────────────────────┘

ONLINE STAGE — deployment only (NOT in notebooks)
┌──────────────────────────────────────────────────────────┐
│  src/data/collector.py  →  Reddit + NewsAPI  →  DB       │
└──────────────────────────────────────────────────────────┘
```
This notebook makes **no live API calls**.  
Section 8 validates credentials exist — it does not pull data.

**Sections:** GPU · Drive · Structure · Deps · spaCy/NLTK · Datasets · Keys · Creds · DB · Smoke · Summary

## 1. GPU & Runtime Check
Runtime → Change runtime type → **T4 GPU** before running.

In [1]:
import subprocess, sys, platform, shutil

print(f"Python  : {sys.version}")
print(f"Platform: {platform.platform()}")

try:
    gpu = subprocess.check_output(["nvidia-smi"], stderr=subprocess.STDOUT).decode()
    print("GPU: ✅ AVAILABLE")
    for line in gpu.split("\n"):
        if any(x in line for x in ["T4", "A100", "V100", "L4", "Tesla"]):
            print(f"     {line.strip()}"); break
except Exception:
    print("GPU: ❌ NOT FOUND — Runtime → Change runtime type → T4 GPU")

total, _, free = shutil.disk_usage("/")
print(f"Disk: {free//(2**30)} GB free of {total//(2**30)} GB")


Python  : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Platform: Linux-6.6.113+-x86_64-with-glibc2.35
GPU: ✅ AVAILABLE
     |   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
Disk: 70 GB free of 112 GB


## 2. Google Drive Mount
All data, models, and outputs persist across Colab sessions.

In [2]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

import os, sys

DRIVE_ROOT = "/content/drive/MyDrive/brand-sentiment-monitor"
COLAB_ROOT = "/content/brand-sentiment-monitor"

os.makedirs(DRIVE_ROOT, exist_ok=True)
os.makedirs(COLAB_ROOT, exist_ok=True)  # symlink target must exist

sys.path.insert(0, DRIVE_ROOT)
sys.path.insert(0, os.path.join(DRIVE_ROOT, "src"))

print(f"Drive root: {DRIVE_ROOT} ✅")


Mounted at /content/drive
Drive root: /content/drive/MyDrive/brand-sentiment-monitor ✅


## 3. Project Structure

In [4]:
# One folder per architecture module — matches src/ layout exactly
dirs = [
    # Data — offline Kaggle (notebooks) vs online live (deployment)
    "data/kaggle/raw",
    "data/kaggle/processed",
    "data/kaggle/splits",
    "data/live/reddit",          # collector.py output — deployment only
    "data/live/news",            # collector.py output — deployment only
    "data/live/combined",
    "data/processed",
    "data/external",

    # Source modules — one per architecture module
    "src/preprocessing",         # Module 2 — cleaner.py
    "src/brand",                 # Module 3 — detector.py
    "src/models",                # Module 4 — sentiment, sarcasm, emotion, topic
    "src/attribution",           # Module 5 — engine.py (key differentiator)
    "src/analytics",             # Module 6 — emotion analytics helpers
    "src/crisis",                # Module 8 — detector.py
    "src/aggregation",           # Module 9 — aggregator.py
    "src/api",                   # Module 10 — schemas.py + predict.py (backend contract)
    "src/data",                  # Deployment — collector.py
    "src/utils",

    # Saved models — one folder per trained model, saved via save_pretrained()
    "models/sentiment",          # cardiffnlp/twitter-roberta fine-tuned
    "models/sarcasm",            # roberta-base fine-tuned on SemEval 2018
    "models/emotion",            # bert-base-cased fine-tuned on GoEmotions
    "models/topic",              # BERTopic saved model

    # Outputs
    "outputs/predictions",
    "outputs/reports",
    "outputs/visualizations",

    # Other
    "config", "dashboard", "tests", "notebooks", "logs",
]

open(os.path.join(DRIVE_ROOT, "src/__init__.py"), "w").close()
for d in dirs:
    full = os.path.join(DRIVE_ROOT, d) # Define 'full' here
    os.makedirs(full, exist_ok=True)
    if d.startswith("src"):
        open(os.path.join(full, "__init__.py"), "w").close()
    print(f"  ✅ {d}")

with open(os.path.join(DRIVE_ROOT, "data/live/README.md"), "w") as f:
    f.write("# Live Data\nPopulated by collector.py in deployment only.\nNOT used in notebooks 00-10.\n")

print("\nStructure ready ✅")


  ✅ data/kaggle/raw
  ✅ data/kaggle/processed
  ✅ data/kaggle/splits
  ✅ data/live/reddit
  ✅ data/live/news
  ✅ data/live/combined
  ✅ data/processed
  ✅ data/external
  ✅ src/preprocessing
  ✅ src/brand
  ✅ src/models
  ✅ src/attribution
  ✅ src/analytics
  ✅ src/crisis
  ✅ src/aggregation
  ✅ src/api
  ✅ src/data
  ✅ src/utils
  ✅ models/sentiment
  ✅ models/sarcasm
  ✅ models/emotion
  ✅ models/topic
  ✅ outputs/predictions
  ✅ outputs/reports
  ✅ outputs/visualizations
  ✅ config
  ✅ dashboard
  ✅ tests
  ✅ notebooks
  ✅ logs

Structure ready ✅


## 4. Dependency Installation
⏱️ ~5 min first run. Installed in **9 batches** — a failure in one batch is easy to isolate and retry.

In [5]:
print("Batch 1: Core ML & Data...")
!pip install -q numpy pandas scikit-learn scipy joblib

Batch 1: Core ML & Data...


In [6]:
print("Batch 2: Transformers & HuggingFace...")
!pip install -q transformers datasets accelerate peft evaluate huggingface-hub

Batch 2: Transformers & HuggingFace...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
print("Batch 3: NLP...")
!pip install -q nltk spacy textblob emoji contractions langdetect

Batch 3: NLP...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.4/608.4 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.1/345.1 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for langdetect (setup.py) ... [?25l[?25hdone


In [8]:
print("Batch 4: Topic Modeling...")
!pip install -q bertopic gensim umap-learn hdbscan

Batch 4: Topic Modeling...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
print("Batch 5: Forecasting & Anomaly Detection...")
!pip install -q prophet pyod statsmodels

Batch 5: Forecasting & Anomaly Detection...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.3/46.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.7/204.7 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
# Installed here so collector.py works in deployment
# NOT imported inside any notebook
print("Batch 6: Live API clients (deployment use only)...")
!pip install -q praw newsapi-python feedparser aiohttp requests
print("✅  [praw=Reddit | newsapi-python=NewsAPI | feedparser=RSS fallback]")

Batch 6: Live API clients (deployment use only)...
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
✅  [praw=Reddit | newsapi-python=NewsAPI | feedparser=RSS fallback]


In [11]:
print("Batch 7: Database & API serving...")
!pip install -q psycopg2-binary sqlalchemy supabase fastapi "uvicorn[standard]" pydantic httpx


Batch 7: Database & API serving...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m730.0/730.0 kB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.9/123.9 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [12]:
print("Batch 8: Dashboard & Visualization...")
!pip install -q streamlit plotly wordcloud matplotlib seaborn altair

Batch 8: Dashboard & Visualization...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m89.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m116.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
print("Batch 9: MLflow, Alerting & Utilities...")
!pip install -q mlflow optuna slack-sdk python-dotenv pyyaml tqdm loguru pytest pytest-asyncio

print("✅")
print("\nAll dependencies installed ✅")

Batch 9: MLflow, Alerting & Utilities...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.9/413.9 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.7/313.7 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K 

## 5. spaCy & NLTK Downloads

In [14]:
print("Downloading spaCy en_core_web_trf (~435 MB)...")
!python -m spacy download en_core_web_trf --quiet

import spacy
try:
    nlp = spacy.load("en_core_web_trf")
    doc = nlp("Nike reported strong earnings while Adidas struggled in Q3.")
    print(f"en_core_web_trf ✅  NER: {[(e.text, e.label_) for e in doc.ents]}")
except Exception as e:
    print(f"Falling back to en_core_web_sm: {e}")
    !python -m spacy download en_core_web_sm --quiet
    print("en_core_web_sm ✅  (less accurate NER)")


Downloading spaCy en_core_web_trf (~435 MB)...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.9/237.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m734.0/734.0 kB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
en_core_web_trf ✅  NER: [('Nike', 'ORG'), ('Adidas', 'ORG'), ('Q3', 'DATE')]


In [15]:
import nltk
for pkg in ["punkt", "punkt_tab", "stopwords", "wordnet",
            "omw-1.4", "averaged_perceptron_tagger", "vader_lexicon"]:
    nltk.download(pkg, quiet=True)
    print(f"  ✅ {pkg}")
print("NLTK ready ✅")


  ✅ punkt
  ✅ punkt_tab
  ✅ stopwords
  ✅ wordnet
  ✅ omw-1.4
  ✅ averaged_perceptron_tagger
  ✅ vader_lexicon
NLTK ready ✅


## 6. Dataset Verification

Upload these three files to Drive **before running this section**.

**Download from Kaggle → rename → upload to `MyDrive/brand-sentiment-monitor/data/kaggle/raw/`**

| Kaggle page | Rename to |
|-------------|-----------|
| kaggle.com/datasets/kazanova/sentiment140 | `sentiment140.csv` |
| kaggle.com/datasets/debarshichanda/goemotions | `goemotions.csv` |
| kaggle.com/datasets/shiroshinki/semeval2018-task3 | upload both `train.csv` and `test.csv` as-is |

In [16]:
import pandas as pd

KAGGLE_RAW = os.path.join(DRIVE_ROOT, "data/kaggle/raw")

s140_path = os.path.join(KAGGLE_RAW, "sentiment140.csv")
ge_path   = os.path.join(KAGGLE_RAW, "goemotions.csv")
sem_path  = os.path.join(KAGGLE_RAW, "semeval2018_irony.csv")


In [17]:
# Sentiment140 — no header row, need to supply column names
s140 = pd.read_csv(s140_path, encoding="latin-1", header=None,
                   names=["polarity", "id", "date", "query", "user", "text"])
print(f"Sentiment140: {s140.shape}  |  polarities: {sorted(s140['polarity'].unique())}")
s140[["polarity", "text"]].head(3)


Sentiment140: (1600000, 6)  |  polarities: [np.int64(0), np.int64(4)]


Unnamed: 0,polarity,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...


In [18]:
# GoEmotions
ge = pd.read_csv(ge_path)
print(f"GoEmotions: {ge.shape}")
ge.head(3)


GoEmotions: (54263, 3)


Unnamed: 0,text,labels,id
0,My favourite food is anything I didn't have to...,[27],eebbqej
1,"Now if he does off himself, everyone will thin...",[27],ed00q6i
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj


In [19]:
# SemEval 2018 — merge train + test once, save as semeval2018_irony.csv
# On second run this cell just loads the already-merged file
if not os.path.exists(sem_path):
    df_train = pd.read_csv(os.path.join(KAGGLE_RAW, "train.csv"))
    df_test  = pd.read_csv(os.path.join(KAGGLE_RAW, "test.csv"))
    sem = pd.concat([df_train, df_test], ignore_index=True)
    sem.columns = [c.lower().strip().replace(" ", "_") for c in sem.columns]
    sem.to_csv(sem_path, index=False)
    print(f"Merged train({len(df_train)}) + test({len(df_test)}) → {len(sem)} rows")
else:
    sem = pd.read_csv(sem_path)
    print(f"SemEval 2018: {sem.shape}")
sem.head(3)


SemEval 2018: (10197, 6)


Unnamed: 0,id,tweet,joy,sadness,anger,fear
0,2017-En-10000,How the fu*k! Who the heck! moved my fridge!.....,0.0,0.0,0.938,0.0
1,2017-En-10001,So my Indian Uber driver just called someone t...,0.0,0.0,0.896,0.0
2,2017-En-10002,@DPD_UK I asked for my parcel to be delivered ...,0.0,0.0,0.896,0.0


In [20]:
# All three ready?
for name, df in [("Sentiment140", s140), ("GoEmotions", ge), ("SemEval 2018", sem)]:
    print(f"  {name:<18} {str(df.shape):<15} cols: {list(df.columns)}")
print("\nAll datasets loaded ✅")


  Sentiment140       (1600000, 6)    cols: ['polarity', 'id', 'date', 'query', 'user', 'text']
  GoEmotions         (54263, 3)      cols: ['text', 'labels', 'id']
  SemEval 2018       (10197, 6)      cols: ['id', 'tweet', 'joy', 'sadness', 'anger', 'fear']

All datasets loaded ✅


## 7. API Key Configuration

| Key | Get it at | Used in |
|-----|-----------|--------|
| `REDDIT_CLIENT_ID` | reddit.com/prefs/apps | `collector.py` (deployment) |
| `NEWSAPI_KEY` | newsapi.org/register | `collector.py` (deployment) |
| `SUPABASE_URL` | supabase.com | All stages |
| `HUGGINGFACE_TOKEN` | huggingface.co/settings/tokens | Notebook 09 |
| `SLACK_WEBHOOK_URL` | api.slack.com → Incoming Webhooks | `alert_manager.py` |

> **Only `HUGGINGFACE_TOKEN` is required to run notebooks 00–09.**  
> Reddit + NewsAPI keys are only needed when deploying `collector.py`.

In [21]:
ENV_FILE = os.path.join(DRIVE_ROOT, ".env")

template = """# Brand Sentiment Monitor — .env
# NEVER commit this file to git.
#
# REDDIT_* + NEWSAPI_KEY  →  collector.py (deployment only, not notebooks)
# HUGGINGFACE_TOKEN       →  notebook 10 (push fine-tuned models)
# SUPABASE_* / DATABASE_URL → all stages

REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=BrandSentimentMonitor/1.0 by YourRedditUsername

NEWSAPI_KEY=your_newsapi_key_here

SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your_anon_key_here
DATABASE_URL=postgresql://postgres:password@db.your-project-ref.supabase.co:5432/postgres

HUGGINGFACE_TOKEN=hf_your_token_here

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/your/webhook/url

SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password_here
ALERT_EMAIL_TO=recipient@example.com

MLFLOW_TRACKING_URI=./models/mlruns
"""

with open(os.path.join(DRIVE_ROOT, ".env.example"), "w") as f:
    f.write(template)
print(".env.example written ✅")

# Only write .env if it doesn't exist — never overwrite real keys
ENV_FILE = os.path.join(DRIVE_ROOT, ".env")
if not os.path.exists(ENV_FILE):
    with open(ENV_FILE, "w") as f:
        f.write(template)
    print(f".env created ✅  →  fill in your keys at: {ENV_FILE}")
else:
    print(".env already exists ✅  (not overwritten)")


.env.example written ✅
.env already exists ✅  (not overwritten)


In [22]:
from dotenv import load_dotenv
load_dotenv(os.path.join(DRIVE_ROOT, ".env"))

def mask(val, show=6):
    if not val or any(x in str(val) for x in ["your_", "your-project", "hf_your"]):
        return "❌ not set"
    return val[:show] + "*" * max(0, len(val) - show)

print("API Key Status:")
print("─" * 60)
rows = [
    ("REDDIT_CLIENT_ID",  "deployment → collector.py"),
    ("NEWSAPI_KEY",        "deployment → collector.py"),
    ("SUPABASE_URL",       "all stages"),
    ("HUGGINGFACE_TOKEN",  "notebook 10 — push models"),
    ("SLACK_WEBHOOK_URL",  "deployment → alert_manager.py"),
]
for key, usage in rows:
    print(f"  {key:<25}  {mask(os.getenv(key,'')):<18}  [{usage}]")

print("\n✅ Only HUGGINGFACE_TOKEN needed for notebooks 00-09.")


API Key Status:
────────────────────────────────────────────────────────────
  REDDIT_CLIENT_ID           ❌ not set           [deployment → collector.py]
  NEWSAPI_KEY                ❌ not set           [deployment → collector.py]
  SUPABASE_URL               ❌ not set           [all stages]
  HUGGINGFACE_TOKEN          ❌ not set           [notebook 10 — push models]
  SLACK_WEBHOOK_URL          https:*******************************************  [deployment → alert_manager.py]

✅ Only HUGGINGFACE_TOKEN needed for notebooks 00-09.


## 8. Credential Validation
Lightweight — **no data pulled here**.  
One minimal API call per service just to confirm keys are accepted.

In [23]:
# ── Reddit ────────────────────────────────────────────────────────────────────
import praw

reddit_id     = os.getenv("REDDIT_CLIENT_ID", "")
reddit_secret = os.getenv("REDDIT_CLIENT_SECRET", "")

if not reddit_id or "your_" in reddit_id:
    print("Reddit  : ⚠️  Keys not set — add to .env before running collector.py")
else:
    try:
        reddit = praw.Reddit(
            client_id=reddit_id,
            client_secret=reddit_secret,
            user_agent=os.getenv("REDDIT_USER_AGENT", "BrandSentimentMonitor/1.0"),
            read_only=True,
        )
        title = reddit.subreddit("python").title   # lightest possible call
        print(f"Reddit  : ✅ Connected (read-only) — test: r/python = '{title}'")
        print(f"           Ready for deployment. collector.py pulls every 30 min.")
    except Exception as e:
        print(f"Reddit  : ❌ {e}")


Reddit  : ⚠️  Keys not set — add to .env before running collector.py


In [24]:
# ── NewsAPI ───────────────────────────────────────────────────────────────────
from newsapi import NewsApiClient

newsapi_key = os.getenv("NEWSAPI_KEY", "")

if not newsapi_key or "your_" in newsapi_key:
    print("NewsAPI : ⚠️  Key not set — register free at newsapi.org/register")
else:
    try:
        newsapi = NewsApiClient(api_key=newsapi_key)
        sources = newsapi.get_sources(language="en", country="us")   # 0 of 100 daily quota
        print(f"NewsAPI : ✅ Connected — {len(sources['sources'])} English sources available")
        print(f"           Ready for deployment. collector.py pulls every 60 min.")
    except Exception as e:
        print(f"NewsAPI : ❌ {e}")


NewsAPI : ⚠️  Key not set — register free at newsapi.org/register


In [25]:
# ── HuggingFace ───────────────────────────────────────────────────────────────
hf_token = os.getenv("HUGGINGFACE_TOKEN", "")

if not hf_token or "hf_your" in hf_token:
    print("HuggingFace: ⚠️  Not set — needed for notebook 10 (push fine-tuned models)")
else:
    try:
        from huggingface_hub import HfApi
        user = HfApi(token=hf_token).whoami()
        print(f"HuggingFace: ✅ Connected as '{user['name']}'")
    except Exception as e:
        print(f"HuggingFace: ❌ {e}")

print("\nCredential check complete. No data was pulled.")


HuggingFace: ⚠️  Not set — needed for notebook 10 (push fine-tuned models)

Credential check complete. No data was pulled.


## 9. Database Schema Setup
Skipped automatically if `DATABASE_URL` is not configured.

In [26]:
from sqlalchemy import create_engine, text

DATABASE_URL = os.getenv("DATABASE_URL", "")

if not DATABASE_URL or "your-project-ref" in DATABASE_URL:
    print("⚠️  DATABASE_URL not set — skipping schema creation.")
    print("   Set up free Supabase at https://supabase.com")
else:
    try:
        engine = create_engine(DATABASE_URL, pool_pre_ping=True)
        schema = """
        CREATE TABLE IF NOT EXISTS live_posts (
            id           TEXT PRIMARY KEY,
            platform     VARCHAR(20)  NOT NULL,
            type         VARCHAR(30),
            brand        VARCHAR(100),
            full_text    TEXT,
            url          TEXT,
            author       TEXT,
            score        INTEGER,
            created_utc  TIMESTAMPTZ,
            collected_at TIMESTAMPTZ DEFAULT NOW(),
            metadata     JSONB
        );
        CREATE TABLE IF NOT EXISTS predictions (
            id              BIGSERIAL PRIMARY KEY,
            post_id         TEXT REFERENCES live_posts(id) ON DELETE CASCADE,
            brand           VARCHAR(100),
            sentiment       VARCHAR(20),
            sentiment_score FLOAT,
            is_sarcastic    BOOLEAN,
            sarcasm_score   FLOAT,
            emotions        JSONB,
            topic_id        INTEGER,
            topic_label     TEXT,
            crisis_flag     BOOLEAN DEFAULT FALSE,
            predicted_at    TIMESTAMPTZ DEFAULT NOW()
        );
        CREATE TABLE IF NOT EXISTS crisis_alerts (
            id           BIGSERIAL PRIMARY KEY,
            brand        VARCHAR(100),
            alert_type   VARCHAR(50),
            severity     VARCHAR(20),
            message      TEXT,
            z_score      FLOAT,
            triggered_at TIMESTAMPTZ DEFAULT NOW(),
            resolved     BOOLEAN DEFAULT FALSE
        );
        """
        with engine.connect() as conn:
            conn.execute(text(schema))
            conn.commit()
            ver = conn.execute(text("SELECT version();")).fetchone()[0]
        print(f"Database ✅  {ver[:55]}")
        print("Tables created: live_posts · predictions · crisis_alerts")
    except Exception as e:
        print(f"Database ❌: {e}")


⚠️  DATABASE_URL not set — skipping schema creation.
   Set up free Supabase at https://supabase.com


## 10. Smoke Test
Verifies every library imported correctly.

In [27]:
import importlib

tests = [
    ("numpy", "numpy"), ("pandas", "pandas"), ("sklearn", "scikit-learn"),
    ("scipy", "scipy"), ("torch", "torch"), ("transformers", "transformers"),
    ("datasets", "datasets"), ("accelerate", "accelerate"),
    ("nltk", "nltk"), ("spacy", "spacy"), ("emoji", "emoji"),
    ("contractions", "contractions"), ("bertopic", "bertopic"),
    ("umap", "umap-learn"), ("hdbscan", "hdbscan"), ("gensim", "gensim"),
    ("prophet", "prophet"), ("pyod", "pyod"), ("statsmodels", "statsmodels"),
    ("praw", "praw"), ("newsapi", "newsapi-python"), ("feedparser", "feedparser"),
    ("sqlalchemy", "sqlalchemy"), ("fastapi", "fastapi"), ("pydantic", "pydantic"),
    ("streamlit", "streamlit"), ("plotly", "plotly"), ("wordcloud", "wordcloud"),
    ("mlflow", "mlflow"), ("optuna", "optuna"),
    ("yaml", "pyyaml"), ("loguru", "loguru"), ("dotenv", "python-dotenv"),
]

passed, failed = 0, []
print("=" * 52)
for mod, label in tests:
    try:
        m = importlib.import_module(mod)
        ver = getattr(m, "__version__", "n/a")
        print(f"  ✅  {label:<25} v{ver}")
        passed += 1
    except ImportError as e:
        print(f"  ❌  {label:<25} {e}")
        failed.append(label)

print("=" * 52)
print(f"  {passed}/{len(tests)} passed")
if failed:
    print(f"  Failed: {failed}")
else:
    print("  All imports successful ✅")
print("=" * 52)


  ✅  numpy                     v2.0.2
  ✅  pandas                    v2.2.2
  ✅  scikit-learn              v1.6.1
  ✅  scipy                     v1.16.3
  ✅  torch                     v2.10.0+cu128
  ✅  transformers              v5.0.0
  ✅  datasets                  v4.0.0
  ✅  accelerate                v1.12.0
  ✅  nltk                      v3.9.1
  ✅  spacy                     v3.8.11
  ✅  emoji                     v2.15.0
  ✅  contractions              vn/a


  $max \{ core_k(a), core_k(b), 1/\alpha d(a,b) \}$.


  ✅  bertopic                  v0.17.4
  ✅  umap-learn                v0.5.11
  ✅  hdbscan                   vn/a
  ✅  gensim                    v4.4.0
  ✅  prophet                   v1.3.0
  ✅  pyod                      v2.0.6
  ✅  statsmodels               v0.14.6
  ✅  praw                      v7.8.1
  ✅  newsapi-python            vn/a
  ✅  feedparser                v6.0.12
  ✅  sqlalchemy                v2.0.46
  ✅  fastapi                   v0.129.0
  ✅  pydantic                  v2.12.3
  ✅  streamlit                 v1.54.0
  ✅  plotly                    v5.24.1
  ✅  wordcloud                 v1.9.6
  ✅  mlflow                    v3.10.0
  ✅  optuna                    v4.7.0
  ✅  pyyaml                    v6.0.3
  ✅  loguru                    v0.7.3
  ✅  python-dotenv             vn/a
  33/33 passed
  All imports successful ✅


In [28]:
# GPU + Transformers pipeline test on actual Kaggle data
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1
gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"GPU: {gpu_name}")

# Use real Kaggle rows if available
if os.path.exists(s140_path):
    sample_df  = pd.read_csv(s140_path, nrows=5)
    text_col   = "text" if "text" in sample_df.columns else sample_df.columns[-1]
    test_texts = sample_df[text_col].dropna().astype(str).tolist()[:3]
    source     = "Sentiment140 (Kaggle)"
else:
    test_texts = [
        "Nike just dropped the best shoe of the year",
        "Waited 3 weeks, shoes arrived damaged. Terrible.",
        "Just bought some Adidas runners.",
    ]
    source = "example texts"

print(f"\nRunning RoBERTa on {source}:")
pipe = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=device, truncation=True, max_length=128,
)
for text, res in zip(test_texts, pipe(test_texts)):
    print(f"  {res['label']:<12} {res['score']:.2f}  |  {str(text)[:90]}")

print("\nTransformers pipeline ✅")


GPU: Tesla T4

Running RoBERTa on Sentiment140 (Kaggle):


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: cardiffnlp/twitter-roberta-base-sentiment-latest
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.pooler.dense.bias       | UNEXPECTED |  | 
roberta.embeddings.position_ids | UNEXPECTED |  | 
roberta.pooler.dense.weight     | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

  negative     0.90  |  is upset that he can't update his Facebook by texting it... and might cry as a result  Sch
  neutral      0.66  |  @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds
  negative     0.90  |  my whole body feels itchy and like its on fire 

Transformers pipeline ✅


## 11. Environment Summary

In [30]:
from datetime import datetime

def fsize(path):
    if not os.path.exists(path): return "not found"
    # Specify 'latin-1' encoding for sentiment140.csv
    rows = sum(1 for _ in open(path, encoding="latin-1")) - 1
    mb   = os.path.getsize(path) / 1e6
    return f"{rows:,} rows ({mb:.0f} MB)"

def key_status(env_key, bad=("your_", "your-project", "hf_your")):
    val = os.getenv(env_key, "")
    return "Set" if val and not any(b in val for b in bad) else "Not set"

print("=" * 62)
print("ENVIRONMENT SUMMARY")
print(datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print("=" * 62)

lines = [
    ("GPU",            gpu_name),
    ("Sentiment140",   fsize(s140_path)),
    ("GoEmotions",     fsize(ge_path)),
    ("SemEval 2018",   fsize(sem_path)),
    ("HuggingFace",    key_status("HUGGINGFACE_TOKEN", ("hf_your",)) + " <- needed for nb 10"),
    ("Supabase",       key_status("SUPABASE_URL")),
    ("Reddit API",     key_status("REDDIT_CLIENT_ID") + " (deployment only)"),
    ("NewsAPI",        key_status("NEWSAPI_KEY") + " (deployment only)"),
]
for label, value in lines:
    print(f"  {label:<18} {value}")

print("=" * 62)
print("Notebook plan:")
plan = [
    ("01", "EDA — 8 findings → decisions"),
    ("02", "Preprocessing pipeline          → src/preprocessing/cleaner.py"),
    ("03", "Brand detection                 → src/brand/detector.py"),
    ("04", "Sentiment model (RoBERTa)       → models/sentiment/"),
    ("05", "Sarcasm model (RoBERTa)         → models/sarcasm/"),
    ("06", "Attribution engine              → src/attribution/engine.py"),
    ("07", "Emotion model (BERT)            → models/emotion/"),
    ("08", "Topic model (BERTopic)          → models/topic/"),
    ("09", "Crisis + Aggregation            → src/crisis/ + src/aggregation/"),
    ("10", "Evaluation + deploy             → STUB_MODE = False"),
]
for nb, desc in plan:
    print(f"  {nb} → {desc}")


ENVIRONMENT SUMMARY
2026-02-24 19:47:46
  GPU                Tesla T4
  Sentiment140       1,599,999 rows (239 MB)
  GoEmotions         54,263 rows (5 MB)
  SemEval 2018       10,197 rows (1 MB)
  HuggingFace        Not set <- needed for nb 10
  Supabase           Not set
  Reddit API         Not set (deployment only)
  NewsAPI            Not set (deployment only)
Notebook plan:
  01 → EDA — 8 findings → decisions
  02 → Preprocessing pipeline          → src/preprocessing/cleaner.py
  03 → Brand detection                 → src/brand/detector.py
  04 → Sentiment model (RoBERTa)       → models/sentiment/
  05 → Sarcasm model (RoBERTa)         → models/sarcasm/
  06 → Attribution engine              → src/attribution/engine.py
  07 → Emotion model (BERT)            → models/emotion/
  08 → Topic model (BERTopic)          → models/topic/
  09 → Crisis + Aggregation            → src/crisis/ + src/aggregation/
  10 → Evaluation + deploy             → STUB_MODE = False
