Awesome — this sounds like a great move toward organizing your experimentation work into polished, reusable insights. Let's start crafting that **topic modeling-focused notebook**. Here’s a proposed outline for the cleaned-up notebook structure, with markdown and code blocks arranged clearly:

---

## 🧠 Notebook Title: *Evaluating Topic Modeling for Risk Score Prediction in Financial News*

### 🔍 Notebook Objective (markdown)
> This notebook explores whether topic modeling via BERTopic can be a valuable feature engineering step in predicting stock price movements. Specifically, we aim to test the hypothesis that topic-driven features can replace or enhance token-based risk scores used previously.

---

### 1. Load Required Libraries

```

---

### 2. Connect to DuckDB and Load Dataset
```python
# %% Connect to DuckDB and Load Dataset
db_path = '/Users/bradams/Documents/financial_news.db'
conn = duckdb.connect(database=db_path, read_only=False)

query = """ 
-- same query used earlier, selecting title, description, price_change_percentage, and sentiment columns
"""

news_data = conn.execute(query).fetchdf()
conn.close()
news_data.info()
```

---

### 3. Apply BERTopic to Extract Topics
```python
# %% Generate BERTopic Topics
texts = news_data["article_title"].fillna("") + " " + news_data["description"].fillna("")
embedding_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")
bertopic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=UMAP(n_neighbors=5, min_dist=0.5, n_components=5, metric='cosine'),
    verbose=True
)
topics, probs = bertopic_model.fit_transform(texts.tolist())
news_data["topic"] = topics
news_data["topic_probability"] = [max(p) if isinstance(p, list) else 0 for p in probs]
```

---

### 🧠 Section: Topic Modeling Intuition (markdown)
> BERTopic groups similar articles into semantically coherent topics using UMAP and HDBSCAN over sentence embeddings. Each article is assigned a topic and a probability of belonging to that topic. These topic features may reflect market-relevant themes that impact price changes.

---

### 4. Engineer Topic-Based Features
```python
# %% Topic Features
news_data["topic_avg_movement"] = news_data.groupby("topic")["price_change_percentage"].transform("mean")
news_data["topic_sensitivity"] = news_data.groupby("topic")["price_change_percentage"].transform("std")
```

---

### 5. Create Risk Score from Topic and Sentiment
```python
# %% Create Sentiment-Weighted Risk Score
news_data["sentiment_impact"] = news_data["finbert_description_positive"] - news_data["finbert_description_negative"]
news_data["market_volatility"] = news_data.groupby("ticker")["price_change_percentage"].transform(lambda x: x.rolling(window=30, min_periods=1).std())
news_data["risk_score_topic"] = (
    news_data["sentiment_impact"] * news_data["topic_sensitivity"]
) * news_data["market_volatility"]
```

---

### 6. Train Regression Model with Topic Features
```python
# %% Train Model Using Topic-Based Features
features = [
    "topic_avg_movement", "topic_sensitivity", "sentiment_impact", "market_volatility",
    "finbert_title_positive", "finbert_title_negative", "finbert_description_positive", "finbert_description_negative"
]
X = news_data[features].replace([np.inf, -np.inf], np.nan).dropna()
y = news_data.loc[X.index, "price_change_percentage"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(objective="reg:squarederror", n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}, R²: {r2_score(y_test, y_pred):.4f}")
```

---

### 📊 Section: Preliminary Evaluation (markdown)
> The model’s performance using topic-based features is compared to prior models that used token-level risk scores. We’ll evaluate based on MAE and R² metrics. If topic features yield competitive performance, they may offer a scalable alternative to token-based approaches.

---

### 7. Visualize Topic Sensitivities and Volatilities
```python
# %% Plot Topic Sensitivity vs Average Price Movement
plt.figure(figsize=(10, 6))
sns.scatterplot(data=news_data, x="topic_avg_movement", y="topic_sensitivity", hue="topic", palette="tab10", alpha=0.6)
plt.title("Topic Sensitivity vs Average Movement")
plt.xlabel("Average Price Movement")
plt.ylabel("Standard Deviation (Sensitivity)")
plt.show()
```

---

### 📌 Final Notes (markdown)
> - Replace token_score with topic-driven features for risk scoring? Promising.
> - Tradeoff: Topic modeling requires more compute but produces interpretable clusters.
> - Next step: compare this to token-focused risk and see which yields more stable predictive performance across tickers.

---

Would you like me to generate this clean notebook file for you now (either here or directly in `.ipynb` format), or are you planning to copy/paste and run section-by-section first before I finalize it?

In [1]:
# %% Load Libraries
import pandas as pd
import numpy as np
import duckdb
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# %% Connect to DuckDB and Load Dataset
db_path = '/Users/bradams/Documents/financial_news.db'
conn = duckdb.connect(database=db_path, read_only=False)

query = """ SELECT 
            a.ticker,
            a.mapped_trading_date AS publish_date,
            a.description, 
            dpm.price_change_percentage,
            f.finbert_title_score,
            f.finbert_description_score,
            f.finbert_title_positive,
            f.finbert_title_neutral,
            f.finbert_title_negative,
            f.finbert_description_positive,
            f.finbert_description_neutral,
            f.finbert_description_negative
        FROM "Headlines"."Articles_Trading_Day" a
        INNER JOIN "Headlines"."Daily_Price_Movement" dpm
            ON a.mapped_trading_date = dpm.trading_date  
        INNER JOIN "Headlines"."finbert_analysis" f
            ON a.guid = f.guid;
"""

news_data = conn.execute(query).fetchdf()
conn.close()
news_data.info()

In [None]:
# %% Generate BERTopic Topics
texts = news_data["article_title"].fillna("") + " " + news_data["description"].fillna("")
embedding_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")
bertopic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=UMAP(n_neighbors=5, min_dist=0.5, n_components=5, metric='cosine'),
    verbose=True
)
topics, probs = bertopic_model.fit_transform(texts.tolist())
news_data["topic"] = topics
news_data["topic_probability"] = [max(p) if isinstance(p, list) else 0 for p in probs]