# Download & Prepare Financial News + Market Data (Colab)
This notebook shows how to: install Kaggle, download the **notlucasp/financial-news-headlines** dataset from Kaggle (the dataset you picked), fetch OHLCV data via `yfinance`, compute technical indicators, compute embeddings with `sentence-transformers`, merge price and news by date/ticker, and save a merged CSV suitable for modeling.

**Note:** In Colab, either upload `kaggle.json` to `~/.kaggle/kaggle.json` (recommended) or upload the dataset manually through the Files sidebar.


## 1. Install dependencies (run this in Colab)


In [None]:
# Install libs (uncomment to run in Colab)
# !pip install -q kaggle yfinance sentence-transformers faiss-cpu lightgbm pandas numpy scikit-learn


## 2. Download Kaggle dataset (requires kaggle.json API key)


In [None]:
# Steps to use Kaggle API in Colab:
# 1) Upload kaggle.json (from your Kaggle account) to Colab
# 2) Move it to ~/.kaggle/kaggle.json
# 3) Run kaggle datasets download -d notlucasp/financial-news-headlines

# Example commands (uncomment when you have kaggle.json uploaded):
# !mkdir -p ~/.kaggle
# !cp /content/kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json
# !kaggle datasets download -d notlucasp/financial-news-headlines
# !unzip -o financial-news-headlines.zip -d ./financial_news
# After unzip you'll have a CSV (rename as needed)

## 3. Download OHLCV price data with yfinance


In [None]:
import yfinance as yf
import pandas as pd

tickers = ['AAPL','MSFT','GOOG']  # replace with tickers you want (use NSE tickers for Indian stocks)
start = '2015-01-01'
end = '2024-12-31'
price = yf.download(tickers, start=start, end=end, progress=False)
price = price.stack(level=1).rename_axis(['Date','Ticker']).reset_index()
price['date'] = pd.to_datetime(price['Date']).dt.date
price = price.rename(columns={'Ticker':'ticker','Open':'open','High':'high','Low':'low','Close':'close','Adj Close':'adj_close','Volume':'volume'})
price = price[['date','ticker','open','high','low','close','adj_close','volume']]
price.head()

## 4. Load the Kaggle news CSV (or upload it manually)


In [None]:
# Example: if you extracted the kaggle dataset to ./financial_news/
# news = pd.read_csv('./financial_news/financial-news-headlines.csv')
# If column names differ, rename:
# news.rename(columns={'title':'headline','published_at':'date'}, inplace=True)
# Convert to date only:
# news['date'] = pd.to_datetime(news['date']).dt.date
print('Load your news CSV here (path may vary)')

## 5. Feature engineering: technical indicators (example functions)


In [None]:
import numpy as np
def compute_technical_indicators(df):
    df = df.copy()
    df['return'] = df['close'].pct_change()
    df['sma_10'] = df['close'].rolling(10).mean()
    df['ema_20'] = df['close'].ewm(span=20, adjust=False).mean()
    df['vol_20'] = df['return'].rolling(20).std()
    # RSI
    delta = df['close'].diff()
    up = delta.clip(lower=0)
    down = -1 * delta.clip(upper=0)
    ma_up = up.rolling(14).mean()
    ma_down = down.rolling(14).mean()
    rs = ma_up / ma_down
    df['rsi_14'] = 100 - (100 / (1 + rs))
    return df

# Example usage (group by ticker)
price_features = price.groupby('ticker').apply(lambda g: compute_technical_indicators(g.reset_index(drop=True))).reset_index(drop=True)
price_features.head()

## 6. Optional: compute embeddings for news text with sentence-transformers


In [None]:
# Uncomment to use embeddings (requires sentence-transformers and possibly GPU)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')
# news['text'] = news['headline'].fillna('') + '. ' + news.get('summary','').fillna('')
# embeddings = model.encode(news['text'].tolist(), show_progress_bar=True)
# Save or aggregate embeddings per date/ticker
print('Embeddings step: optional but recommended for text features')

## 7. Merge price features & news (date + ticker alignment)


In [None]:
# Example merge when 'news' and 'price_features' exist
# news['date'] = pd.to_datetime(news['date']).dt.date
# merged = price_features.merge(news, on=['date','ticker'], how='left')
# merged.to_csv('merged_quant_dataset.csv', index=False)
print('Merge step: ensure date types and ticker naming match')

## 8. Save outputs


In [None]:
os.makedirs('/content/quant_data_outputs', exist_ok=True)
# merged.to_csv('/content/quant_data_outputs/merged_quant_dataset.csv', index=False)
print('When you run merge, save merged CSV to /content/quant_data_outputs')
print('Example merged CSV included in this package: example_merged_quant_dataset.csv')


## Notes
- If you do not want to use Kaggle API, upload CSV via Colab Files UI and set `news = pd.read_csv('/content/your_news.csv')`.
- For Indian tickers, yfinance may require ticker suffixes like `.NS` (for NSE), e.g., `RELIANCE.NS`.
- Always verify date alignment to avoid look-ahead bias.
