# Baseline Experiment - Sentiment Analysis
## Model: FinBERT PT-BR
- Hypothesis (h1a): Sentiment outputs, mapped to direction, can predict short-term exchange rate movements.
- Objective: A baseline comparison before testing hyped-up LLMs capabilities
- Value Proposition: This is the first known study to be conducted on applying language models to trade in emerging currency markets. Especially in a multilingual context.
- Why sentiment analysis as baseline: 
    - FinBERT is widely cited in financial NLP literature
    - outperforms general BERT and lexicon-based models on tasks like financial sentiment classification
    - Traditional ML methods rely on sparse inputs or static word embeddings (like Word2Vec) which don't capture context
    - sentiment analysis is commonly used in generating trading signals, however I believe that market does not operate
    on whether a piece of text is happy or sad. Thus, I'm expecting the following experiments to outperform this baseline. 
    I just want to rule sentiment analysis out of the picture. "Predicting directional movement" is a better approach.
    - it was used as a benchmark in very similar paper found at https://doi.org/10.1016/j.mlwa.2023.100508

- Independent Variable (Predictor):
    - Text: headline / article content
    - Category: FinBERT sentiment output
    - Binary Label: heuristic mapping (positive -> 1, negative -> -1) (bullish or bearish in commercial terms)
    - (POSSIBLY CONSIDER as a control var/experiment?): Multi-class Label: neutral (0) label defined by threshold label (min exchange rate % change)

- Dependent Variable (Ground Truth):
    - Directional Movement: binary direction of exchange rate following news timestamp (time frame TBD)
    - (POSSIBLY CONSIDER?): percent change in exchange rate over defined window? Measures profitability...

- Dataset Creation Process:
    - News Data: 4630 financial news headlines (some with articles, double checked that number) with precise timestamps, in Brazilian-Portuguese
        - Bom Dia Mercado (BDM) → Eli formatted news data into excel file → preprocess.ipynb → export to repo → final dataset
    - FX Rate Data: Minute-level time series of USD/BRL exchange rates, synchronized with news timestamps.
        - Bloomberg → retrieve USD/BRL exchange rates as excel file → preprocess.ipynb → export to repo → final dataset

In [None]:
'''
For FINBERT BASELINE only use HEADLINES not ARTICLE CONTENT and COMMENTS 
'''

# Step 1: Load processed data and drop unnecessary columns
import pandas as pd
allen_df = pd.read_csv('../data/processed/allen-corpus.csv')
fx_df = pd.read_csv('../data/processed/usd-brl.csv')

# Drop 'ARTICLE CONTENT' and 'COMMENTS' columns if they exist
for col in ['ARTICLE CONTENT', 'COMMENTS']:
    if col in allen_df.columns:
        allen_df = allen_df.drop(columns=[col])

# Preview data
display(allen_df.head())
display(fx_df.head())

In [None]:
# Step 2 & 3: Match exchange rates and label ground truth for t+1 to t+20 minutes
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# Ensure timestamps are datetime
allen_df['Timestamp'] = pd.to_datetime(allen_df['Timestamp'])
fx_df['Timestamp'] = pd.to_datetime(fx_df['Timestamp'])

# Convert fx_times to numpy datetime64 for searchsorted
fx_times = fx_df['Timestamp'].values.astype('datetime64[ns]')
fx_rates = fx_df['USD/BRL'].values

# Helper: find closest fx rate at a given time
def get_closest_rate(ts, fx_times, fx_rates):
    # Convert ts to numpy datetime64 if it's a pandas Timestamp
    ts64 = np.datetime64(ts)
    idx = np.searchsorted(fx_times, ts64)
    if idx == 0:
        return fx_rates[0]
    if idx == len(fx_times):
        return fx_rates[-1]
    before = fx_times[idx-1]
    after = fx_times[idx]
    if abs((ts64 - before).astype('timedelta64[s]').astype(int)) <= abs((after - ts64).astype('timedelta64[s]').astype(int)):
        return fx_rates[idx-1]
    else:
        return fx_rates[idx]

# For each news, get fx at t and at t+1 to t+20 minutes
allen_df['fx_t'] = allen_df['Timestamp'].apply(lambda t: get_closest_rate(t, fx_times, fx_rates))
for n in range(1, 21):
    allen_df[f'fx_t+{n}'] = allen_df['Timestamp'].apply(lambda t: get_closest_rate(t + pd.Timedelta(minutes=n), fx_times, fx_rates))
    allen_df[f'direction_gt_{n}'] = np.where(allen_df[f'fx_t+{n}'] > allen_df['fx_t'], 1, -1)

# Show a preview of the new columns
cols = ['Timestamp','fx_t'] + [f'fx_t+{n}' for n in range(1, 6)] + [f'direction_gt_{n}' for n in range(1, 6)]
allen_df[cols].head()

In [None]:
# Step 4: Prepare text data for FinBERT-PT-BR (headlines only)
# At this point, only the 'HEADING' column remains for text input
allen_df['text'] = allen_df['HEADING'].astype(str).str.strip()

allen_df[['text']].head()

In [None]:
# Step 5: Run FinBERT-PT-BR sentiment analysis (binary only)
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = 'lucas-leme/FinBERT-PT-BR'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Set up pipeline (ignore neutral, only positive/negative)
sentiment_pipe = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True, max_length=256)

def map_sentiment(result):
    label = result['label'].lower()
    if 'positive' in label:
        return 1
    elif 'negative' in label:
        return -1
    else:
        return None  # ignore neutral or unknown

# Run sentiment prediction
allen_df['sentiment_pred'] = allen_df['text'].apply(lambda x: map_sentiment(sentiment_pipe(x)[0]))

allen_df[['text','sentiment_pred']].head()

In [None]:
# Step 6: Evaluate predictions vs ground truth for t+1 to t+20 minutes
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
import matplotlib.pyplot as plt

# Drop rows where sentiment_pred is None (i.e., neutral or missing)
eval_df = allen_df.dropna(subset=['sentiment_pred'])

horizons = [(f'direction_gt_{i}', f'{i} min') for i in range(1, 21)]

fig, axes = plt.subplots(4, 5, figsize=(24, 16))
accuracies = []
for idx, (col, label) in enumerate(horizons):
    row, col_idx = divmod(idx, 5)
    y_true = eval_df[col] if col in eval_df else None
    y_pred = eval_df['sentiment_pred']
    if y_true is not None:
        cm = confusion_matrix(y_true, y_pred, labels=[1, -1])
        acc = accuracy_score(y_true, y_pred)
        accuracies.append(acc)
        disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Increase (1)', 'Decrease (-1)'])
        disp.plot(ax=axes[row, col_idx], cmap='Blues', colorbar=False)
        axes[row, col_idx].set_title(f'{label}\nAccuracy: {acc:.2%}')
    else:
        axes[row, col_idx].set_visible(False)
plt.tight_layout()

# Print accuracy for each horizon
for i, acc in enumerate(accuracies, 1):
    print(f'Accuracy for {i} min: {acc:.2%}')