# Baseline Sentiment Analysis Experiment
## Model: FinBERT PT-BR
- Hypothesis: Sentiment outputs, mapped to direction, can predict short-term exchange rate movements.
- Objective: A baseline comparison before testing hyped-up LLMs capabilities
- Why DL: 
    - FinBERT is widely cited in financial NLP literature
    - outperforms general BERT and lexicon-based models on tasks like financial sentiment classification
    - Traditional ML methods rely on sparse inputs or static word embeddings (like Word2Vec) which don't capture context
    - it was used as a benchmark in very similar paper found at https://doi.org/10.1016/j.mlwa.2023.100508

- Independent Variable (Predictor):
    - Text: headline / article content
    - Category: FinBERT sentiment output
    - Binary Label: heuristic mapping (positive -> 1, negative -> -1) (bullish or bearish in commercial terms)
    - (POSSIBLY CONSIDER as a control var/experiment?): Multi-class Label: neutral (0) label defined by threshold label (min exchange rate % change)

- Dependent Variable (Ground Truth):
    - Directional Movement: binary direction of exchange rate following news timestamp (time frame TBD)
    - (POSSIBLY CONSIDER?): percent change in exchange rate over defined window? Measures profitability...

### Preprocessing

Bom Dia Mercado (BDM) → xlsx file with BDM articles and more → preprocessing to CSV → export to repository → final dataset

In [None]:
import pandas as pd
from dateutil import parser

# Step 0: Load the dataset
df = pd.read_excel("../data/raw/allen-corpus.xlsx")

# Step 1–3: Normalize individual DATE and TIME cells
def parse_datetime_components(date_cell, time_cell):
    try:
        # Coerce both to string and strip spaces
        date_str = str(date_cell).strip()
        time_str = str(time_cell).strip()
        
        # Combine and parse flexibly
        dt = parser.parse(f"{date_str} {time_str}", dayfirst=True)
        return dt.isoformat()
    except Exception:
        return pd.NaT  # mark invalid rows

# Step 4: Create ISO 8601 Timestamp column
df['Timestamp'] = df.apply(lambda row: parse_datetime_components(row['DATE'], row['TIME']), axis=1)

# Step 5: Drop old columns
df.drop(columns=['DATE', 'TIME', 'Index', 'DIRECTION', 'BRER', 'LABEL'], inplace=True)

# Step 6: Clean newlines in ARTICLE CONTENT and COMMENTS
for col in ['HEADING', 'ARTICLE CONTENT', 'COMMENTS']:
    if col in df.columns:
        df[col] = df[col].astype(str).str.replace(r'[\r\n]+', ' ', regex=True).str.strip()

# Step 7: Reorder columns
df = df[['Timestamp'] + [col for col in df.columns if col != 'Timestamp']]

# Step 8: Save as CSV
df.to_csv("../data/processed/allen-corpus.csv", index=False, encoding='utf-8-sig')

# Step 9: Check for invalid rows (passed)
invalid_rows = df[df['Timestamp'].isna()]
print(f"{len(invalid_rows)} invalid rows found.")
print(invalid_rows)


Empty DataFrame
Columns: [Timestamp, HEADING, ARTICLE CONTENT, COMMENTS]
Index: []


Bloomberg → Download USD/BRL exchange rates as excel file → preprocess to CSV → export to repository → final dataset

In [None]:
import pandas as pd
from dateutil import parser

# Step 0: Load the dataset
df = pd.read_excel("../data/raw/usd-brl.xlsx")

# Step 1: Clean column names
df.columns = [col.strip() for col in df.columns]
df.rename(columns={"Date": "Raw Timestamp", "Último preço": "USD/BRL"}, inplace=True)

# Step 2: Parse "Raw Timestamp" into ISO 8601 format
def parse_iso8601(raw):
    try:
        return parser.parse(str(raw).strip()).isoformat()
    except Exception:
        return pd.NaT

df["Timestamp"] = df["Raw Timestamp"].apply(parse_iso8601)

# Step 3: Drop the original column
df.drop(columns=["Raw Timestamp"], inplace=True)

# Step 4: Reorder columns
df = df[["Timestamp", "USD/BRL"]]

# Step 5: Save to CSV
df.to_csv("../data/processed/usd-brl.csv", index=False, encoding="utf-8-sig")

# Step 6: Print invalid rows (if any)
invalid_rows = df[df["Timestamp"].isna()]
print(f"{len(invalid_rows)} invalid rows found.")
print(invalid_rows)

0 invalid rows found.
Empty DataFrame
Columns: [Timestamp, USD/BRL]
Index: []
