# Phase 1: Data Foundation - ModernFinBERT

This notebook implements Phase 1 of the ModernFinBERT development timeline, focusing on data audit, cleaning, and preparation.

**Timeline**: Weeks 1-3 (12 hours total)
**Dataset**: `neoyipeng/financial_reasoning_aggregated` from HuggingFace
**Goal**: Transform raw data into high-quality, publication-ready financial sentiment dataset

## Overview
- **Total Samples**: ~31,166 (Train: 19,940, Dev: 4,992, Test: 6,234)
- **Labels**: 3-class sentiment (NEGATIVE:0, NEUTRAL/MIXED:1, POSITIVE:2)
- **Issues to Address**: Missing imports, non-financial content, label validation

## Week 1: Data Audit & Strategy

### Monday (1hr): Dataset Audit 📊
**Goal**: Understand the dataset structure, size, and initial quality

In [None]:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer

In [None]:
import numpy as np
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
ds = load_dataset("neoyipeng/financial_reasoning_aggregated")

# Define constants and label mapping
NUM_CLASSES = 3
label_dict = {'NEUTRAL/MIXED': 1,
              'NEGATIVE': 0,
              'POSITIVE': 2}

# Filter for sentiment task only
ds = ds.filter(lambda x: x["task"] == "sentiment")

# Map labels to one-hot encoding
ds = ds.map(lambda ex: {
    "text": ex["text"],
    "labels": np.eye(NUM_CLASSES)[label_dict[ex["label"]]],
}, remove_columns=["label", "prompt", "aspect", "summary_detail", "title", "topic", "score_topic", "source", "task", "__index_level_0__"])

# Inspect the first example
print("First example:")
print(ds["train"][0])
print(f"\nDataset splits: {ds.keys()}")
print(f"Total samples: {sum(len(split) for split in ds.values())}")
print(f"Train: {len(ds['train'])}, Validation: {len(ds['validation'])}, Test: {len(ds['test'])}")

**Deliverables for Friday:**
- [ ] Remove exact duplicates
- [ ] Fix encoding issues  
- [ ] Generate `cleaning_stats.json`
- [ ] Prepare data for Week 2 deep cleaning

---

## Week 2: Deep Cleaning (Coming Next)

**Preview of Week 2 tasks:**
- **Monday**: Statistical analysis and anomaly detection
- **Tuesday**: GPT-4 assisted quality control setup  
- **Thursday**: Implement automated cleaning pipeline
- **Friday**: Create stratified train/validation/test splits

**Current Status**: ✅ Week 1 implementation ready - run all cells above to complete initial data audit and cleaning!

In [None]:
# Week 1 - Friday: Initial Data Cleaning
print("=== INITIAL DATA CLEANING ===")

def clean_dataset(dataset_split, split_name):
    """Remove duplicates and fix basic issues"""
    print(f"\nCleaning {split_name} split...")
    
    original_count = len(dataset_split)
    seen_texts = set()
    cleaned_data = []
    duplicates_removed = 0
    
    for item in dataset_split:
        text = item['text'].strip()
        
        # Remove exact duplicates
        if text not in seen_texts:
            cleaned_data.append({
                'text': text,
                'label': item['label'],
                'source': item.get('source', 'unknown')
            })
            seen_texts.add(text)
        else:
            duplicates_removed += 1
    
    cleaned_count = len(cleaned_data)
    
    stats = {
        'split': split_name,
        'original_count': original_count,
        'cleaned_count': cleaned_count,
        'duplicates_removed': duplicates_removed,
        'duplicate_rate': duplicates_removed / original_count * 100
    }
    
    print(f"  Original: {original_count}")
    print(f"  Cleaned: {cleaned_count}")
    print(f"  Duplicates removed: {duplicates_removed} ({stats['duplicate_rate']:.2f}%)")
    
    return cleaned_data, stats

# Clean all splits
cleaning_stats = []
cleaned_datasets = {}

for split_name, split_data in ds_sentiment.items():
    cleaned_data, stats = clean_dataset(split_data, split_name)
    cleaned_datasets[split_name] = cleaned_data
    cleaning_stats.append(stats)

# Overall statistics
total_original = sum(s['original_count'] for s in cleaning_stats)
total_cleaned = sum(s['cleaned_count'] for s in cleaning_stats)
total_duplicates = sum(s['duplicates_removed'] for s in cleaning_stats)

print(f"\n=== OVERALL CLEANING STATISTICS ===")
print(f"Total original samples: {total_original}")
print(f"Total cleaned samples: {total_cleaned}")
print(f"Total duplicates removed: {total_duplicates}")
print(f"Overall duplicate rate: {total_duplicates/total_original*100:.2f}%")

# Save cleaned data (we'll implement actual saving in Week 2)
print(f"\nCleaned datasets ready for further processing:")
for split_name, data in cleaned_datasets.items():
    print(f"  {split_name}: {len(data)} samples")

# Create cleaning report
cleaning_report = {
    'cleaning_date': pd.Timestamp.now().isoformat(),
    'dataset_source': 'neoyipeng/financial_reasoning_aggregated',
    'splits_processed': list(cleaned_datasets.keys()),
    'total_original_samples': total_original,
    'total_cleaned_samples': total_cleaned,
    'total_duplicates_removed': total_duplicates,
    'overall_duplicate_rate': total_duplicates/total_original*100,
    'split_statistics': cleaning_stats
}

print(f"\nCleaning report created - save this as cleaning_stats.json")
print(json.dumps(cleaning_report, indent=2))

**Deliverables for Thursday:**
- [ ] Create proper directory structure
- [ ] Initialize git repository
- [ ] Create requirements.txt and .gitignore
- [ ] Write basic data loader script

### Friday (1hr): Initial Data Cleaning 🧹
**Goal**: Remove duplicates, fix encoding issues, generate cleaning stats

In [None]:
# Week 1 - Thursday: Setup Project Structure
# Run these commands in terminal:

project_setup_commands = """
# Create directory structure
mkdir -p data/{raw,processed,cleaned}
mkdir -p models/{checkpoints,configs}  
mkdir -p scripts/{preprocessing,training,evaluation}
mkdir -p notebooks/exploratory
mkdir -p logs
mkdir -p docs

# Initialize git (if not already done)
git init
git add .
git commit -m "Initial project setup with raw data"

# Create requirements.txt
cat > requirements.txt << EOF
datasets>=3.4.1
huggingface_hub
hf_transfer
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
torch>=1.12.0
transformers>=4.20.0
wandb
tqdm
jupyter
openai  # for GPT-4 assisted cleaning
python-dotenv  # for environment variables
EOF

# Create .gitignore
cat > .gitignore << EOF
# Data files
data/raw/*
data/processed/*
!data/raw/.gitkeep
!data/processed/.gitkeep

# Model checkpoints
models/checkpoints/*
!models/checkpoints/.gitkeep

# Logs
logs/*
!logs/.gitkeep

# Environment
.env
.venv/
__pycache__/
*.pyc

# Jupyter
.ipynb_checkpoints/

# System
.DS_Store
Thumbs.db
EOF
"""

print("=== PROJECT SETUP COMMANDS ===")
print("Run these commands in your terminal:")
print(project_setup_commands)

# Save project structure info
project_info = {
    "project_name": "ModernFinBERT",
    "version": "0.1.0-dev",
    "dataset_source": "neoyipeng/financial_reasoning_aggregated",
    "target_accuracy": ">94%",
    "inference_target": "<50ms per sample",
    "development_schedule": "16 weeks, 4 hours/week"
}

print("\nProject Information:")
for key, value in project_info.items():
    print(f"  {key}: {value}")

**Deliverables for Tuesday:**
- [ ] Create `label_mapping.json` with standardized mapping
- [ ] Document edge cases and handling strategy
- [ ] Identify suspicious/non-financial content for review

### Thursday (1hr): Project Structure Setup 🏗️
**Goal**: Create organized project structure and version control

In [None]:
# Week 1 - Tuesday: Text Quality and Content Analysis
print("=== TEXT QUALITY AND CONTENT ANALYSIS ===")

# Analyze text lengths
all_texts = []
for split in ds_sentiment.values():
    all_texts.extend([x['text'] for x in split])

text_lengths = [len(text.split()) for text in all_texts]
char_lengths = [len(text) for text in all_texts]

print("Text Length Statistics (in words):")
print(f"  Min: {min(text_lengths)} words")
print(f"  Max: {max(text_lengths)} words") 
print(f"  Mean: {np.mean(text_lengths):.1f} words")
print(f"  Median: {np.median(text_lengths):.1f} words")

print("\nText Length Statistics (in characters):")
print(f"  Min: {min(char_lengths)} chars")
print(f"  Max: {max(char_lengths)} chars")
print(f"  Mean: {np.mean(char_lengths):.1f} chars")

# Check for non-financial content (potential data quality issues)
suspicious_keywords = [
    'jewelry', 'heist', 'burglary', 'artifacts', 'thieves', 'robbery',
    'sports', 'weather', 'celebrity', 'entertainment', 'movie', 'music'
]

suspicious_samples = []
for text in all_texts[:1000]:  # Check first 1000 samples
    if any(keyword in text.lower() for keyword in suspicious_keywords):
        suspicious_samples.append(text)

print(f"\nPotentially non-financial samples found: {len(suspicious_samples)}")
if suspicious_samples:
    print("\nFirst few suspicious samples:")
    for i, sample in enumerate(suspicious_samples[:5]):
        print(f"{i+1}. {sample[:150]}...")

# Check for duplicates
unique_texts = set(all_texts)
print(f"\nDuplicate Analysis:")
print(f"  Total texts: {len(all_texts)}")
print(f"  Unique texts: {len(unique_texts)}")
print(f"  Duplicates: {len(all_texts) - len(unique_texts)}")

# Save label mapping to JSON for documentation
import json

label_mapping = {
    'description': 'Label mapping for ModernFinBERT sentiment classification',
    'num_classes': NUM_CLASSES,
    'label_to_id': label_dict,
    'id_to_label': {v: k for k, v in label_dict.items()},
    'class_names': ['NEGATIVE', 'NEUTRAL/MIXED', 'POSITIVE']
}

print("\nLabel mapping created:")
print(json.dumps(label_mapping, indent=2))

**Deliverables for Monday:**
- [ ] Document all data sources and licenses → Create `data_sources.md`
- [ ] Count samples per dataset split
- [ ] Identify obvious issues (encoding, duplicates, non-financial content)

### Tuesday (1hr): Label Harmonization 🏷️
**Goal**: Verify label consistency and create standardized mapping

In [None]:
# Week 1 - Monday: Label Distribution Analysis
print("=== LABEL DISTRIBUTION ANALYSIS ===")

# Check original labels before mapping
ds_original = load_dataset("neoyipeng/financial_reasoning_aggregated")
ds_sentiment = ds_original.filter(lambda x: x["task"] == "sentiment")

print("Original label distribution per split:")
for split_name, split in ds_sentiment.items():
    labels = [x['label'] for x in split]
    label_counts = pd.Series(labels).value_counts()
    print(f"\n{split_name.upper()}:")
    for label, count in label_counts.items():
        percentage = (count / len(labels)) * 100
        print(f"  {label}: {count} ({percentage:.1f}%)")

print(f"\nUnique labels found: {set([x['label'] for x in ds_sentiment['train']])}")

# Check for any missing or unexpected labels
expected_labels = set(['NEGATIVE', 'NEUTRAL/MIXED', 'POSITIVE'])
actual_labels = set()
for split in ds_sentiment.values():
    for item in split:
        actual_labels.add(item['label'])

print(f"Expected labels: {expected_labels}")
print(f"Actual labels: {actual_labels}")
print(f"Missing labels: {expected_labels - actual_labels}")
print(f"Unexpected labels: {actual_labels - expected_labels}")

# Check other columns that might be useful
print(f"\nAvailable columns: {ds_sentiment['train'].column_names}")
print(f"Sample sources: {set(x.get('source', 'unknown') for x in ds_sentiment['train'][:100])}")