# Assignment 2024 S2
## Part 2: Structured Data - Direct vs Indirect Training Data

Citation: Some code has been generated with the help of Claude 3.5 Sonnet by Anthropic, and some decisions and further clarifications were made with gemini-2.0-flash-thinking-exp-01-21

drugName (categorical): name of drug

condition (categorical): name of condition

review (text): patient review

rating (numerical): 10 star patient rating

date (date): date of review entry

usefulCount (numerical): number of users who found review useful

In [None]:
# Standard library imports
import logging
import time
from pathlib import Path

# Data handling & visualization
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import pandas as pd
from ydata_profiling import ProfileReport

## Data Understanding

In [None]:
INPUT_TRAIN = "drug_review_train.csv"
INPUT_TEST = "drug_review_test.csv"

INPUT_TRAIN = "drug_review_train.csv"
INPUT_TEST = "drug_review_test.csv"

df = pd.read_csv(INPUT_TRAIN)
df_test = pd.read_csv(INPUT_TEST)

In [None]:
# # --- Data Profiling ---
# print("Profiling train data...")
# profile_train = ProfileReport(
#     df,
#     title="Drug Review Train Data Profiling Report",
#     explorative=True,
# )
# profile_train.to_file("drug_review_train_profiling.html")

# print("Profiling test data...")
# profile_test = ProfileReport(
#     df_test,
#     title="Drug Review Test Data Profiling Report",
#     explorative=True,
# )
# profile_test.to_file("drug_review_test_profiling.html")
# print("Profiling complete. Reports saved as HTML files.")

In [None]:
# Create a DataFrame with column descriptions
column_info = {
    "Column Name": ["drugName", "condition", "review", "rating", "date", "usefulCount"],
    "Data Type": [
        "categorical",
        "categorical",
        "text",
        "numerical",
        "date",
        "numerical",
    ],
    "Description": [
        "Name of the drug",
        "Name of the medical condition",
        "Patient review text",
        "10-star patient rating",
        "Date of review entry",
        "Number of users who found review useful",
    ],
}

column_df = pd.DataFrame(column_info)
column_df

In [None]:
# Analyze usefulCount zeros
df = pd.read_csv(INPUT_TRAIN)
zero_useful_count = (df["usefulCount"] == 0).sum()
total_reviews = len(df)
zero_percentage = (zero_useful_count / total_reviews) * 100

print(f"\n--- UsefulCount Analysis ---")
print(f"Total reviews: {total_reviews:,}")
print(f"Reviews with zero useful votes: {zero_useful_count:,}")
print(f"Percentage of zero useful votes: {zero_percentage:.2f}%")

### Analysis of Zero UsefulCount Impact on Sentiment:

1. Silent Majority Phenomenon:
   - The high percentage of zero useful votes suggests a classic "lurker" behavior in online communities
   - Most users read but don't interact, creating a participation inequality
   - This means our sentiment analysis might be biased towards more "engaging" content

2. Sentiment Validation Gap:
   - Reviews with zero useful votes lack community validation
   - We can't assume these reviews are less valuable - they might be newer or simply not seen by many users
   - This creates a potential temporal bias in our sentiment understanding

3. Engagement vs. Sentiment Relationship:
   - Higher useful counts might indicate more polarizing content rather than more accurate sentiment
   - Extreme opinions (very positive or very negative) tend to attract more engagement
   - This suggests we should be cautious about weighing sentiment by useful counts

4. Data Quality Implications:
   - Zero useful counts might indicate:
     a) Fresh reviews that haven't had time to accumulate votes
     b) Reviews that didn't reach many readers
     c) Reviews that readers found neither particularly helpful nor controversial
   - This impacts how we should approach sentiment weighting in our analysis

In [None]:
print(f"\n--- UsefulCount Analysis ---")
print(f"Total reviews: {total_reviews:,}")
print(f"Reviews with zero useful votes: {zero_useful_count:,}")
print(f"Percentage of zero useful votes: {zero_percentage:.2f}%")

# Create correlation matrix
print("\n--- Correlation Matrix Analysis ---")
# Select only numerical columns
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns
correlation_matrix = df[numerical_cols].corr()

# Create correlation heatmap
plt.rcParams.update({"font.size": 14})
plt.figure(figsize=(10, 8), dpi=400)
sns.heatmap(
    correlation_matrix,
    annot=True,  # Show correlation values
    fmt=".3f",
)  # Format correlation values to 2 decimal places
plt.title("Correlation Matrix of Numerical Features")
plt.tight_layout()
plt.show()

### Key Correlation Findings:
- Correlation between rating and usefulCount: 0.243
- Most other numerical features show weak or no correlation
- This suggests that higher rated reviews tend to be found slightly more useful by readers, or vice versa

### Should we include usefulCount?
usefulCount is a similar metric to ratings, and its logical to think that in situations you don't have ratings, you probably wouldn't have the usefulCount too. 

Some cases of situations where you don't have ratings are if you're predicting how positive a review is. Usually, you would only have the review and the patientid only, while the review_length can be determined from review. Thus, it may not be fair to include usefulCount in our machine learning, especially if we're focusing on text sentiment classification.

In [None]:
# Create histogram of ratings with percentage labels
plt.figure(figsize=(12, 6), dpi=400)


# Calculate histogram data
counts, bins, _ = plt.hist(df["rating"], bins=10, edgecolor="black")
total = len(df["rating"])

for i in range(len(counts)):
    percentage = (counts[i] / total) * 100
    plt.text(bins[i], counts[i], f"{percentage:.1f}%", va="bottom")

plt.title("Distribution of Drug Ratings")
plt.xlabel("Rating (0-10)")
plt.ylabel("Number of Reviews")
plt.tight_layout()
plt.show()

The question is how can we split this into positive and negative sentiment?

1. **Highly Imbalanced Distribution**:
   - Ratings 9 and 10 dominate (48.4% combined)
   - Rating 10 alone is 30.9% of all reviews
   - Ratings 3, 4, and 6 are severely underrepresented (each around 3-4%)
   - This creates a significant class imbalance problem

2. **Bimodal Distribution**:
   - There are two peaks: Rating 1 (12.9%) and Ratings 9-10 (48.4%)
   - This suggests strong polarization in reviews
   - Middle ratings (3-6) are less common
   - This validates our earlier decision to split into two classes (1-6 vs 7-10)

3. **Machine Learning Implications**:

   a) **Class Imbalance Solutions Needed**:
   - Consider using class weights
   - Implement oversampling (SMOTE) for minority classes
   - Use undersampling for majority classes
   - Or combine both (SMOTEENN, SMOTETomek)

   b) **Evaluation Metrics**:
   - Accuracy alone would be misleading
   - Need to focus on:
     * F1-score
     * Precision and Recall
     * ROC-AUC
     * Confusion matrix analysis

   c) **Model Selection**:
   - Choose algorithms that handle imbalanced data well
   - Consider ensemble methods
   - Use stratification in train/test splits

4. **Neural Network Considerations**:
   - The imbalanced distribution affects deep learning models differently than traditional ML:
     * Deep learning models often need MORE data per class for effective learning
     * The severe underrepresentation of ratings 3-6 (each ~3-4%) is particularly problematic
     * The dominance of rating 10 (30.9%) could cause model bias

5. **Deep Learning Specific Solutions**:

   a) **Data Augmentation**:
   - For text data, we can use:
     * Back-translation
     * Synonym replacement
     * Text generation using LLMs
     * EDA (Easy Data Augmentation) techniques
   - These help increase samples for underrepresented ratings

   b) **Loss Functions**:
   - Use specialized loss functions:
     * Weighted Cross-Entropy Loss
     * Focal Loss (reduces impact of easy, common samples)
     * Class-Balanced Loss
   - These help handle class imbalance during training

   c) **Architecture Choices**:
   - Consider:
     * Pre-trained language models (BERT, RoBERTa)
     * Multi-task learning approaches
     * Attention mechanisms to focus on important parts of reviews
   - These help leverage the bimodal nature of the distribution

6. **Training Strategies**:
   - Implement:
     * Gradient accumulation
     * Progressive resizing
     * Curriculum learning (start with balanced subsets)
   - Use dynamic batch sampling:
     * Over-sample minority classes within batches
     * Ensure each batch sees all rating classes

7. **Validation Considerations**:
   - Use:
     * Stratified k-fold cross-validation
     * Balanced validation sets
     * Multiple evaluation metrics
   - Monitor for:
     * Overfitting on majority classes
     * Underfitting on minority classes
     * Class-wise performance metrics

In [None]:
def wrap_text(text, width=80, indent=4):
    """
    Custom function to wrap text with indentation
    Args:
        text (str): The text to wrap
        width (int): Maximum width of each line
        indent (int): Number of spaces for indentation
    Returns:
        str: Wrapped and indented text
    """
    # Split text into words
    words = text.split()
    # Initialize variables
    lines = []
    current_line = " " * indent  # Start with indentation
    current_width = indent

    for word in words:
        # Calculate width if we add this word
        if current_width + len(word) + 1 <= width:
            # Add word with a space
            if current_width > indent:  # If not the first word in line
                current_line += " "
                current_width += 1
            current_line += word
            current_width += len(word)
        else:
            # Line is full, start a new line
            lines.append(current_line)
            current_line = " " * indent + word
            current_width = indent + len(word)

    # Add the last line
    if current_line:
        lines.append(current_line)

    return "\n".join(lines)


# Sample reviews from each rating category
N_REVIEWS = 10
print("\n=== Sample Reviews by Rating ===")
for rating in sorted(df["rating"].unique()):
    print(f"\nRating {rating:.1f} - {N_REVIEWS} Sample Reviews:")
    print("-" * 80)
    sample_reviews = df[df["rating"] == rating].sample(
        n=min(N_REVIEWS, len(df[df["rating"] == rating]))
    )
    # Display in a more readable format
    for idx, row in sample_reviews.iterrows():
        print(f"Drug: {row['drugName']}")
        print(f"Condition: {row['condition']}")
        print(f"UsefulCount: {row['usefulCount']}")
        print("Review:")
        # Use our custom wrap_text function
        wrapped_review = wrap_text(row["review"], width=80, indent=4)
        print(wrapped_review)
        print("-" * 80)

Summary:

The reviews highlight a wide range of experiences, from severe negative side effects and dissatisfaction to significant relief and positive outcomes. Many reviews, especially at the lower ratings, focus on negative side effects such as nausea, weight gain, bleeding, mood changes, and digestive issues. Higher-rated reviews often acknowledge some initial side effects but emphasize the drug's effectiveness in treating the condition. Some medium-rated reviews acknowledge postive side effects of the drug, but not effective overall.

Sentiment Threshold:

Based on the provided samples, the sentiment threshold appears to be around a rating of 7.0.

Sentiment Threshold Analysis:

Looking closely at the reviews within each rating level, and paying attention to how the language and described experiences change, here's a refined breakdown and the apparent threshold:

1.0: Almost universally extremely negative. Users describe severe, debilitating side effects, complete lack of effectiveness, and often dangerous reactions. Words like "horrible," "awful," "die," "severe pain," and descriptions of emergency room visits are common. These are clearly negative experiences.

2.0: Still overwhelmingly negative. The language is similar to the 1.0 reviews, focusing on significant side effects, lack of efficacy, and regret. There's a sense of frustration and disappointment. Some reviews mention stopping the medication due to the negative experience.

3.0: Predominantly negative, but with hints of mixed experiences. While many reviews still detail significant side effects and problems, some acknowledge potential benefits or that the drug might work for others, even if it didn't work for them. There's more ambivalence here, but the overall tone leans negative. We see phrases like "takes some getting used to," "overwhelming," "not worth it," and descriptions of weight gain, mood changes, and other undesirable effects.

4.0: A definite mix of negative and slightly more neutral experiences, but still leaning negative overall. Users often describe a trade-off: the medication might help with the condition to some extent, but the side effects are significant and disruptive. There's a sense of weighing pros and cons, and often the cons are still winning. We see mentions of both positive effects (e.g., "worked for a few years," "pain was less") and negative ones ("wasn't pleasant," "side effects such as lack of concentration," "gaining weight").

5.0: Truly mixed, and the most difficult to categorize neatly. These reviews represent a clear "tipping point." Some users report positive effects on the condition, but significant side effects often counterbalance those benefits. Other users report minimal benefits and persistent problems. There's a strong sense of individual variability and uncertainty. The language is less intensely negative than lower ratings, but still expresses concern and dissatisfaction. Key phrases: "hit or miss," "side effects were slim in the first couple of months but soon after...," "worked really good for that [one symptom]... [but had other significant negative effects]."

6.0: Similar to 5.0, a mix of positive and negative, but with a slight shift towards acknowledging benefits, even with ongoing issues. The reviews often describe a situation where the drug helps, but the side effects are still a significant factor, leading to a less-than-ideal experience. There's a sense of compromise and ongoing evaluation. We see phrases like "love/hate relationship," "better than [previous medication]," "side effects improved," and "debating whether i should stop."

7.0: This is where the sentiment generally shifts to positive, but with caveats. Users often describe a "learning curve" or initial side effects that diminished over time. The reviews tend to emphasize the drug's effectiveness in managing the condition, while still acknowledging some lingering drawbacks or individual concerns. There's more optimism and a sense of finding a workable solution. Key phrases: "helped me a lot," "pros definitely outweigh the cons," "worked great for the first 2 years, but...," "worked really well [but had side effects]."

8.0: More consistently positive. Users report good results and often express satisfaction with the medication. Side effects are either minimal, manageable, or considered worth enduring for the benefits. There's a sense of finding a good balance and a willingness to continue treatment. Phrases like "worked miracles," "feel so much better," "life saver," and "good cushion for my knees" appear.

9.0: Strongly positive, with users often describing significant improvements and a high level of satisfaction. Side effects are mentioned less frequently, and when they are, they're typically described as minor or temporary. There's a clear endorsement of the medication.

10.0: Almost universally positive, with users expressing great satisfaction and often describing the medication as life-changing or highly effective. Side effects are rarely mentioned, and if they are, they are downplayed or considered insignificant compared to the benefits.

Conclusion:

Based on this more detailed analysis, the sentiment threshold is still around the 6.0 to 7.0 range, the sentiment leans to be more positive closer to 7.0.

Negative Sentiment: Ratings 1.0 to 6.0

Positive Sentiment: Ratings 7.0 to 10.0

The key difference is the increased nuance we see in the 4.0, 5.0, and 6.0 ratings. These are not clearly negative in the same way as the 1.0-3.0 ratings, but they represent a mixed bag of experiences where the negative aspects often outweigh or significantly detract from the positive ones. The 7.0 rating represents the point where the balance generally tips towards a positive overall experience, despite potential drawbacks.

In [None]:
# Create histogram with two class distributions
plt.figure(figsize=(12, 8), dpi=400)

# Define the bins for negative (1-6) and positive (7-10) classes
bins = [0, 6.5, 10]  # Using 6.5 as the boundary to properly separate 6 and 7
counts, bins, patches = plt.hist(df["rating"], bins=bins, edgecolor="black")
total = len(df["rating"])

# Color the bars differently
patches[0].set_facecolor("salmon")  # Negative class (1-6)
patches[1].set_facecolor("lightgreen")  # Positive class (7-10)

# Calculate maximum y value needed
max_count = max(counts)
y_margin = max_count * 0.15  # Add 15% margin for labels

# Set y-axis limit
plt.ylim(0, max_count + y_margin)


# Add percentage labels on top of each bar
for i in range(len(counts)):
    percentage = (counts[i] / total) * 100
    plt.text(bins[i], counts[i], f"{counts[i]:,.0f}\n({percentage:.1f}%)", va="bottom")

plt.title("Distribution of Drug Ratings by Class (Negative: 1-6, Positive: 7-10)")
plt.xlabel("Rating Classes")
plt.ylabel("Number of Reviews")
plt.grid(True, alpha=0.3)

# Customize x-axis labels
plt.xticks([3.25, 8.25], ["Negative\n(1-6)", "Positive\n(7-10)"])

plt.tight_layout()
plt.show()

# Print detailed statistics
print("\n--- Class Distribution Statistics ---")
print(f"Total reviews: {total:,}")
print(f"Negative class (1-6): {counts[0]:,.0f} reviews ({(counts[0]/total)*100:.1f}%)")
print(f"Positive class (7-10): {counts[1]:,.0f} reviews ({(counts[1]/total)*100:.1f}%)")

1. **Purpose**: This histogram shows how the drug reviews are distributed when split into two classes based on ratings:
   - Negative class: Ratings from 1-6
   - Positive class: Ratings from 7-10

2. **Visual Elements**:
   - Red (salmon) bar: Represents negative reviews (ratings 1-6)
   - Green bar: Represents positive reviews (ratings 7-10)
   - Each bar shows both the count and percentage of reviews

3. **Key Findings**:
   - Total Dataset Size: 110,811 reviews
   - Negative Reviews (1-6): 37,173 reviews (33.5%)
   - Positive Reviews (7-10): 73,638 reviews (66.5%)

4. **Interpretation**:
   - The data is imbalanced, with about twice as many positive reviews as negative ones
   - Roughly 2/3 of all reviews are positive (7-10 rating)
   - Only 1/3 of reviews are negative (1-6 rating)

5. **Implications**:
   - This imbalance suggests people are more likely to leave positive reviews
   - We might need to consider techniques to handle class imbalance (like oversampling, undersampling, or class weights)

## Data pre-processing

The dataset has been mentioned to be clean. Therefore, we just need to remove the row index. This has to be done during testing as well.

In [None]:
# Check and remove Unnamed: 0 column if it exists
if "Unnamed: 0" in df.columns:
    print("\n--- Removing 'Unnamed: 0' column ---")
    print(f"Original columns: {df.columns.tolist()}")
    df = df_train.drop("Unnamed: 0", axis=1)
    print(f"Updated columns: {df.columns.tolist()}\n")

## Feature Engineering
    - Prepare features for a distilBERT feature extracto

In [None]:
print("\n=== Starting Data Preprocessing ===")

# Remove Unnamed column if it exists
if "Unnamed: 0" in df.columns:
    df = df.drop("Unnamed: 0", axis=1)

# Create binary sentiment labels (0 for ratings 1-6, 1 for ratings 7-10)
print("\nCreating sentiment labels...")
df["sentiment_label"] = (df["rating"] >= 7).astype(int)

# For DistilBERT, we'll just use the raw review text
# No need for text preprocessing as the model's tokenizer will handle it

print("\n=== Basic Preprocessing Complete ===")
print(f"Total reviews: {len(df):,}")
print(f"Positive reviews (rating >= 7): {df['sentiment_label'].sum():,}")
print(f"Negative reviews (rating < 7): {len(df) - df['sentiment_label'].sum():,}")

# Convert date column to datetime
print("\n=== Preparing Data Split ===")
df["date"] = pd.to_datetime(df["date"])

# Sort by date
df = df.sort_values("date")

# Calculate split points (80% train, 10% val, 10% test)
train_end_idx = int(len(df) * 0.8)
val_end_idx = int(len(df) * 0.9)

# Split the data
train_df = df[:train_end_idx]
val_df = df[train_end_idx:val_end_idx]
test_df = df[val_end_idx:]

# Print split statistics
print("\n=== Data Split Statistics ===")
print(f"Training set: {len(train_df):,} reviews")
print(
    f"  Positive: {train_df['sentiment_label'].sum():,} ({train_df['sentiment_label'].mean()*100:.1f}%)"
)
print(
    f"  Negative: {len(train_df) - train_df['sentiment_label'].sum():,} ({(1-train_df['sentiment_label'].mean())*100:.1f}%)"
)
print(
    f"  Date range: {train_df['date'].min().strftime('%Y-%m-%d')} to {train_df['date'].max().strftime('%Y-%m-%d')}"
)

print(f"\nValidation set: {len(val_df):,} reviews")
print(
    f"  Positive: {val_df['sentiment_label'].sum():,} ({val_df['sentiment_label'].mean()*100:.1f}%)"
)
print(
    f"  Negative: {len(val_df) - val_df['sentiment_label'].sum():,} ({(1-val_df['sentiment_label'].mean())*100:.1f}%)"
)
print(
    f"  Date range: {val_df['date'].min().strftime('%Y-%m-%d')} to {val_df['date'].max().strftime('%Y-%m-%d')}"
)

print(f"\nTest set: {len(test_df):,} reviews")
print(
    f"  Positive: {test_df['sentiment_label'].sum():,} ({test_df['sentiment_label'].mean()*100:.1f}%)"
)
print(
    f"  Negative: {len(test_df) - test_df['sentiment_label'].sum():,} ({(1-test_df['sentiment_label'].mean())*100:.1f}%)"
)
print(
    f"  Date range: {test_df['date'].min().strftime('%Y-%m-%d')} to {test_df['date'].max().strftime('%Y-%m-%d')}"
)

## Data Splitting
   - Split data into training and validation sets
   - Consider temporal splits given the date column
   - Ensure balanced distribution of classes

In [None]:
# Can't fully reuse part 1 sadly.
def train_model(
    model,
    dataloaders,
    criterion,
    optimizer,
    scheduler,
    num_epochs=10,
    model_name="Model",
    class_weights=None,
    early_stopping_patience=3,
    gradient_accumulation_steps=1,
):
    """
    Train a model for sentiment classification with advanced training features.

    Args:
        model: The neural network model
        dataloaders: Dictionary containing 'train' and 'valid' dataloaders
        criterion: Loss function
        optimizer: Optimizer
        scheduler: Learning rate scheduler
        num_epochs: Number of training epochs
        model_name: Name for saving the model
        class_weights: Weights for handling class imbalance
        early_stopping_patience: Number of epochs to wait before early stopping
        gradient_accumulation_steps: Number of steps to accumulate gradients
    """
    # Initialize tracking variables
    best_val_f1 = 0.0
    patience_counter = 0
    train_metrics = {"loss": [], "acc": [], "f1": []}
    val_metrics = {"loss": [], "acc": [], "f1": []}

    # Move model and criterion to device
    model = model.to(device)
    if class_weights is not None:
        class_weights = class_weights.to(device)
        criterion = criterion(weight=class_weights)
    criterion = criterion.to(device)

    for epoch in range(num_epochs):
        logger.info(f"\nEpoch {epoch+1}/{num_epochs}")

        # Training phase
        model.train()
        running_loss = 0.0
        all_preds = []
        all_labels = []
        optimizer.zero_grad(set_to_none=True)

        train_pbar = tqdm(
            dataloaders["train"], desc="Training", position=1, leave=False
        )

        for batch_idx, (inputs, labels) in enumerate(train_pbar):
            # Move data to device
            inputs = inputs.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss = loss / gradient_accumulation_steps  # Scale loss

            # Backward pass
            loss.backward()

            # Gradient accumulation
            if (batch_idx + 1) % gradient_accumulation_steps == 0:
                # Clip gradients
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)

            # Track metrics
            running_loss += loss.item() * gradient_accumulation_steps
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

            # Update progress bar
            train_pbar.set_postfix(
                {
                    "loss": f"{loss.item():.4f}",
                }
            )

        # Calculate epoch metrics
        epoch_loss = running_loss / len(dataloaders["train"])
        epoch_acc = accuracy_score(all_labels, all_preds)
        epoch_f1 = f1_score(all_labels, all_preds, average="weighted")

        train_metrics["loss"].append(epoch_loss)
        train_metrics["acc"].append(epoch_acc)
        train_metrics["f1"].append(epoch_f1)

        # Validation phase
        model.eval()
        val_loss = 0.0
        all_val_preds = []
        all_val_labels = []

        with torch.no_grad():
            val_pbar = tqdm(
                dataloaders["valid"], desc="Validation", position=1, leave=False
            )

            for inputs, labels in val_pbar:
                inputs = inputs.to(device, non_blocking=True)
                labels = labels.to(device, non_blocking=True)

                outputs = model(inputs)
                loss = criterion(outputs, labels)

                val_loss += loss.item()
                preds = torch.argmax(outputs, dim=1)
                all_val_preds.extend(preds.cpu().numpy())
                all_val_labels.extend(labels.cpu().numpy())

        # Calculate validation metrics
        val_loss = val_loss / len(dataloaders["valid"])
        val_acc = accuracy_score(all_val_labels, all_val_preds)
        val_f1 = f1_score(all_val_labels, all_val_preds, average="weighted")

        val_metrics["loss"].append(val_loss)
        val_metrics["acc"].append(val_acc)
        val_metrics["f1"].append(val_f1)

        # Log metrics
        logger.info(
            f"Train Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f} F1: {epoch_f1:.4f}"
        )
        logger.info(f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f} F1: {val_f1:.4f}")

        # Learning rate scheduling
        scheduler.step(val_loss)

        # Model checkpointing based on F1 score
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            patience_counter = 0
            logger.info("Saving best model checkpoint...")
            torch.save(
                {
                    "epoch": epoch,
                    "model_state_dict": model.state_dict(),
                    "optimizer_state_dict": optimizer.state_dict(),
                    "scheduler_state_dict": scheduler.state_dict(),
                    "best_val_f1": best_val_f1,
                },
                f"best_{model_name}.pth",
            )
        else:
            patience_counter += 1

        # Early stopping
        if patience_counter >= early_stopping_patience:
            logger.info(f"Early stopping triggered after {epoch+1} epochs")
            break

    # Plot training history
    plot_training_history(train_metrics, val_metrics, model_name)

    return train_metrics, val_metrics


def plot_training_history(train_metrics, val_metrics, model_name):
    """Plot training and validation metrics history."""
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Plot loss
    axes[0].plot(train_metrics["loss"], label="Train")
    axes[0].plot(val_metrics["loss"], label="Val")
    axes[0].set_title("Loss")
    axes[0].set_xlabel("Epoch")
    axes[0].legend()

    # Plot accuracy
    axes[1].plot(train_metrics["acc"], label="Train")
    axes[1].plot(val_metrics["acc"], label="Val")
    axes[1].set_title("Accuracy")
    axes[1].set_xlabel("Epoch")
    axes[1].legend()

    # Plot F1 score
    axes[2].plot(train_metrics["f1"], label="Train")
    axes[2].plot(val_metrics["f1"], label="Val")
    axes[2].set_title("F1 Score")
    axes[2].set_xlabel("Epoch")
    axes[2].legend()

    plt.suptitle(f"Training History - {model_name}")
    plt.tight_layout()
    plt.show()


# Create the sentiment classifier model
class SentimentClassifier(nn.Module):
    def __init__(self, n_classes=2, dropout_rate=0.2):
        """
        Initialize the sentiment classifier
        Args:
            n_classes (int): Number of output classes
            dropout_rate (float): Dropout rate for regularization
        """
        super(SentimentClassifier, self).__init__()

        # Load pre-trained DistilBERT
        self.distilbert = DistilBertModel.from_pretrained("distilbert-base-uncased")

        # Freeze BERT parameters (optional - comment out if you want to fine-tune everything)
        # for param in self.distilbert.parameters():
        #     param.requires_grad = False

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(768, 384),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(384, 96),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(96, n_classes),
        )

    def forward(self, input_ids, attention_mask):
        """
        Forward pass
        Args:
            input_ids: Tokenized input sequences
            attention_mask: Attention mask for padding
        """
        # Get DistilBERT outputs
        outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)

        # Use the [CLS] token representation
        pooled_output = outputs.last_hidden_state[:, 0]

        # Pass through the classifier
        return self.classifier(pooled_output)


# Initialize training components
def initialize_training(train_df, val_df, batch_size=16, max_length=512):
    """
    Initialize all components needed for training
    Args:
        train_df (DataFrame): Training dataframe
        val_df (DataFrame): Validation dataframe
        batch_size (int): Batch size for training
        max_length (int): Maximum sequence length
    """
    # Set device
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    print(f"Using device: {device}")

    # Initialize tokenizer
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

    # Create datasets
    train_dataset = DrugReviewDataset(
        train_df["review"].values,
        train_df["sentiment_label"].values,
        tokenizer,
        max_length,
    )

    val_dataset = DrugReviewDataset(
        val_df["review"].values, val_df["sentiment_label"].values, tokenizer, max_length
    )

    # Create dataloaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=2,
        pin_memory=True,
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=2,
        pin_memory=True,
    )

    # Calculate class weights for imbalanced dataset
    total_samples = len(train_df)
    neg_samples = (train_df["sentiment_label"] == 0).sum()
    pos_samples = (train_df["sentiment_label"] == 1).sum()

    class_weights = torch.tensor(
        [total_samples / (2 * neg_samples), total_samples / (2 * pos_samples)]
    ).to(device)

    # Initialize model
    model = SentimentClassifier().to(device)

    # Initialize optimizer with weight decay
    optimizer = AdamW(
        model.parameters(), lr=2e-5, weight_decay=0.01, correct_bias=False
    )

    # Initialize scheduler
    scheduler = ReduceLROnPlateau(
        optimizer, mode="min", factor=0.1, patience=2, verbose=True
    )

    # Initialize loss function with class weights
    criterion = nn.CrossEntropyLoss(weight=class_weights)

    return {
        "device": device,
        "model": model,
        "train_loader": train_loader,
        "val_loader": val_loader,
        "optimizer": optimizer,
        "scheduler": scheduler,
        "criterion": criterion,
        "class_weights": class_weights,
    }


training_components = initialize_training(train_df, val_df)

# Create dataloaders dictionary for the train_model function
dataloaders = {
    "train": training_components["train_loader"],
    "valid": training_components["val_loader"],
}

# Train the model
train_metrics, val_metrics = train_model(
    model=training_components["model"],
    dataloaders=dataloaders,
    criterion=training_components["criterion"],
    optimizer=training_components["optimizer"],
    scheduler=training_components["scheduler"],
    num_epochs=10,
    model_name="DrugReviewSentiment",
    class_weights=training_components["class_weights"],
    early_stopping_patience=3,
    gradient_accumulation_steps=4,
)

## Model Development
### Direct Training approach
- Define target variable (likely rating)
- Build and train models
- Evaluate performance

### Indirect Training approach
- Define indirect signals/proxies
- Build and train models
- Compare with direct training results

## Model Comparison
   - Compare direct vs indirect training approaches
   - Analyze pros and cons of each method
   - Discuss real-world applicability

## Results and Discussion
   - Present key findings
   - Discuss limitations
   - Suggest improvements