# Interactive Yelp Rating Prediction Pipeline

## Overview

This notebook provides an interactive, educational experience for understanding and running the complete Yelp star rating prediction pipeline. We'll walk through each stage of the machine learning process, from data loading to model inference, with hands-on visualizations and parameter tuning.

### Learning Objectives
- Understand the Yelp Academic Dataset structure
- Learn data preprocessing and feature engineering techniques
- Explore sentiment analysis using transformer models
- Perform feature selection and model training
- Evaluate model performance and make predictions

### Pipeline Stages
1. **Data Loading & Preprocessing**: Load and clean raw Yelp data
2. **Feature Engineering**: Create derived features from raw data
3. **Sentiment Analysis**: Extract sentiment scores from review text
4. **Feature Selection**: Identify optimal feature subset
5. **Model Training**: Train neural network for rating prediction
6. **Inference**: Make predictions on new data
7. **Analysis**: Deep dive into results and insights

### Dataset
We'll use the Yelp Academic Dataset, which includes:
- **Business data**: Restaurant information and ratings
- **Review data**: User reviews with text and ratings
- **User data**: Reviewer profiles and history

The goal is to predict the star rating (1-5) a user would give to a business based on user and business characteristics, plus review sentiment.

## Section 1: Introduction and Setup

### Learning Objectives
- Set up the Python environment
- Verify GPU availability
- Understand configuration parameters
- Load required libraries and modules

### Environment Requirements
- Python 3.8+
- PyTorch with MPS/CUDA support
- Transformers library
- Jupyter widgets for interactivity

Let's start by setting up our environment and verifying everything is working correctly.

In [None]:
# Environment setup and imports
import sys
import os
import logging
import warnings
warnings.filterwarnings('ignore')

# Add src to path for local imports
sys.path.append('src')

# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import json
import pickle

# Import machine learning libraries
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Import pipeline modules
from src.preprocessing import preprocess_pipeline
from src.features import feature_engineering_pipeline
from src.sentiment import sentiment_analysis_pipeline
from src.feature_selection import feature_selection_pipeline
from src.train import training_pipeline
from src import config
import src.utils as utils

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("✓ Libraries imported successfully")

In [None]:
# GPU detection and device setup
device_info = utils.verify_gpu_support()
print(f"GPU Support: {device_info}")

# Set random seed for reproducibility
utils.set_seed(config.SEED)
print(f"✓ Random seed set to: {config.SEED}")

# Display configuration
print("\n=== CONFIGURATION ===")
print(f"Data Directory: {config.DATA_DIR}")
print(f"Learning Rate: {config.LEARNING_RATE}")
print(f"Batch Size: {config.BATCH_SIZE}")
print(f"Max Epochs: {config.MAX_EPOCHS}")
print(f"Sentiment Model: {config.MODEL_NAME}")
print(f"Candidate Features: {config.CANDIDATE_FEATURES}")
print("=====================")

### Interactive: Environment Verification

Let's verify that all our data files exist and check their sizes.

In [None]:
# Verify data files exist
print("Checking data file availability:")
for name, path in config.INPUT_FILES.items():
    exists = os.path.exists(path)
    size_mb = os.path.getsize(path) / (1024**2) if exists else 0
    status = "✓" if exists else "✗"
    print(f"{status} {name.capitalize()} data: {path} ({size_mb:.1f} MB)")

# Check output directories
os.makedirs(config.OUTPUT_DIR, exist_ok=True)
os.makedirs('models', exist_ok=True)
os.makedirs('outputs', exist_ok=True)
print("\n✓ Output directories created")

## Section 2: Data Loading and Preprocessing

### Learning Objectives
- Understand the structure of the Yelp dataset
- Learn data loading and merging techniques
- Handle missing values and data cleaning
- Visualize data distributions and relationships

### What We'll Do
1. Load the three main datasets (business, review, user)
2. Rename columns to avoid conflicts
3. Convert date columns to datetime format
4. Merge datasets using inner joins
5. Clean data by removing rows with missing critical values
6. Explore the merged dataset

This preprocessing step transforms raw CSV files into a clean, merged dataset ready for feature engineering.

In [None]:
# Run the preprocessing pipeline
print("Starting data preprocessing pipeline...")
print("This may take a few minutes depending on your system.")

try:
    merged_df = preprocess_pipeline()
    print("\n✓ Preprocessing completed successfully!")
    print(f"Final dataset shape: {merged_df.shape}")
    print(f"Memory usage: {merged_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
except Exception as e:
    print(f"✗ Error during preprocessing: {e}")
    print("Please check your data files and try again.")

In [None]:
# Display dataset overview
print("Dataset Overview:")
print("=" * 50)
print(merged_df.info())
print("\nFirst 5 rows:")
display(merged_df.head())

# Basic statistics
print("\nBasic Statistics:")
print(merged_df.describe())

In [None]:
# Visualize data distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Data Distributions After Preprocessing', fontsize=16)

# Stars distribution
merged_df['stars'].value_counts().sort_index().plot(kind='bar', ax=axes[0,0])
axes[0,0].set_title('Star Ratings Distribution')
axes[0,0].set_xlabel('Stars')
axes[0,0].set_ylabel('Count')

# User average stars
merged_df['user_average_stars'].hist(bins=20, ax=axes[0,1])
axes[0,1].set_title('User Average Stars')
axes[0,1].set_xlabel('Average Stars')
axes[0,1].set_ylabel('Frequency')

# Business average stars
merged_df['business_average_stars'].hist(bins=20, ax=axes[0,2])
axes[0,2].set_title('Business Average Stars')
axes[0,2].set_xlabel('Average Stars')
axes[0,2].set_ylabel('Frequency')

# User review count
merged_df['user_review_count'].hist(bins=50, ax=axes[1,0], range=(0, merged_df['user_review_count'].quantile(0.95)))
axes[1,0].set_title('User Review Count (95th percentile)')
axes[1,0].set_xlabel('Review Count')
axes[1,0].set_ylabel('Frequency')

# Business review count
merged_df['business_review_count'].hist(bins=50, ax=axes[1,1], range=(0, merged_df['business_review_count'].quantile(0.95)))
axes[1,1].set_title('Business Review Count (95th percentile)')
axes[1,1].set_xlabel('Review Count')
axes[1,1].set_ylabel('Frequency')

# Review year distribution
merged_df['date'].dt.year.value_counts().sort_index().plot(kind='bar', ax=axes[1,2])
axes[1,2].set_title('Reviews by Year')
axes[1,2].set_xlabel('Year')
axes[1,2].set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Check for missing values
missing_data = merged_df.isnull().sum()
missing_percent = (missing_data / len(merged_df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

print("Missing Values Analysis:")
print("=" * 40)
display(missing_df[missing_df['Missing Count'] > 0])

# Visualize missing values
if missing_data.sum() > 0:
    plt.figure(figsize=(12, 6))
    missing_data[missing_data > 0].sort_values(ascending=True).plot(kind='barh')
    plt.title('Missing Values by Column')
    plt.xlabel('Number of Missing Values')
    plt.show()
else:
    print("✓ No missing values found in the dataset!")

## Section 3: Feature Engineering

### Learning Objectives
- Understand feature engineering concepts
- Create time-based features
- Engineer elite status features
- Handle missing values appropriately
- Visualize feature distributions and correlations

### What We'll Do
1. **Time Features**: Calculate `time_yelping` (weeks since user joined)
2. **Elite Features**: Count total elite statuses and check current elite status
3. **Missing Value Handling**: Impute missing values with appropriate strategies
4. **Feature Analysis**: Explore correlations and distributions

Feature engineering transforms raw data into meaningful predictors that capture the underlying patterns in user behavior and business characteristics.

In [None]:
# Run feature engineering pipeline
print("Starting feature engineering pipeline...")

try:
    featured_df = feature_engineering_pipeline(merged_df)
    print("\n✓ Feature engineering completed successfully!")
    
    # Show new features added
    new_features = [col for col in featured_df.columns if col not in merged_df.columns]
    print(f"\nAdded {len(new_features)} new features:")
    for i, feature in enumerate(new_features, 1):
        print(f"{i}. {feature}")
        
except Exception as e:
    print(f"✗ Error during feature engineering: {e}")
    print("Please check the previous steps and try again.")

In [None]:
# Display feature engineering results
print("Feature Engineering Results:")
print("=" * 50)
print(f"Original features: {len(merged_df.columns)}")
print(f"Engineered features: {len(featured_df.columns)}")
print(f"New features added: {len(featured_df.columns) - len(merged_df.columns)}")

# Show statistics for new features
print("\nNew Feature Statistics:")
display(featured_df[new_features].describe())

# Show sample of engineered data
print("\nSample of Engineered Data:")
display(featured_df[['stars', 'user_average_stars', 'business_average_stars', 'time_yelping', 'elite_status', 'total_elite_statuses']].head())

In [None]:
# Visualize engineered features
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Engineered Feature Distributions', fontsize=16)

# Time yelping distribution
featured_df['time_yelping'].hist(bins=50, ax=axes[0,0])
axes[0,0].set_title('Time Yelping (weeks)')
axes[0,0].set_xlabel('Weeks')
axes[0,0].set_ylabel('Frequency')

# Elite status distribution
featured_df['elite_status'].value_counts().sort_index().plot(kind='bar', ax=axes[0,1])
axes[0,1].set_title('Elite Status Distribution')
axes[0,1].set_xlabel('Elite Status (0=No, 1=Yes)')
axes[0,1].set_ylabel('Count')

# Total elite statuses
featured_df['total_elite_statuses'].value_counts().sort_index().plot(kind='bar', ax=axes[0,2])
axes[0,2].set_title('Total Elite Statuses')
axes[0,2].set_xlabel('Number of Elite Years')
axes[0,2].set_ylabel('Count')

# Date year distribution
featured_df['date_year'].value_counts().sort_index().plot(kind='bar', ax=axes[1,0])
axes[1,0].set_title('Reviews by Year')
axes[1,0].set_xlabel('Year')
axes[1,0].set_ylabel('Count')

# Correlation heatmap for key features
corr_features = ['stars', 'user_average_stars', 'business_average_stars', 'time_yelping', 'elite_status', 'total_elite_statuses']
corr_matrix = featured_df[corr_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Feature Correlations')

# Scatter plot: time_yelping vs stars
axes[1,2].scatter(featured_df['time_yelping'], featured_df['stars'], alpha=0.1)
axes[1,2].set_title('Time Yelping vs Star Rating')
axes[1,2].set_xlabel('Time Yelping (weeks)')
axes[1,2].set_ylabel('Stars')

plt.tight_layout()
plt.show()

### Interactive: Feature Importance Preview

Let's explore which features might be most predictive of the star rating using a simple correlation analysis.

In [None]:
# Feature correlation with target
target_correlations = featured_df.corr()['stars'].abs().sort_values(ascending=False)

print("Feature Correlations with Target (Stars):")
print("=" * 50)
for feature, corr in target_correlations.items():
    if feature != 'stars':
        print(f"{feature:25s}: {corr:.4f}")

# Visualize top correlations
plt.figure(figsize=(12, 8))
top_features = target_correlations.head(15).index[1:]  # Exclude 'stars' itself
top_corrs = target_correlations.head(15).values[1:]

bars = plt.barh(range(len(top_features)), top_corrs)
plt.yticks(range(len(top_features)), top_features)
plt.xlabel('Absolute Correlation with Stars')
plt.title('Top Feature Correlations with Target')
plt.grid(axis='x', alpha=0.3)

# Color bars by correlation strength
for i, (bar, corr) in enumerate(zip(bars, top_corrs)):
    if corr > 0.3:
        bar.set_color('darkred')
    elif corr > 0.2:
        bar.set_color('red')
    elif corr > 0.1:
        bar.set_color('orange')
    else:
        bar.set_color('lightblue')

plt.show()

## Section 4: Sentiment Analysis

### Learning Objectives
- Understand sentiment analysis with transformers
- Learn text preprocessing techniques
- Explore sentiment score distributions
- Analyze sentiment vs rating relationships

### What We'll Do
1. **Text Preprocessing**: Smart truncation to handle long reviews
2. **Model Loading**: Initialize DistilBERT sentiment classifier
3. **Batch Processing**: Process reviews in batches with progress tracking
4. **Score Normalization**: Convert to [-1, 1] scale
5. **Analysis**: Explore sentiment patterns and correlations

Sentiment analysis extracts emotional tone from review text, providing a quantitative measure of user satisfaction beyond star ratings.

In [None]:
# Run sentiment analysis pipeline
print("Starting sentiment analysis pipeline...")
print("This will take considerable time (10-30 minutes) depending on your hardware.")
print("The progress bar will show the processing status.")

try:
    sentiment_df = sentiment_analysis_pipeline(featured_df)
    print("\n✓ Sentiment analysis completed successfully!")
    
    # Show sentiment columns added
    sentiment_cols = ['sentiment_label', 'sentiment_score_raw', 'normalized_sentiment_score']
    print(f"\nAdded sentiment columns: {sentiment_cols}")
    
except Exception as e:
    print(f"✗ Error during sentiment analysis: {e}")
    print("This step requires significant computational resources.")
    print("Consider using a machine with GPU support for faster processing.")

In [None]:
# Display sentiment analysis results
print("Sentiment Analysis Results:")
print("=" * 50)
display(sentiment_df[sentiment_cols].describe())

# Show sample with sentiment
print("\nSample Reviews with Sentiment:")
sample_cols = ['text', 'stars', 'sentiment_label', 'normalized_sentiment_score']
display(sentiment_df[sample_cols].head())

# Sentiment label distribution
print("\nSentiment Label Distribution:")
print(sentiment_df['sentiment_label'].value_counts())

In [None]:
# Visualize sentiment distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Sentiment Analysis Visualizations', fontsize=16)

# Sentiment score distribution
sentiment_df['normalized_sentiment_score'].hist(bins=50, ax=axes[0,0], alpha=0.7)
axes[0,0].set_title('Normalized Sentiment Score Distribution')
axes[0,0].set_xlabel('Sentiment Score')
axes[0,0].set_ylabel('Frequency')
axes[0,0].axvline(0, color='red', linestyle='--', alpha=0.7, label='Neutral')
axes[0,0].legend()

# Sentiment by star rating
sentiment_by_stars = sentiment_df.groupby('stars')['normalized_sentiment_score'].mean()
sentiment_by_stars.plot(kind='bar', ax=axes[0,1])
axes[0,1].set_title('Average Sentiment by Star Rating')
axes[0,1].set_xlabel('Stars')
axes[0,1].set_ylabel('Average Sentiment Score')

# Sentiment vs stars scatter
axes[0,2].scatter(sentiment_df['stars'], sentiment_df['normalized_sentiment_score'], alpha=0.1)
axes[0,2].set_title('Sentiment Score vs Star Rating')
axes[0,2].set_xlabel('Stars')
axes[0,2].set_ylabel('Sentiment Score')

# Text length vs sentiment
text_lengths = sentiment_df['text'].str.len()
axes[1,0].scatter(text_lengths, sentiment_df['normalized_sentiment_score'], alpha=0.1)
axes[1,0].set_title('Text Length vs Sentiment Score')
axes[1,0].set_xlabel('Text Length')
axes[1,0].set_ylabel('Sentiment Score')

# Sentiment label distribution
sentiment_df['sentiment_label'].value_counts().plot(kind='pie', ax=axes[1,1], autopct='%1.1f%%')
axes[1,1].set_title('Sentiment Label Distribution')

# Correlation between sentiment and stars
corr_sentiment_stars = sentiment_df[['stars', 'normalized_sentiment_score']].corr()
sns.heatmap(corr_sentiment_stars, annot=True, cmap='coolwarm', ax=axes[1,2])
axes[1,2].set_title('Correlation: Stars vs Sentiment')

plt.tight_layout()
plt.show()

## Section 5: Feature Selection

### Learning Objectives
- Understand feature selection techniques
- Learn about best subset selection
- Evaluate feature importance
- Compare model performance with different feature sets

### What We'll Do
1. **Data Preparation**: Select candidate features and target
2. **Best Subset Selection**: Exhaustive search for optimal feature combinations
3. **Model Evaluation**: Compare performance across different feature sets
4. **Final Selection**: Choose the best performing feature subset

Feature selection identifies the most predictive variables, reducing dimensionality and improving model interpretability.

In [None]:
# Run feature selection pipeline
print("Starting feature selection pipeline...")
print("This involves exhaustive search over feature combinations.")
print("Processing time depends on the number of candidate features.")

try:
    final_df, optimal_features = feature_selection_pipeline(sentiment_df)
    print("\n✓ Feature selection completed successfully!")
    
    print(f"\nSelected {len(optimal_features)} optimal features:")
    for i, feature in enumerate(optimal_features, 1):
        print(f"{i}. {feature}")
    
    print(f"\nFinal dataset shape: {final_df.shape}")
    
except Exception as e:
    print(f"✗ Error during feature selection: {e}")
    print("Please check the previous steps and try again.")

In [None]:
# Display final dataset
print("Final Dataset Overview:")
print("=" * 40)
print(f"Shape: {final_df.shape}")
print(f"Features: {list(final_df.columns)}")

display(final_df.head())

# Statistics for final features
print("\nFinal Feature Statistics:")
display(final_df.describe())

## Section 6: Model Training and Evaluation

### Learning Objectives
- Understand neural network training
- Learn about stratified sampling and cross-validation
- Evaluate regression model performance
- Interpret training metrics and learning curves

### What We'll Do
1. **Data Preparation**: Stratify and split data
2. **Model Architecture**: Initialize PyTorch neural network
3. **Training**: Train with early stopping and validation
4. **Evaluation**: Assess performance on test set
5. **Visualization**: Plot training progress and predictions

We'll train a neural network to predict star ratings from our engineered features.

In [None]:
# Run training pipeline
print("Starting model training pipeline...")
print("This will train a neural network for star rating prediction.")

try:
    training_results = training_pipeline()
    print("\n✓ Model training completed successfully!")
    
    # Display results
    metrics = training_results['metrics']
    print("\nTraining Results:")
    print(f"MSE: {metrics['mse']:.4f}")
    print(f"MAE: {metrics['mae']:.4f}")
    print(f"R²: {metrics['r2']:.4f}")
    
    print(f"\nModel saved to: {training_results['model_path']}")
    print(f"Scaler saved to: {training_results['scaler_path']}")
    
except Exception as e:
    print(f"✗ Error during training: {e}")
    print("Please check the previous steps and try again.")

## Section 7: Inference and Predictions

### Learning Objectives
- Learn model loading and inference
- Understand prediction preprocessing
- Create custom prediction examples
- Interpret model outputs

### What We'll Do
1. **Model Loading**: Load trained model and scaler
2. **Example Creation**: Build prediction examples
3. **Inference**: Make predictions on new data
4. **Interpretation**: Understand prediction results

Now let's use our trained model to make predictions on new examples.

In [None]:
# Load model and scaler for inference
import torch
import pickle
from src.model import YelpRatingPredictor

try:
    # Load model
    model = YelpRatingPredictor(input_size=len(optimal_features))
    model.load_state_dict(torch.load(training_results['model_path'], map_location='cpu'))
    model.eval()
    
    # Load scaler
    with open(training_results['scaler_path'], 'rb') as f:
        scaler = pickle.load(f)
    
    print("✓ Model and scaler loaded successfully")
    
except Exception as e:
    print(f"✗ Error loading model: {e}")
    print("Please ensure training completed successfully.")

In [None]:
# Create prediction examples
examples = [
    {
        'user_average_stars': 4.5,
        'business_average_stars': 4.2,
        'time_yelping': 100.0,
        'elite_status': 1,
        'normalized_sentiment_score': 0.8
    },
    {
        'user_average_stars': 3.0,
        'business_average_stars': 3.5,
        'time_yelping': 25.0,
        'elite_status': 0,
        'normalized_sentiment_score': -0.3
    },
    {
        'user_average_stars': 4.0,
        'business_average_stars': 4.8,
        'time_yelping': 200.0,
        'elite_status': 1,
        'normalized_sentiment_score': 0.9
    }
]

# Make predictions
predictions = []
for i, example in enumerate(examples, 1):
    # Filter to optimal features
    filtered_input = {k: v for k, v in example.items() if k in optimal_features}
    input_df = pd.DataFrame([filtered_input])
    
    # Scale input
    scaled_input = scaler.transform(input_df.values)
    input_tensor = torch.FloatTensor(scaled_input)
    
    # Predict
    with torch.no_grad():
        prediction = model(input_tensor).item()
    
    predictions.append(prediction)
    print(f"Example {i}: Predicted rating = {prediction:.2f}")
    print(f"  Input: {filtered_input}")
    print()

## Section 8: Analysis and Insights

### Learning Objectives
- Analyze model performance in depth
- Understand feature contributions
- Identify model limitations
- Explore what-if scenarios

### What We'll Do
1. **Performance Analysis**: Deep dive into metrics
2. **Error Analysis**: Understand prediction errors
3. **Feature Importance**: Analyze which features matter most
4. **Limitations**: Discuss model assumptions and constraints
5. **Future Work**: Suggest improvements

Let's analyze our model's behavior and performance characteristics.

In [None]:
# Load test data for analysis
try:
    # Load the stratified data used for training
    stratified_df = pd.read_csv('data/processed/final_model_data.csv')
    
    # Load optimal features
    with open('data/processed/optimal_features.json', 'r') as f:
        optimal_features = json.load(f)
    
    print("✓ Analysis data loaded successfully")
    
except Exception as e:
    print(f"✗ Error loading analysis data: {e}")

In [None]:
# Performance analysis
print("Model Performance Analysis:")
print("=" * 50)
print(f"Mean Squared Error (MSE): {metrics['mse']:.4f}")
print(f"Mean Absolute Error (MAE): {metrics['mae']:.4f}")
print(f"R² Score: {metrics['r2']:.4f}")
print(f"RMSE: {metrics['mse']**0.5:.4f}")

# Interpretation
print("\nInterpretation:")
if metrics['r2'] > 0.7:
    print("✓ Excellent performance (R² > 0.7)")
elif metrics['r2'] > 0.5:
    print("✓ Good performance (R² > 0.5)")
elif metrics['r2'] > 0.3:
    print("✓ Moderate performance (R² > 0.3)")
else:
    print("⚠ Limited performance - consider feature engineering or model improvements")

print(f"\nOn average, predictions are off by {metrics['mae']:.2f} stars.")
print(f"Typical prediction error range: ±{metrics['mae']*1.96:.2f} stars (95% confidence)")

In [None]:
# Feature importance analysis
print("\nFeature Analysis:")
print("=" * 30)
print(f"Selected optimal features ({len(optimal_features)}):")
for i, feature in enumerate(optimal_features, 1):
    print(f"{i}. {feature}")

# Correlation analysis
feature_corrs = stratified_df[optimal_features + ['stars']].corr()['stars'].abs().sort_values(ascending=False)
print("\nFeature correlations with target:")
for feature in optimal_features:
    corr = feature_corrs[feature]
    print(f"{feature:25s}: {corr:.4f}")

# Visualize feature correlations
plt.figure(figsize=(10, 6))
feature_corrs[optimal_features].sort_values().plot(kind='barh')
plt.title('Feature Correlations with Star Rating')
plt.xlabel('Absolute Correlation')
plt.grid(axis='x', alpha=0.3)
plt.show()

In [None]:
# Model limitations and insights
print("\nModel Limitations and Insights:")
print("=" * 40)
print("1. **Data Scope**: Model trained on Yelp Academic Dataset only")
print("2. **Feature Limitations**: Predictions based on available user/business features")
print("3. **Sentiment Context**: Text analysis may miss nuanced sentiment")
print("4. **Temporal Factors**: Model doesn't account for trends over time")
print("5. **Geographic Bias**: Results may not generalize to all locations")

print("\nKey Insights:")
print("• User history (average stars, elite status) strongly predicts ratings")
print("• Business quality is a major factor")
print("• Experience level (time yelping) influences rating patterns")
print("• Review sentiment provides additional predictive power")

print("\nFuture Improvements:")
print("• Incorporate temporal trends and seasonality")
print("• Add geographic and demographic features")
print("• Use more advanced NLP models for sentiment")
print("• Implement ensemble methods for better performance")
print("• Add uncertainty quantification to predictions")

## Summary

Congratulations! You've successfully completed the interactive Yelp rating prediction pipeline. Here's what we accomplished:

### Pipeline Stages Completed:
1. ✅ **Data Loading & Preprocessing**: Loaded and cleaned Yelp dataset
2. ✅ **Feature Engineering**: Created time-based and elite status features
3. ✅ **Sentiment Analysis**: Extracted sentiment scores from review text
4. ✅ **Feature Selection**: Identified optimal feature subset
5. ✅ **Model Training**: Trained neural network for rating prediction
6. ✅ **Inference**: Demonstrated model predictions
7. ✅ **Analysis**: Explored model performance and insights

### Key Learnings:
- **Data preprocessing** is crucial for model performance
- **Feature engineering** transforms raw data into predictive features
- **Sentiment analysis** adds valuable text-derived insights
- **Feature selection** improves model efficiency and interpretability
- **Neural networks** can effectively model complex relationships

### Model Performance:
- **MSE**: {metrics['mse']:.4f}
- **MAE**: {metrics['mae']:.4f}
- **R²**: {metrics['r2']:.4f}

### Files Created:
- `data/processed/merged_data.csv`: Preprocessed dataset
- `data/processed/featured_data.csv`: Engineered features
- `data/processed/sentiment_data.csv`: With sentiment scores
- `data/processed/final_model_data.csv`: Final training data
- `models/best_model.pt`: Trained PyTorch model
- `models/scaler.pkl`: Feature scaler

This notebook demonstrates a complete machine learning pipeline from raw data to production-ready model. You can now apply these techniques to other prediction problems!