# Interactive Yelp Rating Prediction Pipeline

## Overview

This notebook provides an interactive, educational experience for understanding and running the complete Yelp star rating prediction pipeline. We'll walk through each stage of the machine learning process, from data loading to model inference, with hands-on visualizations and parameter tuning.

### Learning Objectives
- Understand the Yelp Academic Dataset structure
- Learn data preprocessing and feature engineering techniques
- Explore sentiment analysis using transformer models
- Perform feature selection and model training
- Evaluate model performance and make predictions

### Pipeline Stages
1. **Data Loading & Preprocessing**: Load and clean raw Yelp data
2. **Feature Engineering**: Create derived features from raw data
3. **Sentiment Analysis**: Extract sentiment scores from review text
4. **Feature Selection**: Identify optimal feature subset
5. **Model Training**: Train neural network for rating prediction
6. **Inference**: Make predictions on new data
7. **Analysis**: Deep dive into results and insights

### Dataset
We'll use the Yelp Academic Dataset, which includes:
- **Business data**: Restaurant information and ratings
- **Review data**: User reviews with text and ratings
- **User data**: Reviewer profiles and history

The goal is to predict the star rating (1-5) a user would give to a business based on user and business characteristics, plus review sentiment.

## Section 1: Introduction and Setup

### Learning Objectives
- Set up the Python environment
- Verify GPU availability
- Understand configuration parameters
- Load required libraries and modules

### Environment Requirements
- Python 3.8+
- PyTorch with MPS/CUDA support
- Transformers library
- Jupyter widgets for interactivity

Let's start by setting up our environment and verifying everything is working correctly.

In [None]:
# Environment setup and importsimport sysimport osimport loggingimport warningswarnings.filterwarnings('ignore')# Add src to path for local importssys.path.append('src')# Import standard librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom tqdm import tqdmimport jsonimport pickle# Import machine learning librariesimport torchfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score# Import pipeline modules# Preprocessing functionsimport pandas as pdimport osfrom typing import Listdef load_business_data(filepath: str) -> pd.DataFrame:    """    Load business data from CSV file with appropriate dtypes.    Args:        filepath: Path to the business CSV file    Returns:        DataFrame containing business data    Raises:        FileNotFoundError: If the file does not exist        ValueError: If required columns are missing    """    if not os.path.exists(filepath):        raise FileNotFoundError(f"Business data file not found: {filepath}")    # Define dtypes for business data    dtypes = {        'business_id': str,        'name': str,        'address': str,        'city': str,        'state': str,        'postal_code': str,        'latitude': float,        'longitude': float,        'stars': float,        'review_count': int,        'is_open': int    }    # Load the data    df = pd.read_csv(filepath, dtype=dtypes, low_memory=False)    # Check for required columns    required_columns = ['business_id', 'name', 'stars', 'review_count']    missing_columns = [col for col in required_columns if col not in df.columns]    if missing_columns:        raise ValueError(f"Missing required columns in business data: {missing_columns}")    return dfdef load_review_data(filepath: str) -> pd.DataFrame:    """    Load review data from CSV file with appropriate dtypes.    Args:        filepath: Path to the review CSV file    Returns:        DataFrame containing review data    Raises:        FileNotFoundError: If the file does not exist        ValueError: If required columns are missing    """    if not os.path.exists(filepath):        raise FileNotFoundError(f"Review data file not found: {filepath}")    # Define dtypes for review data    dtypes = {        'review_id': str,        'user_id': str,        'business_id': str,        'stars': int,        'useful': int,        'funny': int,        'cool': int,        'text': str,        'date': str    }    # Load the data    df = pd.read_csv(filepath, dtype=dtypes, low_memory=False)    # Check for required columns    required_columns = ['review_id', 'user_id', 'business_id', 'stars', 'text', 'date']    missing_columns = [col for col in required_columns if col not in df.columns]    if missing_columns:        raise ValueError(f"Missing required columns in review data: {missing_columns}")    return dfdef load_user_data(filepath: str) -> pd.DataFrame:    """    Load user data from CSV file with appropriate dtypes.    Args:        filepath: Path to the user CSV file    Returns:        DataFrame containing user data    Raises:        FileNotFoundError: If the file does not exist        ValueError: If required columns are missing    """    if not os.path.exists(filepath):        raise FileNotFoundError(f"User data file not found: {filepath}")    # Define dtypes for user data    dtypes = {        'user_id': str,        'name': str,        'review_count': int,        'yelping_since': str,        'useful': int,        'funny': int,        'cool': int,        'elite': str,        'friends': str,        'fans': int,        'average_stars': float,        'compliment_hot': int,        'compliment_more': int,        'compliment_profile': int,        'compliment_cute': int,        'compliment_list': int,        'compliment_note': int,        'compliment_plain': int,        'compliment_cool': int,        'compliment_funny': int,        'compliment_writer': int,        'compliment_photos': int    }    # Load the data    df = pd.read_csv(filepath, dtype=dtypes, low_memory=False)    # Check for required columns    required_columns = ['user_id', 'name', 'review_count', 'yelping_since', 'average_stars']    missing_columns = [col for col in required_columns if col not in df.columns]    if missing_columns:        raise ValueError(f"Missing required columns in user data: {missing_columns}")    return dfimport pandas as pdimport osfrom typing import Tuplefrom src.config import INPUT_FILES, OUTPUT_FILESfrom src.data_loading import load_business_data, load_review_data, load_user_datadef rename_columns(user_df: pd.DataFrame, business_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:    """    Rename columns in user and business DataFrames to avoid naming conflicts.    Args:        user_df: DataFrame containing user data        business_df: DataFrame containing business data    Returns:        Tuple of (renamed_user_df, renamed_business_df)    """    # Rename user columns    user_renames = {        'useful': 'total_useful',        'funny': 'total_funny',        'cool': 'total_cool',        'review_count': 'user_review_count',        'name': 'user_name',        'average_stars': 'user_average_stars'    }    # Rename business columns    business_renames = {        'stars': 'business_average_stars',        'review_count': 'business_review_count',        'name': 'business_name'    }    # Apply renames    renamed_user_df = user_df.rename(columns=user_renames)    renamed_business_df = business_df.rename(columns=business_renames)    return renamed_user_df, renamed_business_dfdef convert_date_columns(review_df: pd.DataFrame, user_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:    """    Convert date columns to datetime dtype.    Args:        review_df: DataFrame containing review data        user_df: DataFrame containing user data    Returns:        Tuple of (converted_review_df, converted_user_df)    """    # Convert 'date' column in review_df to datetime    converted_review_df = review_df.copy()    converted_review_df['date'] = pd.to_datetime(converted_review_df['date'])    # Convert 'yelping_since' column in user_df to datetime    converted_user_df = user_df.copy()    converted_user_df['yelping_since'] = pd.to_datetime(converted_user_df['yelping_since'])    return converted_review_df, converted_user_dfdef merge_datasets(review_df: pd.DataFrame, user_df: pd.DataFrame, business_df: pd.DataFrame) -> pd.DataFrame:    """    Merge review, user, and business DataFrames using inner joins.    Args:        review_df: DataFrame containing review data        user_df: DataFrame containing user data        business_df: DataFrame containing business data    Returns:        Merged DataFrame with all three sources combined    """    # Inner join review -> user on 'user_id'    merged = review_df.merge(user_df, on='user_id', how='inner')    # Then result -> business on 'business_id'    merged = merged.merge(business_df, on='business_id', how='inner')    return mergeddef clean_merged_data(merged_df: pd.DataFrame) -> pd.DataFrame:    """    Clean merged DataFrame by removing rows with missing values in critical columns.    Args:        merged_df: Merged DataFrame from merge_datasets    Returns:        Cleaned DataFrame with no missing values in critical columns    """    # Drop rows with missing values in specified columns    cleaned = merged_df.dropna(subset=['stars', 'text', 'business_average_stars', 'user_average_stars', 'user_review_count'])    return cleaneddef preprocess_pipeline() -> pd.DataFrame:    """    Complete preprocessing pipeline: load, rename, convert dates, merge, clean, and save.    Returns:        Final preprocessed DataFrame    """    # Load all three datasets    review_df = load_review_data(INPUT_FILES["review"])    user_df = load_user_data(INPUT_FILES["user"])    business_df = load_business_data(INPUT_FILES["business"])    # Rename columns    user_df, business_df = rename_columns(user_df, business_df)    # Convert date columns    review_df, user_df = convert_date_columns(review_df, user_df)    # Merge datasets    merged_df = merge_datasets(review_df, user_df, business_df)    # Clean merged data    cleaned_df = clean_merged_data(merged_df)    # Ensure output directory exists    output_dir = os.path.dirname(OUTPUT_FILES["merged_data"])    os.makedirs(output_dir, exist_ok=True)    # Save to CSV    cleaned_df.to_csv(OUTPUT_FILES["merged_data"], index=False)    return cleaned_df# Features functionsimport pandas as pdfrom src.utils import count_elite_statuses, check_elite_status# Configimport osimport torch# File paths for input CSV filesDATA_DIR = "data"INPUT_FILES = {    "business": os.path.join(DATA_DIR, "yelp_business_data.csv"),    "review": os.path.join(DATA_DIR, "yelp_review.csv"),    "user": os.path.join(DATA_DIR, "yelp_user.csv"),    "checkin": os.path.join(DATA_DIR, "yelp_checkin_data.csv"),    "tip": os.path.join(DATA_DIR, "yelp_tip_data.csv")}# Output paths for processed dataOUTPUT_DIR = os.path.join(DATA_DIR, "processed")OUTPUT_FILES = {    "merged_data": os.path.join(OUTPUT_DIR, "merged_data.csv"),    "featured_data": os.path.join(OUTPUT_DIR, "featured_data.csv"),    "sentiment_data": os.path.join(OUTPUT_DIR, "sentiment_data.csv"),    "final_model_data": os.path.join(OUTPUT_DIR, "final_model_data.csv")}FEATURED_DATA_PATH = OUTPUT_FILES["featured_data"]# Model hyperparametersLEARNING_RATE = 0.0001BATCH_SIZE = 64MAX_EPOCHS = 40# Feature listsCANDIDATE_FEATURES = [    "user_average_stars",    "business_average_stars",    "user_review_count",    "business_review_count",    "time_yelping",    "date_year",    "total_elite_statuses",    "elite_status",    "normalized_sentiment_score"]EXPECTED_OPTIMAL_FEATURES = [    "user_average_stars",    "business_average_stars",    "time_yelping",    "elite_status",    "normalized_sentiment_score"]# Random seedSEED = 1# Sentiment settingsMODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"MAX_TOKENS = 512SENTIMENT_BATCH_SIZE = 64# Device detection helper functiondef get_device():    """    Detect the available device for PyTorch computations.    Returns:        str: 'mps' if MPS is available, 'cuda' if CUDA is available, otherwise 'cpu'    """    if torch.backends.mps.is_available():        return "mps"    elif torch.cuda.is_available():        return "cuda"    else:        return "cpu"def engineer_time_features(df: pd.DataFrame) -> pd.DataFrame:    """    Engineer time-based features from the DataFrame.    Calculates time_yelping as the difference between date and yelping_since    in weeks, and extracts date_year from the date column.    Args:        df: Input DataFrame with 'date' and 'yelping_since' columns    Returns:        DataFrame with added 'time_yelping' and 'date_year' columns    """    df = df.copy()    # Convert date columns to datetime if they are strings    df['date'] = pd.to_datetime(df['date'])    df['yelping_since'] = pd.to_datetime(df['yelping_since'])    # Calculate time_yelping in weeks    df['time_yelping'] = (df['date'] - df['yelping_since']).dt.total_seconds() / (7 * 24 * 3600)    # Extract year from date    df['date_year'] = df['date'].dt.year    return dfdef engineer_elite_features(df: pd.DataFrame) -> pd.DataFrame:    """    Engineer elite status features from the DataFrame.    Creates 'total_elite_statuses' by counting elite years up to the review year,    and 'elite_status' by checking if the user was elite in the review year or previous year.    Args:        df: Input DataFrame with 'elite' and 'date_year' columns    Returns:        DataFrame with added 'total_elite_statuses' and 'elite_status' columns    """    df = df.copy()    # Create total_elite_statuses using count_elite_statuses    df['total_elite_statuses'] = df.apply(        lambda row: count_elite_statuses(row['elite'], row['date_year']),        axis=1    )    # Create elite_status using check_elite_status    df['elite_status'] = df.apply(        lambda row: check_elite_status(row['elite'], row['date_year']),        axis=1    )    return dfdef handle_missing_values(df: pd.DataFrame) -> pd.DataFrame:    """    Handle missing values in the DataFrame.    Fills 'time_yelping' with the median value, and 'total_elite_statuses'    and 'elite_status' with 0.    Args:        df: Input DataFrame    Returns:        DataFrame with missing values handled    """    df = df.copy()    df['time_yelping'] = df['time_yelping'].fillna(df['time_yelping'].median())    df['total_elite_statuses'] = df['total_elite_statuses'].fillna(0)    df['elite_status'] = df['elite_status'].fillna(0)    return dfdef feature_engineering_pipeline(df: pd.DataFrame) -> pd.DataFrame:    """    Complete feature engineering pipeline.    Applies time feature engineering, elite feature engineering, handles missing values,    saves the processed data to CSV, and returns the DataFrame.    Args:        df: Input DataFrame    Returns:        Processed DataFrame with engineered features    """    df = engineer_time_features(df)    df = engineer_elite_features(df)    df = handle_missing_values(df)    df.to_csv(config.FEATURED_DATA_PATH, index=False)    return df# Sentiment functionsfrom transformers import pipeline, AutoTokenizerimport torchfrom typing import List, Dictimport pandas as pdfrom tqdm import tqdmimport osfrom src.utils import smart_truncate_textdef initialize_sentiment_pipeline(device: str = "mps"):    """    Initialize sentiment analysis pipeline with device detection.    Args:        device: Device to use ('mps' or 'cpu'). Defaults to 'mps'.    Returns:        Hugging Face pipeline for sentiment analysis    """    # Detect device    if device == "mps" and not torch.backends.mps.is_available():        device = "cpu"    elif device not in ["cpu", "mps"]:        device = "cpu"  # fallback    # Initialize pipeline    sentiment_pipeline = pipeline(        "sentiment-analysis",        model="distilbert-base-uncased-finetuned-sst-2-english",        device=device,        truncation=False    )    return sentiment_pipelinedef process_sentiment_batch(texts: List[str], pipeline, batch_size: int = 64) -> List[Dict]:    """    Process batch of texts through sentiment analysis pipeline.    Args:        texts: List of text strings to analyze        pipeline: Hugging Face sentiment analysis pipeline        batch_size: Number of texts to process in each batch    Returns:        List of sentiment analysis results (dicts with 'label' and 'score')    """    results = []    for i in range(0, len(texts), batch_size):        batch = texts[i:i + batch_size]        batch_results = pipeline(batch)        results.extend(batch_results)    return resultsdef normalize_sentiment_scores(sentiment_results: List[Dict]) -> pd.Series:    """    Normalize sentiment scores to range [-1, 1].    Args:        sentiment_results: List of sentiment analysis results    Returns:        Pandas Series with normalized scores (-1 for negative, +1 for positive)    """    scores = []    for result in sentiment_results:        label = result['label']        score = result['score']        if label == 'NEGATIVE':            scores.append(-score)        elif label == 'POSITIVE':            scores.append(score)        else:            # Handle unexpected labels (e.g., neutral) by setting to 0            scores.append(0.0)    return pd.Series(scores)def sentiment_analysis_pipeline(df: pd.DataFrame, batch_size: int = 64) -> pd.DataFrame:    """    Complete sentiment analysis pipeline.    Args:        df: DataFrame containing review texts in 'text' column        batch_size: Number of texts to process in each batch    Returns:        DataFrame with added sentiment columns    """    # Initialize sentiment pipeline    sentiment_pipeline = initialize_sentiment_pipeline()    # Load tokenizer for truncation    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")    # Process in batches with tqdm progress bar    results = []    for i in tqdm(range(0, len(df), batch_size), desc="Processing sentiment analysis"):        batch_texts = df['text'].iloc[i:i + batch_size].tolist()        # Apply smart truncation to each text        truncated_texts = [smart_truncate_text(text, tokenizer, max_tokens=500) for text in batch_texts]        batch_results = process_sentiment_batch(truncated_texts, sentiment_pipeline, batch_size)        results.extend(batch_results)    # Normalize sentiment scores    normalized_scores = normalize_sentiment_scores(results)    # Add columns    df['sentiment_label'] = [r['label'] for r in results]    df['sentiment_score_raw'] = [r['score'] for r in results]    df['normalized_sentiment_score'] = normalized_scores    # Save to CSV with checkpointing (save every 1000 rows or at end)    output_path = 'data/processed/sentiment_data.csv'    checkpoint_interval = 1000    for i in range(0, len(df), checkpoint_interval):        end_idx = min(i + checkpoint_interval, len(df))        temp_df = df.iloc[:end_idx]        temp_df.to_csv(output_path, index=False)        if end_idx < len(df):            # Intermediate save            pass    return df# Feature selection functionsimport pandas as pdimport jsonimport osimport loggingfrom typing import List, Tuplefrom itertools import combinationsfrom tqdm import tqdmfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error# Configimport osimport torch# File paths for input CSV filesDATA_DIR = "data"INPUT_FILES = {    "business": os.path.join(DATA_DIR, "yelp_business_data.csv"),    "review": os.path.join(DATA_DIR, "yelp_review.csv"),    "user": os.path.join(DATA_DIR, "yelp_user.csv"),    "checkin": os.path.join(DATA_DIR, "yelp_checkin_data.csv"),    "tip": os.path.join(DATA_DIR, "yelp_tip_data.csv")}# Output paths for processed dataOUTPUT_DIR = os.path.join(DATA_DIR, "processed")OUTPUT_FILES = {    "merged_data": os.path.join(OUTPUT_DIR, "merged_data.csv"),    "featured_data": os.path.join(OUTPUT_DIR, "featured_data.csv"),    "sentiment_data": os.path.join(OUTPUT_DIR, "sentiment_data.csv"),    "final_model_data": os.path.join(OUTPUT_DIR, "final_model_data.csv")}FEATURED_DATA_PATH = OUTPUT_FILES["featured_data"]# Model hyperparametersLEARNING_RATE = 0.0001BATCH_SIZE = 64MAX_EPOCHS = 40# Feature listsCANDIDATE_FEATURES = [    "user_average_stars",    "business_average_stars",    "user_review_count",    "business_review_count",    "time_yelping",    "date_year",    "total_elite_statuses",    "elite_status",    "normalized_sentiment_score"]EXPECTED_OPTIMAL_FEATURES = [    "user_average_stars",    "business_average_stars",    "time_yelping",    "elite_status",    "normalized_sentiment_score"]# Random seedSEED = 1# Sentiment settingsMODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"MAX_TOKENS = 512SENTIMENT_BATCH_SIZE = 64# Device detection helper functiondef get_device():    """    Detect the available device for PyTorch computations.    Returns:        str: 'mps' if MPS is available, 'cuda' if CUDA is available, otherwise 'cpu'    """    if torch.backends.mps.is_available():        return "mps"    elif torch.cuda.is_available():        return "cuda"    else:        return "cpu"logger = logging.getLogger(__name__)def prepare_feature_data(df: pd.DataFrame, candidate_features: List[str]) -> Tuple[pd.DataFrame, pd.Series]:    """    Prepare data for feature selection.    Args:        df: Input DataFrame containing all features and target        candidate_features: List of feature column names to consider    Returns:        Tuple of (X, y) where X is features DataFrame and y is target Series    """    # Select candidate features plus target column    selected_cols = candidate_features + ['stars']    subset_df = df[selected_cols].copy()    # Remove rows with missing values in selected columns    subset_df = subset_df.dropna()    # Separate features and target    X = subset_df[candidate_features]    y = subset_df['stars']    return X, ydef run_best_subset_selection(X: pd.DataFrame, y: pd.Series) -> List[str]:    """    Run best subset feature selection using exhaustive search over different numbers of features.    Args:        X: Feature DataFrame        y: Target Series    Returns:        List of selected feature names    """    # Split data into train and validation    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)    best_overall_mse = float('inf')    best_k = None    best_features = None    max_k = min(10, len(X.columns))    for k in tqdm(range(1, max_k + 1), desc="Evaluating k"):        # Get all combinations of k features        feature_combos = list(combinations(X.columns, k))        best_mse_for_k = float('inf')        best_combo_for_k = None        for combo in tqdm(feature_combos, desc=f"Evaluating combinations for k={k}"):            combo = list(combo)            # Select features            X_train_combo = X_train[combo]            X_val_combo = X_val[combo]            # Fit model            model = RandomForestRegressor(n_estimators=100, random_state=1, n_jobs=-1)            model.fit(X_train_combo, y_train)            # Predict and compute MSE            y_pred = model.predict(X_val_combo)            mse = mean_squared_error(y_val, y_pred)            if mse < best_mse_for_k:                best_mse_for_k = mse                best_combo_for_k = combo        # Compare across k        if best_mse_for_k < best_overall_mse:            best_overall_mse = best_mse_for_k            best_k = k            best_features = best_combo_for_k    logger.info(f"Best k: {best_k}")    logger.info(f"Best feature set: {best_features}")    logger.info(f"Best MSE: {best_overall_mse}")    return best_featuresdef feature_selection_pipeline(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:    """    Complete feature selection pipeline.    Args:        df: Input DataFrame with all features and target    Returns:        Tuple of (final DataFrame with optimal features + target, list of optimal features)    """    # Get candidate features from config    candidate_features = config.CANDIDATE_FEATURES    # Prepare feature data    X, y = prepare_feature_data(df, candidate_features)    # Run best subset selection    optimal_features = run_best_subset_selection(X, y)    # Save optimal features to JSON    optimal_features_path = os.path.join(config.OUTPUT_DIR, "optimal_features.json")    with open(optimal_features_path, 'w') as f:        json.dump(optimal_features, f, indent=2)    # Create final dataset with optimal features + target    final_cols = optimal_features + ['stars']    final_df = df[final_cols].copy()    # Remove any remaining missing values    final_df = final_df.dropna()    # Save final dataset    final_data_path = config.OUTPUT_FILES["final_model_data"]    os.makedirs(os.path.dirname(final_data_path), exist_ok=True)    final_df.to_csv(final_data_path, index=False)    # Return final DataFrame and feature list    return final_df, optimal_features# Training functionsimport torchimport torch.nn as nnimport pytorch_lightning as plfrom typing import Dict, Anyclass YelpRatingPredictor(pl.LightningModule):    """    PyTorch Lightning module for predicting Yelp ratings using a neural network.    This model consists of a feedforward neural network with dropout and batch normalization    layers, designed to predict star ratings based on input features.    Attributes:        network: Sequential neural network layers        criterion: Mean squared error loss function    """    def __init__(self, input_size: int = 5, learning_rate: float = 0.0001) -> None:        super().__init__()        self.network = nn.Sequential(            nn.Linear(input_size, 256),            nn.ReLU(),            nn.BatchNorm1d(256),            nn.Dropout(0.5),            nn.Linear(256, 128),            nn.ReLU(),            nn.Dropout(0.5),            nn.Linear(128, 1)        )        self.criterion = nn.MSELoss()        self.save_hyperparameters()    def forward(self, x: torch.Tensor) -> torch.Tensor:        """        Forward pass through the network.        Args:            x: Input tensor        Returns:            Output tensor predictions        """        return self.network(x)    def training_step(self, batch: tuple, batch_idx: int) -> torch.Tensor:        """        Training step for one batch.        Args:            batch: Tuple of (features, targets)            batch_idx: Batch index        Returns:            Loss tensor        """        x, y = batch        preds = self(x)        loss = self.criterion(preds, y)        self.log('train_loss', loss)        return loss    def validation_step(self, batch: tuple, batch_idx: int) -> torch.Tensor:        """        Validation step for one batch.        Args:            batch: Tuple of (features, targets)            batch_idx: Batch index        Returns:            Loss tensor        """        x, y = batch        preds = self(x)        loss = self.criterion(preds, y)        mae = torch.mean(torch.abs(preds - y))        self.log('val_loss', loss)        self.log('val_mae', mae)        return loss    def test_step(self, batch: tuple, batch_idx: int) -> Dict[str, torch.Tensor]:        """        Test step for one batch.        Args:            batch: Tuple of (features, targets)            batch_idx: Batch index        Returns:            Dictionary with test loss and MAE        """        x, y = batch        preds = self(x)        loss = self.criterion(preds, y)        mae = torch.mean(torch.abs(preds - y))        return {'test_loss': loss, 'test_mae': mae}    def configure_optimizers(self) -> Dict[str, Any]:        """        Configure optimizer and learning rate scheduler.        Returns:            Dictionary with optimizer and scheduler configuration        """        optimizer = torch.optim.RMSprop(self.parameters(), lr=self.hparams.learning_rate)        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(            optimizer, factor=0.5, patience=3        )        return {'optimizer': optimizer, 'lr_scheduler': {'scheduler': scheduler, 'monitor': 'val_loss'}}import pandas as pdimport jsonimport torchimport osimport picklefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scorefrom typing import List, Tuple, Dict, Any# Configimport osimport torch# File paths for input CSV filesDATA_DIR = "data"INPUT_FILES = {    "business": os.path.join(DATA_DIR, "yelp_business_data.csv"),    "review": os.path.join(DATA_DIR, "yelp_review.csv"),    "user": os.path.join(DATA_DIR, "yelp_user.csv"),    "checkin": os.path.join(DATA_DIR, "yelp_checkin_data.csv"),    "tip": os.path.join(DATA_DIR, "yelp_tip_data.csv")}# Output paths for processed dataOUTPUT_DIR = os.path.join(DATA_DIR, "processed")OUTPUT_FILES = {    "merged_data": os.path.join(OUTPUT_DIR, "merged_data.csv"),    "featured_data": os.path.join(OUTPUT_DIR, "featured_data.csv"),    "sentiment_data": os.path.join(OUTPUT_DIR, "sentiment_data.csv"),    "final_model_data": os.path.join(OUTPUT_DIR, "final_model_data.csv")}FEATURED_DATA_PATH = OUTPUT_FILES["featured_data"]# Model hyperparametersLEARNING_RATE = 0.0001BATCH_SIZE = 64MAX_EPOCHS = 40# Feature listsCANDIDATE_FEATURES = [    "user_average_stars",    "business_average_stars",    "user_review_count",    "business_review_count",    "time_yelping",    "date_year",    "total_elite_statuses",    "elite_status",    "normalized_sentiment_score"]EXPECTED_OPTIMAL_FEATURES = [    "user_average_stars",    "business_average_stars",    "time_yelping",    "elite_status",    "normalized_sentiment_score"]# Random seedSEED = 1# Sentiment settingsMODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"MAX_TOKENS = 512SENTIMENT_BATCH_SIZE = 64# Device detection helper functiondef get_device():    """    Detect the available device for PyTorch computations.    Returns:        str: 'mps' if MPS is available, 'cuda' if CUDA is available, otherwise 'cpu'    """    if torch.backends.mps.is_available():        return "mps"    elif torch.cuda.is_available():        return "cuda"    else:        return "cpu"from src.model import YelpRatingPredictorfrom torch.utils.data import DataLoader, TensorDatasetimport pytorch_lightning as plfrom pytorch_lightning.callbacks import EarlyStopping, ModelCheckpointdef stratify_and_split(df: pd.DataFrame, target_size: int = 130000) -> pd.DataFrame:    """    Stratify the dataset by 'stars' and downsample to equal samples per class.    Args:        df: Input DataFrame containing 'stars' column        target_size: Total target size for the stratified dataset    Returns:        Stratified DataFrame with equal samples per class    """    # Group by 'stars' and downsample each group    samples_per_class = target_size // 5  # 5 classes (1-5 stars)    stratified_df = df.groupby('stars', group_keys=False).apply(        lambda x: x.sample(n=min(len(x), samples_per_class), random_state=1)    ).reset_index(drop=True)    return stratified_dfdef prepare_train_test_data(df: pd.DataFrame, features: List[str], test_size: float = 0.2) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, MinMaxScaler]:    """    Prepare train/test data with stratification, normalization, and PyTorch tensor conversion.    Args:        df: Input DataFrame        features: List of feature column names        test_size: Fraction of data to use for testing    Returns:        Tuple of (X_train, X_test, y_train, y_test, scaler)    """    # Prepare features and target    X = df[features]    y = df['stars']    # Split into train/test with stratification    X_train, X_test, y_train, y_test = train_test_split(        X, y, test_size=test_size, stratify=y, random_state=1    )    # Normalize features using MinMaxScaler    scaler = MinMaxScaler()    X_train_scaled = scaler.fit_transform(X_train)    X_test_scaled = scaler.transform(X_test)    # Convert to PyTorch tensors    X_train_tensor = torch.FloatTensor(X_train_scaled)    X_test_tensor = torch.FloatTensor(X_test_scaled)    y_train_tensor = torch.FloatTensor(y_train.values)    y_test_tensor = torch.FloatTensor(y_test.values)    return X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor, scalerdef create_dataloaders(X_train, y_train, X_val, y_val, batch_size: int = 64) -> Tuple[DataLoader, DataLoader]:    """    Create DataLoaders for training and validation datasets.    Args:        X_train: Training features tensor        y_train: Training labels tensor        X_val: Validation features tensor        y_val: Validation labels tensor        batch_size: Batch size for DataLoaders    Returns:        Tuple of (train_loader, val_loader)    """    train_dataset = TensorDataset(X_train, y_train)    val_dataset = TensorDataset(X_val, y_val)    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)    return train_loader, val_loaderdef train_model(model, train_loader, val_loader, max_epochs: int = 40) -> pl.Trainer:    """    Train the model using PyTorch Lightning.    Args:        model: PyTorch Lightning model to train        train_loader: Training DataLoader        val_loader: Validation DataLoader        max_epochs: Maximum number of epochs    Returns:        Trained PyTorch Lightning Trainer    """    # Detect device    if torch.backends.mps.is_available():        accelerator = 'mps'    else:        accelerator = 'cpu'    # Configure callbacks    early_stopping = EarlyStopping(monitor='val_loss', patience=5)    # Configure trainer    trainer = pl.Trainer(        accelerator=accelerator,        max_epochs=max_epochs,        callbacks=[early_stopping]    )    # Train model    trainer.fit(model, train_loader, val_loader)    # Save model    os.makedirs('models', exist_ok=True)    torch.save(model.state_dict(), os.path.join('models', 'best_model.pt'))    return trainerdef evaluate_model(model, X_test, y_test) -> Dict[str, float]:    """    Evaluate the model on test data and return metrics.    Args:        model: Trained PyTorch model        X_test: Test features tensor        y_test: Test labels tensor    Returns:        Dictionary with MSE, MAE, and R² metrics    """    model.eval()    with torch.no_grad():        predictions = model(X_test)        mse = mean_squared_error(y_test.cpu().numpy(), predictions.cpu().numpy())        mae = mean_absolute_error(y_test.cpu().numpy(), predictions.cpu().numpy())        r2 = r2_score(y_test.cpu().numpy(), predictions.cpu().numpy())    return {'mse': mse, 'mae': mae, 'r2': r2}def training_pipeline() -> Dict[str, Any]:    """    Complete training pipeline: load data, train model, evaluate, and save artifacts.    Returns:        Dictionary with results including metrics and file paths    """    # Load final model data and optimal features    df = pd.read_csv('data/processed/final_model_data.csv')    with open('data/processed/optimal_features.json', 'r') as f:        features = json.load(f)    # Stratify data    df_stratified = stratify_and_split(df)    # Prepare train/test splits    X_train, X_test, y_train, y_test, scaler = prepare_train_test_data(df_stratified, features)    # Split train into train/val for training    X_train_split, X_val, y_train_split, y_val = train_test_split(        X_train, y_train, test_size=0.2, random_state=1    )    # Create DataLoaders    train_loader, val_loader = create_dataloaders(        X_train_split, y_train_split, X_val, y_val, batch_size=config.BATCH_SIZE    )    # Initialize model    input_size = len(features)    model = YelpRatingPredictor(input_size=input_size, learning_rate=config.LEARNING_RATE)    # Train model    trainer = train_model(model, train_loader, val_loader, max_epochs=config.MAX_EPOCHS)    # Evaluate on test set    metrics = evaluate_model(model, X_test, y_test)    # Save scaler    os.makedirs('models', exist_ok=True)    with open('models/scaler.pkl', 'wb') as f:        pickle.dump(scaler, f)    # Save metrics    os.makedirs('outputs', exist_ok=True)    with open('outputs/metrics.json', 'w') as f:        json.dump(metrics, f)    # Return results    return {        'metrics': metrics,        'model_path': 'models/best_model.pt',        'scaler_path': 'models/scaler.pkl',        'metrics_path': 'outputs/metrics.json'    }# Configimport osimport torch# File paths for input CSV filesDATA_DIR = "data"INPUT_FILES = {    "business": os.path.join(DATA_DIR, "yelp_business_data.csv"),    "review": os.path.join(DATA_DIR, "yelp_review.csv"),    "user": os.path.join(DATA_DIR, "yelp_user.csv"),    "checkin": os.path.join(DATA_DIR, "yelp_checkin_data.csv"),    "tip": os.path.join(DATA_DIR, "yelp_tip_data.csv")}# Output paths for processed dataOUTPUT_DIR = os.path.join(DATA_DIR, "processed")OUTPUT_FILES = {    "merged_data": os.path.join(OUTPUT_DIR, "merged_data.csv"),    "featured_data": os.path.join(OUTPUT_DIR, "featured_data.csv"),    "sentiment_data": os.path.join(OUTPUT_DIR, "sentiment_data.csv"),    "final_model_data": os.path.join(OUTPUT_DIR, "final_model_data.csv")}FEATURED_DATA_PATH = OUTPUT_FILES["featured_data"]# Model hyperparametersLEARNING_RATE = 0.0001BATCH_SIZE = 64MAX_EPOCHS = 40# Feature listsCANDIDATE_FEATURES = [    "user_average_stars",    "business_average_stars",    "user_review_count",    "business_review_count",    "time_yelping",    "date_year",    "total_elite_statuses",    "elite_status",    "normalized_sentiment_score"]EXPECTED_OPTIMAL_FEATURES = [    "user_average_stars",    "business_average_stars",    "time_yelping",    "elite_status",    "normalized_sentiment_score"]# Random seedSEED = 1# Sentiment settingsMODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"MAX_TOKENS = 512SENTIMENT_BATCH_SIZE = 64# Device detection helper functiondef get_device():    """    Detect the available device for PyTorch computations.    Returns:        str: 'mps' if MPS is available, 'cuda' if CUDA is available, otherwise 'cpu'    """    if torch.backends.mps.is_available():        return "mps"    elif torch.cuda.is_available():        return "cuda"    else:        return "cpu"# Utilsimport randomimport numpy as npimport pandas as pdimport torchimport loggingfrom typing import Listlogger = logging.getLogger(__name__)def parse_elite_years(elite_str: str) -> List[int]:    """    Parse elite years string into a list of integers.    Handles empty strings, NaN values, and comma/pipe-separated years.    Args:        elite_str: String containing elite years, e.g., "2018,2019,2020" or "2018|2019"    Returns:        List of integers representing elite years, or empty list for invalid input    """    if pd.isna(elite_str) or elite_str == "":        return []    # Replace pipe with comma for consistent splitting    elite_str = elite_str.replace('|', ',')    # Split by comma and convert to integers, filtering out empty strings    years = []    for year_str in elite_str.split(','):        year_str = year_str.strip()        if year_str:            try:                years.append(int(year_str))            except ValueError:                # Skip invalid year strings                continue    return yearsdef count_elite_statuses(elite_str: str, review_year: int) -> int:    """    Count the number of elite statuses up to and including the review year.    Args:        elite_str: String containing elite years        review_year: The year of the review    Returns:        Number of elite years <= review_year    """    elite_years = parse_elite_years(elite_str)    return sum(1 for year in elite_years if year <= review_year)def check_elite_status(elite_str: str, review_year: int) -> int:    """    Check if the user was elite in the review year or the previous year.    Args:        elite_str: String containing elite years        review_year: The year of the review    Returns:        1 if elite in review_year or (review_year - 1), 0 otherwise    """    elite_years = parse_elite_years(elite_str)    return 1 if review_year in elite_years or (review_year - 1) in elite_years else 0def smart_truncate_text(text: str, tokenizer, max_tokens: int = 500) -> str:    """    Tokenize text, keep first 250 + last 250 tokens if over max_tokens, convert back to string.    """    tokens = tokenizer.encode(text, add_special_tokens=False)    if len(tokens) <= max_tokens:        return text    # Keep first 250 and last 250    first_part = tokens[:250]    last_part = tokens[-250:]    truncated_tokens = first_part + last_part    return tokenizer.decode(truncated_tokens)def set_seed(seed: int = 1) -> None:    """    Set random seed for reproducibility.    """    random.seed(seed)    np.random.seed(seed)    torch.manual_seed(seed)    if torch.backends.mps.is_available():        torch.mps.manual_seed(seed)def verify_gpu_support() -> bool:    """    Check MPS GPU support availability.    """    available = torch.backends.mps.is_available()    status = "available" if available else "not available"    logger.info(f"MPS GPU support is {status}.")    return available# Set up plotting styleplt.style.use('default')sns.set_palette("husl")# Configure logginglogging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')logger = logging.getLogger(__name__)print("✓ Libraries imported successfully")

In [None]:
# GPU detection and device setupdevice_info = utils.verify_gpu_support()print(f"GPU Support: {device_info}")# Set random seed for reproducibilityutils.set_seed(config.SEED)print(f"✓ Random seed set to: {config.SEED}")# Display configurationprint("\n=== CONFIGURATION ===")print(f"Data Directory: {config.DATA_DIR}")print(f"Learning Rate: {config.LEARNING_RATE}")print(f"Batch Size: {config.BATCH_SIZE}")print(f"Max Epochs: {config.MAX_EPOCHS}")print(f"Sentiment Model: {config.MODEL_NAME}")print(f"Candidate Features: {config.CANDIDATE_FEATURES}")print("=====================")

### Interactive: Environment Verification

Let's verify that all our data files exist and check their sizes.

In [None]:
# Verify data files existprint("Checking data file availability:")for name, path in config.INPUT_FILES.items():    exists = os.path.exists(path)    size_mb = os.path.getsize(path) / (1024**2) if exists else 0    status = "✓" if exists else "✗"    print(f"{status} {name.capitalize()} data: {path} ({size_mb:.1f} MB)")# Check output directoriesos.makedirs(config.OUTPUT_DIR, exist_ok=True)os.makedirs('models', exist_ok=True)os.makedirs('outputs', exist_ok=True)print("\n✓ Output directories created")

## Section 2: Data Loading and Preprocessing

### Learning Objectives
- Understand the structure of the Yelp dataset
- Learn data loading and merging techniques
- Handle missing values and data cleaning
- Visualize data distributions and relationships

### What We'll Do
1. Load the three main datasets (business, review, user)
2. Rename columns to avoid conflicts
3. Convert date columns to datetime format
4. Merge datasets using inner joins
5. Clean data by removing rows with missing critical values
6. Explore the merged dataset

This preprocessing step transforms raw CSV files into a clean, merged dataset ready for feature engineering.

In [None]:
# Run the preprocessing pipelineprint("Starting data preprocessing pipeline...")print("This may take a few minutes depending on your system.")try:    merged_df = preprocess_pipeline()    print("\n✓ Preprocessing completed successfully!")    print(f"Final dataset shape: {merged_df.shape}")    print(f"Memory usage: {merged_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")except Exception as e:    print(f"✗ Error during preprocessing: {e}")    print("Please check your data files and try again.")

In [None]:
# Display dataset overviewprint("Dataset Overview:")print("=" * 50)print(merged_df.info())print("\nFirst 5 rows:")display(merged_df.head())# Basic statisticsprint("\nBasic Statistics:")print(merged_df.describe())

In [None]:
# Visualize data distributionsfig, axes = plt.subplots(2, 3, figsize=(15, 10))fig.suptitle('Data Distributions After Preprocessing', fontsize=16)# Stars distributionmerged_df['stars'].value_counts().sort_index().plot(kind='bar', ax=axes[0,0])axes[0,0].set_title('Star Ratings Distribution')axes[0,0].set_xlabel('Stars')axes[0,0].set_ylabel('Count')# User average starsmerged_df['user_average_stars'].hist(bins=20, ax=axes[0,1])axes[0,1].set_title('User Average Stars')axes[0,1].set_xlabel('Average Stars')axes[0,1].set_ylabel('Frequency')# Business average starsmerged_df['business_average_stars'].hist(bins=20, ax=axes[0,2])axes[0,2].set_title('Business Average Stars')axes[0,2].set_xlabel('Average Stars')axes[0,2].set_ylabel('Frequency')# User review countmerged_df['user_review_count'].hist(bins=50, ax=axes[1,0], range=(0, merged_df['user_review_count'].quantile(0.95)))axes[1,0].set_title('User Review Count (95th percentile)')axes[1,0].set_xlabel('Review Count')axes[1,0].set_ylabel('Frequency')# Business review countmerged_df['business_review_count'].hist(bins=50, ax=axes[1,1], range=(0, merged_df['business_review_count'].quantile(0.95)))axes[1,1].set_title('Business Review Count (95th percentile)')axes[1,1].set_xlabel('Review Count')axes[1,1].set_ylabel('Frequency')# Review year distributionmerged_df['date'].dt.year.value_counts().sort_index().plot(kind='bar', ax=axes[1,2])axes[1,2].set_title('Reviews by Year')axes[1,2].set_xlabel('Year')axes[1,2].set_ylabel('Count')plt.tight_layout()plt.show()

In [None]:
# Check for missing valuesmissing_data = merged_df.isnull().sum()missing_percent = (missing_data / len(merged_df)) * 100missing_df = pd.DataFrame({    'Missing Count': missing_data,    'Missing Percentage': missing_percent}).sort_values('Missing Count', ascending=False)print("Missing Values Analysis:")print("=" * 40)display(missing_df[missing_df['Missing Count'] > 0])# Visualize missing valuesif missing_data.sum() > 0:    plt.figure(figsize=(12, 6))    missing_data[missing_data > 0].sort_values(ascending=True).plot(kind='barh')    plt.title('Missing Values by Column')    plt.xlabel('Number of Missing Values')    plt.show()else:    print("✓ No missing values found in the dataset!")

## Section 3: Feature Engineering

### Learning Objectives
- Understand feature engineering concepts
- Create time-based features
- Engineer elite status features
- Handle missing values appropriately
- Visualize feature distributions and correlations

### What We'll Do
1. **Time Features**: Calculate `time_yelping` (weeks since user joined)
2. **Elite Features**: Count total elite statuses and check current elite status
3. **Missing Value Handling**: Impute missing values with appropriate strategies
4. **Feature Analysis**: Explore correlations and distributions

Feature engineering transforms raw data into meaningful predictors that capture the underlying patterns in user behavior and business characteristics.

In [None]:
# Run feature engineering pipelineprint("Starting feature engineering pipeline...")try:    featured_df = feature_engineering_pipeline(merged_df)    print("\n✓ Feature engineering completed successfully!")        # Show new features added    new_features = [col for col in featured_df.columns if col not in merged_df.columns]    print(f"\nAdded {len(new_features)} new features:")    for i, feature in enumerate(new_features, 1):        print(f"{i}. {feature}")        except Exception as e:    print(f"✗ Error during feature engineering: {e}")    print("Please check the previous steps and try again.")

In [None]:
# Display feature engineering resultsprint("Feature Engineering Results:")print("=" * 50)print(f"Original features: {len(merged_df.columns)}")print(f"Engineered features: {len(featured_df.columns)}")print(f"New features added: {len(featured_df.columns) - len(merged_df.columns)}")# Show statistics for new featuresprint("\nNew Feature Statistics:")display(featured_df[new_features].describe())# Show sample of engineered dataprint("\nSample of Engineered Data:")display(featured_df[['stars', 'user_average_stars', 'business_average_stars', 'time_yelping', 'elite_status', 'total_elite_statuses']].head())

In [None]:
# Visualize engineered featuresfig, axes = plt.subplots(2, 3, figsize=(18, 12))fig.suptitle('Engineered Feature Distributions', fontsize=16)# Time yelping distributionfeatured_df['time_yelping'].hist(bins=50, ax=axes[0,0])axes[0,0].set_title('Time Yelping (weeks)')axes[0,0].set_xlabel('Weeks')axes[0,0].set_ylabel('Frequency')# Elite status distributionfeatured_df['elite_status'].value_counts().sort_index().plot(kind='bar', ax=axes[0,1])axes[0,1].set_title('Elite Status Distribution')axes[0,1].set_xlabel('Elite Status (0=No, 1=Yes)')axes[0,1].set_ylabel('Count')# Total elite statusesfeatured_df['total_elite_statuses'].value_counts().sort_index().plot(kind='bar', ax=axes[0,2])axes[0,2].set_title('Total Elite Statuses')axes[0,2].set_xlabel('Number of Elite Years')axes[0,2].set_ylabel('Count')# Date year distributionfeatured_df['date_year'].value_counts().sort_index().plot(kind='bar', ax=axes[1,0])axes[1,0].set_title('Reviews by Year')axes[1,0].set_xlabel('Year')axes[1,0].set_ylabel('Count')# Correlation heatmap for key featurescorr_features = ['stars', 'user_average_stars', 'business_average_stars', 'time_yelping', 'elite_status', 'total_elite_statuses']corr_matrix = featured_df[corr_features].corr()sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=axes[1,1])axes[1,1].set_title('Feature Correlations')# Scatter plot: time_yelping vs starsaxes[1,2].scatter(featured_df['time_yelping'], featured_df['stars'], alpha=0.1)axes[1,2].set_title('Time Yelping vs Star Rating')axes[1,2].set_xlabel('Time Yelping (weeks)')axes[1,2].set_ylabel('Stars')plt.tight_layout()plt.show()

### Interactive: Feature Importance Preview

Let's explore which features might be most predictive of the star rating using a simple correlation analysis.

In [None]:
# Feature correlation with targettarget_correlations = featured_df.corr()['stars'].abs().sort_values(ascending=False)print("Feature Correlations with Target (Stars):")print("=" * 50)for feature, corr in target_correlations.items():    if feature != 'stars':        print(f"{feature:25s}: {corr:.4f}")# Visualize top correlationsplt.figure(figsize=(12, 8))top_features = target_correlations.head(15).index[1:]  # Exclude 'stars' itselftop_corrs = target_correlations.head(15).values[1:]bars = plt.barh(range(len(top_features)), top_corrs)plt.yticks(range(len(top_features)), top_features)plt.xlabel('Absolute Correlation with Stars')plt.title('Top Feature Correlations with Target')plt.grid(axis='x', alpha=0.3)# Color bars by correlation strengthfor i, (bar, corr) in enumerate(zip(bars, top_corrs)):    if corr > 0.3:        bar.set_color('darkred')    elif corr > 0.2:        bar.set_color('red')    elif corr > 0.1:        bar.set_color('orange')    else:        bar.set_color('lightblue')plt.show()

## Section 4: Sentiment Analysis

### Learning Objectives
- Understand sentiment analysis with transformers
- Learn text preprocessing techniques
- Explore sentiment score distributions
- Analyze sentiment vs rating relationships

### What We'll Do
1. **Text Preprocessing**: Smart truncation to handle long reviews
2. **Model Loading**: Initialize DistilBERT sentiment classifier
3. **Batch Processing**: Process reviews in batches with progress tracking
4. **Score Normalization**: Convert to [-1, 1] scale
5. **Analysis**: Explore sentiment patterns and correlations

Sentiment analysis extracts emotional tone from review text, providing a quantitative measure of user satisfaction beyond star ratings.

In [None]:
# Run sentiment analysis pipelineprint("Starting sentiment analysis pipeline...")print("This will take considerable time (10-30 minutes) depending on your hardware.")print("The progress bar will show the processing status.")try:    sentiment_df = sentiment_analysis_pipeline(featured_df)    print("\n✓ Sentiment analysis completed successfully!")        # Show sentiment columns added    sentiment_cols = ['sentiment_label', 'sentiment_score_raw', 'normalized_sentiment_score']    print(f"\nAdded sentiment columns: {sentiment_cols}")    except Exception as e:    print(f"✗ Error during sentiment analysis: {e}")    print("This step requires significant computational resources.")    print("Consider using a machine with GPU support for faster processing.")

In [None]:
# Display sentiment analysis resultsprint("Sentiment Analysis Results:")print("=" * 50)display(sentiment_df[sentiment_cols].describe())# Show sample with sentimentprint("\nSample Reviews with Sentiment:")sample_cols = ['text', 'stars', 'sentiment_label', 'normalized_sentiment_score']display(sentiment_df[sample_cols].head())# Sentiment label distributionprint("\nSentiment Label Distribution:")print(sentiment_df['sentiment_label'].value_counts())

In [None]:
# Visualize sentiment distributionsfig, axes = plt.subplots(2, 3, figsize=(18, 12))fig.suptitle('Sentiment Analysis Visualizations', fontsize=16)# Sentiment score distributionsentiment_df['normalized_sentiment_score'].hist(bins=50, ax=axes[0,0], alpha=0.7)axes[0,0].set_title('Normalized Sentiment Score Distribution')axes[0,0].set_xlabel('Sentiment Score')axes[0,0].set_ylabel('Frequency')axes[0,0].axvline(0, color='red', linestyle='--', alpha=0.7, label='Neutral')axes[0,0].legend()# Sentiment by star ratingsentiment_by_stars = sentiment_df.groupby('stars')['normalized_sentiment_score'].mean()sentiment_by_stars.plot(kind='bar', ax=axes[0,1])axes[0,1].set_title('Average Sentiment by Star Rating')axes[0,1].set_xlabel('Stars')axes[0,1].set_ylabel('Average Sentiment Score')# Sentiment vs stars scatteraxes[0,2].scatter(sentiment_df['stars'], sentiment_df['normalized_sentiment_score'], alpha=0.1)axes[0,2].set_title('Sentiment Score vs Star Rating')axes[0,2].set_xlabel('Stars')axes[0,2].set_ylabel('Sentiment Score')# Text length vs sentimenttext_lengths = sentiment_df['text'].str.len()axes[1,0].scatter(text_lengths, sentiment_df['normalized_sentiment_score'], alpha=0.1)axes[1,0].set_title('Text Length vs Sentiment Score')axes[1,0].set_xlabel('Text Length')axes[1,0].set_ylabel('Sentiment Score')# Sentiment label distributionsentiment_df['sentiment_label'].value_counts().plot(kind='pie', ax=axes[1,1], autopct='%1.1f%%')axes[1,1].set_title('Sentiment Label Distribution')# Correlation between sentiment and starscorr_sentiment_stars = sentiment_df[['stars', 'normalized_sentiment_score']].corr()sns.heatmap(corr_sentiment_stars, annot=True, cmap='coolwarm', ax=axes[1,2])axes[1,2].set_title('Correlation: Stars vs Sentiment')plt.tight_layout()plt.show()

## Section 5: Feature Selection

### Learning Objectives
- Understand feature selection techniques
- Learn about best subset selection
- Evaluate feature importance
- Compare model performance with different feature sets

### What We'll Do
1. **Data Preparation**: Select candidate features and target
2. **Best Subset Selection**: Exhaustive search for optimal feature combinations
3. **Model Evaluation**: Compare performance across different feature sets
4. **Final Selection**: Choose the best performing feature subset

Feature selection identifies the most predictive variables, reducing dimensionality and improving model interpretability.

In [None]:
# Run feature selection pipelineprint("Starting feature selection pipeline...")print("This involves exhaustive search over feature combinations.")print("Processing time depends on the number of candidate features.")try:    final_df, optimal_features = feature_selection_pipeline(sentiment_df)    print("\n✓ Feature selection completed successfully!")        print(f"\nSelected {len(optimal_features)} optimal features:")    for i, feature in enumerate(optimal_features, 1):        print(f"{i}. {feature}")        print(f"\nFinal dataset shape: {final_df.shape}")    except Exception as e:    print(f"✗ Error during feature selection: {e}")    print("Please check the previous steps and try again.")

In [None]:
# Display final datasetprint("Final Dataset Overview:")print("=" * 40)print(f"Shape: {final_df.shape}")print(f"Features: {list(final_df.columns)}")display(final_df.head())# Statistics for final featuresprint("\nFinal Feature Statistics:")display(final_df.describe())

## Section 6: Model Training and Evaluation

### Learning Objectives
- Understand neural network training
- Learn about stratified sampling and cross-validation
- Evaluate regression model performance
- Interpret training metrics and learning curves

### What We'll Do
1. **Data Preparation**: Stratify and split data
2. **Model Architecture**: Initialize PyTorch neural network
3. **Training**: Train with early stopping and validation
4. **Evaluation**: Assess performance on test set
5. **Visualization**: Plot training progress and predictions

We'll train a neural network to predict star ratings from our engineered features.

In [None]:
# Run training pipelineprint("Starting model training pipeline...")print("This will train a neural network for star rating prediction.")try:    training_results = training_pipeline()    print("\n✓ Model training completed successfully!")        # Display results    metrics = training_results['metrics']    print("\nTraining Results:")    print(f"MSE: {metrics['mse']:.4f}")    print(f"MAE: {metrics['mae']:.4f}")    print(f"R²: {metrics['r2']:.4f}")        print(f"\nModel saved to: {training_results['model_path']}")    print(f"Scaler saved to: {training_results['scaler_path']}")    except Exception as e:    print(f"✗ Error during training: {e}")    print("Please check the previous steps and try again.")

## Section 7: Inference and Predictions

### Learning Objectives
- Learn model loading and inference
- Understand prediction preprocessing
- Create custom prediction examples
- Interpret model outputs

### What We'll Do
1. **Model Loading**: Load trained model and scaler
2. **Example Creation**: Build prediction examples
3. **Inference**: Make predictions on new data
4. **Interpretation**: Understand prediction results

Now let's use our trained model to make predictions on new examples.

In [None]:
# Load model and scaler for inferenceimport torchimport picklefrom src.model import YelpRatingPredictortry:    # Load model    model = YelpRatingPredictor(input_size=len(optimal_features))    model.load_state_dict(torch.load(training_results['model_path'], map_location='cpu'))    model.eval()        # Load scaler    with open(training_results['scaler_path'], 'rb') as f:        scaler = pickle.load(f)        print("✓ Model and scaler loaded successfully")    except Exception as e:    print(f"✗ Error loading model: {e}")    print("Please ensure training completed successfully.")

In [None]:
# Create prediction examplesexamples = [    {        'user_average_stars': 4.5,        'business_average_stars': 4.2,        'time_yelping': 100.0,        'elite_status': 1,        'normalized_sentiment_score': 0.8    },    {        'user_average_stars': 3.0,        'business_average_stars': 3.5,        'time_yelping': 25.0,        'elite_status': 0,        'normalized_sentiment_score': -0.3    },    {        'user_average_stars': 4.0,        'business_average_stars': 4.8,        'time_yelping': 200.0,        'elite_status': 1,        'normalized_sentiment_score': 0.9    }]# Make predictionspredictions = []for i, example in enumerate(examples, 1):    # Filter to optimal features    filtered_input = {k: v for k, v in example.items() if k in optimal_features}    input_df = pd.DataFrame([filtered_input])        # Scale input    scaled_input = scaler.transform(input_df.values)    input_tensor = torch.FloatTensor(scaled_input)        # Predict    with torch.no_grad():        prediction = model(input_tensor).item()        predictions.append(prediction)    print(f"Example {i}: Predicted rating = {prediction:.2f}")    print(f"  Input: {filtered_input}")    print()

## Section 8: Analysis and Insights

### Learning Objectives
- Analyze model performance in depth
- Understand feature contributions
- Identify model limitations
- Explore what-if scenarios

### What We'll Do
1. **Performance Analysis**: Deep dive into metrics
2. **Error Analysis**: Understand prediction errors
3. **Feature Importance**: Analyze which features matter most
4. **Limitations**: Discuss model assumptions and constraints
5. **Future Work**: Suggest improvements

Let's analyze our model's behavior and performance characteristics.

In [None]:
# Load test data for analysistry:    # Load the stratified data used for training    stratified_df = pd.read_csv('data/processed/final_model_data.csv')        # Load optimal features    with open('data/processed/optimal_features.json', 'r') as f:        optimal_features = json.load(f)        print("✓ Analysis data loaded successfully")    except Exception as e:    print(f"✗ Error loading analysis data: {e}")

In [None]:
# Performance analysisprint("Model Performance Analysis:")print("=" * 50)print(f"Mean Squared Error (MSE): {metrics['mse']:.4f}")print(f"Mean Absolute Error (MAE): {metrics['mae']:.4f}")print(f"R² Score: {metrics['r2']:.4f}")print(f"RMSE: {metrics['mse']**0.5:.4f}")# Interpretationprint("\nInterpretation:")if metrics['r2'] > 0.7:    print("✓ Excellent performance (R² > 0.7)")elif metrics['r2'] > 0.5:    print("✓ Good performance (R² > 0.5)")elif metrics['r2'] > 0.3:    print("✓ Moderate performance (R² > 0.3)")else:    print("⚠ Limited performance - consider feature engineering or model improvements")print(f"\nOn average, predictions are off by {metrics['mae']:.2f} stars.")print(f"Typical prediction error range: ±{metrics['mae']*1.96:.2f} stars (95% confidence)")

In [None]:
# Feature importance analysisprint("\nFeature Analysis:")print("=" * 30)print(f"Selected optimal features ({len(optimal_features)}):")for i, feature in enumerate(optimal_features, 1):    print(f"{i}. {feature}")# Correlation analysisfeature_corrs = stratified_df[optimal_features + ['stars']].corr()['stars'].abs().sort_values(ascending=False)print("\nFeature correlations with target:")for feature in optimal_features:    corr = feature_corrs[feature]    print(f"{feature:25s}: {corr:.4f}")# Visualize feature correlationsplt.figure(figsize=(10, 6))feature_corrs[optimal_features].sort_values().plot(kind='barh')plt.title('Feature Correlations with Star Rating')plt.xlabel('Absolute Correlation')plt.grid(axis='x', alpha=0.3)plt.show()

In [None]:
# Model limitations and insightsprint("\nModel Limitations and Insights:")print("=" * 40)print("1. **Data Scope**: Model trained on Yelp Academic Dataset only")print("2. **Feature Limitations**: Predictions based on available user/business features")print("3. **Sentiment Context**: Text analysis may miss nuanced sentiment")print("4. **Temporal Factors**: Model doesn't account for trends over time")print("5. **Geographic Bias**: Results may not generalize to all locations")print("\nKey Insights:")print("• User history (average stars, elite status) strongly predicts ratings")print("• Business quality is a major factor")print("• Experience level (time yelping) influences rating patterns")print("• Review sentiment provides additional predictive power")print("\nFuture Improvements:")print("• Incorporate temporal trends and seasonality")print("• Add geographic and demographic features")print("• Use more advanced NLP models for sentiment")print("• Implement ensemble methods for better performance")print("• Add uncertainty quantification to predictions")

## Summary

Congratulations! You've successfully completed the interactive Yelp rating prediction pipeline. Here's what we accomplished:

### Pipeline Stages Completed:
1. ✅ **Data Loading & Preprocessing**: Loaded and cleaned Yelp dataset
2. ✅ **Feature Engineering**: Created time-based and elite status features
3. ✅ **Sentiment Analysis**: Extracted sentiment scores from review text
4. ✅ **Feature Selection**: Identified optimal feature subset
5. ✅ **Model Training**: Trained neural network for rating prediction
6. ✅ **Inference**: Demonstrated model predictions
7. ✅ **Analysis**: Explored model performance and insights

### Key Learnings:
- **Data preprocessing** is crucial for model performance
- **Feature engineering** transforms raw data into predictive features
- **Sentiment analysis** adds valuable text-derived insights
- **Feature selection** improves model efficiency and interpretability
- **Neural networks** can effectively model complex relationships

### Model Performance:
- **MSE**: {metrics['mse']:.4f}
- **MAE**: {metrics['mae']:.4f}
- **R²**: {metrics['r2']:.4f}

### Files Created:
- `data/processed/merged_data.csv`: Preprocessed dataset
- `data/processed/featured_data.csv`: Engineered features
- `data/processed/sentiment_data.csv`: With sentiment scores
- `data/processed/final_model_data.csv`: Final training data
- `models/best_model.pt`: Trained PyTorch model
- `models/scaler.pkl`: Feature scaler

This notebook demonstrates a complete machine learning pipeline from raw data to production-ready model. You can now apply these techniques to other prediction problems!