# Predicting Star Ratings from Review Data
This notebook demonstrates the process of predicting star ratings based on available numeric variables and the length of the review. The steps include data exploration, feature engineering, visualization, model building, and evaluation.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import joblib

## Load and Inspect Data
First, we load the data and inspect its structure to understand what we're working with.

In [2]:
# Load the data
df = pd.read_csv('/Datasets/Yelp_Reviews/reviews.csv')

# Display the first few rows of the dataset
df.head()

## Check for Missing Values
We need to check if there are any missing values in the dataset and handle them accordingly.

In [3]:
# Check for missing values
df.isnull().sum()

## Feature Engineering
Create a new feature for the length of the review text.

In [4]:
# Feature Engineering: Create a new feature for the length of the review text
df['review_length'] = df['text'].apply(len)

# Display the first few rows of the updated dataset
df[['text', 'review_length']].head()

## Data Visualization
Visualize the distribution of the review length and examine its relationship with the star rating.

In [5]:
# Set the style for the plots
sns.set(style='whitegrid')

# Plot the distribution of review lengths
plt.figure(figsize=(10, 6))
sns.histplot(df['review_length'], kde=True)
plt.title('Distribution of Review Lengths')
plt.xlabel('Review Length')
plt.ylabel('Frequency')
plt.show()

In [6]:
# Plot review length vs. star rating
plt.figure(figsize=(10, 6))
sns.boxplot(x='stars', y='review_length', data=df)
plt.title('Review Length vs. Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Length')
plt.show()

## Feature Selection and Transformation
Prepare the features and target variable for modeling.

In [7]:
# Select features and target variable
features = df[['useful', 'funny', 'cool', 'review_length']]
target = df['stars']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Building and Evaluation
Train and evaluate regression models to predict the star rating.

In [8]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2}

# Display results
results_df = pd.DataFrame(results).T
results_df

## Save the Model
Save the trained Random Forest model for future use.

In [9]:
# Save the Random Forest model
joblib.dump(models['Random Forest'], 'random_forest_model.pkl')