# Customer Behavior Predictive Analysis

## Predicting Future Customer Behavior

In this notebook, we'll focus on two key prediction problems:
1. Predicting customer purchase frequency in the next month
2. Forecasting customer lifetime value (CLV)

These predictions will help in:
- Inventory planning
- Marketing campaign optimization
- Resource allocation
- Customer retention strategies

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Load and prepare the data
df = pd.read_excel('Online Retail.xlsx')
df_clean = clean_data(df)  # Using the cleaning function from first notebook

### 1. Feature Engineering for Predictive Models

In [None]:
def create_customer_features(df):
    # Convert InvoiceDate to datetime
    df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
    
    # Calculate recency, frequency, and monetary value (RFM)
    latest_date = df['InvoiceDate'].max()
    
    customer_features = df.groupby('CustomerID').agg({
        'InvoiceDate': lambda x: (latest_date - x.max()).days,  # Recency
        'InvoiceNo': 'count',  # Frequency
        'TotalAmount': ['sum', 'mean'],  # Monetary
        'Quantity': ['sum', 'mean'],  # Purchase volume
        'Description': 'nunique'  # Product variety
    })
    
    # Flatten column names
    customer_features.columns = ['Recency', 'Frequency', 'TotalSpent', 
                                'AvgTransactionValue', 'TotalItems', 
                                'AvgItemsPerTransaction', 'ProductVariety']
    
    return customer_features

# Create features
customer_features = create_customer_features(df_clean)

print("Feature Summary:")
print("-" * 50)
print(customer_features.describe())

### 2. Predicting Purchase Frequency

In [None]:
# Prepare data for purchase frequency prediction
X = customer_features[['Recency', 'TotalSpent', 'AvgTransactionValue', 
                      'TotalItems', 'ProductVariety']]
y = customer_features['Frequency']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = rf_model.predict(X_test_scaled)

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Purchase Frequency Prediction Results:")
print("-" * 50)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature')
plt.title('Feature Importance for Purchase Frequency Prediction')
plt.show()

### 3. Customer Lifetime Value (CLV) Prediction

In [None]:
def prepare_clv_features(df):
    # Calculate customer age (months since first purchase)
    customer_history = df.groupby('CustomerID').agg({
        'InvoiceDate': lambda x: (x.max() - x.min()).days / 30,  # Customer age in months
        'TotalAmount': ['sum', 'mean'],
        'InvoiceNo': 'count',
        'Quantity': ['sum', 'mean']
    })
    
    customer_history.columns = ['CustomerAge', 'TotalRevenue', 'AvgTransactionValue',
                               'TransactionCount', 'TotalItems', 'AvgItemsPerTransaction']
    
    # Calculate monthly metrics
    customer_history['MonthlyPurchaseRate'] = customer_history['TransactionCount'] / customer_history['CustomerAge']
    customer_history['MonthlyRevenue'] = customer_history['TotalRevenue'] / customer_history['CustomerAge']
    
    return customer_history

# Prepare data for CLV prediction
clv_features = prepare_clv_features(df_clean)

# Define target variable (future 6-month revenue)
X_clv = clv_features[['CustomerAge', 'MonthlyPurchaseRate', 'AvgTransactionValue',
                      'MonthlyRevenue', 'AvgItemsPerTransaction']]
y_clv = clv_features['TotalRevenue']

# Split and scale data
X_train_clv, X_test_clv, y_train_clv, y_test_clv = train_test_split(X_clv, y_clv, test_size=0.2, random_state=42)

scaler_clv = StandardScaler()
X_train_clv_scaled = scaler_clv.fit_transform(X_train_clv)
X_test_clv_scaled = scaler_clv.transform(X_test_clv)

# Train CLV prediction model
clv_model = RandomForestRegressor(n_estimators=100, random_state=42)
clv_model.fit(X_train_clv_scaled, y_train_clv)

# Make predictions
y_pred_clv = clv_model.predict(X_test_clv_scaled)

# Evaluate CLV model
mse_clv = mean_squared_error(y_test_clv, y_pred_clv)
r2_clv = r2_score(y_test_clv, y_pred_clv)

print("CLV Prediction Results:")
print("-" * 50)
print(f"Mean Squared Error: {mse_clv:.2f}")
print(f"R-squared Score: {r2_clv:.2f}")

### 4. Model Application and Business Insights

In [None]:
# Function to identify high-potential customers
def identify_high_potential_customers(features, clv_predictions, threshold_percentile=90):
    high_potential = pd.DataFrame({
        'CustomerID': features.index,
        'Predicted_CLV': clv_predictions
    })
    
    threshold = np.percentile(clv_predictions, threshold_percentile)
    high_potential['High_Potential'] = high_potential['Predicted_CLV'] >= threshold
    
    return high_potential

# Make predictions for all customers
all_features_scaled = scaler_clv.transform(X_clv)
all_predictions = clv_model.predict(all_features_scaled)

# Identify high-potential customers
high_potential_customers = identify_high_potential_customers(X_clv, all_predictions)

print("High-Potential Customer Analysis:")
print("-" * 50)
print(f"Number of high-potential customers identified: {high_potential_customers['High_Potential'].sum()}")
print("\nSample of high-potential customers:")
print(high_potential_customers[high_potential_customers['High_Potential']].head())

# Visualize CLV distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=high_potential_customers, x='Predicted_CLV')
plt.axvline(x=np.percentile(all_predictions, 90), color='r', linestyle='--', 
            label='High Potential Threshold')
plt.title('Distribution of Predicted Customer Lifetime Value')
plt.xlabel('Predicted CLV')
plt.ylabel('Count')
plt.legend()
plt.show()

## Predictive Analytics Summary

Our predictive analysis has focused on two key aspects:

1. **Purchase Frequency Prediction**:
   - Model Performance: R-squared score indicates the model's ability to predict future purchase frequency
   - Key predictive features identified through feature importance analysis
   - Can be used for inventory planning and marketing campaign timing

2. **Customer Lifetime Value Prediction**:
   - Successfully identified high-potential customers
   - Created a framework for future value estimation
   - Can be used for customer segmentation and targeted marketing

### Business Applications:

1. **Inventory Management**:
   - Use purchase frequency predictions for stock planning
   - Optimize inventory levels based on predicted demand

2. **Marketing Optimization**:
   - Target high-potential customers with specialized campaigns
   - Adjust marketing spend based on predicted customer value

3. **Customer Retention**:
   - Identify at-risk customers before they churn
   - Implement targeted retention strategies

4. **Resource Allocation**:
   - Focus resources on high-potential customer segments
   - Optimize customer service allocation