# Online Shoppers Purchasing Intention - Exploratory Data Analysis

**Course**: Python for Data Science ‚Äì Guided Machine Learning  
**Week**: 1 - Exploratory Data Analysis

---

## Objective

This notebook performs exploratory data analysis on the Online Shoppers Purchasing Intention dataset to:
1. Understand the dataset structure and quality
2. Identify missing values and data issues
3. Analyze the class distribution (Revenue)
4. Explore relationships between features and purchase behavior
5. Document a preprocessing plan for Week 2

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")

## 2. Load Dataset

In [None]:
# Load the dataset
df = pd.read_csv('../data/online_shoppers_intention.csv')

print("Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]:,}")
print(f"Number of columns: {df.shape[1]}")

## 3. Dataset Overview

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display column names
print("Column Names:")
print("=" * 50)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
# Display data types and non-null counts
print("Dataset Information:")
print("=" * 50)
df.info()

In [None]:
# Display data types summary
print("\nData Types Summary:")
print("=" * 50)
print(df.dtypes)

## 4. Missing Value Analysis

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing Count': missing_values.values,
    'Missing Percentage': missing_percentage.values
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("Missing Value Analysis:")
print("=" * 60)
if len(missing_df) == 0:
    print("‚úÖ No missing values found in the dataset!")
else:
    print(missing_df.to_string(index=False))

print(f"\nTotal missing values: {df.isnull().sum().sum()}")

## 5. Statistical Summary

In [None]:
# Statistical summary of numerical features
print("Statistical Summary of Numerical Features:")
df.describe()

## 6. Class Distribution Analysis (Revenue)

**Key Question**: How balanced is our target variable?

In [None]:
# Analyze Revenue distribution
revenue_counts = df['Revenue'].value_counts()
revenue_percentages = df['Revenue'].value_counts(normalize=True) * 100

print("Revenue Distribution:")
print("=" * 50)
print(f"No Purchase (FALSE): {revenue_counts[False]:,} ({revenue_percentages[False]:.2f}%)")
print(f"Purchase (TRUE):     {revenue_counts[True]:,} ({revenue_percentages[True]:.2f}%)")
print(f"\n‚ö†Ô∏è  CLASS IMBALANCE DETECTED!")
print(f"Purchase rate is only {revenue_percentages[True]:.2f}%")
print(f"This justifies the use of SMOTE for handling class imbalance.")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
colors = ['#e74c3c', '#2ecc71']
revenue_counts.plot(kind='bar', ax=axes[0], color=colors, edgecolor='black', alpha=0.8)
axes[0].set_title('Revenue Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Revenue', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xticklabels(['No Purchase', 'Purchase'], rotation=0)
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(revenue_counts.values):
    axes[0].text(i, v + 200, f'{v:,}', ha='center', va='bottom', fontweight='bold')

# Pie chart
axes[1].pie(revenue_counts.values, labels=['No Purchase', 'Purchase'], autopct='%1.1f%%',
            colors=colors, startangle=90, explode=(0.05, 0.05), shadow=True)
axes[1].set_title('Revenue Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../data/plots/01_class_distribution.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/01_class_distribution.png")
plt.show()

## 7. Purchase Rate by Visitor Type

In [None]:
# Analyze purchase rate by visitor type
visitor_revenue = pd.crosstab(df['VisitorType'], df['Revenue'], normalize='index') * 100

print("Purchase Rate by Visitor Type:")
print("=" * 50)
print(visitor_revenue)

In [None]:
# Visualize purchase rate by visitor type
fig, ax = plt.subplots(figsize=(12, 6))

visitor_revenue.plot(kind='bar', ax=ax, color=['#e74c3c', '#2ecc71'], 
                      edgecolor='black', alpha=0.8)
ax.set_title('Purchase Rate by Visitor Type', fontsize=14, fontweight='bold')
ax.set_xlabel('Visitor Type', fontsize=12)
ax.set_ylabel('Percentage (%)', fontsize=12)
ax.set_xticklabels(visitor_revenue.index, rotation=45, ha='right')
ax.legend(['No Purchase', 'Purchase'], title='Revenue')
ax.grid(axis='y', alpha=0.3)

# Add percentage labels
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%', padding=3)

plt.tight_layout()
plt.savefig('../data/plots/02_purchase_by_visitor_type.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/02_purchase_by_visitor_type.png")
plt.show()

## 8. Purchase Rate by Month

In [None]:
# Analyze purchase rate by month
month_revenue = pd.crosstab(df['Month'], df['Revenue'], normalize='index') * 100

# Sort by month order
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_revenue = month_revenue.reindex([m for m in month_order if m in month_revenue.index])

print("Purchase Rate by Month:")
print("=" * 50)
print(month_revenue)

In [None]:
# Visualize purchase rate by month
fig, ax = plt.subplots(figsize=(14, 6))

month_revenue[True].plot(kind='bar', ax=ax, color='#3498db', 
                          edgecolor='black', alpha=0.8)
ax.set_title('Purchase Rate by Month', fontsize=14, fontweight='bold')
ax.set_xlabel('Month', fontsize=12)
ax.set_ylabel('Purchase Rate (%)', fontsize=12)
ax.set_xticklabels(month_revenue.index, rotation=45, ha='right')
ax.grid(axis='y', alpha=0.3)

# Add percentage labels
for i, v in enumerate(month_revenue[True].values):
    ax.text(i, v + 0.5, f'{v:.1f}%', ha='center', va='bottom', fontweight='bold')

# Add horizontal line for average purchase rate
avg_purchase_rate = month_revenue[True].mean()
ax.axhline(y=avg_purchase_rate, color='red', linestyle='--', linewidth=2, 
           label=f'Average: {avg_purchase_rate:.1f}%')
ax.legend()

plt.tight_layout()
plt.savefig('../data/plots/03_purchase_by_month.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/03_purchase_by_month.png")
plt.show()

## 9. Distribution of Key Numerical Features

In [None]:
# Analyze distributions of key numerical features
key_features = ['PageValues', 'BounceRates', 'ExitRates']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, feature in enumerate(key_features):
    # Histogram with KDE
    axes[i].hist(df[feature], bins=50, color='#9b59b6', alpha=0.7, edgecolor='black')
    axes[i].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(feature, fontsize=10)
    axes[i].set_ylabel('Frequency', fontsize=10)
    axes[i].grid(axis='y', alpha=0.3)
    
    # Add statistics
    mean_val = df[feature].mean()
    median_val = df[feature].median()
    axes[i].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    axes[i].axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2f}')
    axes[i].legend()

plt.tight_layout()
plt.savefig('../data/plots/04_numerical_distributions.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/04_numerical_distributions.png")
plt.show()

In [None]:
# Box plots by Revenue
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, feature in enumerate(key_features):
    df.boxplot(column=feature, by='Revenue', ax=axes[i], 
               patch_artist=True, 
               boxprops=dict(facecolor='lightblue', alpha=0.7),
               medianprops=dict(color='red', linewidth=2))
    axes[i].set_title(f'{feature} by Revenue', fontsize=12, fontweight='bold')
    axes[i].set_xlabel('Revenue', fontsize=10)
    axes[i].set_ylabel(feature, fontsize=10)
    axes[i].set_xticklabels(['No Purchase', 'Purchase'])
    axes[i].grid(axis='y', alpha=0.3)

plt.suptitle('')  # Remove the default title
plt.tight_layout()
plt.savefig('../data/plots/05_boxplots_by_revenue.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/05_boxplots_by_revenue.png")
plt.show()

## 10. Correlation Analysis

In [None]:
# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Convert Revenue to numeric for correlation
df_corr = df.copy()
df_corr['Revenue'] = df_corr['Revenue'].astype(int)

# Calculate correlation matrix
correlation_matrix = df_corr[numerical_cols + ['Revenue']].corr()

# Get correlations with Revenue
revenue_correlations = correlation_matrix['Revenue'].sort_values(ascending=False)
print("Correlation with Revenue:")
print("=" * 50)
print(revenue_correlations)

In [None]:
# Visualize correlation heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - Focus on Revenue', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../data/plots/06_correlation_heatmap.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/06_correlation_heatmap.png")
plt.show()

In [None]:
# Top correlations with Revenue (bar plot)
top_n = 10
top_correlations = revenue_correlations[1:top_n+1]  # Exclude Revenue itself

plt.figure(figsize=(12, 6))
colors = ['#2ecc71' if x > 0 else '#e74c3c' for x in top_correlations.values]
top_correlations.plot(kind='barh', color=colors, edgecolor='black', alpha=0.8)
plt.title(f'Top {top_n} Features Correlated with Revenue', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(top_correlations.values):
    plt.text(v + 0.01 if v > 0 else v - 0.01, i, f'{v:.3f}', 
             va='center', ha='left' if v > 0 else 'right', fontweight='bold')

plt.tight_layout()
plt.savefig('../data/plots/07_top_correlations.png', dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved: data/plots/07_top_correlations.png")
plt.show()

## 11. Key Insights Summary

In [None]:
print("="*70)
print("KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS")
print("="*70)

print("\n1. DATASET OVERVIEW:")
print(f"   - Total Records: {df.shape[0]:,}")
print(f"   - Total Features: {df.shape[1]}")
print(f"   - Missing Values: {df.isnull().sum().sum()} (0%)")

print("\n2. CLASS IMBALANCE (CRITICAL):")
purchase_rate = (df['Revenue'].sum() / len(df)) * 100
print(f"   - Purchase Rate: {purchase_rate:.2f}%")
print(f"   - No Purchase Rate: {100 - purchase_rate:.2f}%")
print(f"   - Imbalance Ratio: ~1:{int(100/purchase_rate)}")
print(f"   ‚ö†Ô∏è  SMOTE will be ESSENTIAL for handling this imbalance!")

print("\n3. VISITOR TYPE INSIGHTS:")
for visitor_type in df['VisitorType'].unique():
    vt_purchase_rate = (df[df['VisitorType'] == visitor_type]['Revenue'].sum() / 
                        len(df[df['VisitorType'] == visitor_type])) * 100
    print(f"   - {visitor_type}: {vt_purchase_rate:.2f}% purchase rate")

print("\n4. TOP CORRELATED FEATURES WITH REVENUE:")
for i, (feature, corr) in enumerate(revenue_correlations[1:6].items(), 1):
    print(f"   {i}. {feature}: {corr:.3f}")

print("\n5. SEASONAL PATTERNS:")
best_month = month_revenue[True].idxmax()
worst_month = month_revenue[True].idxmin()
print(f"   - Highest purchase rate: {best_month} ({month_revenue[True].max():.2f}%)")
print(f"   - Lowest purchase rate: {worst_month} ({month_revenue[True].min():.2f}%)")

print("\n" + "="*70)
print("EDA COMPLETE - Ready for Preprocessing Planning!")
print("="*70)

---

# üìù Preprocessing Plan for Week 2

## Overview

Based on the EDA findings, the following preprocessing pipeline will be implemented using **scikit-learn Pipeline** to ensure reproducibility and prevent data leakage.

---

## Step 1: Handle Missing Values

**Finding**: No missing values detected in the current dataset.

**Strategy** (for robustness in case of missing values in future data):
- **Numerical features**: Impute with **mean** or **median** (median preferred for skewed distributions)
- **Categorical features**: Impute with **mode** (most frequent value)

**Implementation**:
```python
from sklearn.impute import SimpleImputer

# Numerical imputer
num_imputer = SimpleImputer(strategy='median')

# Categorical imputer
cat_imputer = SimpleImputer(strategy='most_frequent')
```

---

## Step 2: Encode Categorical Variables

**Categorical Features Identified**:
- `Month` (Jan, Feb, Mar, etc.)
- `VisitorType` (New_Visitor, Returning_Visitor, Other)
- `Weekend` (Boolean - already binary)

**Strategy**:
- Use **One-Hot Encoding** for `Month` and `VisitorType`
- Convert `Weekend` and `Revenue` to binary (0/1)

**Implementation**:
```python
from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Month', 'VisitorType']
encoder = OneHotEncoder(drop='first', sparse_output=False)
```

**Note**: `drop='first'` prevents multicollinearity by dropping one category.

---

## Step 3: Scale Numerical Features

**Numerical Features** (different scales observed):
- Page-related: `Administrative`, `Informational`, `ProductRelated`, etc.
- Duration-related: `Administrative_Duration`, `Informational_Duration`, etc.
- Behavior metrics: `BounceRates`, `ExitRates`, `PageValues`

**Strategy**:
- Use **StandardScaler** (z-score normalization)
- Transforms features to have mean=0 and std=1
- Required for distance-based algorithms (Logistic Regression, SVM, KNN)

**Implementation**:
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
```

---

## Step 4: Train-Test Split

**Strategy**:
- **Split ratio**: 80% training, 20% testing
- **Stratification**: Ensure equal class distribution in both sets
- **Random state**: Set for reproducibility

**Implementation**:
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y, 
    random_state=42
)
```

---

## Step 5: Handle Class Imbalance with SMOTE

**Critical Finding**: Purchase rate is only **~15%** ‚Üí Severe class imbalance!

**Strategy**:
- Apply **SMOTE (Synthetic Minority Over-sampling Technique)**
- Generates synthetic samples for minority class (purchases)
- **IMPORTANT**: Apply SMOTE **only on training data** to prevent data leakage

**Implementation**:
```python
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
```

**Why SMOTE?**
- Prevents model from being biased toward majority class
- Improves recall for minority class (purchase prediction)
- Better than simple oversampling (no duplicates)

---

## Step 6: Build Scikit-Learn Pipeline

**Integration Strategy**:
- Combine all preprocessing steps into a single pipeline
- Ensures consistent transformations during training and testing
- Prevents data leakage

**Implementation**:
```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Define numerical and categorical columns
numerical_features = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]
categorical_features = ['Month', 'VisitorType']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numerical_features),
        
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first', sparse_output=False))
        ]), categorical_features)
    ]
)

# Full pipeline (preprocessing + model)
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())  # Placeholder for Week 2
])
```

---

## Preprocessing Workflow Summary

```
Raw Data
   ‚Üì
1. Handle Missing Values (if any)
   ‚Üì
2. Encode Categorical Variables (One-Hot)
   ‚Üì
3. Scale Numerical Features (StandardScaler)
   ‚Üì
4. Train-Test Split (80/20, stratified)
   ‚Üì
5. Apply SMOTE (on training data only)
   ‚Üì
Ready for Model Training (Week 2)
```

---

## Additional Considerations for Week 2

1. **Feature Engineering**:
   - Create interaction features (e.g., `BounceRate * ExitRate`)
   - Binning of continuous variables if needed

2. **Alternative Scaling Methods** (if needed):
   - MinMaxScaler for neural networks
   - RobustScaler for outlier-heavy features

3. **Cross-Validation**:
   - Use StratifiedKFold (5-10 folds) for robust evaluation

4. **Evaluation Metrics** (for imbalanced data):
   - **Precision, Recall, F1-Score** (more important than accuracy)
   - **ROC-AUC** and **Precision-Recall AUC**
   - **Confusion Matrix**

---

## ‚úÖ Week 1 Deliverables Complete

1. ‚úÖ Dataset loaded and explored
2. ‚úÖ Missing value analysis completed (none found)
3. ‚úÖ Class imbalance identified and quantified (~15% purchase rate)
4. ‚úÖ EDA visualizations created and saved
5. ‚úÖ Preprocessing plan documented
6. ‚úÖ Ready for Week 2 implementation

**Next Steps**: Implement the preprocessing pipeline and begin model training in Week 2!