# DeepCSAT_Ecommerce_Prediction

**Project:** E-commerce Customer Satisfaction Prediction (Deep Learning ANN)

**Author:** Mahima Patel

**Description:** This single notebook contains the full pipeline: data loading, exploratory data analysis (15 charts with explanations), data cleaning, feature engineering, ANN model building (Keras), training, evaluation, saving model, and conclusions. Each code cell is followed by a clear markdown explanation so it is ready to submit.


In [1]:
import importlib
import subprocess
import sys

def install_if_missing(package):
    try:
        importlib.import_module(package)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package],
                              stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

for pkg in ["numpy", "pandas", "tensorflow", "keras", "scikit-learn", "matplotlib", "seaborn"]:
    install_if_missing(pkg)


CalledProcessError: Command '['C:\\Users\\hp\\anaconda3\\python.exe', '-m', 'pip', 'install', 'tensorflow']' returned non-zero exit status 1.

## 1. Import Libraries

The following cell imports all required libraries. Explanations for each import are given inline in comments.

In [None]:
# Numerical and data handling
import numpy as np
import pandas as pd

# Plotting (matplotlib only to keep plots reproducible in many environments)
import matplotlib.pyplot as plt

# Machine learning utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Deep learning (Keras API)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# File system
import os
print('Libraries imported successfully')

## 2. Load Dataset

Load dataset from the `data/` folder. The code below reads the CSV and shows a quick preview and basic info.

In [None]:
# Adjust path if needed. The uploaded file is expected at: data/eCommerce_Customer_support_data.csv
data_path = 'data/eCommerce_Customer_support_data.csv'
if not os.path.exists(data_path):
    # fallback to root if user placed file differently in the environment used here
    data_path = '/mnt/data/eCommerce_Customer_support_data.csv'

df = pd.read_csv(data_path)
print('Dataset shape:', df.shape)
display(df.head(5))
df.info()

### 2.1 Initial EDA: Basic statistics and missing values

We check summary statistics and missing values to plan cleaning steps.

In [None]:
# Summary statistics
display(df.describe(include='all').T)

# Missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing = missing[missing>0]
print('Columns with missing values:')
display(missing)

### 2.2 Identify numeric and categorical columns

We will programmatically detect numeric and categorical columns so the plotting and preprocessing adapts to your dataset.

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print('Numeric columns:', numeric_cols)
print('Categorical columns:', cat_cols)

## 3. Exploratory Data Analysis (15 charts)

Below are 15 charts. Each chart cell includes a short markdown explaining what to look for, why we chose the chart, and how the insight helps model training.

### Chart 1-6: Histograms for numeric features (distribution insights)

In [None]:
# Charts 1-6: Histograms for numeric features
num_to_plot = numeric_cols[:6]  # up to 6 histograms
for col in num_to_plot:
    plt.figure(figsize=(6,4))
    plt.hist(df[col].dropna(), bins=30)
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title(f'Distribution of {col}')
    plt.grid(True)
    plt.show()

**Insight & Reason:**

- **What to look for:** skewness, multimodality, outliers.
- **Why chosen:** Histograms show how feature values are distributed. If strongly skewed, consider transforms (log) which can help ANN training by stabilizing gradients.
- **How it helps model training:** Normalized distributions or transformed features usually make neural network training smoother and improve convergence.

### Chart 7-10: Bar plots for top categorical features (category counts)

In [None]:
# Charts 7-10: Bar plots for top categorical features
cat_to_plot = cat_cols[:4]  # up to 4 categorical plots
for col in cat_to_plot:
    vc = df[col].value_counts().nlargest(10)
    plt.figure(figsize=(6,4))
    vc.plot(kind='bar')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title(f'Top categories in {col}')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y')
    plt.show()

**Insight & Reason:**

- **What to look for:** dominant categories, class imbalance.
- **Why chosen:** Bar plots quickly reveal category frequency; imbalance may require special handling (class weights or resampling) for the ANN.
- **How it helps model training:** Knowing imbalance helps choose loss/metrics or resampling strategies to avoid biased models.

### Chart 11: Correlation heatmap (numeric features)

Shows pairwise correlations, helps detect multicollinearity and features strongly correlated with the target.

In [None]:
# Chart 11: Correlation matrix (numeric)
if len(numeric_cols) > 1:
    corr = df[numeric_cols].corr()
    plt.figure(figsize=(8,6))
    im = plt.imshow(corr, interpolation='nearest')
    plt.colorbar(im)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.index)), corr.index)
    plt.title('Correlation matrix (numeric features)')
    plt.tight_layout()
    plt.show()
else:
    print('Not enough numeric columns for correlation matrix')

**Insight & Reason:**

- **What to look for:** features highly correlated with target or with each other.
- **Why chosen:** Correlation can suggest which features matter and whether multicollinearity might harm model learning.
- **How it helps model training:** If two features are nearly identical, consider dropping one to reduce redundancy and overfitting risk.

### Chart 12: Boxplot of a numeric feature vs target

This visualizes how a numeric feature's distribution varies across target classes.

In [None]:
# Chart 12: Boxplot for numeric vs target (if target exists)
target_col = None
# heuristics to find a likely target column named like 'CSAT' or 'target'
for possible in ['CSAT_Score', 'CSAT', 'csat', 'target', 'label']:
    if possible in df.columns:
        target_col = possible
        break

if target_col is None:
    # fallback: if dataset has a low-cardinality numeric column, treat it as target
    numeric_card = [(c, df[c].nunique()) for c in numeric_cols]
    numeric_card.sort(key=lambda x: x[1])
    if numeric_card and numeric_card[0][1] <= 10:
        target_col = numeric_card[0][0]

if target_col is None:
    print('No obvious target column found for boxplot. Please set target_col variable manually.')
else:
    # choose a numeric feature different from target
    num_feats = [c for c in numeric_cols if c != target_col]
    if num_feats:
        col = num_feats[0]
        # draw boxplots per target class
        classes = sorted(df[target_col].dropna().unique())
        plt.figure(figsize=(6,4))
        data_to_plot = [df[df[target_col]==cls][col].dropna() for cls in classes]
        plt.boxplot(data_to_plot, labels=[str(x) for x in classes])
        plt.xlabel('Target ('+target_col+')')
        plt.ylabel(col)
        plt.title(f'Boxplot of {col} grouped by {target_col}')
        plt.grid(True)
        plt.show()
    else:
        print('No numeric features available for boxplot.')

**Insight & Reason:**

- **What to look for:** differences in medians, spread and outliers across target classes.
- **Why chosen:** Boxplots reveal whether a numeric feature separates target classes well (useful feature for classification).
- **How it helps model training:** Good separability indicates a strong predictive feature.

### Chart 13-14: Scatter plots between numeric feature pairs (relationship / interaction)

In [None]:
# Chart 13-14: Scatter plots for top numeric pairs
num_pairs = []
if len(numeric_cols) >= 2:
    num_pairs = [(numeric_cols[i], numeric_cols[i+1]) for i in range(min(2, len(numeric_cols)-1))]
for a,b in num_pairs:
    plt.figure(figsize=(6,4))
    plt.scatter(df[a], df[b], alpha=0.5, s=10)
    plt.xlabel(a)
    plt.ylabel(b)
    plt.title(f'Scatter: {a} vs {b}')
    plt.grid(True)
    plt.show()

**Insight & Reason:**

- **What to look for:** linear or non-linear relationships and clusters.
- **Why chosen:** Scatter plots reveal interactions that a neural network can exploit.
- **How it helps model training:** If two features interact strongly, the ANN can model complex combinations of them.

### Chart 15: Target distribution

Visualize the target classes to understand balance/imbalance prior to training.

In [None]:
# Chart 15: Target distribution (if target exists)
if target_col is None:
    print('No target column detected automatically. Set target_col manually to visualize target distribution.')
else:
    vals = df[target_col].value_counts().sort_index()
    plt.figure(figsize=(6,4))
    vals.plot(kind='bar')
    plt.xlabel(target_col)
    plt.ylabel('Count')
    plt.title(f'Distribution of target: {target_col}')
    plt.grid(axis='y')
    plt.show()

**Insight & Reason:**

- **What to look for:** whether the classes are balanced.
- **Why chosen:** ANN training on imbalanced data can produce biased predictions; we may need class weights or resampling.
- **How it helps model training:** Guides selection of loss function, metrics, and sampling strategy.

## 4. Data Cleaning & Preprocessing

This section performs cleaning, missing value handling, encoding and scaling. Code is adaptive to dataset columns detected earlier.

In [None]:
# Copy dataframe to work on
df_clean = df.copy()

# Identify or set target column if not found earlier
if 'target_col' in globals() and target_col is not None:
    target = target_col
else:
    # try common names
    for possible in ['CSAT_Score', 'CSAT', 'csat', 'target', 'label']:
        if possible in df_clean.columns:
            target = possible
            break
    else:
        # fallback: ask user to set target manually
        target = None

print('Auto-detected target column:', target)

# Drop obvious ID columns
for idcol in ['Ticket_ID', 'Customer_ID', 'ID', 'id']:
    if idcol in df_clean.columns:
        df_clean.drop(columns=[idcol], inplace=True, errors='ignore')

# Fill numeric missing with mean, categorical with mode
for col in df_clean.columns:
    if df_clean[col].dtype in [np.float64, np.int64]:
        df_clean[col].fillna(df_clean[col].mean(), inplace=True)
    else:
        df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

# Encode categorical features
le_map = {}
for col in df_clean.select_dtypes(include=['object', 'category']).columns:
    le = LabelEncoder()
    df_clean[col] = le.fit_transform(df_clean[col].astype(str))
    le_map[col] = le

print('Preprocessing complete. Cleaned shape:', df_clean.shape)

### Feature matrix and target vector

We prepare `X` and `y`, then scale features using `StandardScaler` for ANN training.

In [None]:
# Ensure target is set
if target is None:
    raise ValueError('Please set the target variable name in the notebook (target variable could not be auto-detected).')

X = df_clean.drop(columns=[target])
y = df_clean[target]

# If target is numeric with many unique values, consider converting to binary or classes
if y.nunique() > 5:
    print('Detected many unique values in target; assuming classification may be binary/multiclass. Keep as-is.')
else:
    print('Target classes:', y.unique())

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y if y.nunique()>1 else None)

print('Train/Test shapes:', X_train.shape, X_test.shape)

## 5. Build and Train ANN (Deep Learning)

We build a Multilayer Perceptron using Keras. This architecture is a strong baseline for tabular data and meets the 'DeepCSAT' deep learning requirement.

In [None]:
# Determine output layer configuration
num_classes = y_train.nunique() if hasattr(y_train, 'nunique') else len(np.unique(y_train))
print('Detected number of target classes:', num_classes)

# If binary classification
if num_classes == 2:
    output_units = 1
    output_activation = 'sigmoid'
    loss_fn = 'binary_crossentropy'
else:
    # multiclass
    output_units = num_classes
    output_activation = 'softmax'
    loss_fn = 'sparse_categorical_crossentropy'

# Build model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(output_units, activation=output_activation))

model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.summary()

### Train the model with EarlyStopping

We use early stopping to avoid overfitting and restore the best weights observed on the validation set.

In [None]:
early_stop = EarlyStopping(monitor='val_loss', patience=6, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)

### Training Curves: Accuracy and Loss

Plot training and validation accuracy and loss to inspect learning behavior.

In [None]:
# Plot accuracy and loss curves
plt.figure(figsize=(6,4))
plt.plot(history.history.get('accuracy', []), label='train_accuracy')
plt.plot(history.history.get('val_accuracy', []), label='val_accuracy')
plt.title('Accuracy over epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(6,4))
plt.plot(history.history.get('loss', []), label='train_loss')
plt.plot(history.history.get('val_loss', []), label='val_loss')
plt.title('Loss over epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

## 6. Model Evaluation on Test Set

We compute predictions and standard classification metrics, plus a confusion matrix.

In [None]:
# Predictions
if num_classes == 2:
    y_prob = model.predict(X_test)
    y_pred = (y_prob > 0.5).astype(int).reshape(-1)
else:
    y_prob = model.predict(X_test)
    y_pred = np.argmax(y_prob, axis=1)

print('Test Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
plt.imshow(cm, interpolation='nearest')
plt.colorbar()
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks(range(len(np.unique(y_test))))
plt.yticks(range(len(np.unique(y_test))))
plt.show()

## 7. Save Model and Predictions

Save the trained model and create a submission CSV containing actual vs predicted values.

In [None]:
# Ensure models and submission directories exist
os.makedirs('models', exist_ok=True)
os.makedirs('submission', exist_ok=True)

model_path = 'models/best_model.h5'
model.save(model_path)
print('Model saved to', model_path)

# Save predictions
import pandas as pd
pred_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
pred_df.to_csv('submission/CSAT_Predictions.csv', index=False)
print('Predictions saved to submission/CSAT_Predictions.csv')