# **Project Name**    - Predict CSAT scores



##### **Project Type**    - EDA/Regression/Classification/Unsupervised


# **Project Summary -**

AIM - to forecast CSAT scores

The goal is to develop a deep learning model that predicts CSAT (Customer Satisfaction) scores based on customer interactions and feedback.
The aim is to help e-commerce businesses monitor and improve customer satisfaction in real-time, enhancing service quality and fostering loyalty.

# **GitHub Link -**

Provide your GitHub Link here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:

from google.colab import files
uploaded = files.upload()  # This will open a file picker to upload files

### Import Libraries

In [None]:
!pip install tensorflow

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

### Dataset Loading

In [None]:
# Load Dataset
data = pd.read_csv(r"eCommerce_Customer_support_data.csv")

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
columns_to_drop = ['Customer Remarks', 'Order_id', 'order_date_time','Customer_City',
       'Product_category', 'Item_price','connected_handling_time','Unique id']

data = data.drop(columns = columns_to_drop, axis = 1)
data.head()

In [None]:
data.isnull().sum()

In [None]:
# Convert dates to datetime
data['Survey_response_Date'] = pd.to_datetime(data['Survey_response_Date'], dayfirst=True)
data = data.sort_values('Survey_response_Date')

In [None]:
data.head()

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Assuming your cleaned DataFrame is named 'data'
plt.figure(figsize=(6, 4))

# Create bar chart of channel counts
data['channel_name'].value_counts().plot(kind='bar', color='skyblue', edgecolor='black')

# Add labels and title
plt.title('Number of Records by Channel', fontsize=14)
plt.xlabel('Channel Name', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose a bar chart to visualise the number of records that were inbound vs outcall vs email. We can see most records were inbound calls showing how most of the records are from customers or consumers interested in shopzilla and its relatead products. The business could try to increase the number of outbound calls to try to expand customer base.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Assuming your cleaned DataFrame is named 'data'
plt.figure(figsize=(10, 4))

# Plot only the top 5 categories
data['category'].value_counts().head(5).plot(kind='bar', color='lightblue', edgecolor='black')

# Add labels and title
plt.title('Top 5 Categories ', fontsize=14)
plt.xlabel('Category Name', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose a bar chart here to visualise what the category of each interaction was. Most were returns related meaning customers are returning a lot of products on shopzilla/ have lots of calls pertaining to the returns of their products. The business could try to look further into why so many customer service records are to do with returns and how that can be rectified so the busines can increase profits and sales.

#### Chart - 3

In [None]:
# Chart - 3 visualization code


# Convert to datetime with dayfirst=True
data['Issue_reported at'] = pd.to_datetime(data['Issue_reported at'], dayfirst=True, errors='coerce')
data['issue_responded'] = pd.to_datetime(data['issue_responded'], dayfirst=True, errors='coerce')

# Calculate time difference in minutes
data['time_to_respond'] = (data['issue_responded'] - data['Issue_reported at']).dt.total_seconds() / 60
data = data[data['time_to_respond'] >= 0]



plt.figure(figsize=(8, 5))
plt.hist(data['time_to_respond'].dropna(), bins=30, color='lightcyan', edgecolor='black')
# Limit x-axis from 0 to 1000 minutes
plt.xlim(0, 1000)
plt.title('Distribution of Time to Respond', fontsize=14)
plt.xlabel('Response Time (minutes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I wanted to visualise the response time as a distribution so we can look at what the most common response times were/ variance of the response times. Here we can see more response times are between 0 and 200 minutes meaning most customer service enquiries were responded to within a few hours which is great for the business. However, there are still quite a few records that have much longer response times and the business could conduct further analysis on why this is and whether these longer response times had an affect on the individual customer satisfaction scores.

#### Chart - 4 and 5

In [None]:
# Chart - 4 visualization code
# Create figure with 2 subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

#  Pie Chart 1: Agent Shift
data['Agent Shift'].value_counts().plot(
    kind='pie',
    ax=axes[0],
    autopct='%1.1f%%',
    startangle=90,
    colors=['skyblue', 'lightgreen', 'lightcoral', 'gold'],
    wedgeprops={'edgecolor': 'black'}
)
axes[0].set_title('Distribution of Agent Shifts', fontsize=14)
axes[0].set_ylabel('')  # Remove y-label for cleaner look

# Pie Chart 2: CSAT Score
data['CSAT Score'].value_counts().sort_index().plot(
    kind='pie',
    ax=axes[1],
    autopct='%1.1f%%',
    startangle=90,
    colors=['gold', 'lightcoral', 'lightskyblue', 'lightgreen', 'orchid'],
    wedgeprops={'edgecolor': 'black'}
)
axes[1].set_title('Distribution of CSAT Scores', fontsize=14)
axes[1].set_ylabel('')  # Remove y-label for cleaner look

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose pie charts here to visualise the propotions of each feature (CSAT score and Agent Shifts). In the first pie chart we can see most of the records were dealt with agents who had morning shifts and evening shifts whereas a very small number of records were dealt with at night. The business could look into implementing a chatbot etc. into the ecommerce website to improve response times for customres are night. On the right, we can see most customer satisfaction scores are 5 then the second most frequent score is 4. This is a good indicator that most customers are satisfied with the business's customer service.

#### Chart - 6

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 4))

# Create bar chart of channel counts
data['Manager'].value_counts().plot(kind='bar', color='lightgreen', edgecolor='black')

# Add labels and title
plt.title('No. Records by Manager', fontsize=14)
plt.xlabel('Manager', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

Here I picked a bar chart to visualise which manager dealt with the most records. Here we can see John Smith is the manager who dealt with the most records. The business could look into why Olivia Tan and other Managers not shown here aren't dealing with as many records.

#### Chart - 7

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 4))

# Create bar chart of channel counts
data['Supervisor'].value_counts().head(5).plot(kind='bar', color='skyblue', edgecolor='black')

# Add labels and title
plt.title(' Top 5 Supervisors by Records', fontsize=14)
plt.xlabel('Supervisor', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose this visualisation as i wanted to see who the top 5 supervisors were by record. We can see Carter Park deals with most of the customer serivce enquiries. The business could look into getting these supervisors to train other supervisors to distribute the load and increase the number of enquiries dealt with.

#### Chart - 8

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 4))

# Create bar chart of channel counts
data['Agent_name'].value_counts().head(5).plot(kind='bar', color='skyblue', edgecolor='black')

# Add labels and title
plt.title('Top 5 Supervisors by Agent', fontsize=14)
plt.xlabel('Agent', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I wanted to visualise the top 5 agents by records. We can see Wendy Taylor dealt with the most records. Again the company could look into why other agents were not peforming as well and how to increase this performance.

#### Chart - 9

In [None]:
# Chart - 8 visualization code


plt.figure(figsize=(10, 4))

# Create bar chart of channel counts
data['Tenure Bucket'].value_counts().head(5).plot(kind='bar', color='lightsteelblue', edgecolor='black')

# Add labels and title
plt.title('Top 5 Supervisors by Agent', fontsize=14)
plt.xlabel('Agent', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()



##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?


I chose this chart to visualise how long an agent had been working. The labels correspond to the following descriptions.

    On Job Training = Newly hired, in training
    
    0-30 = 0 to 30 days employed
    
    31-60 = 31 to 60 days employed
    
    61-90 = 61 to 90 days employed
    
    >90 = More than 90 days employed

Here we can see most agents have been employed for more than 90 days. This shows most records are dealt with agents who have experience. However the second most frequent category is the group of agents that are newly hired and are still in training. Due to their lack of experience, they may contribute to the records that have lower customer satisfaction scores hence the business could look into whether there is a correlation between these two factors.

#### Chart - 10

In [None]:
# Chart - 9 visualization code
cross = pd.crosstab(data['CSAT Score'], data['Tenure Bucket'])

cross.plot(kind='bar', stacked=True, figsize=(8,5), colormap='Set3', edgecolor='black')
plt.title('Stacked Bar: How long they have worked there vs communication channel')
plt.xlabel('Job Tenure')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose this chart to visualise CSAT scores split by job tenure. We can see that the higher CSAT scores are mainly achieved by people who have spent more than 90 days on the job and people who are newly hired. This shows that more experience is better and people who are newly hired may put in more effort at first in customer service. The company should look at training and incentives to agents who have worked for between 0-90 days to increase their results with CSAT scores.

#### Chart - 11

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8,5))
sns.countplot(x='Agent Shift', hue='category', data=data)
plt.title('Agent Shift vs Category')
plt.xlabel('Agent Shift')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose a countplot as I wanted to see what category of customer enquiry was most dealt with at what times. We can see it's evenly split where most enquiries are returns related at all times. The second most relevant enquiry is order related and again this is true for all shift times. This doesn't have immediate positive business impact.

#### Chart - 12

In [None]:
# Chart - 11 visualization code
cross = pd.crosstab(data['Tenure Bucket'], data['channel_name'])

cross.plot(kind='bar', stacked=True, figsize=(8,5), colormap='Set2', edgecolor='black')
plt.title('Stacked Bar: How long they have worked there vs communication channel')
plt.xlabel('Job Tenure')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose this chart to do bivariate analysis between 2 different categorical variables. Here we can see different job tenures split by channel name. This shows us that people who are newly hired make quite a few outbound calls, however people who have worked there for less than 90 days don't make as many outbound calls. The business could look at incentives / goals to make more outbound calls to expand customer base. However if the goal is to improve customer satisfaction, optimisng for email and inbound seems the best option.

Answer Here

#### Chart - 13

In [None]:
# Chart - 12 visualization code
# Chart - 10 visualization code
plt.figure(figsize=(8,5))
sns.countplot(x='CSAT Score', hue='category', data=data)
plt.title('CSAT Score vs Category')
plt.xlabel('CSAT Score')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I wanted to see CSAT scores split by category do provide more insight into which types of customer enquiries had the best satisfaction scores. We can see returns make up a huge proportion of CSAT Score 5 meaning customers who return items are very happy with customer service and this is definitely Shopzilla's strength. This does have a positive business impact as it shows that customer service with return based enquiries is high and this can be used as part of a marketing strategy in the future.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_matrix = data.corr(numeric_only=True)


plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose a correlation matrix to visualise how correlated different features were. However as most of the features are categorical variables the correlation matrix is very simple. We know we must do lots of categorical encoding for the final ML model. This doesn't have any immediate positive impact.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Create a pairplot for all numeric columns
sns.pairplot(data, diag_kind='hist', corner=True)

plt.suptitle('Pairplot of Numeric Variables', y=1.02, fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart? What are the insights? Does this have a positive business impact?

I chose a pairplot to visualise different bivariate charts and if there are any correlations between the numeric variables. However as most of the variables are categorical variables, the pair plot is quite bare. There is no immediate positive business impact.

Answer Here

## ***6. Feature Engineering & Data Pre-processing***

### 1. Converting dates to datetimes

In [None]:
# Convert dates to datetime
data['Survey_response_Date'] = pd.to_datetime(data['Survey_response_Date'], dayfirst=True)
data = data.sort_values('Survey_response_Date')

# Select features
features = [
    'channel_name', 'category', 'Sub-category',
    'Agent_name', 'Supervisor', 'Manager',
    'Tenure Bucket', 'Agent Shift'
]


In [None]:
target = 'CSAT Score'

### 2. Encoding categorical variables

In [None]:
# Encode categorical variables
encoder = LabelEncoder()
for col in features:
    data[col] = encoder.fit_transform(data[col])

### 3. Assigning X and y

In [None]:
# Feature matrix (X) and target (y)
X = data[features]
y = data[target]

### 4. Normalising data

In [None]:
# Normalize data
scaler_X = MinMaxScaler()
scaler_y = MinMaxScaler()

X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1))

### 5. Creating sequences for the model

In [None]:
#Create sequences
def create_sequences(X, y, time_steps=10):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        Xs.append(X[i:(i+time_steps)])
        ys.append(y[i+time_steps])
    return np.array(Xs), np.array(ys)

time_steps = 10
X_seq, y_seq = create_sequences(X_scaled, y_scaled, time_steps)


### 6. Train-Test Split

In [None]:
#split into train test splits
X_train, X_test, y_train, y_test = train_test_split(
    X_seq, y_seq, test_size=0.2, shuffle=False
)


## ***7. Model Implementation***


In [None]:
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])),
    Dropout(0.2),
    LSTM(32, return_sequences=False),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])



model.compile(optimizer='adam', loss='mse')

In [None]:
es = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=50,
    batch_size=32,
    callbacks=[es],
    verbose=1
)


In [None]:
last_sequence = X_scaled[-time_steps:]
future_preds = []

for _ in range(7):  #predict 7 days ahead
    pred = model.predict(last_sequence.reshape(1, time_steps, X_scaled.shape[1]))
    future_preds.append(pred[0,0])

    # Append prediction to sequence (for iterative forecasting)
    new_row = np.append(X_scaled[-1][1:], pred[0,0])  # shift and add prediction
    last_sequence = np.vstack([last_sequence[1:], new_row])

# Inverse scale
future_preds = scaler_y.inverse_transform(np.array(future_preds).reshape(-1,1))
print("Future 7-day CSAT forecast:", future_preds.flatten())


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

y_pred = model.predict(X_test)
y_pred_inv = scaler_y.inverse_transform(y_pred)

y_test_inv = scaler_y.inverse_transform(y_test)

rmse = np.sqrt(mean_squared_error(y_test_inv, y_pred_inv))
mae = mean_absolute_error(y_test_inv, y_pred_inv)

print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")


### 2nd Model with Hyperparameter Tuning

In [None]:
!pip install -q keras-tuner


In [None]:
import keras_tuner as kt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

def build_model(hp):
    model = Sequential()

    # Tune units
    units = hp.Choice('units', [32, 64, 128])
    model.add(LSTM(units, input_shape=(X_train.shape[1], X_train.shape[2])))

    # Tune dropout
    dropout = hp.Float('dropout', 0.1, 0.3, step=0.1)
    model.add(Dropout(dropout))

    model.add(Dense(1))

    # Tune learning rate
    lr = hp.Choice('learning_rate', [0.001, 0.005, 0.01])
    model.compile(optimizer=Adam(learning_rate=lr), loss='mae')

    return model



In [None]:
tuner = kt.RandomSearch(
    build_model,
    objective='val_loss',  # you can also use 'mae'
    max_trials=10,          # number of hyperparameter combinations
    executions_per_trial=1,
    directory='my_dir',
    project_name='csat_lstm'
)


In [None]:
val_split = 0.2
val_size = int(len(X_train) * val_split)

X_train_tune = X_train[:-val_size]
y_train_tune = y_train[:-val_size]

X_val_tune = X_train[-val_size:]
y_val_tune = y_train[-val_size:]


In [None]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(monitor='val_loss', patience=3)
tuner.search(X_train_tune, y_train_tune,
             epochs=20,
             batch_size=32,
             validation_data=(X_val_tune, y_val_tune),
             callbacks=[es])


In [None]:
best_model = tuner.get_best_models(num_models=1)[0]

In [None]:
best_model.compile(optimizer=best_model.optimizer, loss='mae', metrics=['mae'])
loss, mae = best_model.evaluate(X_test, y_test)
print(f'Test MAE: {mae:.4f}')



Here we can see there is an improvement in scores as the mae has decreased.

# **Conclusion**

Write the conclusion here.