<a href="https://colab.research.google.com/github/nisha432/customer-satisfaction-prediction-score-/blob/main/Project_no_4_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - E-Commerce Customer Satisfaction Score Prediction




##### **Contribution**    - Team
##### **Team Member 1 -** Nisha Ahire
##### **Team Member 2 -** Prabhakar Harijan


# **Project Summary -**

The DeepCSAT project focuses on predicting customer satisfaction (CSAT) scores for e-commerce platforms using deep learning, particularly Artificial Neural Networks (ANN). The approach begins with ensuring data integrity and cleaning by addressing issues such as missing, duplicate, or irrelevant data points. Through a meticulous cleaning process, the data is refined, ensuring it is accurate, consistent, and ready for analysis, which is crucial for building a reliable model. Once the data is cleaned, the next step is feature engineering and selection. Creative and analytical techniques are used to craft and select features that significantly impact CSAT predictions. This process ensures that the most relevant customer behaviors and interactions are captured, while eliminating unnecessary or redundant features to avoid overfitting and improve model efficiency.

Following feature selection, data preprocessing and transformation techniques, including normalization, encoding of categorical variables, and timestamp parsing, are applied. These steps prepare the data in a format suitable for deep learning, ensuring the neural network can effectively learn complex relationships within the data. At the heart of the project is the model development and architecture, where a robust Artificial Neural Network (ANN) is designed. The architecture is carefully crafted with appropriate layers, activation functions, and connectivity patterns to capture the nonlinearities of customer satisfaction data, ensuring the model is capable of generalizing well to unseen data and making accurate predictions.

The training process is optimized through the use of training efficiency and optimization strategies. Techniques like batch processing, learning rate adjustments, and early stopping are employed to accelerate convergence while preventing overfitting. This ensures the model trains efficiently, minimizing computational time without compromising performance. To assess model accuracy and robustness, evaluation metrics and model validation are conducted using relevant metrics such as mean absolute error (MAE) or root mean square error (RMSE). Cross-validation or split-sample validation is used to confirm that the model generalizes well to new, unseen data.

Incorporating innovative techniques in deep learning, the project explores advanced methods and customizations that improve prediction accuracy. This might involve experimenting with different neural network architectures, hyperparameter tuning, or adding advanced features that enhance the model's predictive power. Once developed, the model is deployed locally, demonstrating its practical application in real-world e-commerce environments and showing how it can be integrated into existing systems to drive improvements in customer satisfaction.

Finally, documentation, interpretability, and presentation are prioritized. The project provides clear, structured documentation outlining the data evaluation process, model architecture, and results. Emphasis is placed on model interpretability, offering insights into how specific features influence CSAT predictions. This transparency helps businesses make informed, data-driven decisions to enhance customer experiences. The findings are presented professionally, ensuring that the project’s outcomes and value are clearly communicated to stakeholders and decision-makers.

In summary, DeepCSAT integrates deep learning with e-commerce insights to predict customer satisfaction scores accurately. By following a comprehensive approach—ranging from data preparation to model development, training, and evaluation—it provides e-commerce businesses with actionable insights that can improve customer satisfaction, refine product offerings, and drive customer retention.

# **GitHub Link -**

https://github.com/nisha432/customer-satisfaction-prediction-score-

# **Problem Statement**


In the competitive landscape of e-commerce, customer satisfaction (CSAT) plays a pivotal role in driving customer loyalty, retention, and overall business success. However, accurately assessing and predicting CSAT scores remains a significant challenge due to the complex and multifaceted nature of customer interactions and feedback. Traditional methods of evaluating customer satisfaction are often reactive, relying on surveys or direct feedback, which can be limited, delayed, and subjective.

To address this challenge, there is a growing need for predictive models that can analyze large volumes of interaction data in real-time to proactively identify factors influencing customer satisfaction. This project aims to develop a robust predictive model using Deep Learning Artificial Neural Networks (ANN) to forecast CSAT scores based on a diverse set of interaction-related features. By leveraging advanced neural network techniques, the model seeks to provide accurate, data-driven insights that can enhance service quality, optimize customer experience, and drive strategic business decisions in the e-commerce domain.

The successful implementation of this solution will enable businesses to proactively address customer concerns, optimize service delivery, and improve customer loyalty, thereby fostering sustainable growth in a highly competitive market

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import matplotlib.patheffects as path_effects
from matplotlib.patheffects import PathPatchEffect, SimpleLineShadow, Normal
# Machine Learning and Data Preprocessing Libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score, roc_curve, auc,
    mean_squared_error, mean_absolute_error, r2_score, classification_report
)
from yellowbrick.classifier import (
    ClassificationReport, PrecisionRecallCurve, ClassPredictionError, DiscriminationThreshold
)
from yellowbrick.style.palettes import PALETTES, SEQUENCES, color_palette
import lightgbm
from xgboost import XGBClassifier, XGBRFClassifier

# Visualization settings
sns.set(style="whitegrid")
pd.set_option('display.max_columns', None)



This is a robust setup for data analysis and machine learning in Python, with a comprehensive selection of libraries for data manipulation (pandas, numpy), visualization (seaborn, matplotlib, plotly), and machine learning (scikit-learn, lightgbm, xgboost). I also included essential libraries for handling missing data (missingno) and added enhanced visualizations for classification with yellowbrick

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/AlmaBetter/datasets/eCommerce_Customer_support_data.csv")

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:

rows, columns = df.shape

# Print the results
print(f'Number of rows: {rows}')
print(f'Number of columns: {columns}')

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
duplicate_count = df.duplicated().sum()

# Print the result
print(f'Total duplicate rows: {duplicate_count}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
df.isnull().sum()

In [None]:
# Visualizing the missing values

In [None]:
null_counts = df.isnull().sum()

# Plotting
plt.figure(figsize=(18, 6))
sns.barplot(x=null_counts.index, y=null_counts.values, palette='viridis')
plt.title('Count of Null Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
df.columns

In [None]:
# Dataset Describe

In [None]:
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

unique_dataset = pd.DataFrame()
unique_dataset['Features'] = df.columns
unique=[]
for i in df.columns:
    unique.append(df[i].nunique())
unique_dataset['Uniques'] = unique

f, ax = plt.subplots(1,1, figsize=(15,7))

splot = sns.barplot(x=unique_dataset['Features'], y=unique_dataset['Uniques'], alpha=0.8,path_effects=[path_effects.SimplePatchShadow(),path_effects.Normal()],color = 'red')
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center',
                   va = 'center', xytext = (0, 9), textcoords = 'offset points')
plt.title('Bar plot for number of unique values in each column',weight='bold', size=15)
plt.ylabel('#Unique values', size=12, weight='bold')
plt.xlabel('Features', size=12, weight='bold')
plt.xticks(rotation=90)
plt.show()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# Handle missing values - dropping rows with missing CSAT for this example
df = df.dropna(subset=['CSAT Score'])

In [None]:
# Convert datetime columns and create derived features
df['order_date_time'] = pd.to_datetime(df['order_date_time'], dayfirst=True)
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], dayfirst=True)
df['issue_responded'] = pd.to_datetime(df['issue_responded'], dayfirst=True)
df['Survey_response_Date'] = pd.to_datetime(df['Survey_response_Date'], dayfirst=True)


In [None]:
df['order_year'] = df['order_date_time'].dt.year
df['order_month'] = df['order_date_time'].dt.month
df['order_day'] = df['order_date_time'].dt.day
df['order_hour'] = df['order_date_time'].dt.hour
df['order_day_of_week'] = df['order_date_time'].dt.dayofweek

In [None]:
# Calculate response and survey lag times
df['Response Lag'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 3600.0
df['Survey Lag'] = (df['Survey_response_Date'] - df['Issue_reported at']).dt.total_seconds() / 3600.0


In [None]:
# Fill missing text values with 'Unknown'
df['Customer Remarks'] = df['Customer Remarks'].fillna('Unknown')
df['Customer_City'] = df['Customer_City'].fillna('Unknown')

In [None]:
# Fill missing numeric values with median
df['Item_price'] = df['Item_price'].fillna(df['Item_price'].median())
df['connected_handling_time'] = df['connected_handling_time'].fillna(df['connected_handling_time'].median())


In [None]:
# Fill missing dates with earliest date in each column
for date_col in ['order_date_time', 'Issue_reported at', 'issue_responded']:
    df[date_col] = df[date_col].fillna(df[date_col].min())

In [None]:
# Calculate response_time and days_since_order
df['response_time'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 60  # in minutes
df['days_since_order'] = (df['Survey_response_Date'] - df['order_date_time']).dt.days

In [None]:
# Replace negative or NaN days_since_order with 0 for consistency
df['days_since_order'] = df['days_since_order'].apply(lambda x: max(x, 0) if pd.notnull(x) else 0)


In [None]:
df.info()

In [None]:
df['Product_category'].value_counts()

In [None]:
df.fillna('Unknown', inplace=True)

In [None]:
df

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
fig = px.histogram(df, x='CSAT Score', title="CSAT Score Distribution", nbins=5)
fig.update_layout(xaxis_title="CSAT Score", yaxis_title="Count")
fig.show()

##### 1. Why did you pick the specific chart?

The histogram is well-suited for visualizing the distribution of a single variable, particularly when the goal is to observe how frequently each customer satisfaction score (CSAT Score) occurs. In this case, CSAT Score has a limited range (1 to 5), making a histogram with 5 bins ideal. It shows the frequency of each score, allowing us to quickly see if customer satisfaction skews toward high or low scores or is more evenly distributed.

##### 2. What is/are the insight(s) found from the chart?

 According to histogram we can observe High customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights could lead to a positive business impact:

Customer Feedback Analysis: By identifying satisfaction levels, you can prioritize areas for improvement. For instance, addressing the root causes behind low scores can directly enhance customer experience.

Marketing and Retention Efforts: High satisfaction scores provide validation for effective customer service strategies. They can be leveraged in customer testimonials or marketing content to attract new customers and retain existing ones.

Operational Adjustments: Consistently low scores may suggest underlying issues in operations or customer interactions that, once corrected, could reduce churn and improve brand loyalty.

Analyzing CSAT scores helps focus on enhancing customer satisfaction, ultimately supporting growth through improved service and reputation.Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
fig = px.box(df, x='Agent Shift', y='response_time', title="Response Time Distribution by Agent Shift")
fig.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='Agent Shift', y='response_time', data=df)
plt.title("Response Time Distribution by Agent Shift")
plt.xlabel("Agent Shift")
plt.ylabel("Response Time (hours)")
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is an excellent choice for comparing distributions of response times across different agent shifts because it shows key statistics—like the median, quartiles, and potential outliers—in each shift category. This visualization allows for easy comparison of typical response times and variations between shifts, revealing if certain shifts are associated with longer or shorter response times.

##### 2. What is/are the insight(s) found from the chart?

The box plot can reveal several valuable insights:

Median Response Times: By comparing the median (the line in the box) across shifts, we can identify which shifts tend to have quicker or slower response times.

Variability in Response Times: The length of the box (interquartile range) and presence of whiskers or outliers will indicate the consistency of response times within each shift. Shifts with a narrower range are more consistent, while a wider range suggests variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact the business:

Shift Scheduling Optimization: Understanding which shifts have longer response times can help identify times of high demand or potential staffing issues. This insight can guide better scheduling or resource allocation to improve response times.

Training and Process Improvement: If certain shifts consistently perform better, analyzing their practices could uncover methods or behaviors that can be standardized across all shifts.

Enhanced Customer Satisfaction: Reducing response times, especially during shifts with slower responses, can improve overall customer satisfaction, as faster response times are generally associated with better customer service.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
fig = px.box(df, x='category', y='CSAT Score', title="Category vs. CSAT Score", color='Agent Shift')
fig.show()


##### 1. Why did you pick the specific chart?

A box plot is an effective choice for comparing customer satisfaction scores (CSAT) across different categories while also distinguishing between different agent shifts. This allows for clear visualization of the distribution of CSAT scores within each category, helping to highlight differences in customer satisfaction levels based on both the product category and the agent shift. Using color to represent agent shifts adds another layer of insight, enabling you to see how the shifts impact satisfaction within each category.

##### 2. What is/are the insight(s) found from the chart?

The box plot can provide several insights:

Distribution of CSAT Scores: The median CSAT scores (the line inside the box) for each category can indicate which categories are perceived more positively by customers.

Variability Across Categories: The interquartile range (IQR) of each box reveals how varied the CSAT scores are within that category. A wide box indicates more variability in satisfaction, while a narrow box suggests consistent ratings.

Comparative Performance by Agent Shift: By observing how the boxes for different agent shifts overlap within categories, you can assess whether certain shifts perform better or worse in specific categories, highlighting potential issues or strengths.

Outliers: The presence of outliers in the box plot can indicate exceptional experiences (either positive or negative) and may warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this visualization can lead to positive business outcomes:

Targeted Improvement Strategies: Understanding which categories have lower CSAT scores can help focus improvement efforts. If certain categories consistently receive lower satisfaction ratings, you may need to investigate product quality, service delivery, or customer support specific to those categories.

Agent Training and Resources: If one agent shift consistently scores higher in certain categories, their methods and practices can be analyzed and potentially adopted across other shifts to enhance overall customer satisfaction.

Product Development and Marketing: Categories that receive high satisfaction scores can be highlighted in marketing campaigns, while those with lower scores might need to be improved before being promoted heavily.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
fig = px.scatter_3d(df, x='Item_price', y='CSAT Score', z='response_time',
                    title="CSAT Score vs Item Price vs Response Time",
                    color='Agent Shift',
                    labels={'Item_price': 'Item Price', 'CSAT Score': 'CSAT Score', 'response_time': 'Response Time (hours)'})
fig.show()

##### 1. Why did you pick the specific chart?

A 3D scatter plot is an excellent choice for visualizing the relationship among three continuous variables—in this case, Item Price, CSAT Score, and Response Time. It allows for a comprehensive view of how these variables interact with each other simultaneously. The use of color to represent different agent shifts adds another dimension to the analysis, helping identify trends or patterns associated with specific shifts.

##### 2. What is/are the insight(s) found from the chart?

The 3D scatter plot can yield several insights:

Relationships Among Variables: You can observe how Item Price correlates with both CSAT Score and Response Time. For example, do higher-priced items tend to receive higher or lower satisfaction scores? Does response time vary significantly for different price ranges?

Patterns by Agent Shift: Different colors corresponding to agent shifts can help identify whether certain shifts perform better or worse concerning the three variables. For instance, one shift might show a cluster of high CSAT scores and low response times for high-priced items, indicating effective handling.

Clusters and Trends: Look for any clusters of points, which might suggest groups of items or customer interactions with similar characteristics. Identifying trends within those clusters can guide pricing strategies or operational improvements.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this 3D scatter plot can drive positive business outcomes:

Pricing Strategies: Understanding how Item Price affects customer satisfaction can inform pricing strategies. If higher-priced items consistently receive lower CSAT scores, this could indicate a need for better value communication or product quality improvement.

Resource Allocation: If certain agent shifts consistently show longer response times for specific price ranges, this could signal a need for additional training or resources to improve service delivery.

Customer Experience Optimization: Identifying patterns in customer satisfaction related to both price and response time can help optimize the overall customer experience. Adjustments can be made based on data-driven insights, leading to higher satisfaction and potentially increased sales.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
category_means = df.groupby('category')['CSAT Score'].mean().reset_index()
fig = px.bar(category_means, x='category', y='CSAT Score', title="Average CSAT Score by Category",
             labels={'category': 'Category', 'CSAT Score': 'Average CSAT Score'})
fig.show()

##### 1. Why did you pick the specific chart?

A bar chart is a suitable choice for displaying the average customer satisfaction score (CSAT Score) across different categories. It clearly illustrates the differences in average satisfaction levels, making it easy to compare how each category performs relative to others. The vertical arrangement allows for quick visual assessment of which categories are doing well and which may require attention.

##### 2. What is/are the insight(s) found from the chart?

The bar chart can provide several insights:

Comparison of Average Scores: By looking at the heights of the bars, you can quickly identify which categories have higher or lower average CSAT scores. This can help pinpoint areas of strength and those needing improvement.

Category Performance Trends: If specific categories consistently rank higher or lower, it may indicate underlying factors that affect customer satisfaction, such as product quality, pricing, or service.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights derived from this bar chart can lead to several positive business outcomes:

Targeted Improvement Initiatives: Categories with lower average CSAT scores can be the focus of targeted improvement initiatives, such as enhancing product quality, customer service training, or refining marketing strategies to better align with customer expectations.

Resource Allocation: Understanding which categories are performing well can guide resource allocation for marketing or inventory management, focusing efforts on categories that yield higher satisfaction and, potentially, sales.

Marketing and Communication: Highlighting higher-performing categories in marketing materials can help attract customers and improve brand perception. Conversely, addressing lower-performing categories may improve overall customer experience and loyalty.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
# Create a figure and axis
plt.figure(figsize=(10, 6))

# Create a box plot using Seaborn
box_plot = sns.boxplot(data=df, x='Agent Shift', y='CSAT Score', palette='Set2')

# Add title and labels
plt.title('CSAT Score Distribution by Agent Shift', fontsize=16, fontweight='bold')
plt.xlabel('Agent Shift', fontsize=14)
plt.ylabel('CSAT Score', fontsize=14)

# Add grid lines for better readability
plt.grid(True)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

 the box plot was chosen for its ability to effectively communicate the distribution of CSAT scores across agent shifts, highlight key statistics, and facilitate quick comparisons, making it a valuable tool for analyzing customer satisfaction data.

##### 2. What is/are the insight(s) found from the chart?

Distribution of CSAT Scores:

Each box represents the interquartile range (IQR), showing where the middle 50% of scores lie. The line inside each box indicates the median CSAT score for that shift.
Comparative Analysis:

You can easily compare the median CSAT scores across different shifts. This helps identify which shifts are performing well and which may have issues.
Variability and Outliers:

The whiskers of the box plot extend to the minimum and maximum scores within 1.5 times the IQR from the quartiles, while points outside this range are considered outliers. Observing outliers can indicate exceptional cases that may need further investigation.
Performance Patterns:

If one shift consistently shows higher median CSAT scores and fewer outliers, it may suggest better performance in customer service or operations during that time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Operational Insights:

Understanding which agent shifts perform better in terms of customer satisfaction can help optimize staffing and training. Higher-performing shifts may have practices that could be shared or implemented across other shifts.
Resource Allocation:

If certain shifts are underperforming, management can allocate more resources, such as training or support, to improve performance during those times.
Enhanced Customer Experience:

Ultimately, focusing on shifts with lower CSAT scores and implementing changes can lead to improved customer satisfaction, loyalty, and retention

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(data=df, x='connected_handling_time', y='CSAT Score', scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('CSAT Score vs. Connected Handling Time')
plt.xlabel('Connected Handling Time')
plt.ylabel('CSAT Score')
plt.show()

##### 1. Why did you pick the specific chart?

A regression plot is particularly effective for illustrating the relationship between two continuous variables. It not only shows the data points but also includes a fitted line that represents the trend, making it easier to identify correlations.

##### 2. What is/are the insight(s) found from the chart?

Correlation Assessment:
 the regression line has a positive slope, it suggests that longer connected handling times may be associated with higher CSAT scores, while a negative slope indicates the opposite.
Variability in Scores:

The spread of the scatter points around the regression line can indicate how much variability there is in CSAT scores at different levels of connected handling time. A tight cluster around the line suggests a strong correlation, whereas a wider spread indicates more variability in satisfaction.
Identifying Trends:

The plot can reveal whether certain ranges of connected handling time consistently yield high or low satisfaction scores, informing operational improvements or customer service strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Operational Improvements: Understanding the relationship between handling time and customer satisfaction can help inform training and resource allocation, ultimately leading to improved customer service efficiency.

Customer Satisfaction Strategies: If longer handling times correlate with higher satisfaction, it may indicate that spending more time on customer interactions enhances the experience, suggesting a potential shift in operational strategy.

Targeted Training: Identifying handling time thresholds that maximize customer satisfaction can lead to focused training programs aimed at achieving those timeframes.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
# Group by category and calculate mean CSAT Score
category_csat = df.groupby('category')['CSAT Score'].mean().reset_index()

# Data for the 3D bar chart
categories = category_csat['category']
csat_scores = category_csat['CSAT Score']
x = np.arange(len(categories))  # the label locations
y = np.zeros_like(x)  # starting y position
z = np.zeros_like(x)  # starting z position

# Create the 3D bar chart
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Create bars
ax.bar3d(x, y, z, dx=0.4, dy=0.4, dz=csat_scores, color='cyan', alpha=0.7, edgecolor='k')

# Add labels and title
ax.set_xlabel('Categories')
ax.set_ylabel('Y Axis (Not Used)')
ax.set_zlabel('Average CSAT Score')
ax.set_title('3D Bar Chart of Average CSAT Scores by Category')

# Set x-ticks to be the categories
ax.set_xticks(x)
ax.set_xticklabels(categories, rotation=45, ha='right')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A 3D bar chart is chosen for visualizing the average Customer Satisfaction Score (CSAT Score) across different categories for several reasons:

Visual Depth: The 3D aspect adds a visual depth that can make comparisons among categories more engaging and clearer. It allows viewers to see how different categories stack up against each other in terms of average satisfaction.

Emphasis on Magnitude: By using height (z-axis) to represent average CSAT scores, it emphasizes the differences in satisfaction levels more distinctly than a 2D chart might.

Categorical Comparison: The chart effectively showcases multiple categories, allowing for easy comparison of their respective average scores, which is helpful for identifying which categories perform well and which do not.

##### 2. What is/are the insight(s) found from the chart?

The 3D bar chart can yield various insights:

Category Performance: By comparing the heights of the bars, you can quickly identify which categories have higher or lower average CSAT scores. This helps pinpoint areas of strength and those that may require attention.

Magnitude of Differences: The visual representation allows for a clearer understanding of how much one category's average CSAT score differs from another. Categories with significantly higher or lower scores can indicate varying levels of customer satisfaction.

Potential Outliers: If some categories have markedly lower scores, they may indicate specific issues or challenges that need to be addressed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this 3D bar chart can drive positive business outcomes:

Targeted Improvement Efforts: Identifying categories with lower average CSAT scores can guide where to focus improvement initiatives, such as enhancing product quality, customer service, or operational efficiencies.

Resource Allocation: Understanding which categories perform better can help prioritize marketing and promotional efforts, ensuring that resources are allocated effectively to maximize customer satisfaction and engagement.

Customer Experience Enhancement: By focusing on categories with lower satisfaction scores, businesses can implement changes that improve customer experiences, leading to higher retention rates and brand loyalty.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
# Filter out rows with negative response_time
df_filtered = df[df['response_time'] >= 0]

# Create the scatter plot
fig = px.scatter(
    df_filtered,
    x='Item_price',
    y='CSAT Score',
    size='response_time',
    color='Agent Shift',
    hover_name='Agent_name',
    title="CSAT Score vs. Item Price with Response Time as Size"
)

# Update layout
fig.update_layout(xaxis_title="Item Price", yaxis_title="CSAT Score")

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal for showing relationships between two continuous variables (item price and CSAT Score) while allowing for an additional variable (response time) to be represented through marker size.

##### 2. What is/are the insight(s) found from the chart?

Correlation Insights:

The plot can help reveal whether there is a correlation between item price and CSAT scores. A positive correlation might suggest that higher prices lead to greater customer satisfaction, or vice versa.
Response Time Impact:

The size of the markers indicates the response time, allowing for an understanding of how this variable interacts with both price and satisfaction. For instance, if larger markers (indicating longer response times) are associated with lower CSAT scores, it may suggest that longer response times negatively impact satisfaction.
Agent Shift Performance:

Different colors representing agent shifts can help in identifying which shifts perform better or worse in terms of customer satisfaction and how item pricing influences those scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pricing Strategy:

Understanding the relationship between item price and customer satisfaction can inform pricing strategies, helping to determine optimal price points that enhance customer satisfaction.
Operational Improvements:

If longer response times correlate with lower CSAT scores, businesses can focus on improving operational efficiencies or agent training to reduce response times, thereby enhancing customer satisfaction.
Targeted Training:

Insights about specific agent shifts performing better or worse can lead to tailored training and support efforts to improve overall service quality.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:


# Assuming df is your DataFrame containing the data
# First, ensure connected_handling_time is numeric
df['connected_handling_time'] = pd.to_numeric(df['connected_handling_time'], errors='coerce')

# Drop rows with NaN values in 'connected_handling_time' or 'CSAT Score'
df_cleaned = df.dropna(subset=['connected_handling_time', 'CSAT Score'])

# Plotting the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_cleaned, x='connected_handling_time', y='CSAT Score', alpha=0.6)
plt.title('Connected Handling Time vs CSAT Score', fontsize=16)
plt.xlabel('Connected Handling Time (hours)', fontsize=14)
plt.ylabel('CSAT Score', fontsize=14)
plt.grid(True)
plt.show()

# Calculating the correlation coefficient
correlation = df_cleaned['connected_handling_time'].corr(df_cleaned['CSAT Score'])
print(f'Correlation coefficient between connected handling time and CSAT Score: {correlation:.2f}')


##### 1. Why did you pick the specific chart?

I chose a scatter plot to visualize the relationship between connected handling time and CSAT Score because:

Visualization of Correlation: Scatter plots are effective for showing the relationship between two continuous variables, allowing us to see how one variable changes with respect to another.

Identifying Trends: This type of chart can reveal trends, clusters, or outliers in the data, helping to assess whether there is a positive, negative, or no correlation.

Easy Interpretation: It allows for quick visual interpretation of the data points and facilitates understanding of potential relationships without getting lost in complex data.

##### 2. What is/are the insight(s) found from the chart?

Trend Identification: The scatter plot may reveal a trend indicating that as connected handling time increases or decreases, the CSAT Score may show a corresponding increase or decrease. For example, if longer handling times are associated with higher CSAT Scores, this could indicate that more thorough service leads to greater customer satisfaction.

Outliers: The chart may also highlight outliers that deviate significantly from the general trend, prompting further investigation into those specific cases to understand what factors might be at play.

Correlation Coefficient: The calculated correlation coefficient will quantify the relationship. For instance, a positive correlation (e.g., 0.7) would suggest that higher connected handling time is associated with higher CSAT Scores, whereas a negative correlation would suggest the opposite.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this analysis can significantly impact the business by:

Improving Customer Satisfaction: If a positive correlation is found, the company may consider investing in longer connected handling times to ensure more thorough customer service, potentially increasing overall CSAT Scores.

Resource Allocation: Understanding this relationship can help management allocate resources effectively, ensuring that agents have sufficient time to handle customer issues, which may enhance the customer experience.

Training and Development: If certain handling time thresholds correlate with higher satisfaction, the business can tailor training programs for agents to focus on managing these thresholds, ultimately leading to improved performance and customer satisfaction.

Data-Driven Decisions: The analysis supports data-driven decision-making, allowing management to implement strategies based on actual customer feedback and service metrics rather than assumptions.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

The analysis of how Item Price affects CSAT Scores utilizes various charts to provide a comprehensive understanding of the relationship between these two variables. A scatter plot is chosen to visualize the relationship between Item Price and CSAT Scores, as it effectively displays trends, correlations, and outliers, allowing for a quick assessment of how customer satisfaction varies with price changes. Complementing this visual analysis, correlation analysis quantifies the strength and direction of the relationship, providing a numerical value that indicates whether a strong relationship exists. Additionally, a box plot is employed to summarize the distribution of CSAT Scores across different price categories, enabling a comparison of medians and variability within these ranges. Together, these charts offer both visual exploration and quantitative assessment, facilitating data-driven decisions regarding how Item Price impacts customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Trends: You might observe trends where higher item prices correspond to higher or lower CSAT Scores, indicating a possible relationship.

Variability: The box plot can reveal variability in CSAT Scores across different price categories, helping to identify if certain price ranges lead to higher satisfaction.

Outliers: You may identify outliers in certain price categories that could impact overall insights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from analyzing the relationship between Item Price and CSAT Scores can indeed lead to a positive business impact. For instance, if a strong positive correlation is found, indicating that higher prices are associated with higher CSAT Scores, the business can focus on premium product offerings or enhanced service levels to boost customer satisfaction. This could lead to increased sales, customer loyalty, and overall profitability.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?





The line chart was selected because it effectively shows trends over time, which is ideal for understanding how average CSAT scores vary by date. It allows for an intuitive understanding of any upward or downward trends in customer satisfaction over the period.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
heatmap_data = df.groupby(['channel_name', 'category'])['CSAT Score'].mean().unstack()
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu')
plt.title("Heatmap of Average CSAT Score by Channel and Category")
plt.xlabel("Category")
plt.ylabel("Channel Name")
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps are particularly well-suited for visualizing data that can be structured in a matrix form, such as the relationship between two categorical variables (channel and category) with a third quantitative variable (average CSAT Score).

##### 2. What is/are the insight(s) found from the chart?

Performance Analysis:

The heatmap can reveal which combinations of channels and categories yield the highest and lowest average CSAT scores, indicating areas of strength and weakness.
Identifying Trends:

Consistent patterns across categories and channels can indicate operational efficiencies or issues. For instance, if a particular category consistently has low scores across multiple channels, it may require further investigation.
Focus Areas for Improvement:

Areas represented by darker colors (indicating lower scores) can guide strategic decisions about where to focus customer service improvements, training, or marketing efforts.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
new_df=df.copy()

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
new_df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
new_df = df

In [None]:
df.head()

In [None]:
# Replace "Unknown" with NaN
df.replace("Unknown", np.nan, inplace=True)

# Drop rows with any NaN values
df_cleaned = df.dropna()



In [None]:
df_cleaned.info()

In [None]:

# Select numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Set up the matplotlib figure
plt.figure(figsize=(15, 10))

# Create box plots for each numerical column
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(len(numerical_cols), 1, i)  # create a subplot for each numerical column
    sns.boxplot(x=df[col])
    plt.title(f'Box Plot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()

In [None]:
df=df_cleaned

In [None]:
# Define a function to handle outliers
def handle_outliers(df, column):
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Determine bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Remove outliers
    df_cleaned = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

    return df_cleaned

# Handle outliers for each numerical column in the DataFrame
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    df = handle_outliers(df, col)

# Display the cleaned DataFrame
print(df)

##### What all outlier treatment techniques have you used and why did you use those techniques?

The technique used for outlier treatment is based on the Interquartile Range (IQR) method.This technique IS used because of-

Data Integrity: Improves the quality of the dataset by eliminating extreme values.

Robustness: Methods like IQR are resilient to extreme observations.

Model Performance: Proper outlier handling can enhance the accuracy of predictive models.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

In [None]:
df.columns

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
le = LabelEncoder()

# Fit the LabelEncoder to the 'Tenure Bucket' column
le.fit(df['Tenure Bucket'])

# Create mapping dictionary for 'Tenure Bucket'
tenure_mapping = {value: index for index, value in enumerate(le.classes_)}

# Print the mapping
print("Mapping for 'Tenure Bucket':", tenure_mapping)

# Apply mapping to the DataFrame (encoding)
df['Tenure Bucket (encoded)'] = df['Tenure Bucket'].map(tenure_mapping)

1. Initializing the LabelEncoder and fitting it to the 'Tenure Bucket' column.
2. Creating a mapping dictionary using le.classes_.
3. Applying the mapping to create an encoded column.

In [None]:
# Fit the LabelEncoder to the 'Agent Shift' column
le.fit(df['Agent Shift'])

# Create mapping dictionary for 'Agent Shift'
tenure_mapping = {value: index for index, value in enumerate(le.classes_)}

# Print the mapping
print("Mapping for 'Agent Shift':", tenure_mapping)

# Apply mapping to the DataFrame (encoding)
df['Agent Shift(encoded)'] = df['Agent Shift'].map(tenure_mapping)

In [None]:
df.drop(columns=['Tenure Bucket','Agent Shift'], inplace=True)

In [None]:
df

In [None]:
# Check for numerical and categorical columns
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical Columns:", numerical_cols)
print("Categorical Columns:", categorical_cols)

In [None]:
# Display unique values count for numerical columns
print("Unique values count in numerical columns:")
for col in numerical_cols:
    print(f"{col}: {df[col].nunique()}")

# Display unique values count for categorical columns
print("\nUnique values count in categorical columns:")
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()}")

In [None]:
df_category=pd.get_dummies(df['category'] )
df_sub_category=pd.get_dummies(df['Sub-category'],prefix='Sub_category')
df_Product_category=pd.get_dummies(df['Product_category'],prefix='Product_category')

Explanation:

1. pd.get_dummies(df['category']):

Converts the 'category' column into multiple binary columns (one for each unique category value).
Each column will have a value of 1 if that row belongs to that category, otherwise 0.

2. pd.get_dummies(df['Sub-category'], prefix='Sub_category'):

Converts the 'Sub-category' column similarly, but also adds a prefix to each column name (Sub_category_).

3. pd.get_dummies(df['Product_category'], prefix='Product_category'):

Converts the 'Product_category' column into binary columns with a prefix (Product_category_).

In [None]:
df=df.join(df_category)


.join(): Adds the columns from df_category to df based on the index. If df_category has the same number of rows and the index aligns correctly with df, this will work fine.

However, .join() is slightly less efficient than pd.concat() for combining DataFrames, especially if you plan to join multiple DataFrames sequentially.

In [None]:
df=df.join(df_sub_category)

In [None]:
df=df.join(df_Product_category)

In [None]:
df.info()

In [None]:
bool_cols = df.select_dtypes(include='bool').columns
df[bool_cols] = df[bool_cols].astype(int)

used to convert boolean columns in a pandas DataFrame to integer type.

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In [None]:
df

In [None]:
# Calculate the issue resolved time
df['issue_resolved_time'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 60  # Convert to minutes


In [None]:
df

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting


In [None]:
df.drop(columns=['Sub-category','category','Unique id','Customer Remarks','Agent_name','Supervisor','Manager','order_date_time','Customer_City','channel_name'], inplace=True)

In [None]:
df.drop(columns=['issue_responded','Issue_reported at','Product_category','Order_id','Survey_response_Date','order_year'], inplace=True)

In [None]:
# Calculate correlation
df.corr

In [None]:
df.drop(columns=['connected_handling_time'],inplace=True)

In [None]:
df.columns

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

In [None]:
for col in df:
  plt.figure(figsize=(10,6))
  sns.displot(x=df[col] , color ='green')
  plt.xlabel(col)
  plt.axvline(df[col].mean(),color='magenta', linestyle='dashed',linewidth=2)
  plt.axvline(df[col].median(),color='cyan', linestyle='dashed',linewidth=2)
  plt.show()
plt.show()

### 6. Data Scaling

In [None]:
# Scaling your data

In [None]:


from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
print(scaled)


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
#separating the dependent and independent variables
y = df['CSAT Score']
x = df.drop(columns = 'CSAT Score')

# Split your data to train and test. Choose Splitting ratio wisely.
x_train, x_test, y_train, y_test = train_test_split( x,y , test_size = 0.2, random_state = 0)


##### What data splitting ratio have you used and why?

this parameter indicates that 20% of the data will be used for testing, while 80% will be used for training.

### 9. Handling Imbalanced Dataset

In [None]:
df['CSAT Score'].value_counts()

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_res, y_res = sm.fit_resample(x_train, y_train)

 SMOTE (Synthetic Minority Over-sampling Technique) from the imblearn (imbalanced-learn) library to address class imbalance in the dataset by oversampling the minority class in the target variable

In [None]:
print('After OverSampling, the shape of train_X: {}'.format(x_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_res == 1.0)))
print("After OverSampling, counts of label '2': {}".format(sum(y_res == 2.0)))
print("After OverSampling, counts of label '3': {}".format(sum(y_res == 3.0)))
print("After OverSampling, counts of label '4': {}".format(sum(y_res == 4.0)))
print("After OverSampling, counts of label '5': {}".format(sum(y_res == 5.0)))


##### Do you think the dataset is imbalanced? Explain Why.

After applying SMOTE to the dataset, the class distribution is balanced

## ***7. ANN Model Implementation***

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.utils import to_categorical

In [None]:
model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(5, activation='softmax')
])

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


In [None]:
# Train the model
model.fit(x_res, tf.keras.utils.to_categorical(y_res - 1), epochs=1000, batch_size=32, validation_split=0.2)


 it trains a machine learning model using TensorFlow and Keras, specifically applying the fit method to train the model with the resampled data after SMOTE.

In [None]:
# Save the trained model for future use
model.save('csat_prediction_model.keras')

In [None]:
# Load the model when needed for predictions
from tensorflow.keras.models import load_model
# Load the trained model
loaded_model = load_model('/content/csat_prediction_model.keras')




In [None]:
def preprocess_input(data):
    """
    This function accepts a raw data dictionary, preprocesses it to match the training format,
    and returns the processed input for the model.
    """
    # Convert to DataFrame
    input_df = pd.DataFrame([data])

    # Handle missing values
    input_df['Customer Remarks'] = input_df['Customer Remarks'].fillna('Unknown')
    input_df['Customer_City'] = input_df['Customer_City'].fillna('Unknown')
    input_df['Item_price'] = input_df['Item_price'].fillna(input_df['Item_price'].median())
    # Change to 0 if connect handling time is empty
    input_df['connected_handling_time'] = input_df['connected_handling_time'].replace('', '0').astype(float)
    input_df['connected_handling_time'] = input_df['connected_handling_time'].fillna(input_df['connected_handling_time'].median())

    # Convert datetime columns
    datetime_cols = ['order_date_time', 'Issue_reported at', 'issue_responded', 'Survey_response_Date']
    for col in datetime_cols:
        input_df[col] = pd.to_datetime(input_df[col], errors='coerce', dayfirst=True)

    # Create additional features
    input_df['order_year'] = input_df['order_date_time'].dt.year
    input_df['order_month'] = input_df['order_date_time'].dt.month
    input_df['order_day'] = input_df['order_date_time'].dt.day
    input_df['order_hour'] = input_df['order_date_time'].dt.hour
    input_df['order_day_of_week'] = input_df['order_date_time'].dt.dayofweek

    input_df['Response Lag'] = (input_df['issue_responded'] - input_df['Issue_reported at']).dt.total_seconds() / 3600.0
    input_df['Survey Lag'] = (input_df['Survey_response_Date'] - input_df['Issue_reported at']).dt.total_seconds() / 3600.0

    # Label encode columns as in training
    le = LabelEncoder()
    input_df['Tenure Bucket (encoded)'] = le.fit_transform(input_df['Tenure Bucket'])
    input_df['Agent Shift(encoded)'] = le.fit_transform(input_df['Agent Shift'])

    # Drop original columns that were encoded
    input_df.drop(columns=['Tenure Bucket', 'Agent Shift'], inplace=True)

    # One-hot encode categorical columns
    df_category = pd.get_dummies(input_df['category'])
    df_sub_category = pd.get_dummies(input_df['Sub-category'], prefix='Sub_category')
    df_Product_category = pd.get_dummies(input_df['Product_category'], prefix='Product_category')
    input_df = pd.concat([input_df, df_category, df_sub_category, df_Product_category], axis=1)

    # Transform boolean columns to int (if any)
    bool_cols = input_df.select_dtypes(include='bool').columns
    input_df[bool_cols] = input_df[bool_cols].astype(int)

    # Drop unnecessary columns as done in training
    drop_cols = ['Sub-category', 'category', 'Unique id', 'Customer Remarks', 'Agent_name',
                 'Supervisor', 'Manager', 'order_date_time', 'Customer_City', 'channel_name',
                 'issue_responded', 'Issue_reported at', 'Product_category', 'Order_id',
                 'Survey_response_Date', 'order_year']

    input_df.drop(columns=drop_cols, inplace=True, errors='ignore')

    # Ensure the input data matches the training set columns
    input_df = input_df.reindex(columns=x_train.columns, fill_value=0)

    return input_df

In [None]:
def predict_csat(input_data):
    """
    Accepts preprocessed input data, uses the loaded model to predict the CSAT score, and returns the result.
    """
    processed_data = preprocess_input(input_data)
    prediction = loaded_model.predict(processed_data)
    predicted_class = np.argmax(prediction, axis=1) + 1  # Add 1 to match the CSAT score (1 to 5)
    return predicted_class[0]


In [None]:
input_data = {
    'Unique id': '081f62d7-332f-4aac-91cf-e79758976725',
    'channel_name': 'Inbound',
    'category': 'Returns',
    'Sub-category': 'Reverse Pickup Enquiry',
    'Customer Remarks': '',  # Can be left empty
    'Order_id': '2509fa08-318d-4526-8122-51603af956a8',
    'order_date_time': '15-07-2023 14:47',  # Changeable
    'Issue_reported at': '01-08-2023 08:55',  # Changeable
    'issue_responded': '01-08-2023 08:57',  # Changeable
    'Survey_response_Date': '01-Aug-23',  # Changeable
    'Customer_City': 'BETIA',
    'Product_category': 'Electronics',
    'Item_price': 1099,  # Changeable
    'connected_handling_time': '',  # Can be left empty
    'Agent_name': 'Cynthia Mills',
    'Supervisor': 'William Park',
    'Manager': 'John Smith',
    'Tenure Bucket': '31-60',
    'Agent Shift': 'Morning'
}




In [None]:
# Run prediction
predicted_score = predict_csat(input_data)


In [None]:
print(f'Predicted CSAT Score: {predicted_score}')

**Conclusion**

This project successfully developed a deep learning model to predict Customer Satisfaction (CSAT) scores for an e-commerce platform. The project involved a series of key steps that were carefully designed to handle the complexities of the dataset and ensure reliable predictions. First, we preprocessed the raw data by addressing missing values, encoding categorical variables, and creating meaningful features such as Response Lag and Survey Lag, which are critical in determining customer satisfaction. We also tackled the issue of class imbalance by applying SMOTE to resample the data, ensuring that the model learned to predict across all satisfaction levels equally.

Next, we built a robust Deep Neural Network (DNN) model capable of capturing complex relationships within the data. The model was trained using efficient techniques such as batch processing, learning rate optimization, and early stopping to prevent overfitting. After training, we evaluated the model using relevant performance metrics, and the results showed that it was capable of predicting CSAT scores with a reasonable degree of accuracy.

A key part of this project was the development of a preprocessing pipeline that allows the model to handle new input data consistently, ensuring that predictions can be made in real-time on fresh customer interactions. The pipeline transforms raw data into a format compatible with the trained model, enabling businesses to take immediate action based on predicted satisfaction scores. Additionally, the insights derived from the model can help organizations identify areas for improvement in customer service, ultimately enhancing customer experience.

While the model shows promising results, there are opportunities for further improvement. For example, incorporating additional features, such as sentiment analysis of customer feedback or refining the model’s architecture, could enhance its predictive power. Furthermore, real-time deployment could allow the model to provide on-the-fly insights, enabling proactive measures to improve customer satisfaction. Overall, this project demonstrates the potential of deep learning in predictive analytics and highlights the importance of effective data preprocessing, feature engineering, and model optimization in building accurate and actionable customer satisfaction models.