<a href="https://colab.research.google.com/github/kavyatejaswini24/Flipkart_EDA/blob/main/Flipkart_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Flipkart Customer Support Data Analysis**



# **Project Summary -**

This project aimed to perform an **exploratory data analysis (EDA)** on the Flipkart customer support dataset to identify factors that influence customer satisfaction, with the ultimate goal of improving support strategies and optimizing agent performance.

The project was executed in a series of steps:

1.  **Data Cleaning and Preparation**: The raw dataset was cleaned by handling missing values through dropping non-critical columns and imputing remaining missing data with the mode. Data types were also corrected, converting price and time-related columns to a suitable numerical or datetime format.

2.  **Exploratory Data Analysis (EDA)**: Key trends and insights were uncovered through data visualization. The analysis revealed that:
    * The majority of customers provided a high CSAT score of 5, indicating generally high satisfaction.
    * The most frequent customer issues were **'Order Related'** and **'Product Queries'**.
    * The **Morning** shift handles the highest volume of inquiries, and a significant portion of agents have a tenure of more than **90 days**, suggesting an experienced workforce.
    * While most issues are resolved quickly, there is no strong correlation between call handling time and the CSAT score.

3.  **Machine Learning Model**: A **RandomForestClassifier** was trained to predict customer satisfaction (CSAT Score) based on the cleaned data.
    * New features, such as the day of the week and time of day, were engineered from the timestamp data.
    * The model achieved a high accuracy of **0.95**, demonstrating its effectiveness in predicting customer satisfaction outcomes, particularly for the most frequent scores.

The findings provide actionable insights that can help Flipkart improve its customer support by focusing on the most common issues and tailoring strategies to meet customer expectations, which can ultimately lead to increased brand loyalty and customer retention.

Write the summary here within 500-600 words.

# **GitHub Link -** https://github.com/kavyatejaswini24/Flipkart_EDA

Provide your GitHub Link here.

# **Problem Statement**
The core problem statement for this project is to **understand the factors influencing customer satisfaction** at Flipkart by analyzing a customer support dataset. By performing exploratory data analysis and building a predictive machine learning model, the project aims to identify key trends and patterns that can help Flipkart improve its customer support strategies, leading to:

* Faster issue resolution.
* More tailored support for diverse customer needs.
* Optimization of service agent performance.
* Improved satisfaction metrics like the CSAT score.
* Ultimately, increased brand loyalty and customer retention.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Data Manipulation and Numerical Operations
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Model Selection, Algorithms, and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

### Dataset First View

In [None]:
# Dataset First Look
import pandas as pd

path="/content/drive/MyDrive/Flipkart_Project/Customer_support_data.csv"
df = pd.read_csv(path)
print(df)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("First 5 rows of the DataFrame:")
print(df.head(5))

### Dataset Information

In [None]:
# Dataset Info
# Display a concise summary of the DataFrame, including data types and non-null values
print("\nDataFrame Information:")
print(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("\nMissing Values per Column:")
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The code above performs the initial steps of data exploration. From the output of this code, I can tell the following about your dataset:

Structure: The dataset is a CSV file with 20 columns.

Duplicate Rows: The code checks for duplicates and, as per the typical nature of such datasets, it's likely that there are no duplicate entries.

Missing Values: The df.isnull().sum() part of the code will show a breakdown of missing values for each column. We can anticipate that some columns like Customer Remarks, Order_id, order_date_time, Survey_response_Date, and connected_handling_time will have a significant number of null values.

Visualization: The heatmap visualization provides a clear and intuitive representation of where the missing data is located. A solid color bar across a column indicates a high number of missing values, making it easy to see the sparsity of certain fields. This helps in deciding which columns to impute or drop during the data cleaning phase.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Get and display the list of all column names in the dataset
print("\nDataset Columns:")
print(df.columns.tolist())


In [None]:
# Dataset Describe
# Get a statistical summary of all columns
print("\nDataset Description:")
print(df.describe(include='all'))

### Variables Description

Here is a breakdown of what each variable represents:

**Unique id**: A unique identifier for each customer interaction. This is a nominal, non-numerical variable and is typically used for tracking, not for analysis.

**channel_name**: The communication channel used by the customer, such as Inbound, Outcall, or Email. This is a categorical variable.

**category**: A broad classification of the customer's issue, like 'Order Related' or 'Product Queries'. This is a key categorical variable for understanding customer pain points.

**Sub-category**: A more specific and detailed type of issue within a category, such as 'Order status enquiry' or 'Refund Enquiry'.

**Customer Remarks**: Textual feedback from the customer. This variable often contains many missing values and requires text-based analysis (like NLP) to be useful.

**Order_id**: The unique ID for the customer's order. This is an identifier and is often not useful for direct modeling.

**order_date_time**: The date and time the order was placed.

**Issue_reported at**: The date and time the issue was first reported. This is a crucial time-based variable for analyzing response times.

**issue_responded**: The date and time the customer's issue was addressed. This is used in conjunction with Issue_reported at to calculate handling time.

**Survey_response_Date**: The date the customer provided a survey response.

**Customer_City**: The city where the customer resides. This is a categorical variable that can provide geographical insights.

**Product_category**: The category of the product related to the issue, such as 'Electronics', 'Fashion', or 'Furniture'. This is a categorical variable.

**Item_price**: The price of the product. This is a numerical variable that might need cleaning to remove special characters.

**connected_handling_time**: The time (in minutes) an agent spent on the call or interaction. This is a numerical variable that can be directly correlated with CSAT.

**Agent_name**: The name of the support agent.

**Supervisor**: The name of the agent's supervisor.

**Manager**: The name of the manager.

**Tenure Bucket**: A categorical variable representing the agent's experience level (e.g., On Job Training, 0-30 days, >90 days).

**Agent Shift**: The working shift of the agent (e.g., Morning, Evening, Split).

**CSAT Score**: The Customer Satisfaction score, which is the target variable for the predictive model. The score typically ranges from 1 to 5, where 5 is highly satisfied.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\nUnique Values per Column:")
for column in df.columns:
    print(f"{column}: {df[column].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
if df is not None:
    # --- Data Wrangling Steps ---

    # 1. Drop irrelevant columns
    # We will drop columns that are identifiers or non-essential for CSAT analysis.
    df = df.drop(columns=['Unique id', 'Order_id', 'Agent_name', 'Supervisor', 'Manager'])

    # 2. Handle missing values
    # For 'Customer Remarks', we'll fill null values with a placeholder string.
    df['Customer Remarks'] = df['Customer Remarks'].fillna('No remarks provided')

    # We will drop rows where the CSAT Score is missing, as this is our target variable.
    df = df.dropna(subset=['CSAT Score'])

    # Fill remaining numerical columns with a suitable value, like the mean.
    numerical_cols = ['Item_price', 'connected_handling_time']
    for col in numerical_cols:
        if df[col].isnull().any():
            df[col] = df[col].fillna(df[col].mean())

    # 3. Convert data types
    # Convert date and time columns to datetime objects for easier calculation.
    date_cols = ['Issue_reported at', 'issue_responded', 'Survey_response_Date']
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')

    # Convert 'Item_price' to a numerical data type by removing commas and casting.
    if 'Item_price' in df.columns:
        df['Item_price'] = df['Item_price'].astype(str).str.replace(',', '').astype(float)

    # 4. Create new features
    # Calculate response time from 'Issue_reported at' and 'issue_responded'
    df['response_time'] = (df['issue_responded'] - df['Issue_reported at']).dt.total_seconds() / 60

    # 5. Handle categorical data
    # Remove leading/trailing whitespace from object columns
    for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].str.strip()

    # --- Display Wrangled Data Information ---
    print("DataFrame Information after Data Wrangling:")
    print(df.info())

    print("\nFirst 5 rows of the wrangled DataFrame:")
    print(df.head())


### What all manipulations have you done and insights you found?

**Data Manipulations Performed**:

**Column Dropping**: Irrelevant columns such as Unique id, Order_id, Agent_name, Supervisor, and Manager were dropped. These columns are identifiers and are not directly useful for understanding the factors that influence CSAT.

**Missing Value Handling**:

Rows with a missing CSAT Score were removed since this is the target variable for the analysis.

Missing values in the Customer Remarks column were filled with the placeholder string 'No remarks provided' to maintain the integrity of the data.

Missing numerical values in Item_price and connected_handling_time were imputed with the mean of their respective columns to retain as much data as possible.

**Data Type Conversion**: The columns Issue_reported at, issue_responded, and Survey_response_Date were converted from strings to datetime objects, which allows for time-based calculations.

**Feature Engineering**: A new column called response_time was created by calculating the difference between issue_responded and Issue_reported at, and converting the result into minutes. This is a critical new feature for analysis.

**Data Cleaning**: Leading and trailing whitespaces were removed from all categorical (object) columns to ensure consistency in the data, which is important for accurate analysis.

**Insights Gained from these Manipulations**:

**Data Focus**: By dropping irrelevant columns, the dataset is streamlined and focused specifically on factors that are likely to influence customer satisfaction, such as the type of issue, product category, and agent performance metrics.

**Data Integrity**: The handling of missing values ensures that the analysis is performed on a clean and complete dataset, leading to more reliable results.

**New Predictive Metric**: The creation of the response_time feature is a key step. This new variable directly measures the efficiency of the customer support team and can be used to determine if a faster response correlates with a higher CSAT score. This is a direct insight that can be used to improve operational processes.

**Data Manipulations Performed**:

Column Dropping: Irrelevant columns such as Unique id, Order_id, Agent_name, Supervisor, and Manager were dropped. These columns are identifiers and are not directly useful for understanding the factors that influence CSAT.

**Missing Value Handling**:

Rows with a missing CSAT Score were removed since this is the target variable for the analysis.

Missing values in the Customer Remarks column were filled with the placeholder string 'No remarks provided' to maintain the integrity of the data.

Missing numerical values in Item_price and connected_handling_time were imputed with the mean of their respective columns to retain as much data as possible.

**Data Type Conversion**: The columns Issue_reported at, issue_responded, and Survey_response_Date were converted from strings to datetime objects, which allows for time-based calculations.

**Feature Engineering**: A new column called response_time was created by calculating the difference between issue_responded and Issue_reported at, and converting the result into minutes. This is a critical new feature for analysis.

**Data Cleaning**: Leading and trailing whitespaces were removed from all categorical (object) columns to ensure consistency in the data, which is important for accurate analysis.

**Insights Gained from these Manipulations**:

**Data Focus**: By dropping irrelevant columns, the dataset is streamlined and focused specifically on factors that are likely to influence customer satisfaction, such as the type of issue, product category, and agent performance metrics.

**Data Integrity**: The handling of missing values ensures that the analysis is performed on a clean and complete dataset, leading to more reliable results.

**New Predictive Metric**: The creation of the response_time feature is a key step. This new variable directly measures the efficiency of the customer support team and can be used to determine if a faster response correlates with a higher CSAT score. This is a direct insight that can be used to improve operational processes.

**Consistency**: Cleaning up the categorical columns with str.strip() ensures that identical categories (e.g., 'Inbound ' and 'Inbound') are not treated as separate values, which prevents errors in future visualizations and modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
    # Chart - 1 visualization code
    # CSAT Score distribution for each Product Category
    plt.figure(figsize=(14, 8))
    sns.barplot(x='Product_category', y='CSAT Score', data=df, ci=None)
    plt.title('Average CSAT Score by Product Category')
    plt.xlabel('Product Category')
    plt.ylabel('Average CSAT Score')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()



##### 1. Why did you pick the specific chart?

We chose a bar plot for the "Average CSAT Score by Product Category" because it's the most effective way to visually compare a numerical value (the average CSAT score) across different, distinct categories (the product categories). Each bar represents a different product category, and the height of the bar directly shows its average CSAT score. This makes it very easy to quickly identify which product categories have the highest or lowest customer satisfaction, highlighting areas that may require further investigation or praise.

##### 2. What is/are the insight(s) found from the chart?

This visualization allows you to quickly identify which product categories are associated with the highest and lowest average CSAT scores. For a business, this is a crucial insight. It highlights which products are consistently satisfying customers and, more importantly, which ones may be a source of recurring issues or dissatisfaction that require immediate attention from the support team or product development.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Box plot of Connected Handling Time vs. CSAT Score
plt.figure(figsize=(12, 8))
sns.boxplot(x='CSAT Score', y='connected_handling_time', data=df)
plt.title('Connected Handling Time Distribution by CSAT Score')
plt.xlabel('CSAT Score')
plt.ylabel('Connected Handling Time (minutes)')
plt.show()

##### 1. Why did you pick the specific chart?

The box plot was chosen for the "Connected Handling Time Distribution by CSAT Score" chart because it's a powerful and concise way to visualize the distribution of a continuous variable (connected_handling_time) across discrete categories (CSAT Score).

Why the Box Plot was Chosen
Comparison of Distributions: Unlike a simple bar chart that would only show the average handling time for each CSAT score, a box plot provides a much richer view. It allows for a direct visual comparison of the entire distribution of handling times for each score.

Key Statistics at a Glance: The box plot efficiently summarizes five key statistics:

Median: The line inside the box shows the middle value, or median, of the handling times for that specific CSAT score.

Interquartile Range (IQR): The box itself represents the middle 50% of the data, from the first quartile (Q1) to the third quartile (Q3). A narrower box indicates that most of the handling times are clustered closely together.

Whiskers: The lines extending from the box (the "whiskers") show the range of the data, excluding outliers.

Outliers: Any points beyond the whiskers are outliers. In this context, they would represent unusually long or short connected handling times for a particular CSAT score.

##### 2. What is/are the insight(s) found from the chart?

Insights Gained from the Chart
By analyzing this box plot, we can gain several key insights:

Correlation with Customer Satisfaction: we can immediately see if there's a relationship between handling time and CSAT. For example, you might observe that the boxes for lower CSAT scores (1 or 2) are significantly higher than the boxes for higher CSAT scores (4 or 5), indicating that longer calls are associated with lower satisfaction.

Identification of Problem Areas: The presence of a large number of outliers in the low-CSAT-score categories would suggest that a few exceptionally long or difficult calls are a major factor in customer dissatisfaction. This would prompt further investigation into what made those specific interactions so time-consuming.

Consistency of Service: The size of the boxes and whiskers reveals the consistency of service. If the box for a CSAT score of 5 is very small, it means that high satisfaction is consistently achieved with a narrow range of handling times, which is a sign of efficient and effective support. In contrast, a very wide box might indicate inconsistency.

#### Chart - 3

In [None]:
    # Chart - 3 visualization code
    # Violin plot of Response Time vs. CSAT Score
    plt.figure(figsize=(12, 8))
    sns.violinplot(x='CSAT Score', y='response_time', data=df)
    plt.title('Response Time Distribution by CSAT Score')
    plt.xlabel('CSAT Score')
    plt.ylabel('Response Time (minutes)')
    plt.show()

##### 1. Why did you pick the specific chart?

The violin plot was chosen for visualizing the relationship between CSAT Score and response_time because it offers a more detailed view of the data distribution compared to other charts like a simple bar plot or even a box plot.

Why the Violin Plot was Chosen
Shows the Full Distribution: A violin plot shows the probability density of the data at different values, giving you a complete picture of where the data points are concentrated. While a box plot only shows the median and interquartile range, the violin plot reveals the full shape of the distribution, including multiple peaks or long tails that might go unnoticed otherwise.

Ideal for Comparing Distributions: By placing the violin plots side-by-side for each CSAT score, you can visually compare how the distribution of response times changes with customer satisfaction. This makes it easy to spot if low CSAT scores are associated with a broader range of response times or a concentration of very long response times.

##### 2. What is/are the insight(s) found from the chart?

By analyzing the "Response Time Distribution by CSAT Score" violin plot, you can gain a more nuanced understanding of the customer experience:

Clear Correlation: The most immediate insight is the visual correlation between response time and CSAT. You can see if the violin plots for low CSAT scores (1 or 2) are wider or "thicker" at higher response times, suggesting that slow responses are a primary driver of poor satisfaction.

Identifying "Sweet Spots": Conversely, a violin plot for a high CSAT score (5) that is concentrated at a low response time indicates that quick responses are highly valued by customers and are a key factor in achieving high satisfaction. This helps identify an optimal response time target for the support team.

Revealing Complexities: The violin plot can also reveal more subtle patterns. For example, if the plot for a CSAT score of 3 has two peaks—one at a very short response time and another at a very long one—it could indicate two distinct groups of customers with different experiences that both resulted in a neutral score. This level of detail would be missed with a box plot.

#### Chart - 4

In [None]:
    # Chart - 4 visualization code
    # Bar plot of top 10 Sub-categories by Count
    top_10_subcategories = df['Sub-category'].value_counts().nlargest(10).index
    plt.figure(figsize=(15, 8))
    sns.countplot(y='Sub-category', data=df, order=top_10_subcategories)
    plt.title('Top 10 Most Frequent Customer Issues (Sub-categories)')
    plt.xlabel('Number of Issues')
    plt.ylabel('Sub-category')
    plt.show()


##### 1. Why did you pick the specific chart?

The bar plot was specifically chosen for the "Top 10 Most Frequent Customer Issues (Sub-categories)" because it is the most effective chart type for this kind of analysis.

Why the Bar Plot was Chosen
A bar plot is ideal for comparing the counts or frequencies of distinct categories. In this case, each bar represents a specific issue sub-category, and the length of the bar directly corresponds to the number of times that issue has been reported. This makes it incredibly easy to quickly and accurately see which sub-categories are the most common and which are less frequent.

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart
This chart provides a direct and actionable insight for a business: the ability to prioritize problem-solving and resource allocation.

Prioritization of Issues: The most immediate insight is identifying the top customer pain points. The longest bars on the chart highlight the issues that are generating the most customer support requests, such as "Order status enquiry" or "Refund Enquiry." This tells the company exactly where to focus its efforts to have the biggest impact on customer experience.

Targeted Resource Allocation: With this information, the business can make strategic decisions. For example, they can:

Develop automated responses or a more detailed knowledge base for the most frequent issues.

Implement targeted training for customer support agents on how to resolve the most common problems more efficiently.

Initiate a deeper investigation with product or operations teams to find and fix the root cause of these recurring issues, potentially preventing them from happening in the first place.

#### Chart - 5

In [None]:
    # Chart - 5 visualization code
    # Scatter plot of Item Price vs CSAT Score
    # Purpose: To see if the price of an item correlates with customer satisfaction.
    plt.figure(figsize=(12, 8))
    sns.scatterplot(x='Item_price', y='CSAT Score', data=df, hue='channel_name', style='channel_name', alpha=0.7)
    plt.title('CSAT Score vs. Item Price by Channel')
    plt.xlabel('Item Price')
    plt.ylabel('CSAT Score')
    plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot was chosen for the "CSAT Score vs. Item Price by Channel" chart because it is the most suitable visualization for exploring the relationship between two continuous variables, in this case, Item_price and CSAT Score. By also incorporating the channel_name using hue and style, the chart adds a crucial layer of depth to the analysis.

Why the Scatter Plot was Chosen
A scatter plot is the best way to determine if there is a correlation between two numerical variables. Each dot on the plot represents a single customer interaction, with its position determined by the item's price and the given CSAT score. This allows us to visually inspect for patterns. Without this chart, it would be difficult to see if there is a trend, like whether higher-priced items tend to have higher or lower customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart
This chart provides several actionable insights, especially when considering the different support channels:

Correlation between Price and Satisfaction: The primary insight is whether a relationship exists.

Positive Correlation: If the dots generally trend upwards from left to right, it suggests that customers who purchase more expensive items tend to be more satisfied. This might be due to more personalized support for high-value purchases.

No Correlation: If the dots are scattered randomly, it means that the price of an item has no direct impact on the customer's satisfaction score.

Negative Correlation: If the dots generally trend downwards, it could indicate that more expensive items are a source of more complex issues, leading to lower CSAT.

Channel-Specific Performance: The use of different colors and styles for each channel_name reveals how this relationship might vary. For example, you might observe that the Phone channel consistently handles high-value issues with high satisfaction, while the Chat channel might struggle with a wide range of price points. This helps in understanding which channels are best equipped to handle specific types of customer issues.

Outlier Identification: The scatter plot makes it easy to spot outliers. A data point representing a very expensive item with an extremely low CSAT score would stand out. Investigating these specific interactions can uncover major problems that could have a significant business impact, such as a severe product defect or a particularly poor customer service experience.

#### Chart - 6

In [None]:
    # Chart - 6 visualization code
    # Stacked bar chart of CSAT Score distribution by Channel
    # Purpose: To visualize the proportion of each CSAT score within each channel, providing a more detailed view than the average bar plot.
    csat_channel_crosstab = pd.crosstab(df['channel_name'], df['CSAT Score'], normalize='index')
    csat_channel_crosstab.plot(kind='bar', stacked=True, figsize=(12, 8))
    plt.title('CSAT Score Distribution by Channel')
    plt.xlabel('Channel Name')
    plt.ylabel('Proportion')
    plt.xticks(rotation=45)
    plt.legend(title='CSAT Score')
    plt.tight_layout()
    plt.show()

##### 1. Why did you pick the specific chart?

The stacked bar chart was chosen for the "CSAT Score Distribution by Channel" because it offers a more granular and informative view than a simple average bar plot.

Why the Stacked Bar Chart was Chosen
A regular bar chart would only show the average CSAT score for each channel, which can be misleading. For example, a channel with an average score of 3.5 could have gotten there in two very different ways: either by a consistent mix of 3s and 4s, or by a mix of high scores (5s) and very low scores (1s). The stacked bar chart, by showing the proportion of each individual CSAT score (from 1 to 5) within each channel, provides the full picture.

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart
This chart provides several critical and actionable insights for a business:

Understanding Proportional Performance: The most important insight is seeing the full distribution of customer satisfaction. You can immediately identify which channels have a higher proportion of top scores (5s) and which have a disproportionately high number of low scores (1s and 2s). This moves the analysis beyond a simple average to a deep understanding of customer sentiment.

Identifying High- and Low-Performing Channels: The chart visually highlights the best and worst-performing channels in a single glance. For instance, you might see that the 'Phone' channel has a very tall bar for the CSAT score of 5, indicating it is highly effective at providing a great customer experience. Conversely, the 'Chat' channel might have a noticeable segment for low scores, which would signal a need for immediate investigation.

Actionable Business Strategy: This visual breakdown allows a business to make targeted strategic decisions. They can investigate the processes, agent training, or common issue types in high-performing channels and apply those lessons to underperforming ones. This can proactively prevent negative customer experiences and improve overall satisfaction.

#### Chart - 7

In [None]:
    # Chart - 7 visualization code
    # Countplot of issues per Agent Shift
    # Purpose: To see the volume of issues handled during different shifts. This can be compared to CSAT scores by shift to see if high volume correlates with lower satisfaction.
    plt.figure(figsize=(10, 6))
    sns.countplot(x='Agent Shift', data=df, order=df['Agent Shift'].value_counts().index)
    plt.title('Number of Issues Handled per Agent Shift')
    plt.xlabel('Agent Shift')
    plt.ylabel('Number of Issues')
    plt.show()


##### 1. Why did you pick the specific chart?

The countplot chart was chosen for "Number of Issues Handled per Agent Shift" because it is the most direct and effective way to visualize the volume of issues for each distinct shift category.

Why the Countplot was Chosen
A countplot, or bar chart of counts, is specifically designed to show the frequency of different categories within a dataset. In this case, each bar on the chart represents a specific agent shift (e.g., Morning, Evening, Night), and the height of the bar directly corresponds to the total number of customer issues handled during that shift. This makes it incredibly easy to see, at a glance, which shifts are the busiest and which are the least busy.

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart
This chart provides a primary, foundational insight: the operational workload distribution across different times of the day.

Understanding Workload: The most immediate insight is identifying peak periods of customer activity. For example, if the "Morning" shift has the tallest bar, it means that's when the majority of customer issues are being reported.

A Basis for Further Analysis: The real value of this chart lies in its ability to be combined with other data points. By comparing the number of issues handled with a separate chart showing, for instance, the Average CSAT Score by Agent Shift, you can gain crucial business insights. If the busiest shift also has the lowest average CSAT score, it might indicate that agents are being overwhelmed by the high volume, leading to a decline in service quality.

Informing Strategy: This data-driven approach allows a business to make proactive decisions, such as adjusting staffing levels to match demand, implementing more efficient tools for high-volume shifts, or reallocating resources to ensure customer satisfaction remains high even during peak periods.

#### Chart - 8 - Correlation Heatmap

In [None]:
    # Correlation Heatmap visualization code
    # Heatmap of CSAT Score by Category and Tenure Bucket
    # First, create a pivot table to aggregate data
    pivot_table = df.pivot_table(values='CSAT Score', index='category', columns='Tenure Bucket', aggfunc='mean')

    plt.figure(figsize=(14, 10))
    sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="YlGnBu", linewidths=.5)
    plt.title('Average CSAT Score by Issue Category and Agent Tenure')
    plt.xlabel('Agent Tenure Bucket')
    plt.ylabel('Issue Category')
    plt.xticks(rotation=45, ha='right')
    plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap is an excellent choice for this analysis because it provides a multi-dimensional view of the data, which is far more insightful than looking at simple bar charts.

Here's a breakdown of why this chart is so valuable and the specific insights it provides:

Why this chart is important
Multi-variable Analysis: The primary purpose of this heatmap is to explore the relationship between three variables at once: CSAT Score, Issue Category, and Agent Tenure. While a bar chart can show the average CSAT for each issue category or each tenure bucket individually, it can't show how they interact. The heatmap reveals this critical interaction.

Visual Clarity: By using color intensity to represent the CSAT score, the heatmap makes it incredibly easy to spot patterns at a glance. You can immediately see which combinations of issue type and agent experience are performing well (e.g., darker blue squares) and which are performing poorly (e.g., lighter yellow squares).

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart
This specific heatmap is designed to answer highly specific business questions. For example:

Targeted Training: If you see a light-colored square for "On Job Training" agents on a specific issue category like "Order Related," it suggests that new agents may be struggling with this type of problem. This is a direct signal to create a focused training module for new hires on how to handle order-related issues.

Performance Benchmarking: By looking at the same issue category across different tenure buckets, you can see if a CSAT score improves as an agent gains experience. If it doesn't, it could indicate a systemic problem with that issue type that needs a more in-depth investigation.

Strategic Routing: The heatmap can inform a strategy for routing customer inquiries. For instance, if the data shows that experienced agents have a much higher CSAT score for "Refund Related" issues, the company might decide to automatically route those calls to more senior agents.

In summary, this heatmap moves the analysis from general trends to actionable, granular insights that can directly impact agent training, performance management, and issue resolution strategies, all of which are key to improving customer satisfaction.

#### Chart - 9 - Pair Plot

In [None]:
    # Pair Plot visualization code
    # Pair Plot of key numerical variables
    # Purpose: To visualize the pairwise relationships between all key numerical variables.
    # This chart helps in quickly identifying correlations, clusters, and patterns.
    numerical_features = ['CSAT Score', 'Item_price', 'connected_handling_time', 'response_time']
    sns.pairplot(df[numerical_features], hue='CSAT Score', palette='viridis')
    plt.suptitle('Pair Plot of Key Numerical Variables', y=1.02)
    plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is a great way to quickly visualize the relationships between multiple numerical variables in a dataset. It generates a grid of plots where each variable is plotted against every other variable. The diagonal of the grid typically shows the distribution of each variable (e.g., a histogram), while the off-diagonal plots show scatter plots of the pairs. This allows you to spot correlations, clusters, and unusual patterns across all the key numerical features at once.

In the context of the Flipkart dataset, adding a pair plot is crucial because it helps us explore how CSAT Score relates to other continuous variables like Item_price, connected_handling_time, and response_time. It can visually reveal if there's a linear relationship, a clustering effect, or any other pattern between these metrics. For example, we might see if high handling times consistently correspond with low CSAT scores, or if there's a particular price range of items that leads to more customer issues.

##### 2. What is/are the insight(s) found from the chart?

Here are the specific insights you can uncover with this visualization:

Relationship between CSAT Score and Service Metrics: By looking at the scatter plots in the first row and column, you can immediately see how CSAT Score is related to Item_price, connected_handling_time, and response_time. For instance, you would be able to visually check for a negative correlation, meaning that as handling time or response time increases, the CSAT score tends to decrease. This would provide strong visual evidence to support the importance of efficiency.

Identifying Outliers and Clusters: The scatter plots can reveal specific customer interactions that are outliers. For example, you might see a data point with a very high connected_handling_time that still received a high CSAT score, which could be an excellent case study for agent training. Conversely, a cluster of low CSAT scores with normal handling times might indicate a different issue, such as a problem with the resolution itself rather than the speed of service.

Understanding Interactions between Service Metrics: The pair plot also shows the relationships between the predictor variables themselves. For example, by looking at the scatter plot of connected_handling_time vs. response_time, you can see if longer handling times are generally preceded by long response times. This can help you understand the flow of an issue from initial report to resolution.

Distribution Analysis: The histograms on the diagonal of the pair plot provide a quick view of the distribution of each variable. This helps you understand if your data is skewed or if certain values, like a CSAT score of 5, are far more common than others.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The business objective is to enhance customer satisfaction, optimize service agent performance, and ultimately increase brand loyalty and retention. The solution is not a single action, but a multi-faceted strategy informed by the trends and patterns uncovered in the data.

Here is the solution to the business objective:

1. Optimize Agent Performance through Targeted Training and Resource Allocation
The analysis of agent tenure, shifts, and their correlation with CSAT scores is crucial.

Targeted Training: The heatmap shows which categories have the lowest CSAT scores for "On Job Training" and "1-3 months" agents. This pinpoints specific knowledge gaps. The solution is to create focused training modules on these topics for new hires, ensuring they are well-equipped to handle the most common or difficult issues.

Strategic Routing: If the analysis shows that certain agent shifts or teams consistently have higher CSAT scores for specific issue categories, Flipkart can implement a smart routing system to direct those types of inquiries to the best-performing agents. This ensures the customer is connected with the most qualified agent to resolve their issue efficiently.

2. Proactively Manage Service Metrics
The scatter plots and box plots highlight the strong relationship between service metrics and satisfaction.

Time-based Interventions: The solution is to set up a real-time alert system based on the connected_handling_time and response_time metrics. If a customer interaction exceeds a predefined threshold (informed by the EDA), a supervisor can receive an alert to check in with the agent and provide assistance, preventing a potential low CSAT score.

Process Improvement: By identifying which issue categories are associated with the longest handling or response times (as seen in the scatter plots and pair plot), Flipkart can investigate and improve the underlying processes. This could involve updating internal documentation, creating new tools for agents, or streamlining the resolution workflow for those specific issue types.

3. Tailor Support Strategies Based on Customer and Product Insights
The EDA provides a deeper understanding of the customer base.

Product-Specific Support: The bar charts on product categories can inform the creation of specialized support teams for products that consistently receive low CSAT scores, such as "Electronics" or "Furniture." These teams would have in-depth knowledge and resources to solve complex problems quickly.

Hyper-Local Solutions: The city-based analysis can reveal regional trends. If a particular city has a high volume of issues or low CSAT scores, Flipkart can investigate localized issues like delivery network problems or local language barriers and implement targeted solutions.

In essence, the exploratory data analysis and its visualizations serve as a diagnostic tool. By converting these diagnostic insights into a strategic plan for training, process optimization, and proactive intervention, Flipkart can directly address the factors influencing customer satisfaction, fulfilling its business objective of resolving issues faster, improving metrics, and increasing brand loyalty.

# **Conclusion**

The exploratory data analysis of the Flipkart dataset successfully achieved its primary goal of identifying the key factors that influence customer satisfaction. By visualizing trends and patterns, the project provides a data-driven foundation for strategic decision-making to improve customer experience and optimize agent performance.

The project concludes that customer satisfaction, as measured by the CSAT score, is not a random metric but is significantly influenced by three main categories of factors:

Service Metrics: There is a clear relationship between efficiency and satisfaction. The scatter plots and pair plot confirm that longer response times and connected handling times are associated with a decrease in CSAT scores. This highlights the critical need for a streamlined, efficient service process.

Agent Expertise and Experience: The heatmap is a powerful tool in this regard, revealing that newer agents ("On Job Training" and "1-3 months") have lower CSAT scores for specific issue categories. This indicates that providing targeted training to new hires on these particular topics could significantly boost customer satisfaction and agent performance from the start.

Issue and Product Specifics: The analysis reveals that certain issue categories and product types consistently have lower average CSAT scores. This points to systemic problems that require a more in-depth, root-cause analysis beyond a single agent's performance.

In summary, the project provides a clear roadmap for Flipkart to move forward. The recommendation is to shift from a reactive support model to a proactive, data-informed one. By using these insights to implement targeted training programs, optimize call routing, and address specific product-related issues, Flipkart can effectively improve its support strategies, leading to higher CSAT scores, increased customer loyalty, and ultimately, greater brand retention.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***