In [28]:
import pandas as pd

# Getting rid of the annoying SettingWithCopyWarning.
# This sets the pandas option for copy-on-write behavior.
pd.options.mode.copy_on_write = True


You are a Product Insights Analyst working with the **Ben & Jerry's** sales strategy team to investigate seasonal sales patterns through comprehensive data analysis. The team wants to understand how temperature variations and unique transaction characteristics impact ice cream sales volume. Your goal is to perform detailed data cleaning and exploratory analysis to uncover meaningful insights about seasonal sales performance.

In [29]:
# Load the CSV file into a DataFrame
ice_cream_sales_data = pd.read_csv('ice_cream_sales_data.csv')

# Display de DataFrame
print("Ice Cream Sales Data:")
print(ice_cream_sales_data)


Ice Cream Sales Data:
     sale_date  temperature             product_name  sales_volume  \
0   2024-07-05         62.0            Cherry Garcia            23   
1   2024-08-15         64.0            Chunky Monkey            26   
2   2024-09-25         66.0               Phish Food            29   
3   2024-10-05         68.0          Americone Dream            32   
4   2024-11-15         70.0  Chocolate Fudge Brownie            35   
..         ...          ...                      ...           ...   
56  2025-03-25         81.0            Cherry Garcia            61   
57  2025-04-05         83.0            Chunky Monkey            64   
58  2025-05-15         85.0               Phish Food            67   
59  2025-06-25         87.0          Americone Dream            70   
60  2024-12-25         89.0            Chunky Monkey           110   

   transaction_id  
0          TX0001  
1          TX0002  
2          TX0003  
3          TX0004  
4          TX0005  
..            ...

### Question 1 of 3

Question 1 of 3

Identify and remove any duplicate sales transactions from the dataset to ensure accurate analysis of seasonal patterns.

In [30]:
# Remove duplicate transactions based on 'transaction_id'
ice_cream_sales_data_cleaned = ice_cream_sales_data.drop_duplicates(subset='transaction_id')

# Display the cleaned DataFrame
print("Cleaned Ice Cream Sales Data (duplicates removed):")
print(ice_cream_sales_data_cleaned)


Cleaned Ice Cream Sales Data (duplicates removed):
     sale_date  temperature                product_name  sales_volume  \
0   2024-07-05         62.0               Cherry Garcia            23   
1   2024-08-15         64.0               Chunky Monkey            26   
2   2024-09-25         66.0                  Phish Food            29   
3   2024-10-05         68.0             Americone Dream            32   
4   2024-11-15         70.0     Chocolate Fudge Brownie            35   
5   2024-12-25         72.0                  Half Baked            38   
6   2025-01-05         74.0  New York Super Fudge Chunk            41   
7   2025-02-15         76.0               Cherry Garcia            44   
8   2025-03-25         78.0               Chunky Monkey            47   
9   2025-04-05         80.0                  Phish Food            50   
10  2025-05-15         82.0             Americone Dream            53   
11  2025-06-25         84.0     Chocolate Fudge Brownie          1000   


### Question 2 of 3

Create a pivot table to summarize the total sales volume of ice cream products by month and temperature range.

Use the following temperature bins where each bin includes the upper bound but not the lower:
1. Less than 60 degrees
2. 60 to less than 70 degrees
3. 70 to less than 80 degrees
4. 80 to less than 90 degrees
5. 90 to less than 100 degrees
6. 100 degrees or more

*Note: The question contradicts itself, as it asks for upper-bound intervals — like (a, b] —, but then lists lower-bound ones — like [a, b). The solution below adopts the latter.*

In [31]:
# Define temperature bins and labels
temp_bins = [float('-inf'), 60, 70, 80, 90, 100, float('inf')]
temp_labels = [
    "Less than 60",
    "60 to <70",
    "70 to <80",
    "80 to <90",
    "90 to <100",
    "100 or more"
]

# Bin the temperature values
ice_cream_sales_data_cleaned['temperature_range'] = pd.cut(
    ice_cream_sales_data_cleaned['temperature'],
    bins=temp_bins,
    labels=temp_labels,
    right=False
)

# Extract month from sale_date
ice_cream_sales_data_cleaned['month'] = pd.to_datetime(
    ice_cream_sales_data_cleaned['sale_date']
).dt.month

# Create the pivot table
pivot_table = pd.pivot_table(
    ice_cream_sales_data_cleaned,
    values='sales_volume',
    index='month',
    columns='temperature_range',
    aggfunc='sum',
    fill_value=0,
    observed=False
)

print("Pivot Table: Total Sales Volume by Month and Temperature Range")
print(pivot_table)


Pivot Table: Total Sales Volume by Month and Temperature Range
temperature_range  Less than 60  60 to <70  70 to <80  80 to <90  90 to <100  \
month                                                                          
1                             0        190         41        149           0   
2                             0        116        102         22           0   
3                            25        119        130         61           0   
4                             0        122          0        114          28   
5                             0        156         89        120           0   
6                             0         34        220       1070           0   
7                             0         60       1295         59           0   
8                             0         66        134        160           0   
9                             0         72        137        101          65   
10                            0        100        186    

### Question 3 of 3

Can you detect any outliers in the monthly sales volume using the Inter Quartile Range (IQR) method?

A month is considered an outlier if it falls below Q1 minus 1.5 times the IQR or above Q3 plus 1.5 times the IQR.

In [32]:
# Calculate total monthly sales volume
monthly_sales = ice_cream_sales_data_cleaned.groupby('month')['sales_volume'].sum()

# Calculate Q1, Q3, and IQR
Q1 = monthly_sales.quantile(0.25)
Q3 = monthly_sales.quantile(0.75)
IQR = Q3 - Q1

# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outlier months
outliers = monthly_sales[(monthly_sales < lower_bound) | (monthly_sales > upper_bound)]

print("Monthly Sales Volume Outliers (IQR method):")
print(outliers)


Monthly Sales Volume Outliers (IQR method):
month
6    1324
7    1414
Name: sales_volume, dtype: int64
