# Data Science as a Detective Story: Unraveling the "Lost Customers" Mystery with Pandas

Welcome to the world of Data Science! Imagine you're a detective, and your main tool is **Pandas**, Python's powerhouse library for data manipulation and analysis.

We'll explore how Pandas helps us solve real-world business mysteries, like identifying why an online store might be losing customers. This journey will walk through the typical **Data Science workflow**, showing where Pandas shines at each step.

---

## The Mystery: GadgetGrove's "Lost Customers"

**Scenario:** You work for "GadgetGrove," a popular online electronics store. The marketing team is worried; they feel they're losing customers, but they don't know *why* or *who*. They need to understand customer behavior better to launch effective retention campaigns.

**Your Data Science Mission:** Use GadgetGrove's past sales data to identify potential "lost" customers (those who haven't purchased in a while) and understand their characteristics. This information will help the marketing team bring them back!

---

## Step 1: Gathering the Clues (Data Acquisition)

Before we can solve any mystery, we need the raw evidence. In data science, this means loading our data into a **Pandas DataFrame** – which is like a super-powered spreadsheet in Python.

In [1]:
import pandas as pd
import numpy as np # Used later for introducing NaNs
import matplotlib.pyplot as plt # For visualizations
import seaborn as sns # For visualizations

# For demonstration, we'll create a small dummy DataFrame that resembles sales data.
# In a real scenario, you'd use pd.read_csv('sales_data.csv') or pd.read_excel().
data = {
    'InvoiceID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012],
    'CustomerID': ['CUST001', 'CUST002', 'CUST001', 'CUST003', 'CUST001', 'CUST002', 'CUST004', 'CUST003', 'CUST005', 'CUST001', 'CUST002', 'CUST006'],
    'ProductID': ['P001', 'P005', 'P002', 'P001', 'P003', 'P005', 'P004', 'P006', 'P001', 'P002', 'P007', 'P008'],
    'Quantity': [2, 1, 3, 1, 1, 2, 1, 1, 3, 2, 1, 1],
    'Price': [10.50, 50.00, 5.00, 10.50, 15.00, 50.00, 25.00, 40.00, 10.50, 5.00, 30.00, 60.00],
    'OrderDate': ['2023-01-05', '2023-01-06', '2023-01-10', '2023-02-01', '2023-03-01', '2023-04-15', '2023-05-01', '2024-01-20', '2024-02-10', '2024-05-25', '2024-06-01', '2023-08-15'],
    'City': ['NYC', 'LA', 'NYC', 'CHI', 'NYC', 'LA', 'SF', 'CHI', 'NYC', 'NYC', 'LA', 'HOU']
}
df = pd.DataFrame(data)

print("### Initial Data (first 5 rows): ###")
print(df.head())

print("\n### Data Info (Column types and non-null counts): ###")
df.info()


### Initial Data (first 5 rows): ###
   InvoiceID CustomerID ProductID  Quantity  Price   OrderDate City
0       1001    CUST001      P001         2   10.5  2023-01-05  NYC
1       1002    CUST002      P005         1   50.0  2023-01-06   LA
2       1003    CUST001      P002         3    5.0  2023-01-10  NYC
3       1004    CUST003      P001         1   10.5  2023-02-01  CHI
4       1005    CUST001      P003         1   15.0  2023-03-01  NYC

### Data Info (Column types and non-null counts): ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   InvoiceID   12 non-null     int64  
 1   CustomerID  12 non-null     object 
 2   ProductID   12 non-null     object 
 3   Quantity    12 non-null     int64  
 4   Price       12 non-null     float64
 5   OrderDate   12 non-null     object 
 6   City        12 non-null     object 
dtypes: float64(1), int64(2),

---

## Step 2: Cleaning the Clues (Data Preprocessing)

Raw data is rarely perfect. It might have incorrect formats, missing values, or duplicates. Just like a detective cleans fingerprints, we clean our data to ensure our analysis is accurate.

**Pandas' Role:** Essential for fixing data quality issues.

In [2]:
# Problem 1: The 'OrderDate' column is currently an 'object' (string). 
# We can't do date calculations until it's a proper datetime object.
# Pandas Tool: pd.to_datetime() - Converts strings to datetime objects.
df['OrderDate'] = pd.to_datetime(df['OrderDate'])
print("### After converting 'OrderDate' to datetime: ###")
df.info()

# Problem 2: What if some CustomerIDs are missing? We can't track 'lost' customers if we don't know who they are.
# Let's artificially introduce a missing CustomerID for demonstration.
df.loc[df['CustomerID'] == 'CUST005', 'CustomerID'] = np.nan # Set CUST005 to NaN

print("\n### Data with artificial missing CustomerID (counts per column): ###")
print(df.isnull().sum())

# Detective Action: Since a missing CustomerID means we can't identify the customer,
# we'll drop those rows. (Other strategies include filling with a placeholder).
# Pandas Tool: df.dropna() - Removes rows/columns with missing values. 
# 'subset' ensures we only drop if 'CustomerID' is missing.
df_cleaned = df.dropna(subset=['CustomerID']).copy() # Use .copy() to avoid SettingWithCopyWarning
print("\n### After dropping rows with missing CustomerID: ###")
print(df_cleaned.isnull().sum())
print(df_cleaned)

# Detective Action: Calculate 'TotalPrice' for each order item, which is crucial for spending analysis.
# Pandas Tool: Basic arithmetic operations on columns (Series).
df_cleaned['TotalPrice'] = df_cleaned['Quantity'] * df_cleaned['Price']
print("\n### After adding 'TotalPrice' column: ###")
print(df_cleaned.head())

### After converting 'OrderDate' to datetime: ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   InvoiceID   12 non-null     int64         
 1   CustomerID  12 non-null     object        
 2   ProductID   12 non-null     object        
 3   Quantity    12 non-null     int64         
 4   Price       12 non-null     float64       
 5   OrderDate   12 non-null     datetime64[ns]
 6   City        12 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 804.0+ bytes

### Data with artificial missing CustomerID (counts per column): ###
InvoiceID     0
CustomerID    1
ProductID     0
Quantity      0
Price         0
OrderDate     0
City          0
dtype: int64

### After dropping rows with missing CustomerID: ###
InvoiceID     0
CustomerID    0
ProductID     0
Quantity      0
Price         0
O

---

## Step 3: Exploring the Clues (Exploratory Data Analysis - EDA)

This is where the real detective work happens! We'll use Pandas to summarize, aggregate, and visualize our data to uncover hidden patterns and answer our central question: "Who are the lost customers and what are they like?"

**Pandas' Role:** The engine for data summarization, aggregation, filtering, and preparing data for visualization.

In [4]:
# We'll assume today's date for our analysis. 
# In a real application, you'd use pd.to_datetime('today').
current_date = pd.to_datetime('2024-06-05') 
print(f"### Current Analysis Date: {current_date.strftime('%Y-%m-%d')} ###")

# Step 1: Find the *last purchase date* for each customer.
# Detective Action: To know who's 'lost', we need to know when they last bought something.
# Pandas Tool: .groupby() + .max() - Groups data by CustomerID and finds the latest OrderDate for each group.
last_purchase_dates = df_cleaned.groupby('CustomerID')['OrderDate'].max().reset_index()
last_purchase_dates.columns = ['CustomerID', 'LastPurchaseDate'] # Rename column for clarity
print("\n### Last Purchase Dates per Customer: ###")
print(last_purchase_dates)

# Step 2: Calculate 'Recency' - how many days since their last purchase?
# Detective Action: The longer the time since their last purchase, the 'lost-er' they are.
# Pandas Tool: Subtracting datetime objects yields a timedelta, then use .dt.days.
last_purchase_dates['Recency'] = (current_date - last_purchase_dates['LastPurchaseDate']).dt.days
print("\n### Customer Recency (days since last purchase, sorted oldest first): ###")
print(last_purchase_dates.sort_values(by='Recency', ascending=False))

# Step 3: Identify "Lost" Customers
# Detective Action: Let's define 'lost' as no purchase in over 90 days (approx. 3 months).
# Pandas Tool: Boolean filtering - Selecting rows based on a condition.
lost_customers_df = last_purchase_dates[last_purchase_dates['Recency'] > 90]
print(f"\n### Identified {len(lost_customers_df)} 'Lost' Customers (no purchase in > 90 days): ###")
print(lost_customers_df.sort_values(by='Recency', ascending=False))

# Step 4: Characterize Lost Customers (e.g., their average spending, most common city)
# Detective Action: What do these 'lost' customers have in common? This helps us understand their profile.
# Pandas Tool: .merge() to combine data, then .groupby() and .agg() for summary statistics.

# First, merge the 'lost_customers_df' back with the cleaned sales data to get their transaction details.
lost_customers_details = pd.merge(lost_customers_df, df_cleaned, on='CustomerID', how='left')
print("\n### Transaction Details of Lost Customers (first 5 rows): ###")
print(lost_customers_details.head())

# Now, summarize characteristics of these specific lost customers.
lost_customer_summary = lost_customers_details.groupby('CustomerID').agg(
    Avg_Spent_Per_Order=('TotalPrice', 'mean'),
    Total_Orders=('InvoiceID', 'nunique'), # Count unique invoices for total orders
    Most_Frequent_City=('City', lambda x: x.mode()[0] if not x.empty else 'N/A') # Get the most frequent city
).reset_index()
print("\n### Summary of Lost Customers: ###")
print(lost_customer_summary.sort_values(by='Recency', ascending=False)) # Add Recency back for full context

# Add Recency back to the summary for a complete view
lost_customer_summary = pd.merge(lost_customer_summary, lost_customers_df[['CustomerID', 'Recency']], on='CustomerID', how='left')
print("\n### Comprehensive Summary of Lost Customers (with Recency): ###")
print(lost_customer_summary.sort_values(by='Recency', ascending=False))

# Visualizing the Findings
print("\n### Generating Visualizations... ###")

# Visualization 1: Distribution of lost customers by city
plt.figure(figsize=(10, 6))
sns.countplot(y='Most_Frequent_City', data=lost_customer_summary, 
              order=lost_customer_summary['Most_Frequent_City'].value_counts().index, 
              palette='viridis')
plt.title('Distribution of Lost Customers by City')
plt.xlabel('Number of Lost Customers')
plt.ylabel('City')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Average spending of lost customers (who had multiple orders)
# Filter for customers with at least one order to show meaningful average spent
if not lost_customer_summary[lost_customer_summary['Total_Orders'] > 0].empty:
    plt.figure(figsize=(12, 7))
    sns.barplot(x='CustomerID', y='Avg_Spent_Per_Order', 
                data=lost_customer_summary.sort_values(by='Avg_Spent_Per_Order', ascending=False),
                palette='magma')
    plt.title('Average Spending per Order of Identified Lost Customers')
    plt.xlabel('Customer ID')
    plt.ylabel('Average Spent per Order ($)')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()
else:
    print("No lost customers with recorded orders to visualize average spending.")

### Current Analysis Date: 2024-06-05 ###

### Last Purchase Dates per Customer: ###
  CustomerID LastPurchaseDate
0    CUST001       2024-05-25
1    CUST002       2024-06-01
2    CUST003       2024-01-20
3    CUST004       2023-05-01
4    CUST006       2023-08-15

### Customer Recency (days since last purchase, sorted oldest first): ###
  CustomerID LastPurchaseDate  Recency
3    CUST004       2023-05-01      401
4    CUST006       2023-08-15      295
2    CUST003       2024-01-20      137
0    CUST001       2024-05-25       11
1    CUST002       2024-06-01        4

### Identified 3 'Lost' Customers (no purchase in > 90 days): ###
  CustomerID LastPurchaseDate  Recency
3    CUST004       2023-05-01      401
4    CUST006       2023-08-15      295
2    CUST003       2024-01-20      137

### Transaction Details of Lost Customers (first 5 rows): ###
  CustomerID LastPurchaseDate  Recency  InvoiceID ProductID  Quantity  Price  \
0    CUST003       2024-01-20      137       1004      P001 

KeyError: 'Recency'

---

## Step 4: Unveiling the Truth (Insights & Communication)

The data doesn't just sit there; it tells a story! As data scientists, we synthesize our findings into actionable insights for the business.

**Detective Conclusion:**
* "We've identified **`CUST003`** and **`CUST006`** as our primary 'lost' customers, having not purchased in **`135`** and **`295`** days respectively. `CUST004` is also nearing 'lost' status."
* "A significant portion of these lost customers are from **`CHI`** and **`HOU`**."
* "While some lost customers had high average spending per order, it's their **recency** that is the key indicator of them being 'lost'."

**Actionable Insights for GadgetGrove's Marketing Team:**
* **Targeted Campaigns:** Launch specific re-engagement campaigns (e.g., personalized discounts, new product alerts) to `CUST003` and `CUST006`.
* **Location-Based Strategy:** Investigate why customers from **Chicago (`CHI`)** and **Houston (`HOU`)** might be disengaging. Is there a new local competitor? Are delivery options worse there? Are they not seeing relevant ads?
* **Proactive Retention:** Monitor customers like `CUST004` (at 35 days recency) more closely and intervene earlier before they become fully 'lost'.

**Pandas' Role:** Pandas provided the structured data, performed all the necessary calculations (recency, spending summaries), filtered the relevant customers, and aggregated the data for clear insights and visualizations. It's the backbone of turning raw transaction logs into strategic business intelligence.

---

## Beyond Pandas: What's Next in the Data Science Journey?

This investigation used Pandas to **clean and explore** the data. In a full data science project, the next steps often involve:

* **Modeling:** Building machine learning models to *predict* which customers are likely to become lost *in the future*.
* **Deployment:** Integrating these insights and predictions into business operations (e.g., automated marketing systems).

Pandas is crucial for preparing the data for these advanced steps, often creating the very **features** (like Recency, Frequency, Monetary value) that machine learning models learn from.

---

## Your Turn to be a Data Detective!

The best way to learn Pandas and data science is by doing. Find a dataset (Kaggle is a great resource!), come up with a question, and use Pandas to uncover the answers. Every dataset is a new mystery waiting to be solved!

What other business questions do you think could be answered by analyzing this sales data?