# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who:
   - have a **low total_claim_amount** (e.g., below $1,000),
   - have a response "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze:
   - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded "Yes", and
   - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [1]:
# Step 0 - Import libraries
# pandas: data manipulation
# numpy: numerical helpers (optional)

import pandas as pd
import numpy as np


In [2]:
# Step 0.1 - Load the dataset
# read_csv loads the CSV file into a DataFrame

url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis_clean.csv"
df = pd.read_csv(url)

# Step 0.2 - Quick view
df.head()


Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2011-02-18,Employed,M,...,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,A,2
1,1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,...,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,A,1
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2011-02-10,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A,2
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,2011-01-11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,1
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,...,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,A,1


In [3]:
# Step 0.3 - Check columns to know what we can use
df.columns


Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type', 'month'],
      dtype='object')

In [4]:
# Step 1 - Convert effective_to_date into datetime
# to_datetime changes a string column into real datetime format

df["effective_to_date"] = pd.to_datetime(df["effective_to_date"], errors="coerce")

# Step 1.1 - Create a month column (1..12)
# dt.month extracts the month number from a datetime column

df["month"] = df["effective_to_date"].dt.month


In [5]:
# Step 2 - Group by state and month
# size() counts how many rows (policies) are in each group

policies_state_month = (
    df.groupby(["state", "month"])
      .size()
      .reset_index(name="policies_sold")
)

policies_state_month.head()


Unnamed: 0,state,month,policies_sold
0,Arizona,1,1008
1,Arizona,2,929
2,California,1,2231
3,California,2,1952
4,Nevada,1,551


In [6]:
# Step 3 - Pivot the grouped data into a table format
# index="state" makes states the rows
# columns="month" makes months the columns
# values="policies_sold" fills table with counts

pivot_state_month = policies_state_month.pivot(
    index="state",
    columns="month",
    values="policies_sold"
).fillna(0).astype(int)

pivot_state_month


month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,1008,929
California,2231,1952
Nevada,551,442
Oregon,1565,1344
Washington,463,425


In [7]:
# Step 4 - Count total policies per state
# groupby("state").size() counts all policies in each state

state_totals = df.groupby("state").size().sort_values(ascending=False)

# Step 4.1 - Select top 3 states
top_3_states = state_totals.head(3).index.tolist()

top_3_states


['California', 'Oregon', 'Arizona']

In [8]:
# Step 5 - Select only the top 3 states from the pivot table

pivot_top3 = pivot_state_month.loc[top_3_states]

pivot_top3


month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,2231,1952
Oregon,1565,1344
Arizona,1008,929


In [9]:
# Step 6 - Check columns to find marketing channels / response columns
df.columns


Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type', 'month'],
      dtype='object')

In [10]:
# Step 7A - Calculate response rate by sales_channel
# We create a boolean: responded_yes = 1 if "Yes", else 0
# Then mean() gives the percentage as a decimal

df["responded_yes"] = (df["response"].str.lower() == "yes").astype(int)

response_rate_channel = (
    df.groupby("sales_channel")["responded_yes"]
      .mean()
      .sort_values(ascending=False)
      .reset_index()
)

# Step 7A.1 - Convert to percentage
response_rate_channel["response_rate_percent"] = (response_rate_channel["responded_yes"] * 100).round(2)

response_rate_channel[["sales_channel", "response_rate_percent"]]


Unnamed: 0,sales_channel,response_rate_percent
0,Agent,18.01
1,Web,10.89
2,Branch,10.79
3,Call Center,10.32


In [11]:
# Step 7B - Use melt when there are multiple channel columns
# Example: channel columns are binary flags (0/1)

channel_cols = [c for c in df.columns if "channel" in c.lower()]

channel_cols


['sales_channel']

In [12]:
# Step 7B.1 - Melt (unpivot) channel columns into long format
# id_vars are the columns we keep (response)
# value_vars are the channel columns we unpivot into rows

melted = df.melt(
    id_vars=["response"],
    value_vars=channel_cols,
    var_name="marketing_channel",
    value_name="used_channel"
)

# Step 7B.2 - Keep only rows where channel was used (flag == 1)
melted = melted[melted["used_channel"] == 1].copy()

# Step 7B.3 - Compute response rate by channel
melted["responded_yes"] = (melted["response"].str.lower() == "yes").astype(int)

response_rate_melt = (
    melted.groupby("marketing_channel")["responded_yes"]
          .mean()
          .sort_values(ascending=False)
          .reset_index()
)

response_rate_melt["response_rate_percent"] = (response_rate_melt["responded_yes"] * 100).round(2)

response_rate_melt[["marketing_channel", "response_rate_percent"]]


Unnamed: 0,marketing_channel,response_rate_percent
