# Life Insurance Customer Churn Analysis


## Introduction

What drives a customer to remain loyal to their insurance provider, and what factors push them to seek alternatives? In the competitive world of insurance, understanding customer behavior is crucial to retaining policyholders and maintaining steady business growth.

This project delves into a dataset centered on life insurance customers, aiming to uncover the key factors influencing customer churn. By analyzing patterns in customer demographics, claims, premiums, and other attributes, we strive to answer critical questions:
- What makes a customer more likely to churn?
- Are there specific premium categories or claim behaviors that signal dissatisfaction?
- What strategies can insurers adopt to retain their customers?

The dataset, sourced from [Kaggle](https://www.kaggle.com/datasets/usmanfarid/customer-churn-dataset-for-life-insurance-industry), offers a snapshot of life insurance customers, including variables such as claim amounts, premium ratios, and churn status. Through this analysis, we hope to shed light on actionable insights that can drive improved customer retention in the insurance industry.

## Data Preparation

In [66]:
import pandas as pd

In [67]:
# Load the dataset
df = pd.read_csv('customer_churn_dataset.csv')

In [68]:
# Examining the data to make sure everything loaded in as expected
df.head()

Unnamed: 0.1,Unnamed: 0,Customer Name,Customer_Address,Company Name,Claim Reason,Data confidentiality,Claim Amount,Category Premium,Premium/Amount Ratio,Claim Request output,BMI,Churn
0,0,Christine Payne,"7627 Anderson Rest Apt. 265,Lake Heather, DC 3...","Williams, Henderson and Perez",Travel,Low,377,4794,0.07864,No,21,Yes
1,1,Tony Fernandez,"3953 Cindy Brook Apt. 147,East Lindatown, TN 4...",Moore-Goodwin,Medical,High,1440,14390,0.100069,No,24,Yes
2,2,Christopher Kim,"8693 Walters Mountains,South Tony, TX 88407",Smith-Holmes,Phone,Medium,256,1875,0.136533,No,18,Yes
3,3,Nicole Allen,"56926 Webster Coves,Shawnmouth, NV 04853",Harrell-Perez,Phone,Medium,233,1875,0.124267,No,24,Yes
4,4,Linda Cruz,"489 Thomas Forges Apt. 305,Jesseton, GA 36765","Simpson, Kramer and Hughes",Phone,Medium,239,1875,0.127467,No,21,Yes


In [69]:
# Checking data types and # of null values in dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Unnamed: 0            200000 non-null  int64  
 1   Customer Name         200000 non-null  object 
 2   Customer_Address      200000 non-null  object 
 3   Company Name          200000 non-null  object 
 4   Claim Reason          200000 non-null  object 
 5   Data confidentiality  200000 non-null  object 
 6   Claim Amount          200000 non-null  int64  
 7   Category Premium      200000 non-null  int64  
 8   Premium/Amount Ratio  200000 non-null  float64
 9   Claim Request output  200000 non-null  object 
 10  BMI                   200000 non-null  int64  
 11  Churn                 200000 non-null  object 
dtypes: float64(1), int64(4), object(7)
memory usage: 18.3+ MB


#### Extraneous Data
The dataset contains 11 columns and 200,000 rows. After examining the data, there appears to be an extraneous column, "Unnamed: 0", that is not needed for the data analysis and can be removed. The Customer Name and Company Name columns can also be removed since they are irrelevant to the problem we are trying to solve.

In [70]:
# Dropping the Unnamed, Customer Name, and Company Name columns
df.drop(columns=['Unnamed: 0', 'Customer Name', 'Company Name'], inplace=True)

#### Standardize and Clean Column Names
Renamed columns for readability and consistency. For example, ```Category Premium``` was changed to ```CategoryPremium```.

In [71]:
df.rename(columns={
    'Customer_Address': 'CustomerAddress',
    'Claim Reason': 'ClaimReason',
    'Data confidentiality': 'DataConfidentiality',
    'Claim Amount': 'ClaimAmount',
    'Category Premium': 'CategoryPremium',
    'Premium/Amount Ratio': 'PremiumAmountRatio',
    'Claim Request output': 'ClaimRequestOutput'
}, inplace=True)

#### Checking for Null Values
Next, the dataset is double-checked to ensure that there are no null or blank values. Both the info table above and the output below indicate that the data doesn't have any so we can proceed.

In [72]:
# Checking for any other missing values in the dataset
print(df.isin(['', 'Unknown', 'N/A']).sum())

CustomerAddress        0
ClaimReason            0
DataConfidentiality    0
ClaimAmount            0
CategoryPremium        0
PremiumAmountRatio     0
ClaimRequestOutput     0
BMI                    0
Churn                  0
dtype: int64


#### Checking for Correct Data Types
The dataset contains incorrect data types in the Claim Reason and Data Confidentiality columns. It is more accurate to type these as category columns instead of object columns.
- Claim Reason: Updated data type from object to category
- Data confidentiality: Updated data type from object to category

In addition, we need to map the Claim Request output and Churn to numerical boolean values (0 or 1) for easier analysis and visualization.

In [74]:
# Re-typing variables
df['ClaimReason'] = df['ClaimReason'].astype('category')
df['DataConfidentiality'] = df['DataConfidentiality'].astype('category')
df['ClaimRequestOutput'] = df['ClaimRequestOutput'].map({'Yes': 1, 'No': 0})
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

In [75]:
# Checking the updated info for the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   CustomerAddress      200000 non-null  object  
 1   ClaimReason          200000 non-null  category
 2   DataConfidentiality  200000 non-null  category
 3   ClaimAmount          200000 non-null  int64   
 4   CategoryPremium      200000 non-null  int64   
 5   PremiumAmountRatio   200000 non-null  float64 
 6   ClaimRequestOutput   200000 non-null  int64   
 7   BMI                  200000 non-null  int64   
 8   Churn                200000 non-null  int64   
dtypes: category(2), float64(1), int64(5), object(1)
memory usage: 11.1+ MB


In [76]:
# Reviewing high-level overview of the quantitative data
df.describe()

Unnamed: 0,ClaimAmount,CategoryPremium,PremiumAmountRatio,ClaimRequestOutput,BMI,Churn
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,1120.47884,8963.783895,0.125024,0.03503,23.007205,0.63636
std,796.660796,6114.737202,0.034742,0.183856,3.164976,0.481048
min,1.0,399.0,0.002506,0.0,18.0,0.0
25%,245.0,1875.0,0.106741,0.0,20.0,0.0
50%,1390.0,14390.0,0.125122,0.0,23.0,1.0
75%,1844.0,14390.0,0.143155,0.0,26.0,1.0
max,2299.0,14390.0,0.24812,1.0,28.0,1.0


## EDA (Exploratory Data Analysis) Findings

## Modeling Process

## Conclusions and Recommendations