# **1. Getting to know the dataset**

First, We're going to import the required Python libraries and our dataset.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#importing the dataset
customer_data = pd.read_csv('/kaggle/input/online-retail-customer-churn-dataset/online_retail_customer_churn.csv')

Next, We'll take a look at our dataset size, and the first few rows of the data.

In [2]:
customer_data.shape

(1000, 15)

In [3]:
customer_data.head()

Unnamed: 0,Customer_ID,Age,Gender,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn
0,1,62,Other,45.15,5892.58,5,22,453.8,2,0,3,129,True,Responded,True
1,2,65,Male,79.51,9025.47,13,77,22.9,2,2,3,227,False,Responded,False
2,3,18,Male,29.19,618.83,13,71,50.53,5,2,2,283,False,Responded,True
3,4,21,Other,79.63,9110.3,3,33,411.83,5,3,5,226,True,Ignored,True
4,5,21,Other,77.66,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False


A vital step for a proper data analysis project is to identify variable's data types. We're also gonna take a look at column names, just to get a better sense of them.

In [4]:
customer_data.dtypes

Customer_ID                     int64
Age                             int64
Gender                         object
Annual_Income                 float64
Total_Spend                   float64
Years_as_Customer               int64
Num_of_Purchases                int64
Average_Transaction_Amount    float64
Num_of_Returns                  int64
Num_of_Support_Contacts         int64
Satisfaction_Score              int64
Last_Purchase_Days_Ago          int64
Email_Opt_In                     bool
Promotion_Response             object
Target_Churn                     bool
dtype: object

Now, let's get a little bit more information about the dataset and whether it has any null values. If it does, we need to follow a proper strategy, such as removing or substituting them. 

In [5]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Customer_ID                 1000 non-null   int64  
 1   Age                         1000 non-null   int64  
 2   Gender                      1000 non-null   object 
 3   Annual_Income               1000 non-null   float64
 4   Total_Spend                 1000 non-null   float64
 5   Years_as_Customer           1000 non-null   int64  
 6   Num_of_Purchases            1000 non-null   int64  
 7   Average_Transaction_Amount  1000 non-null   float64
 8   Num_of_Returns              1000 non-null   int64  
 9   Num_of_Support_Contacts     1000 non-null   int64  
 10  Satisfaction_Score          1000 non-null   int64  
 11  Last_Purchase_Days_Ago      1000 non-null   int64  
 12  Email_Opt_In                1000 non-null   bool   
 13  Promotion_Response          1000 n

In [6]:
customer_data.isnull().sum()

Customer_ID                   0
Age                           0
Gender                        0
Annual_Income                 0
Total_Spend                   0
Years_as_Customer             0
Num_of_Purchases              0
Average_Transaction_Amount    0
Num_of_Returns                0
Num_of_Support_Contacts       0
Satisfaction_Score            0
Last_Purchase_Days_Ago        0
Email_Opt_In                  0
Promotion_Response            0
Target_Churn                  0
dtype: int64

Well, sounds like we've got a good dataset; no null values! For the last step before analyzing the data and answering our questions, let's take a look at some unique values for our object variables.

In [7]:
print(f"The unique values of the 'Gender' column are: {customer_data['Gender'].unique()}")
print()
print(f"The unique values of the 'Promotion Response' column are: {customer_data['Promotion_Response'].unique()}")
print()

The unique values of the 'Gender' column are: ['Other' 'Male' 'Female']

The unique values of the 'Promotion Response' column are: ['Responded' 'Ignored' 'Unsubscribed']



# 2. What Are Our Questions?

After getting familiar with our dataset, let's explore the questions we can answer by analyzing this data.

Here's a list of 11 questions:
1. What is the overall churn rate of the retail company?
2. Does customer satisfaction score correlate with churn?
3. How does age and gender influence customer churn?
4. Do customers with a longer tenure (years as a customer) exhibit lower churn rates?
5. Are customers with a higher number of returns more likely to churn?
6. Do customers who engage in more support contacts have a higher churn rate?
7. Is there any correlation between purchases and customer churn?
8. Does the response to promotions affect the likelihood of churn?
9. Is there a difference in churn rates between customers who have opted in for emails and those who haven't?
10. How does the target churn variable distribute across the dataset? 
11. Can we build a predictive model to forecast customer churn based on the available features?



# 2.1. What is the overall churn rate of the retail company?

In [8]:
churn_rate = customer_data['Target_Churn'].mean() * 100
print(churn_rate)

52.6


So, by far, we know that an average of 52.6% of the company's customers have discontinued receiving service from the company. In other words, the churn rate is 52.6%.

# 2.2. Does customer satisfaction score correlate with churn?


# 