# Dataset Cleaning
*(This notebook was inspired by Anton T. Ruberts' Dataset Cleaning notebook.)*
The dataset came from the [Online Retail Customer Churn Dataset](https://www.kaggle.com/datasets/hassaneskikri/online-retail-customer-churn-dataset) from kaggle.com.

The main objectives of this notebook are:
- Observe the contents of the dataset,
- handle missing, duplicate, incorrect, or outlier values, and
- export the cleaned data.

In [2]:
import pandas as pd

## Loading the dataset

In [5]:
data = pd.read_csv('../data/Customer Churn.csv')
data

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.640,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.520,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.020,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3145,21,0,19,2,6697,147,92,44,2,2,1,25,721.980,0
3146,17,0,17,1,9237,177,80,42,5,1,1,55,261.210,0
3147,13,0,18,4,3157,51,38,21,3,1,1,30,280.320,0
3148,7,0,11,2,4695,46,222,12,3,1,1,30,1077.640,0


In [6]:
print("Dataset shape:", data.shape)

Dataset shape: (3150, 14)


## Initial exploration of the data

Additional variable information:
- Call Failures: number of call failures
- Complains: binary (0: No complaint, 1: complaint)
- Subscription Length: total months of subscription
- Charge Amount: Ordinal attribute (0: lowest amount, 9: highest amount)
- Seconds of Use: total seconds of calls
- Frequency of use: total number of calls
- Frequency of SMS: total number of text messages
- Distinct Called Numbers: total number of distinct phone calls 
- Age Group: ordinal attribute (1: younger age, 5: older age)
- Tariff Plan: binary (1: Pay as you go, 2: contractual)
- Status: binary (1: active, 2: non-active)
- Churn: binary (1: churn, 0: non-churn) - Class label
- Customer Value: The calculated value of customer

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Call  Failure            3150 non-null   int64  
 1   Complains                3150 non-null   int64  
 2   Subscription  Length     3150 non-null   int64  
 3   Charge  Amount           3150 non-null   int64  
 4   Seconds of Use           3150 non-null   int64  
 5   Frequency of use         3150 non-null   int64  
 6   Frequency of SMS         3150 non-null   int64  
 7   Distinct Called Numbers  3150 non-null   int64  
 8   Age Group                3150 non-null   int64  
 9   Tariff Plan              3150 non-null   int64  
 10  Status                   3150 non-null   int64  
 11  Age                      3150 non-null   int64  
 12  Customer Value           3150 non-null   float64
 13  Churn                    3150 non-null   int64  
dtypes: float64(1), int64(13)

**Observations**
- There are data input mismatches that we can iron out to make the notebook clean.
- Column name spacings are inconsistent.

**Actions**
- Change Tariff Plan, Status from int to categorical variables,
- Complains, Churn can be converted from int to bool,
- Drop Age Group becase Age is less vague, and
- Rename columns with shortened names and use underscore for spaces.

In [9]:
data.describe()

Unnamed: 0,Customer_ID,Age,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,43.267,111962.96,5080.79265,9.727,49.456,266.87653,4.612,1.934,2.974,182.89
std,288.819436,15.242311,52844.111367,2862.12335,5.536346,28.543595,145.873445,2.896869,1.402716,1.391855,104.391319
min,1.0,18.0,20010.0,108.94,1.0,1.0,10.46,0.0,0.0,1.0,1.0
25%,250.75,30.0,67800.0,2678.675,5.0,25.0,139.6825,2.0,1.0,2.0,93.0
50%,500.5,43.0,114140.0,4986.195,9.0,49.0,270.1,5.0,2.0,3.0,180.5
75%,750.25,56.0,158452.5,7606.47,14.0,74.0,401.6025,7.0,3.0,4.0,274.0
max,1000.0,69.0,199730.0,9999.64,19.0,99.0,499.57,9.0,4.0,5.0,364.0


**Observations**
- Based from the website source, Annual_Income is in thousand dollars. We'll convert it for easier comparison.
- There are no redundant or unnecessary columns, so no columns will be dropped.

In [8]:
data['Annual_Income'] = data['Annual_Income'] * 1000

### Missing Data

In [10]:
data.isna().sum()

Customer_ID                   0
Age                           0
Gender                        0
Annual_Income                 0
Total_Spend                   0
Years_as_Customer             0
Num_of_Purchases              0
Average_Transaction_Amount    0
Num_of_Returns                0
Num_of_Support_Contacts       0
Satisfaction_Score            0
Last_Purchase_Days_Ago        0
Email_Opt_In                  0
Promotion_Response            0
Target_Churn                  0
dtype: int64

**Observations**
- Where do the missing data come from? Are there missing entries associated with a category/value of a column?

**Outcome(s)**
- Can the rows be dropped? Why?

## Data Pre-processing Pipeline

For this coffee dataset, we have a clean dataset such that we do not have to apply a pipeline. Nonetheless, I'm leaving this section here to serve as a template for future projects.

In [12]:
# data['transaction_date'] = pd.to_datetime(data['transaction_date'], format="mixed", dayfirst=True)
data.to_csv('../data/online_retail_churn.csv', index=False)  

In [11]:
data.head()

Unnamed: 0,Customer_ID,Age,Gender,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn
0,1,62,Other,45150.0,5892.58,5,22,453.8,2,0,3,129,True,Responded,True
1,2,65,Male,79510.0,9025.47,13,77,22.9,2,2,3,227,False,Responded,False
2,3,18,Male,29190.0,618.83,13,71,50.53,5,2,2,283,False,Responded,True
3,4,21,Other,79630.0,9110.3,3,33,411.83,5,3,5,226,True,Ignored,True
4,5,21,Other,77660.0,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False
