<a href="https://colab.research.google.com/github/jsroa15/KKBOX/blob/main/merged_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The objective is to merge all datasets to create a final dataset to train and evaluate our final model. As always, we are going to follow the structure of EDA and Feature Engineering.

**Exploratory Data Analysis**

1.  Load data
2.  Merge all dataset
2.  General statistics
3.  Data Visualization
4.  Data Cleaning
5.  Fixing formats

**Feature Engineering**
7.  Create new features
6.  Data Transformation
7.  Outlier detection
8.  Scaling features (optional)
9.  Create a dataframe grouped by user id

**Import packages**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# **Exploratory Data Analysis**


## 1. Load Data

In [3]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KKBOX/train.csv')
transactions=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KKBOX/df_transactions.csv')
users=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KKBOX/df_members.csv')
logs=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KKBOX/df_logs.csv')

## 2. Merge all datasets

In [4]:
df=df.merge(transactions,on='msno',how='left')
df=df.merge(users,on='msno',how='left')
df=df.merge(logs,on='msno',how='left')

# 3. General Statistics

In [5]:
#First rows of the dataset

df.head()

Unnamed: 0,msno,is_churn,regist_trans,mst_frq_plan_days,mst_frq_pay_met,revenue,is_auto_renew,regist_cancels,qtr_trans,city,bd,gender,registered_via,registration_init_time,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,1,2,7,38,149,0,0,1,18.0,36.0,female,9.0,2005-04-06,0.621227,0.274653,0.44794,0.173287,2.880669,2.962292,8.440285
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1,23,30,39,3458,1,2,3,10.0,38.0,male,9.0,2005-04-07,0.444694,0.193904,0.17525,0.179176,2.601858,2.342516,8.087119
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,1,10,30,39,1492,1,1,1,11.0,27.0,female,9.0,2005-10-16,1.168699,0.46995,0.360776,0.499874,3.253308,3.35579,8.697465
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,1,2,410,17,1788,0,0,1,13.0,23.0,female,9.0,2005-11-02,1.830671,1.01807,0.974649,0.85055,2.618528,2.699278,8.174752
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,1,8,30,38,3576,0,0,1,3.0,27.0,male,9.0,2005-12-28,0.757936,0.378817,0.493943,0.969785,4.400313,4.456234,9.985234


In [6]:
#Basic info of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 992931 entries, 0 to 992930
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   msno                    992931 non-null  object 
 1   is_churn                992931 non-null  int64  
 2   regist_trans            992931 non-null  int64  
 3   mst_frq_plan_days       992931 non-null  int64  
 4   mst_frq_pay_met         992931 non-null  int64  
 5   revenue                 992931 non-null  int64  
 6   is_auto_renew           992931 non-null  int64  
 7   regist_cancels          992931 non-null  int64  
 8   qtr_trans               992931 non-null  int64  
 9   city                    877161 non-null  float64
 10  bd                      877161 non-null  float64
 11  gender                  877161 non-null  object 
 12  registered_via          877161 non-null  float64
 13  registration_init_time  877161 non-null  object 
 14  num_25              

We have to fix some datatypes

In [9]:
#Missing Values

pd.DataFrame({'%MissingValues':round(df.isna().sum()/df.shape[0]*100,2)})

Unnamed: 0,%MissingValues
msno,0.0
is_churn,0.0
regist_trans,0.0
mst_frq_plan_days,0.0
mst_frq_pay_met,0.0
revenue,0.0
is_auto_renew,0.0
regist_cancels,0.0
qtr_trans,0.0
city,11.66


We have some missing values, but in later stages we can find the way to impute them.

In [11]:
#Checking for duplicated values

print(df.shape)
print(df.msno.nunique())

(992931, 21)
992931


There are no duplicated values in the dataset.

# 4. Data Visualization

Let's explore the data visually and discover useful insights.

