<img src="https://i1.creativecow.net/u/301161/ezgif.com-resize5.gif"/>

# About This Kernel
****

This Kernel will be updated daily. I'll be updating you guys with more statistical and graphical analysis. Feel free to leave any important findings or questions during your exploration! Let's d

This notebook will always be a work in progress. Please leave any comments about further improvements to the notebook! Any feedback or constructive criticism is greatly appreciated!. Thank you guys!

# Part 1: Obtaining the Data 
***

In [None]:
# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv('../input/train.csv')
members = pd.read_csv('../input/members.csv')
transactions = pd.read_csv('../input/transactions.csv')
#sample_submission_zero= pd.read_csv('../input/sample_submission_zero.csv')
#user_logs = pd.read_csv('../input/user_logs.csv',nrows = 2e7)

# Part 2: Scrubbing the Data 
***

### Overview of Train DataFrame
***

**The dataset has:**
 - **Observations:** 992,931
 - **Features:** 2
 - **Churn Rate:** 6.4%

**Feature Description:**
- **msno:** This feature represents the customer's user **ID**, which is labeled as long character strings.
- **is_churn:** This feature is our target variable. **0's** represent no churn. **1's** represent churn.

**Questions & Concerns:**
- "The provided training data set is derived from transaction log. We picked the users who have their expiration dates fall in Feb, 2017 and check whether those people renew their subscription with 30 days after expiration to generate training label. Our method is not the only way to generate the training data. The training data set can be generate using different logic. Say, you can check each user's transaction log and calculate the interval between two consecutive entries. In this case, you will generate a training data set much bigger than what we provided in the data section." - Arden Chiu 


- "One reminder, we did make a filter on the expiration date associated with each transaction. We removed the entries that have expiration date > 2017-03-31." - Arden Chiu


- **Qustion:** "Can we use the expiration date in the members.csv? Is it future information?" - yangyang
    - **Response**: "The expiration date in the members.csv is a snapshot of our member table. Hence it is possible to contain future information, but it may not give you much useful information. Say a user made a two-year term subscription on 2017-03-15. We will have the membership expiration in member.csv set to 2019-05-15, however, this does not mean the user will not have other transaction between those two dates (2017-03-15- 2019-03-15)." - Arden Chiu

Source: https://www.kaggle.com/c/kkbox-churn-prediction-challenge/discussion/39756

In [None]:
train.head()

In [None]:
# The dataset contains 2 columns and 992931 observations
train.shape

In [None]:
# Check to see if the train set has any missing values. No missing values!
train.isnull().any()

In [None]:
# Looks like about 93.6% of customers stayed and 6.4% of customers left. 
# NOTE: When performing cross validation, its important to maintain this turnover ratio
churn_rate = train.is_churn.value_counts() / len(train)
churn_rate

### Overview of Members DataFrame
***

**The dataset has:**
 - **Observations:** 5,116,194
 - **Features:** 7
 - **Missing Value(s):** gender

**Feature Description:**
- **msno:** This feature represents the customer's user **ID**, which is labeled as long character strings.
- **city:** This feature contains **21** different unique cities, ranging from 1-22 (**excluding** the number **2**)
- **bd:** This feature contains a lot of **outliers**. It represents the **age** of the user. Probably not a useful variable to use.
- **gender:** This feature represents the customer's gender. The distrubution of this feature contains **A LOT OF MISSING VALUES**. About **17%** are males, **17%** are females, and **66%** are NaN's. Probably not a useful variable to use.
- **registered_via:** This feature represents the registration method of the user. There are **7** unique labels. 
- **registration_init_time:** This feature is just the date of registration of the user
- **expiration_date:** This feature represents the expiration date of the user's subscription


In [None]:
members.tail()

In [None]:
# The dataset contains 2 columns and 992931 observations
members.shape

In [None]:
# Check to see if the train set has any missing values.
members.isnull().any()

In [None]:
# Quick Overview of the members dataframe
members.describe()

In [None]:
members.city.describe()

In [None]:
# Display the unique values in the city variable
# It has 21 unique city values and the #2 is missing
members.city.unique()

In [None]:
# Display the unique values in the bd variable
# It contains many outliers and random numbers. Maybe this variable shouldn't be used
members.bd.unique()

In [None]:
# Display the distrubtion of gender variable
members.gender.value_counts() / len(members)

In [None]:
members.registered_via.unique()

### Overview of Transaction DataFrame
***

**The dataset has:**
 - **Observations:** 5,116,194
 - **Features:** 7
 - **Missing Value(s):** gender

In [None]:
transactions.head()

In [None]:
# This data frame 
transactions.shape

In [None]:
# Check to see if the transaction set has any missing values.
transactions.isnull().any()

# Reformating Features in Train/Memebers Dataset
***

### Create dummy variables for the 'department' and 'salary' features, since they are categorical 


In [None]:
# Convert is_churn into a categorical variable
train["is_churn"] = train["is_churn"].astype('category')

# Convert these features from members dataset into categorical variables
members["city"] = members["city"].astype('category')
members["gender"] = members["gender"].astype('category')
members["registered_via"] = members["registered_via"].astype('category')
members["registration_init_time"] = members["registration_init_time"].astype('category')
members["expiration_date"] = members["expiration_date"].astype('category')

# Merge Train & Members Dataset
***

In [None]:
training = pd.merge(left = train,right = members,how = 'left',on=['msno'])
training.head()

In [None]:
training.dtypes

In [None]:
training['city'].fillna(method='ffill', inplace=True)
training['bd'].fillna(method='ffill', inplace=True)

training['gender'].fillna(method='ffill', inplace=True)

training['registered_via'].fillna(method='ffill', inplace=True)
training.isnull().any()

# Exploring the Data
***

## Members Exploration
### City / Gender / Churn Distributions
***

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(20, 6))

# Graph User City Distribution
# sns.distplot(training.city, kde=False, color="g",  ax=axes[0]).set_title('User City Distribution')
data = training.groupby('city').aggregate({'msno':'count'}).reset_index()
sns.barplot(x='city', y='msno', data=data, ax=axes[0]).set_title('User City Distribution')

# Graph User Gender Distrubtion
##sns.barplot(x="gender", data=training, ax=axes[1]).set_title('User Register_Via Distribution')
sns.countplot(y="gender", data=training, color="c",  ax=axes[1]).set_title('User Gender Distribution')

# Graph User Churn Distribution
sns.distplot(training.is_churn, kde=False, color="b", bins = 3,  ax=axes[2]).set_title('User Churn Distribution')

## Member's Registration Type Distribution
***

In [None]:
sns.countplot(y="registered_via", data=training, color="c").set_title('Registration Type Distribution')