# Assignment 4 - Data Exploration and Pre-Processing

**Objectives:**
1. Understand common tasks and goals in data exploration.
2. Be able to use Python code to do these common tasks, including
 + read in data files
 + identify variables
 + univariate and bivariate analysis
 + use visualization
3. Perform basic data pre-processing tasks, such as, 
 + change data types
 + merge dataframes 
 + group data by columns
 + impute missing values
 + dummy code categorical variables

## Tasks to do in this assignment

1. Download from canvas the data we processed from last class "train_mem_trans_fea.csv", which combines the features extracted from members and transactions data files.
2. Read in the csv data file with pandas, but you need to specify the data types of each column. Decide youself what
3. Check the memory usage of each column of the created dataframe.
4. Identify which features are categorical and which are continous.
5. Check the missing values of each column.
6. Decide how to deal with missing values for this data example.
7. Dummy code the categorical features.


### Import libraries

In [98]:
import pandas as pd
import numpy as np
import time 
import datetime

### Tasks 1 and 2: read in csv data with dtypes specified.

These are the columns of the csv file:

['msno', 'is_churn', 'city', 'bd', 'gender', 'registered_via',
       'registration_init_time', 'expiration_date', 'registration_year',
       'registration_month', 'registration_day', 'expiration_year',
       'expiration_month', 'expiration_day', 'payment_method_id',
       'payment_plan_days', 'plan_list_price', 'actual_amount_paid',
       'is_auto_renew', 'transaction_date', 'membership_expire_date',
       'is_cancel', 'trans_year', 'trans_month', 'trans_day',
       'trans_expiration_year', 'trans_expiration_month',
       'trans_expiration_day', 'discount_value', 'discount_y',
       'amount_per_day', 'day_from_last', 'trans_count', 'cancel_count',
       'discount_freq', 'renew_freq', 'discount_std', 'cancel_freq',
       'price_increase']

#### Instruction:
    Creat a dictionary to specify the dtypes for all above columns, with the dictonary key as column names, and value as the dtypes. e.g.
    
>    `data_type={'msno': 'object', 'is_churn':'bool', ...}`
    
    and parse the date of 'registration_init_time', 'expiration_date', and 'membership_expire_date'
    When you read in with pandas, use 
    
>    `dtype=data_type` and `parse_dates`, `infer_datetime_format`
    
**Type your code below (name the created dataframe as "train" **    
    

In [99]:
data_type = {'msno': 'object', 'is_churn':'bool', 'city': 'float16', 'bd': 'float16', 'gender': 'category', 
         'registered_via': 'float16', 'registration_year': 'float16', 'registration_month': 'float16', 
         'registration_day': 'float16', 'expiration_year': 'float16', 'expiration_month': 'float16', 
         'expiration_day': 'float16', 'payment_method_id': 'int16', 'payment_plan_days': 'int64', 
         'plan_list_price': 'int64', 'actual_amount_paid': 'int64', 'is_auto_renew': 'bool', 
         'is_cancel': 'bool', 'trans_year': 'float16', 'trans_month': 'float16', 'trans_day': 'float16', 
         'trans_expiration_year': 'float16', 'trans_expiration_month': 'float16', 'trans_expiration_day': 'float16', 
         'discount_value': 'int64', 'discount_y': 'bool', 'amount_per_day': 'float64', 'day_from_last': 'int16', 
         'trans_count': 'int16', 'cancel_count': 'int16', 'discount_freq': 'float64', 'renew_freq': 'float64', 
         'discount_std': 'float64', 'cancel_freq': 'float64', 'price_increase': 'float64'}

train = pd.read_csv('E:\\usu_classes\\fall_2019\\mis_6110_machine_learning\\data\\train_mem_trans_fea.csv',
                    dtype = data_type,
                   parse_dates=['registration_init_time','expiration_date','transaction_date', 'membership_expire_date'],
                   infer_datetime_format =True
                  )

In [100]:
#Check the shape of the dataframe
train.shape

(992931, 39)

In [101]:
train.head()

Unnamed: 0,msno,is_churn,city,bd,gender,registered_via,registration_init_time,expiration_date,registration_year,registration_month,...,discount_y,amount_per_day,day_from_last,trans_count,cancel_count,discount_freq,renew_freq,discount_std,cancel_freq,price_increase
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,True,18.0,36.0,female,9.0,2005-04-06,2017-09-07,2005.0,4.0,...,False,4.966667,52,2,0,0.0,0.0,0.0,0.0,4.966667
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,True,10.0,38.0,male,9.0,2005-04-07,2017-03-21,2005.0,4.0,...,False,4.966667,4,23,2,0.043478,0.956522,31.068648,0.086957,0.0
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,True,11.0,27.0,female,9.0,2005-10-16,2017-02-03,2005.0,10.0,...,False,4.966667,47,10,1,0.0,0.8,0.0,0.1,0.0
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,True,13.0,23.0,female,9.0,2005-11-02,2017-09-26,2005.0,11.0,...,False,4.360976,419,2,0,0.0,0.0,0.0,0.0,4.360976
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,True,3.0,27.0,male,9.0,2005-12-28,2017-09-27,2005.0,12.0,...,False,4.966667,31,8,0,0.0,0.0,0.0,0.0,0.382051


#### DoubleCheck the dtyes of each column of the created dataframe

In [102]:
train.dtypes

msno                              object
is_churn                            bool
city                             float16
bd                               float16
gender                          category
registered_via                   float16
registration_init_time    datetime64[ns]
expiration_date           datetime64[ns]
registration_year                float16
registration_month               float16
registration_day                 float16
expiration_year                  float16
expiration_month                 float16
expiration_day                   float16
payment_method_id                  int16
payment_plan_days                  int64
plan_list_price                    int64
actual_amount_paid                 int64
is_auto_renew                       bool
transaction_date          datetime64[ns]
membership_expire_date    datetime64[ns]
is_cancel                           bool
trans_year                       float16
trans_month                      float16
trans_day       

### Task 3. Check the memory usage of each column of the created dataframe.

Make changes to optimize the use of memory usage, if neccessary.

In [103]:
print(f'Memory Usage per column \n {train.memory_usage()/1024**2}\n Total Memory Used:{train.memory_usage().sum()/1024**2:6.2f} MB')

Memory Usage per column 
 Index                     0.000076
msno                      7.575462
is_churn                  0.946933
city                      1.893866
bd                        1.893866
gender                    0.947032
registered_via            1.893866
registration_init_time    7.575462
expiration_date           7.575462
registration_year         1.893866
registration_month        1.893866
registration_day          1.893866
expiration_year           1.893866
expiration_month          1.893866
expiration_day            1.893866
payment_method_id         1.893866
payment_plan_days         7.575462
plan_list_price           7.575462
actual_amount_paid        7.575462
is_auto_renew             0.946933
transaction_date          7.575462
membership_expire_date    7.575462
is_cancel                 0.946933
trans_year                1.893866
trans_month               1.893866
trans_day                 1.893866
trans_expiration_year     1.893866
trans_expiration_month    1.8

### Task 4. Identify which features are categorical and which are continous.

Create two list with the name of categorical and continous columns as:

category_col = ['city',...     ]

continous_col = ['discount_value', ...]

In [104]:
category_col = ['city', 'gender', 'registered_via', 'payment_method_id', 'is_auto_renew', 'is_cancel', 'discount_y']

continous_col = ['discount_value', 'bd', 'registration_init_time', 'expiration_date', 'payment_plan_days', 'plan_list_price',
                'actual_amount_paid', 'discount_value', 'amount_per_day', 'day_from_last', 'trans_count', 'cancel_count',
                'discount_freq', 'cancel_freq']

In [105]:
print(train['cancel_count'].unique())

[ 0  2  1  4  3  7  5  8  6  9 11 10 20]


In [106]:
train.head()

Unnamed: 0,msno,is_churn,city,bd,gender,registered_via,registration_init_time,expiration_date,registration_year,registration_month,...,discount_y,amount_per_day,day_from_last,trans_count,cancel_count,discount_freq,renew_freq,discount_std,cancel_freq,price_increase
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,True,18.0,36.0,female,9.0,2005-04-06,2017-09-07,2005.0,4.0,...,False,4.966667,52,2,0,0.0,0.0,0.0,0.0,4.966667
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,True,10.0,38.0,male,9.0,2005-04-07,2017-03-21,2005.0,4.0,...,False,4.966667,4,23,2,0.043478,0.956522,31.068648,0.086957,0.0
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,True,11.0,27.0,female,9.0,2005-10-16,2017-02-03,2005.0,10.0,...,False,4.966667,47,10,1,0.0,0.8,0.0,0.1,0.0
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,True,13.0,23.0,female,9.0,2005-11-02,2017-09-26,2005.0,11.0,...,False,4.360976,419,2,0,0.0,0.0,0.0,0.0,4.360976
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,True,3.0,27.0,male,9.0,2005-12-28,2017-09-27,2005.0,12.0,...,False,4.966667,31,8,0,0.0,0.0,0.0,0.0,0.382051


### Task 5. Check the missing values of each column.

In [107]:
train.isnull().sum()

msno                           0
is_churn                       0
city                      116788
bd                        116788
gender                    116788
registered_via            116788
registration_init_time    116788
expiration_date           116788
registration_year         116788
registration_month        116788
registration_day          116788
expiration_year           116788
expiration_month          116788
expiration_day            116788
payment_method_id              0
payment_plan_days              0
plan_list_price                0
actual_amount_paid             0
is_auto_renew                  0
transaction_date               0
membership_expire_date         0
is_cancel                      0
trans_year                     0
trans_month                    0
trans_day                      0
trans_expiration_year          0
trans_expiration_month         0
trans_expiration_day           0
discount_value                 0
discount_y                     0
amount_per

### Task 6. Decide how to deal with missing values for this data example.

Write comments for dealing the missing values for later reference.e.g., 
> `train['gender'] = train['gender'].fillna('Unknow') ## use 'Unknown' for gender missing values.`

In [108]:
train['gender'] = train['gender'].fillna('Unknown') # No way to identify gender if not provided

## Columns I am leaving Null
city: There is no way to identify what city they live in if they don't provide the info
bd: I could apply the avg age to unknowns but then if I wanted to analyze by age I would get innaccurate data.

registration and expiration columns: I do not have a way to infer this data so I will leave it blank for now.

discount_std and price_increase: I do not have a way to infer these either and I can just infer that because they are null they are unkown.

In [109]:
print(train['city'].unique())

[18. 10. 11. 13.  3.  6.  4. 14. 22. 17.  5.  9.  1. 15. nan 12.  8.  7.
 21. 20. 16. 19.]


In [110]:
train.isnull().sum()

msno                           0
is_churn                       0
city                      116788
bd                        116788
gender                         0
registered_via            116788
registration_init_time    116788
expiration_date           116788
registration_year         116788
registration_month        116788
registration_day          116788
expiration_year           116788
expiration_month          116788
expiration_day            116788
payment_method_id              0
payment_plan_days              0
plan_list_price                0
actual_amount_paid             0
is_auto_renew                  0
transaction_date               0
membership_expire_date         0
is_cancel                      0
trans_year                     0
trans_month                    0
trans_day                      0
trans_expiration_year          0
trans_expiration_month         0
trans_expiration_day           0
discount_value                 0
discount_y                     0
amount_per

### 7. Dummy code the categorical features.

#### Instruction:
> Use 
>
> `train = pd.get_dummies(train, columns=[ 'put the columns names that you want to dummy code here' ]) # where you specify the columns list that you want to dummy code.`

In [111]:
# for x in category_col:
#    train[x] = train[x].astype('category')


In [112]:
train = pd.get_dummies(train, columns = category_col) 

#### Check the columns name 

Use `train.columns`

In [113]:
train.columns

Index(['msno', 'is_churn', 'bd', 'registration_init_time', 'expiration_date',
       'registration_year', 'registration_month', 'registration_day',
       'expiration_year', 'expiration_month', 'expiration_day',
       'payment_plan_days', 'plan_list_price', 'actual_amount_paid',
       'transaction_date', 'membership_expire_date', 'trans_year',
       'trans_month', 'trans_day', 'trans_expiration_year',
       'trans_expiration_month', 'trans_expiration_day', 'discount_value',
       'amount_per_day', 'day_from_last', 'trans_count', 'cancel_count',
       'discount_freq', 'renew_freq', 'discount_std', 'cancel_freq',
       'price_increase', 'city_1.0', 'city_3.0', 'city_4.0', 'city_5.0',
       'city_6.0', 'city_7.0', 'city_8.0', 'city_9.0', 'city_10.0',
       'city_11.0', 'city_12.0', 'city_13.0', 'city_14.0', 'city_15.0',
       'city_16.0', 'city_17.0', 'city_18.0', 'city_19.0', 'city_20.0',
       'city_21.0', 'city_22.0', 'gender_Unknown', 'gender_female',
       'gender_male', 