Hello World!
This notebook describes the decision tree based Machine Learning model I have created
to segment the users of Habits app.

# Looking around the data set and getting all colums in required format

In [42]:
# Import the required modules
import pandas as pd
import numpy as np
import scipy as sp

In [3]:
# simple function to read in the user data file.
# the argument parse_dates takes in a list of colums, which are to be parsed as date format
user_data_raw = pd.read_csv("janacare_user-engagement_Aug2014-Apr2016.csv", parse_dates = [-3,-2,-1])

In [4]:
# data metrics
user_data_raw.shape # Rows , colums

(372, 19)

In [5]:
# data metrics
user_data_raw.dtypes # data type of colums

user_id                                                        float64
num_modules_consumed                                           float64
num_glucose_tracked                                            float64
num_of_days_steps_tracked                                      float64
num_of_days_food_tracked                                       float64
num_of_days_weight_tracked                                     float64
insulin_a1c_count                                              float64
cholesterol_count                                              float64
hemoglobin_count                                               float64
watching_videos (binary - 1 for yes, blank/0 for no)           float64
weight                                                         float64
height                                                           int64
bmi                                                              int64
age                                                              int64
gender

As is visible from the last column data type, Pandas is not recognising it as date type format. 
This will make things difficult, so I delete this particular column and add a new one.
Since the data in *age_on_platform* can be recreated from *last_activity* & *first_login* colums

In [6]:
# drop last column
user_data_del_last_col = user_data_raw.drop("age_on_platform", 1)

In [7]:
# Check if colums has been deleted. Number of column changed from 19 to 18
user_data_del_last_col.shape

(372, 18)

In [8]:
# Copy data frame 'user_data_del_last_col' into a new one
user_data = user_data_del_last_col

In [9]:
# Create new column 'age_on_platform' which has the corresponding value in date type format
user_data["age_on_platform"] = user_data_del_last_col["last_activity"]-user_data_del_last_col["first_login"]

In [13]:
# Check the result in first few rows
user_data["age_on_platform"].head(5)

0   151 days
1   129 days
2   211 days
3   235 days
4     3 days
Name: age_on_platform, dtype: timedelta64[ns]

#### The column name *watching_videos (binary - 1 for yes, blank/0 for no)* is too long and has special chars, lests change it to *watching_videos* df=df.rename(columns = {'two':'new_name'})

In [52]:
user_data = user_data.rename(columns = {'watching_videos (binary - 1 for yes, blank/0 for no)':'watching_videos'})

In [53]:
# Some basic statistical information on the data
user_data.describe()

Unnamed: 0,user_id,num_modules_consumed,num_glucose_tracked,num_of_days_steps_tracked,num_of_days_food_tracked,num_of_days_weight_tracked,insulin_a1c_count,cholesterol_count,hemoglobin_count,watching_videos,weight,height,bmi,age,has_diabetes,age_on_platform
count,371.0,69.0,91.0,120.0,78.0,223.0,47.0,15.0,0.0,97.0,372.0,372.0,372.0,372.0,39.0,302
mean,13850.74124,12.072464,17.769231,53.433333,29.576923,3.210762,5.170213,4.733333,,1.0,72.074597,169.306452,25.325269,49.223118,0.512821,142 days 21:32:11.125827
std,12773.298,13.693406,38.881894,80.690792,47.019344,4.490778,12.694263,1.709915,,0.0,14.744092,16.112564,5.194763,13.487788,0.50637,168 days 20:03:46.790780
min,4288.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,,1.0,40.0,120.0,5.0,11.0,0.0,-300 days +00:00:00
25%,6075.5,3.0,2.0,6.75,2.0,1.0,1.0,4.0,,1.0,62.0,162.0,22.0,39.0,0.0,28 days 00:00:00
50%,7462.0,8.0,5.0,19.5,10.5,2.0,2.0,4.0,,1.0,70.0,167.0,25.0,49.5,1.0,108 days 00:00:00
75%,15258.0,15.0,12.5,65.0,31.5,3.0,3.0,5.0,,1.0,80.0,172.0,27.0,60.0,1.0,234 days 12:00:00
max,49766.0,78.0,260.0,469.0,229.0,40.0,78.0,10.0,,1.0,165.0,349.0,56.0,77.0,1.0,667 days 00:00:00


# Data Clean up

In the last section of looking around, I saw that a lot of rows do not have any values or have garbage values.
This can cause errors when computing anything using the values in these rows, hence a clean up is required.

We will clean up only those columns, that are being used for features.

* **num_modules_consumed**
* **num_glucose_tracked**
* **num_of_days_food_tracked**
* **watching_videos**
* **first_login**
* **last_activity**

In [54]:
# Lets check the health of the data set
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 19 columns):
user_id                       371 non-null float64
num_modules_consumed          69 non-null float64
num_glucose_tracked           91 non-null float64
num_of_days_steps_tracked     120 non-null float64
num_of_days_food_tracked      78 non-null float64
num_of_days_weight_tracked    223 non-null float64
insulin_a1c_count             47 non-null float64
cholesterol_count             15 non-null float64
hemoglobin_count              0 non-null float64
watching_videos               97 non-null float64
weight                        372 non-null float64
height                        372 non-null int64
bmi                           372 non-null int64
age                           372 non-null int64
gender                        372 non-null object
has_diabetes                  39 non-null float64
first_login                   372 non-null datetime64[ns]
last_activity                 302 non

The second column of the above table describes, the number of non-null values in the respective column.
As is visible for the columns of interest for us,
eg. *num_modules_consumed* has ONLY 69 values out of possible 371 total

In [75]:
# Lets remove all columns from the data set that do not have to be imputed - 
user_data_to_impute = user_data.drop(["user_id", "watching_videos", "num_of_days_steps_tracked", "num_of_days_weight_tracked", "insulin_a1c_count", "weight", "height", "bmi", "age", "gender", "has_diabetes", "first_login", "last_activity", "age_on_platform", "hemoglobin_count", "cholesterol_count"], 1 )

### The next 3 cells describes the steps to Impute data using KNN strategy, sadly this is not working well for our data set!

In [76]:
# Import Imputation method KNN
from fancyimpute import KNN

In [77]:
# First lets convert the Pandas Dataframe into a Numpy array. We do this since the data frame needs to be transposed,
# which is only possible if the format is an Numpy array.
user_data_to_impute_np_array = user_data_to_impute.as_matrix()
# Lets Transpose it
user_data_to_impute_np_array_transposed = user_data_to_impute_np_array.T

In [78]:
# usage X_filled_knn = KNN(k=3).complete(X_incomplete)
user_data_imputed_knn_np_array = KNN(k=5).complete(user_data_to_impute_np_array_transposed)

Computing pairwise distances between 3 samples
Computing distances for sample #1/3, elapsed time: 0.000
Imputing row 1/3 with 303 missing columns, elapsed time: 0.001


### The above 3 steps are for KNN based Imputation, did not work well. As visible 804 items could not be imputed for and get replaced with zero

In [79]:
# Lets use simpler method that is provided by Scikit Learn itself
# import the function
from sklearn.preprocessing import Imputer

In [80]:
# Create an object of class Imputer, with the relvant parameters
imputer_object = Imputer(missing_values='NaN', strategy='mean', axis=0, copy=False)

In [81]:
user_data_imputed_np_array = imputer_object.fit_transform(user_data_to_impute)

In [67]:
sp.stats.describe(user_data_imputed_np_array)

DescribeResult(nobs=372, minmax=(array([ 1.,  1.,  1.]), array([  78.,  260.,  229.])), mean=array([ 12.07246377,  17.76923077,  29.57692308]), variance=array([  34.36829564,  366.74434999,  458.84916027]), skewness=array([ 5.86878165,  8.3052548 ,  5.28895823]), kurtosis=array([ 55.50017875,  86.17598001,  38.76391529]))

#### the *user_data_imputed_np_array* is a NumPy array, we need to convert it back to Pandas data frame

### Now lets add back the useful colums that we had removed from data set, these are
* *last_activity*
* *first_login*
* *age_on_platform*
* *watching_videos*

# Labelling the Raw data

Now comes the code that will based on the rules mentioned below label the provided data, so it can be used as trainning data for the classifer.

This tables defines the set of rules used to assign labels for Traning data

| label               | age_on_platform      | last_activity             | num_modules_comsumed        | num_of_days_food_tracked | num_glucose_tracked         | watching_videos  |
|---------------------|----------------------|---------------------------|-----------------------------|--------------------------|-----------------------------|------------------|
| Generic (ignore)    | Converted to days    | to be Measured from 16Apr | Good >= 3/week Bad < 3/week | Good >= 30 Bad < 30      | Good >= 4/week Bad < 4/week | Good = 1 Bad = 0 |
| good_new_user       | >= 30 days && < 180  | <= 2 days                 | >= 12                       | >= 20                    | >= 16                       | Good = 1         |
| bad_new_user        | >= 30 days && < 180  | > 2 days                  | < 12                        | < 20                     | < 16                        | Bad = 0          |
| good_mid_term_user  | >= 180 days && < 360 | <= 7 days                 | >= 48                       | >= 30                    | >= 96                       | Good = 1         |
| bad_mid_term_user   | >= 180 days && <360  | > 7 days                  | < 48                        | < 30                     | < 96                        | Bad = 0          |
| good_long_term_user | >= 360 days          | <= 14 days                | >= 48                       | >= 30                    | >= 192                      | Good = 1         |
| bad_long_term_user  | >= 360 days          | > 14 days                 | < 48                        | < 30                     | < 192                       | Bad = 0          |

In [None]:
one_month = 30
#one_month = one_month.astype(int)
six_month = 180

## There are empty rows in *last_activity* and *first_login* !! Correct that first 

In [None]:
for index, row in user_data.iterrows():
    if (row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int) >= one_month and (row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int) < six_month:
        row["label"] = 1
    elif ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int)) >=180  and ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int))< 360:
        row["label"] = 3
    elif ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int))>= 360:
        row["label"] = 5
        
    

In [None]:
for index, row in user_data.iterrows():
    if row["age_on_platform"] >= np.timedelta64(30, 'D') and row["age_on_platform"] < np.timedelta64(180, 'D'):
        row["label"] = 1
    elif ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int)) >=180  and ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int))< 360:
        row["label"] = 3
    elif ((row["age_on_platform"] / np.timedelta64(1, 'D')).astype(int))>= 360:
        row["label"] = 5
        
    

In [None]:
(user_data["age_on_platform"].head(1) / np.timedelta64(1, 'D')).astype(int)

In [None]:
np.timedelta64('180', 'D')