##### modeling section, should do a machine learning model of random forest or decision tree and this will tell which features is more important in terms of given the model predictive power and should align with my chi2 and anova test. ML is need to test if the data is under or over fitted

In [1]:
# Import packages 
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning Predictive Model packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix
from sklearn.linear_model import LogisticRegression 

In [2]:
khan_data = pd.read_csv('./data/kdata_final.csv')

In [3]:
khan_data.head()

Unnamed: 0.1,Unnamed: 0,timestamp,user_id,session_id,country,language,user_registered_flag,device_type,KA_app_flag,OS,...,conversion,returned_user,returner,lang_filter,lang_encode,country_filter,country_encode,registered_user,lang_3,country_3
0,0,2016-02-18 18:05:34.408245 UTC,461023995001001,7269247775762971847,US,en,True,desktop,False,Windows,...,login,1,Yes,eng,1,USA,1,1,eng,us
1,1,2016-02-18 18:05:35.156166 UTC,461023995001001,7269247775762971847,US,en,True,desktop,False,Windows,...,homepage_view,1,Yes,eng,1,USA,1,1,eng,us
2,2,2016-02-18 18:05:44.033396 UTC,461023995001001,7269247775762971847,US,en,True,desktop,False,Windows,...,pageview,1,Yes,eng,1,USA,1,1,eng,us
3,3,2016-02-18 18:06:39.681943 UTC,461023995001001,7269247775762971847,US,en,True,desktop,False,Windows,...,pageview,1,Yes,eng,1,USA,1,1,eng,us
4,4,2016-02-18 18:06:55.040427 UTC,461023995001001,7269247775762971847,US,en,True,desktop,False,Windows,...,pageview,1,Yes,eng,1,USA,1,1,eng,us


In [4]:
khan_data = khan_data.drop(['Unnamed: 0'], axis =1)

In [5]:
# Drop Datetime features and user & session id for predictive modeling 
khan_data = khan_data.drop(['timestamp', 'user_id', 'session_id'], axis =1)

In [6]:
# Create encoded dataframe w/ only numerical values
khan_encoded = khan_data.drop(['URI','country','language','user_registered_flag','conversion','returner','lang_filter','country_filter','lang_3','country_3'], axis=1)

In [7]:
# Generate dummies for features that are not encoded
khan_encoded = pd.get_dummies(khan_encoded, columns = ['device_type', 'KA_app_flag', 'OS'])
khan_encoded.head()

Unnamed: 0,returned_user,lang_encode,country_encode,registered_user,device_type_desktop,device_type_phone,device_type_tablet,device_type_unknown/other,KA_app_flag_False,KA_app_flag_True,OS_Android,OS_BlackBerry OS,OS_Chrome OS,OS_Linux,OS_Mac OS X,OS_Other,OS_Ubuntu,OS_Windows,OS_Windows Phone,OS_iOS
0,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
2,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
3,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
4,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0


In [8]:
khan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31481 entries, 0 to 31480
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   country               31481 non-null  object
 1   language              31481 non-null  object
 2   user_registered_flag  31481 non-null  bool  
 3   device_type           31481 non-null  object
 4   KA_app_flag           31481 non-null  bool  
 5   OS                    31481 non-null  object
 6   URI                   26149 non-null  object
 7   conversion            31481 non-null  object
 8   returned_user         31481 non-null  int64 
 9   returner              31481 non-null  object
 10  lang_filter           31481 non-null  object
 11  lang_encode           31481 non-null  int64 
 12  country_filter        31481 non-null  object
 13  country_encode        31481 non-null  int64 
 14  registered_user       31481 non-null  int64 
 15  lang_3                31481 non-null

In [9]:
khan_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31481 entries, 0 to 31480
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   returned_user              31481 non-null  int64
 1   lang_encode                31481 non-null  int64
 2   country_encode             31481 non-null  int64
 3   registered_user            31481 non-null  int64
 4   device_type_desktop        31481 non-null  uint8
 5   device_type_phone          31481 non-null  uint8
 6   device_type_tablet         31481 non-null  uint8
 7   device_type_unknown/other  31481 non-null  uint8
 8   KA_app_flag_False          31481 non-null  uint8
 9   KA_app_flag_True           31481 non-null  uint8
 10  OS_Android                 31481 non-null  uint8
 11  OS_BlackBerry OS           31481 non-null  uint8
 12  OS_Chrome OS               31481 non-null  uint8
 13  OS_Linux                   31481 non-null  uint8
 14  OS_Mac OS X           

### Building a Predictive Model

In [10]:
khan_data.returned_user.value_counts(normalize=True)*100

0     30.783647
1     22.585051
2     14.881992
3      8.954608
4      8.703663
11     4.713954
9      3.049458
5      2.760395
8      1.963089
6      0.625774
13     0.619421
7      0.358947
Name: returned_user, dtype: float64

In [11]:
# Group returner encode to 0 = non-returner, 1 = returner
khan_data['return'] = np.where(khan_data['returned_user'] >= 1, 1, 0)

In [12]:
khan_data['return'].value_counts(normalize=True)*100

1    69.216353
0    30.783647
Name: return, dtype: float64

Here we can see that 69% of the users are returnered users and 31% are non-return users. 

In [13]:
#This needs to be completed on the encoded data set and drop orginal returned_user column
khan_encoded['return'] = np.where(khan_encoded['returned_user'] >= 1, 1, 0)

In [14]:
# Generate y (target) 
y = khan_encoded['return'].values

In [15]:
# Drop target feature frome encoded dataframe 
khan_encoded = khan_encoded.drop(['returned_user','return'], axis =1)

In [16]:
khan_encoded.head(5)

Unnamed: 0,lang_encode,country_encode,registered_user,device_type_desktop,device_type_phone,device_type_tablet,device_type_unknown/other,KA_app_flag_False,KA_app_flag_True,OS_Android,OS_BlackBerry OS,OS_Chrome OS,OS_Linux,OS_Mac OS X,OS_Other,OS_Ubuntu,OS_Windows,OS_Windows Phone,OS_iOS
0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
2,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
3,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
4,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0


In [17]:
# Generate X (features) from khan_encoded dataframe 
X = khan_encoded.values

In [18]:
# Verify first row of X
X[0]

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [19]:
# Verify first row of y
y[0]

1

In [20]:
# Split the data set: 80% Train, 20% Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [21]:
# Inspect the first row of the training set 
X_train[0]

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

In [22]:
# DO NOT THINK I NEED TO SCALE SINCE ALL FEATURES ARE BINARY (0,1), 
# but knowing the mean for some features are import so a StandardScaler to fit and transform the data is used.
scaler = StandardScaler() 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [23]:
# We can see that the first row has been scaled 
X_train[0] 

array([ 0.24380766,  0.65959649,  0.45331577,  0.41223056, -0.24942466,
       -0.30768646, -0.01409178,  0.13820287, -0.13820287, -0.22211108,
       -0.01091501, -0.36683475, -0.07766619,  2.16891454, -0.01409178,
       -0.11940246, -1.09249126, -0.02183392, -0.32230976])

In [24]:
# Build Random Forest 
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [25]:
rf.score(X_test, y_test)

0.7478164205177068

In [26]:
print(cross_val_score(rf,X_test,np.ravel(y_test),cv=5))
print('Mean Cross Validated Score:',np.mean(cross_val_score(rf,X_test,np.ravel(y_test),cv=5)))

[0.75555556 0.7484127  0.74900715 0.74821287 0.74662431]
Mean Cross Validated Score: 0.7497212451302998


In [27]:
df_importance = pd.DataFrame(zip(list(khan_encoded.columns),rf.feature_importances_),
                             index=range(khan_encoded.columns.shape[0]),columns=['feature','importance'])
df_importance.sort_values(by='importance',ascending=False)

Unnamed: 0,feature,importance
2,registered_user,0.526168
0,lang_encode,0.094701
1,country_encode,0.090661
13,OS_Mac OS X,0.039429
9,OS_Android,0.031251
16,OS_Windows,0.029851
18,OS_iOS,0.027905
11,OS_Chrome OS,0.027064
5,device_type_tablet,0.022487
12,OS_Linux,0.01853



As expected from our feature analysis the most important features in predicted a return user (defined as returning after 4 hours or more) are:

    registered_user, language, Country, OS, and Device type

The first three have the highest level of importance at 0.5, 0.09, 0.09 respectively. 

In conclusion, these results match with our results from Chi-Squared test and ANOVA test

In [28]:
# Extra metrics for the model 
precision_recall_fscore_support(y_test,y_pred)

(array([0.70177268, 0.75608842]),
 array([0.34075949, 0.93382693]),
 array([0.45875937, 0.83561077]),
 array([1975, 4322]))

In [29]:
# Apply weighted metrics
precision_recall_fscore_support(y_test,y_pred,average='weighted')

(0.7390527561315842, 0.7478164205177068, 0.7174145612619851, None)

In [30]:
# Generate confusion matrix
confusion_matrix(y_test,y_pred)

array([[ 673, 1302],
       [ 286, 4036]])

Accuracy at 75% is a good start and we will try a logistic regression model with hyper paramater tuning taken from buidling Random Forest model.

A logistic regression is used because of the categorical variables in the model and to measure the odds of being an active user based on the output coefficient

In [31]:
# Intialize logistic model 
logit = LogisticRegression(solver = 'lbfgs')

#set parameter grid
param_grid = {'C':np.arange(0.5,5.1,0.1)}

#instantiate and fit grid search object
grid = GridSearchCV(logit,param_grid,cv=5)
grid.fit(X_train,np.ravel(y_train))


GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7,
       1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. ,
       3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3,
       4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. ])})

In [32]:
grid.best_estimator_.C

0.5

In [33]:
#find coefficients from Logistic Regression
logit_coef = LogisticRegression(C=grid.best_estimator_.C)
logit_coef.fit(X_train,y_train)
df_coef = pd.DataFrame(zip(list(khan_encoded.columns),logit_coef.coef_[0]),index=range
                       (khan_encoded.columns.shape[0]),columns=['feature','coefficient'])
df_coef.sort_values(by='coefficient',ascending=False)

Unnamed: 0,feature,coefficient
2,registered_user,0.645401
15,OS_Ubuntu,0.225816
5,device_type_tablet,0.220803
16,OS_Windows,0.154175
0,lang_encode,0.149767
17,OS_Windows Phone,0.14847
4,device_type_phone,0.114831
13,OS_Mac OS X,0.099133
8,KA_app_flag_True,0.024563
7,KA_app_flag_False,-0.024563


In [34]:
print(np.mean(cross_val_score(grid, X_test, np.ravel(y_test), cv=5)))

0.7365392034494497


The logistic regression has an prediction accuracy of 74%

Having a similar accuracy to the random forest model we can assume that the coefficents from the logestic regression are accurate 

Note: Can use a Tobit regression rather than Logit, need to verify. 

### Conclusion 

The Random Forest model predicted a returner with 75% accuracy and listed features by importance. The features of most importance were registered user, language, and country. Registered user is defined as having a registered account with Khan Academy. Language in the model is defined as being in english or not. Country in the model is defiened as being from the US or not.

Country had 77 unique values. USA consist of 47% of those values, with the second most Canada consist of 5% of the sample. Language, similiary, had 11 unique values with english comprising of 94% of the sample.

Next to find the predicted results a logisitic regression was used to derive interprebable predictions when predicting a return user. The regression had an accuracy of 74%, thus, can be used in hand with the Random Forest model. 

Logistic results on features of importance: 
registered_user = 0.64540, lang_encode = 0.149767, country_encode = -0.201253 