# Distributed Random Forest

Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification or regression trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value. (Note: For a categorical response column, DRF maps factors (e.g. ‘dog’, ‘cat’, ‘mouse) in lexicographic order to a name lookup array with integer indices (e.g. ‘cat -> 0, ‘dog’ -> 1, ‘mouse’ -> 2.)


References:

https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

https://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf

https://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html

https://github.com/h2oai/h2o-2

https://aichamp.wordpress.com/2017/04/11/working-with-variable-importance-data-with-models-in-h2o/

#### Dataset : [Metro Bike Share Trip Data](https://www.kaggle.com/cityofLA/los-angeles-metro-bike-share-trip-data)

**Context: This is a dataset hosted by the city of Los Angeles. The organization has an open data platform and they update their information according the amount of data that is brought in.**  

  
| Data Type     | Number of Columns|
|:------------- |:---------------- |
| Numeric       |10                |
| Date Type     |2                 |
| String        |4                 |


In [1]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool

In [2]:
reads = pd.read_csv("metro-bike-share-trip-data.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
reads.values.tolist()[2]

[1933383,
 300,
 '2016-07-07T10:32:00',
 '2016-07-07T10:37:00',
 3016.0,
 34.0528984,
 -118.24156,
 3016.0,
 34.0528984,
 -118.24156,
 5861.0,
 365.0,
 'Round Trip',
 'Flex Pass',
 "{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}",
 "{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"]

In [4]:
reads.head()

Unnamed: 0,Trip ID,Duration,Start Time,End Time,Starting Station ID,Starting Station Latitude,Starting Station Longitude,Ending Station ID,Ending Station Latitude,Ending Station Longitude,Bike ID,Plan Duration,Trip Route Category,Passholder Type,Starting Lat-Long,Ending Lat-Long
0,1912818,180,2016-07-07T04:17:00,2016-07-07T04:20:00,3014.0,34.05661,-118.23721,3014.0,34.05661,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.05...","{'longitude': '-118.23721', 'latitude': '34.05..."
1,1919661,1980,2016-07-07T06:00:00,2016-07-07T06:33:00,3014.0,34.05661,-118.23721,3014.0,34.05661,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.05...","{'longitude': '-118.23721', 'latitude': '34.05..."
2,1933383,300,2016-07-07T10:32:00,2016-07-07T10:37:00,3016.0,34.052898,-118.24156,3016.0,34.052898,-118.24156,5861.0,365.0,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.05...","{'longitude': '-118.24156', 'latitude': '34.05..."
3,1944197,10860,2016-07-07T10:37:00,2016-07-07T13:38:00,3016.0,34.052898,-118.24156,3016.0,34.052898,-118.24156,5861.0,365.0,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.05...","{'longitude': '-118.24156', 'latitude': '34.05..."
4,1940317,420,2016-07-07T12:51:00,2016-07-07T12:58:00,3032.0,34.049889,-118.25588,3032.0,34.049889,-118.25588,6674.0,0.0,Round Trip,Walk-up,"{'longitude': '-118.25588', 'latitude': '34.04...","{'longitude': '-118.25588', 'latitude': '34.04..."


In [6]:
# Setting up the default parameters
data_path=None
all_variables=None
test_path=None
target='Passholder Type'
nthreads=1 
min_mem_size=6 
run_time=600
classification=False
scale=False
max_models=5   
model_path=None
balance_y=False 
balance_threshold=0.2
name=None 
server_path=None  
analysis=0

In [7]:
h2o.get_model

<function h2o.h2o.get_model>

In [8]:
# Functions

def alphabet(n):
  alpha='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
  str=''
  r=len(alpha)-1   
  while len(str)<n:
    i=random.randint(0,r)
    str+=alpha[i]   
  return str
  
#This function returns a randomly generated string from alpha after comapring it with the length of the string with the provided 
#variable 'n'

def set_meta_data(analysis,run_id,server,data,test,model_path,target,run_time,classification,scale,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
  m_data={}
  m_data['start_time'] = time.time()
  m_data['target']=target
  m_data['server_path']=server
  m_data['data_path']=data 
  m_data['test_path']=test
  m_data['max_models']=model
  m_data['run_time']=run_time
  m_data['run_id'] =run_id
  m_data['scale']=scale
  m_data['classification']=classification
  m_data['scale']=False
  m_data['model_path']=model_path
  m_data['balance']=balance
  m_data['balance_threshold']=balance_threshold
  m_data['project'] =name
  m_data['end_time'] = time.time()
  m_data['execution_time'] = 0.0
  m_data['run_path'] =path
  m_data['nthreads'] = nthreads
  m_data['min_mem_size'] = min_mem_size
  m_data['analysis'] = analysis
  return m_data


def dict_to_json(dct,n):
  j = json.dumps(dct, indent=4)
  f = open(n, 'w')
  print(j, file=f)
  f.close()
  
  
def stackedensemble(mod):
    coef_norm=None
    try:
      metalearner = h2o.get_model(mod.metalearner()['name'])
      coef_norm=metalearner.coef_norm()
    except:
      pass        
    return coef_norm

def stackedensemble_df(df):
    bm_algo={ 'GBM': None,'GLM': None,'DRF': None,'XRT': None,'Dee': None}
    for index, row in df.iterrows():
      if len(row['model_id'])>3:
        key=row['model_id'][0:3]
        if key in bm_algo:
          if bm_algo[key] is None:
                bm_algo[key]=row['model_id']
    bm=list(bm_algo.values()) 
    bm=list(filter(None.__ne__, bm))             
    return bm

def se_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['auc']=modl.auc()   
    d['roc']=modl.roc()
    d['mse']=modl.mse()   
    d['null_degrees_of_freedom']=modl.null_degrees_of_freedom()
    d['null_deviance']=modl.null_deviance()
    d['residual_degrees_of_freedom']=modl.residual_degrees_of_freedom()   
    d['residual_deviance']=modl.residual_deviance()
    d['rmse']=modl.rmse()
    return d

def get_model_by_algo(algo,models_dict):
    mod=None
    mod_id=None    
    for m in list(models_dict.keys()):
        if m[0:3]==algo:
            mod_id=m
            mod=h2o.get_model(m)      
    return mod,mod_id     
    
    
def gbm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def dl_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def drf_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
def xrt_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
    
def glm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['coef']=modl.coef()  
    d['coef_norm']=modl.coef_norm()      
    return d
    
def model_performance_stats(perf):
    d={}
    try:    
      d['mse']=perf.mse()
    except:
      pass      
    try:    
      d['rmse']=perf.rmse() 
    except:
      pass      
    try:    
      d['null_degrees_of_freedom']=perf.null_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_degrees_of_freedom']=perf.residual_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_deviance']=perf.residual_deviance() 
    except:
      pass      
    try:    
      d['null_deviance']=perf.null_deviance() 
    except:
      pass      
    try:    
      d['aic']=perf.aic() 
    except:
      pass      
    try:
      d['logloss']=perf.logloss() 
    except:
      pass    
    try:
      d['auc']=perf.auc()
    except:
      pass  
    try:
      d['gini']=perf.gini()
    except:
      pass    
    return d
    
def impute_missing_values(df, x, scal=False):
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in x:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    _ = df[reals].impute(method='mean')
    _ = df[ints].impute(method='median')
    if scal:
        df[reals] = df[reals].scale()
        df[ints] = df[ints].scale()    
    return


def get_independent_variables(df, targ):
    C = [name for name in df.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x
    
def get_all_variables_csv(i):
    ivd={}
    try:
      iv = pd.read_csv(i,header=None)
    except:
      sys.exit(1)    
    col=iv.values.tolist()[0]
    dt=iv.values.tolist()[1]
    i=0
    for c in col:
      ivd[c.strip()]=dt[i].strip()
      i+=1        
    return ivd
    
    

def check_all_variables(df,dct,y=None):     
    targ=list(dct.keys())     
    for key, val in df.types.items():
        if key in targ:
          if dct[key] not in ['real','int','enum']:                      
            targ.remove(key)  
    for key, val in df.types.items():
        if key in targ:            
          if dct[key] != val:
            print('convert ',key,' ',dct[key],' ',val)
            if dct[key]=='enum':
                try:
                  df[key] = df[key].asfactor() 
                except:
                  targ.remove(key)                 
            if dct[key]=='int': 
                try:                
                  df[key] = df[key].asnumeric() 
                except:
                  targ.remove(key)                  
            if dct[key]=='real':
                try:                
                  df[key] = df[key].asnumeric()  
                except:
                  targ.remove(key)                  
    if y is None:
      y=df.columns[-1] 
    if y in targ:
      targ.remove(y)
    else:
      y=targ.pop()            
    return targ    
    
def predictions(mod,data,run_id):
    test = h2o.import_file(data)
    mod_perf=mod_best.model_performance(test)
              
    stats_test={}
    stats_test=model_performance_stats(mod_perf)

    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 

    try:    
      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf[0].table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass

    predictions = mod_best.predict(test)
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return

def predictions_test(mod,test,run_id):
    mod_perf=mod_best.model_performance(test)          
    stats_test={}
    stats_test=model_performance_stats(mod_perf)
    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 
    try:
      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf[0].table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass
    predictions = mod_best.predict(test)    
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return predictions

def check_X(x,df):
    for name in x:
        if name not in df.columns:
          x.remove(name)  
    return x    
    
    
def get_stacked_ensemble(model_set):
    se=None
    for model in model_set:
      if 'BestOfFamily' in model:
        se=model
    if se is None:     
      for model in model_set:
        if 'AllModels'in model:
          se=model           
    return se       
#


def get_variables_types(df):
    d={}
    for key, val in df.types.items():
        d[key]=val           
    return d    
    
#  End Functions

In [9]:
data_path='C:/Users/manan/Desktop/Engineering Management/Fall 2018/Independent Study/ANNUUR/metro-bike-share-trip-data.csv'

In [10]:
# all_variables='/Users/bear/Downloads/H2O/AML/data/logistic_regression_bin_class_ad_conversion_fields.csv'
# all_variables='/Users/bear/Downloads/H2O/AML/data/logistic_regression_bin_class_ad_conversion_fields_reg.csv'
all_variables=None

In [92]:
# classification=True

In [11]:
run_id=alphabet(9)
if server_path==None:
  server_path=os.path.abspath(os.curdir)
os.chdir(server_path) 
run_dir = os.path.join(server_path,run_id)
os.mkdir(run_dir)
os.chdir(run_dir)    

# run_id to std out
print (run_id) 

4CSKMCAaq


In [15]:
# 65535 Highest port no
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
  Starting server from C:\Users\manan\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\manan\AppData\Local\Temp\tmpmi4g0fu5
  JVM stdout: C:\Users\manan\AppData\Local\Temp\tmpmi4g0fu5\h2o_Manan_Shukla_started_from_python.out
  JVM stderr: C:\Users\manan\AppData\Local\Temp\tmpmi4g0fu5\h2o_Manan_Shukla_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.9
H2O cluster version age:,2 months and 8 days
H2O cluster name:,H2O_from_python_Manan_Shukla_gsaci1
H2O cluster total nodes:,1
H2O cluster free memory:,3.530 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [12]:
# meta data
meta_data = set_meta_data(analysis, run_id,server_path,data_path,test_path,model_path,target,run_time,classification,scale,max_models,balance_y,balance_threshold,name,run_dir,nthreads,min_mem_size)
print(meta_data)  

{'start_time': 1544423992.7211893, 'target': 'Passholder Type', 'server_path': 'C:\\Users\\manan\\Desktop\\Engineering Management\\Fall 2018\\Independent Study\\ANNUUR', 'data_path': 'C:/Users/manan/Desktop/Engineering Management/Fall 2018/Independent Study/ANNUUR/metro-bike-share-trip-data.csv', 'test_path': None, 'max_models': 5, 'run_time': 600, 'run_id': '4CSKMCAaq', 'scale': False, 'classification': False, 'model_path': None, 'balance': False, 'balance_threshold': 0.2, 'project': None, 'end_time': 1544423992.7211893, 'execution_time': 0.0, 'run_path': 'C:\\Users\\manan\\Desktop\\Engineering Management\\Fall 2018\\Independent Study\\ANNUUR\\4CSKMCAaq', 'nthreads': 1, 'min_mem_size': 6, 'analysis': 0}


In [13]:
print(data_path)

C:/Users/manan/Desktop/Engineering Management/Fall 2018/Independent Study/ANNUUR/metro-bike-share-trip-data.csv


In [16]:
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [17]:
df.head()

Trip ID,Duration,Start Time,End Time,Starting Station ID,Starting Station Latitude,Starting Station Longitude,Ending Station ID,Ending Station Latitude,Ending Station Longitude,Bike ID,Plan Duration,Trip Route Category,Passholder Type,Starting Lat-Long,Ending Lat-Long
1912820.0,180,2016-07-07T04:17:00,2016-07-07T04:20:00,3014,34.0566,-118.237,3014,34.0566,-118.237,6281,30,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1919660.0,1980,2016-07-07T06:00:00,2016-07-07T06:33:00,3014,34.0566,-118.237,3014,34.0566,-118.237,6281,30,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1933380.0,300,2016-07-07T10:32:00,2016-07-07T10:37:00,3016,34.0529,-118.242,3016,34.0529,-118.242,5861,365,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}","{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"
1944200.0,10860,2016-07-07T10:37:00,2016-07-07T13:38:00,3016,34.0529,-118.242,3016,34.0529,-118.242,5861,365,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}","{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"
1940320.0,420,2016-07-07T12:51:00,2016-07-07T12:58:00,3032,34.0499,-118.256,3032,34.0499,-118.256,6674,0,Round Trip,Walk-up,"{'longitude': '-118.25588', 'latitude': '34.0498886', 'needs_recoding': False}","{'longitude': '-118.25588', 'latitude': '34.0498886', 'needs_recoding': False}"
1944080.0,780,2016-07-07T12:51:00,2016-07-07T13:04:00,3021,34.0456,-118.237,3054,34.0392,-118.236,6717,30,One Way,Monthly Pass,"{'longitude': '-118.23703', 'latitude': '34.0456085', 'needs_recoding': False}","{'longitude': '-118.23649', 'latitude': '34.0392189', 'needs_recoding': False}"
1944070.0,600,2016-07-07T12:54:00,2016-07-07T13:04:00,3022,34.0461,-118.233,3014,34.0566,-118.237,5721,30,One Way,Monthly Pass,"{'longitude': '-118.23309', 'latitude': '34.0460701', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1944070.0,600,2016-07-07T12:59:00,2016-07-07T13:09:00,3076,34.0406,-118.254,3005,34.0485,-118.259,5957,365,One Way,Flex Pass,"{'longitude': '-118.25384', 'latitude': '34.0405998', 'needs_recoding': False}","{'longitude': '-118.25905', 'latitude': '34.0485497', 'needs_recoding': False}"
1944060.0,2880,2016-07-07T13:01:00,2016-07-07T13:49:00,3031,34.0447,-118.252,3031,34.0447,-118.252,6137,365,Round Trip,Flex Pass,"{'longitude': '-118.25244', 'latitude': '34.0447006', 'needs_recoding': False}","{'longitude': '-118.25244', 'latitude': '34.0447006', 'needs_recoding': False}"
1944060.0,960,2016-07-07T13:01:00,2016-07-07T13:17:00,3031,34.0447,-118.252,3078,34.0643,-118.239,6351,30,One Way,Monthly Pass,"{'longitude': '-118.25244', 'latitude': '34.0447006', 'needs_recoding': False}","{'longitude': '-118.23894', 'latitude': '34.0642815', 'needs_recoding': False}"




In [18]:
df.describe()

Rows:132427
Cols:16




Unnamed: 0,Trip ID,Duration,Start Time,End Time,Starting Station ID,Starting Station Latitude,Starting Station Longitude,Ending Station ID,Ending Station Latitude,Ending Station Longitude,Bike ID,Plan Duration,Trip Route Category,Passholder Type,Starting Lat-Long,Ending Lat-Long
type,int,int,enum,enum,int,real,real,int,real,real,int,int,enum,enum,enum,enum
mins,1912818.0,60.0,,,3000.0,0.0,-118.472832,3000.0,0.0,-118.472832,1349.0,0.0,,,,
mean,11530012.457806952,1555.3015623702115,,,3043.0207540329893,34.0393087583038,-118.22153406446637,3042.3867196650804,34.03461425703101,-118.2066415739633,6193.618878240707,44.82196702136549,,,,
maxs,23794218.0,86400.0,,,4108.0,34.0642815,0.0,4108.0,34.0642815,0.0,6728.0,365.0,,,,
sigma,6369461.533031412,5814.241812583598,,,37.74202119985789,0.5293364348227377,1.838335432859621,43.08439999253781,0.6507059152410839,2.259912704392353,293.60675542048,90.41157881756699,,,,
zeros,0,0,,,0,32,32,0,48,48,0,41224,,,,
missing,0,0,0,0,19,48,48,96,1051,1051,10,766,0,0,33805,1051
0,1912818.0,180.0,2016-07-07T04:17:00,2016-07-07T04:20:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1,1919661.0,1980.0,2016-07-07T06:00:00,2016-07-07T06:33:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
2,1933383.0,300.0,2016-07-07T10:32:00,2016-07-07T10:37:00,3016.0,34.0528984,-118.24156,3016.0,34.0528984,-118.24156,5861.0,365.0,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}","{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"


In [19]:
'''
import sys
sys.stdout = open('describe.txt', 'w')
print ('test')
'''

"\nimport sys\nsys.stdout = open('describe.txt', 'w')\nprint ('test')\n"

;erwj;lrwj;ljerw;lewr;lwer 

In [20]:
# dependent variable
# assign target and inputs for classification or regression
if target==None:
  target=df.columns[-1]   
y = target

In [21]:
print(y)

Passholder Type


In [22]:
print(all_variables)

None


In [23]:
if all_variables is not None:
  ivd=get_all_variables_csv(all_variables)
  print(ivd)    
  X=check_all_variables(df,ivd,y)
  print(X)

In [24]:
df.describe()

Rows:132427
Cols:16




Unnamed: 0,Trip ID,Duration,Start Time,End Time,Starting Station ID,Starting Station Latitude,Starting Station Longitude,Ending Station ID,Ending Station Latitude,Ending Station Longitude,Bike ID,Plan Duration,Trip Route Category,Passholder Type,Starting Lat-Long,Ending Lat-Long
type,int,int,enum,enum,int,real,real,int,real,real,int,int,enum,enum,enum,enum
mins,1912818.0,60.0,,,3000.0,0.0,-118.472832,3000.0,0.0,-118.472832,1349.0,0.0,,,,
mean,11530012.457806952,1555.3015623702115,,,3043.0207540329893,34.0393087583038,-118.22153406446637,3042.3867196650804,34.03461425703101,-118.2066415739633,6193.618878240707,44.82196702136549,,,,
maxs,23794218.0,86400.0,,,4108.0,34.0642815,0.0,4108.0,34.0642815,0.0,6728.0,365.0,,,,
sigma,6369461.533031412,5814.241812583598,,,37.74202119985789,0.5293364348227377,1.838335432859621,43.08439999253781,0.6507059152410839,2.259912704392353,293.60675542048,90.41157881756699,,,,
zeros,0,0,,,0,32,32,0,48,48,0,41224,,,,
missing,0,0,0,0,19,48,48,96,1051,1051,10,766,0,0,33805,1051
0,1912818.0,180.0,2016-07-07T04:17:00,2016-07-07T04:20:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1,1919661.0,1980.0,2016-07-07T06:00:00,2016-07-07T06:33:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
2,1933383.0,300.0,2016-07-07T10:32:00,2016-07-07T10:37:00,3016.0,34.0528984,-118.24156,3016.0,34.0528984,-118.24156,5861.0,365.0,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}","{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"


### Selecting the Independent variable
* Imputing the missing values

In [25]:
# independent variables

X = []  
if all_variables is None:
  X=get_independent_variables(df, target)  
else: 
  ivd=get_all_variables_csv(all_variables)    
  X=check_all_variables(df, ivd)


X=check_X(X,df)


# Add independent variables

meta_data['X']=X  


# impute missing values

_=impute_missing_values(df,X, scale)

In [26]:
if analysis == 3:
  classification=False
elif analysis == 2:
  classification=True
elif analysis == 1:
  classification=True

In [27]:
print(classification)

False


In [28]:
if classification:
    df[y] = df[y].asfactor()

In [34]:
def check_y(y,df):
    ok=False
    C = [name for name in df.columns if name == y]
    for key, val in df.types.items():
        if key in C:
            if val in ['real','int','enum']:        
                ok=True         
                return ok, val

In [35]:
ok,val=check_y(y,df)

In [36]:
if val=='enum':
    print(df[y].levels())

[['Flex Pass', 'Monthly Pass', 'Staff Annual', 'Walk-up']]


In [37]:
df.describe()

Rows:132427
Cols:16




Unnamed: 0,Trip ID,Duration,Start Time,End Time,Starting Station ID,Starting Station Latitude,Starting Station Longitude,Ending Station ID,Ending Station Latitude,Ending Station Longitude,Bike ID,Plan Duration,Trip Route Category,Passholder Type,Starting Lat-Long,Ending Lat-Long
type,int,int,enum,enum,real,real,real,real,real,real,real,real,enum,enum,enum,enum
mins,1912818.0,60.0,,,3000.0,0.0,-118.472832,3000.0,0.0,-118.472832,1349.0,0.0,,,,
mean,11530012.457806952,1555.3015623702115,,,3043.020754032988,34.039308758303825,-118.22153406446643,3042.386719665082,34.034614257031,-118.20664157396324,6193.618878240708,44.821967021365474,,,,
maxs,23794218.0,86400.0,,,4108.0,34.0642815,0.0,4108.0,34.0642815,0.0,6728.0,365.0,,,,
sigma,6369461.533031412,5814.241812583598,,,37.73931355888747,0.5292404927430479,1.8380022350803011,43.06878050591668,0.6481186051327203,2.2509269324066663,293.59566951973255,90.14971290953852,,,,
zeros,0,0,,,0,32,32,0,48,48,0,41224,,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33805,1051
0,1912818.0,180.0,2016-07-07T04:17:00,2016-07-07T04:20:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
1,1919661.0,1980.0,2016-07-07T06:00:00,2016-07-07T06:33:00,3014.0,34.0566101,-118.23721,3014.0,34.0566101,-118.23721,6281.0,30.0,Round Trip,Monthly Pass,"{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}","{'longitude': '-118.23721', 'latitude': '34.0566101', 'needs_recoding': False}"
2,1933383.0,300.0,2016-07-07T10:32:00,2016-07-07T10:37:00,3016.0,34.0528984,-118.24156,3016.0,34.0528984,-118.24156,5861.0,365.0,Round Trip,Flex Pass,"{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}","{'longitude': '-118.24156', 'latitude': '34.0528984', 'needs_recoding': False}"


In [38]:
allV=get_variables_types(df)
allV

{'Bike ID': 'int',
 'Duration': 'int',
 'End Time': 'enum',
 'Ending Lat-Long': 'enum',
 'Ending Station ID': 'int',
 'Ending Station Latitude': 'real',
 'Ending Station Longitude': 'real',
 'Passholder Type': 'enum',
 'Plan Duration': 'int',
 'Start Time': 'enum',
 'Starting Lat-Long': 'enum',
 'Starting Station ID': 'int',
 'Starting Station Latitude': 'real',
 'Starting Station Longitude': 'real',
 'Trip ID': 'int',
 'Trip Route Category': 'enum'}

In [39]:
meta_data['variables']=allV

In [40]:
train, test = df.split_frame([0.9])
train,x_test = df.split_frame([0.3])

In [41]:
train.shape

(39652, 16)

In [42]:
# Set up AutoML
aml = H2OAutoML(max_runtime_secs=run_time,project_name = name)

In [43]:
model_start_time = time.time()

In [44]:
aml.train(x=X,y=y,training_frame=df)

AutoML progress: |████████████████████████████████████████████████████████ (cancelled) 100%


H2OJobCancelled: Job<$03017f00000132d4ffffffff$_8b95919a60c2b36453f1d21f5f19497b> was cancelled by the user.

In [122]:
meta_data['model_execution_time'] = time.time() - model_start_time

In [123]:
# get leaderboard
aml_leaderboard_df=aml.leaderboard.as_data_frame()

In [124]:
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,StackedEnsemble_AllModels_0_AutoML_20180911_165725,0.699153,0.836154,0.699153,0.584355,0.103589
1,GBM_grid_0_AutoML_20180911_165725_model_4,0.772541,0.878943,0.772541,0.613218,0.109507
2,GBM_grid_0_AutoML_20180911_165725_model_13,0.796443,0.892436,0.796443,0.656173,0.091876
3,StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725,0.800052,0.894456,0.800052,0.618665,0.113842
4,GBM_grid_0_AutoML_20180911_165725_model_19,0.805763,0.897643,0.805763,0.619135,0.104288
5,GBM_grid_0_AutoML_20180911_165725_model_1,1.056889,1.028051,1.056889,0.746915,0.113832
6,GBM_grid_0_AutoML_20180911_165725_model_6,1.100144,1.048878,1.100144,0.725743,0.127928
7,GBM_grid_0_AutoML_20180911_165725_model_7,1.236186,1.111839,1.236186,0.757219,0.134491
8,GBM_grid_0_AutoML_20180911_165725_model_26,1.261679,1.123245,1.261679,0.776453,0.121459
9,GBM_grid_0_AutoML_20180911_165725_model_2,1.274768,1.129056,1.274768,0.776065,0.125758


In [125]:
# STart best model as first model

model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])


In [126]:
mod_best._id

'StackedEnsemble_AllModels_0_AutoML_20180911_165725'

In [None]:
model_set1

In [127]:
# Get stacked ensemble  
se=get_stacked_ensemble(model_set)

In [128]:
print(se)

StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725


In [129]:
if se is not None:
  mod_best=h2o.get_model(se)

In [None]:
mod_best.confusion_matrix(data=train)

In [131]:
mod_best._id

'StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725'

In [132]:
mod_best._get_metrics

<function h2o.model.model_base.ModelBase._get_metrics>

In [133]:
type(mod_best)

h2o.estimators.stackedensemble.H2OStackedEnsembleEstimator

In [134]:
mods=mod_best.coef_norm
print(mods)

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725
No model summary for this model


ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.2656240839673641
RMSE: 0.5153873145192498
MAE: 0.37835965719714737
RMSLE: 0.07721728464557322
R^2: 0.9903186553874367
Mean Residual Deviance: 0.2656240839673641
Null degrees of freedom: 157
Residual degrees of freedom: 152
Null deviance: 4334.997559366069
Residual deviance: 41.968605266843525
AIC: 252.9282128040255

ModelMetricsRegressionGLM: stackedensemble
** Reported on validation data. **

MSE: 0.4091217177636795
RMSE: 0.6396262328607852
MAE: 0.5050625669918949
RMSLE: 0.04683398783969261
R^2: 0.9840218140842899
Mean Residual Deviance: 0.4091217177636795
Null degrees of freedom: 41
Residual degrees of freedom: 36
Null deviance: 1083.943065834998
Residual deviance: 17.18311214607454
AIC: 95.6536489009044

ModelMetricsRegressionGLM: stackedensemb

In [136]:
bm=stackedensemble_df(aml_leaderboard_df)

In [137]:
bm

['GBM_grid_0_AutoML_20180911_165725_model_4',
 'GLM_grid_0_AutoML_20180911_165725_model_0',
 'DRF_0_AutoML_20180911_165725',
 'XRT_0_AutoML_20180911_165725',
 'DeepLearning_0_AutoML_20180911_165725']

In [138]:
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,StackedEnsemble_AllModels_0_AutoML_20180911_165725,0.699153,0.836154,0.699153,0.584355,0.103589
1,GBM_grid_0_AutoML_20180911_165725_model_4,0.772541,0.878943,0.772541,0.613218,0.109507
2,GBM_grid_0_AutoML_20180911_165725_model_13,0.796443,0.892436,0.796443,0.656173,0.091876
3,StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725,0.800052,0.894456,0.800052,0.618665,0.113842
4,GBM_grid_0_AutoML_20180911_165725_model_19,0.805763,0.897643,0.805763,0.619135,0.104288
5,GBM_grid_0_AutoML_20180911_165725_model_1,1.056889,1.028051,1.056889,0.746915,0.113832
6,GBM_grid_0_AutoML_20180911_165725_model_6,1.100144,1.048878,1.100144,0.725743,0.127928
7,GBM_grid_0_AutoML_20180911_165725_model_7,1.236186,1.111839,1.236186,0.757219,0.134491
8,GBM_grid_0_AutoML_20180911_165725_model_26,1.261679,1.123245,1.261679,0.776453,0.121459
9,GBM_grid_0_AutoML_20180911_165725_model_2,1.274768,1.129056,1.274768,0.776065,0.125758


In [139]:
#  Get best_models and coef_norm()
best_models={}
best_models=helperfunc.stackedensemble(mod_best)
bm=[]
if best_models is not None: 
    if 'Intercept' in best_models.keys():
        del best_models['Intercept']
    bm=list(best_models.keys())
else:
    best_models={}
    bm=stackedensemble_df(aml_leaderboard_df)   
    for b in bm:   
        best_models[b]=None

if mod_best.model_id not in bm:
    bm.append(mod_best.model_id)

In [140]:
bm

['GBM_grid_0_AutoML_20180911_165725_model_4',
 'GLM_grid_0_AutoML_20180911_165725_model_0',
 'DeepLearning_0_AutoML_20180911_165725',
 'DRF_0_AutoML_20180911_165725',
 'XRT_0_AutoML_20180911_165725',
 'StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725']

In [141]:
# Best of Family leaderboard

aml_leaderboard_df=aml_leaderboard_df.loc[aml_leaderboard_df['model_id'].isin(bm)]


In [142]:
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
1,GBM_grid_0_AutoML_20180911_165725_model_4,0.772541,0.878943,0.772541,0.613218,0.109507
3,StackedEnsemble_BestOfFamily_0_AutoML_20180911_165725,0.800052,0.894456,0.800052,0.618665,0.113842
20,GLM_grid_0_AutoML_20180911_165725_model_0,3.105844,1.762341,3.105844,1.29466,0.184919
22,DeepLearning_0_AutoML_20180911_165725,4.014114,2.003525,4.014114,1.514668,0.190918
27,DRF_0_AutoML_20180911_165725,5.929395,2.435035,5.929395,1.796851,0.194273
28,XRT_0_AutoML_20180911_165725,6.010363,2.451604,6.010363,1.788272,0.199768


In [143]:
# save leaderboard
leaderboard_stats=run_id+'_leaderboard.csv'
aml_leaderboard_df.to_csv(leaderboard_stats)

In [144]:
top=aml_leaderboard_df.iloc[0]['model_id']
print(top)

GBM_grid_0_AutoML_20180911_165725_model_4


In [145]:
mod_best=h2o.get_model(top)
print(mod_best._id)
print(mod_best.algo)

GBM_grid_0_AutoML_20180911_165725_model_4
gbm


In [146]:
meta_data['mod_best']=mod_best._id
meta_data['mod_best_algo']=mod_best.algo

In [147]:
meta_data['models']=bm

In [148]:
models_path=os.path.join(run_dir,'models')
for mod in bm:
  try:   
    m=h2o.get_model(mod) 
    h2o.save_model(m, path = models_path)
  except:    
    pass    

In [1]:
print(models_path)

## Individual Model

In [150]:
# GBM
 
mod,mod_id=get_model_by_algo("GBM",best_models)
if mod is not None:
    try:     
        sh_df=mod.scoring_history()
        sh_df.to_csv(run_id+'_gbm_scoring_history.csv') 
    except:
        pass   
    try:     
        stats_gbm={}
        stats_gbm=gbm_stats(mod)
        n=run_id+'_gbm_stats.json'
        dict_to_json(stats_gbm,n)
        print(stats_gbm)
    except:
        pass        

{'algo': 'gbm', 'model_id': 'GBM_grid_0_AutoML_20180911_165725_model_4', 'varimp': [('TV', 17376.125, 1.0, 0.685929063968572), ('radio', 7794.36328125, 0.44856740390909944, 0.30768541949018097), ('newspaper', 131.0323944091797, 0.007540944509157231, 0.0051725530086051625), ('ID', 30.727092742919922, 0.0017683512718123242, 0.0012129635326418615)]}


In [None]:
print(gbm_mod)

In [None]:
gbm_mod.varimp_plot()

## Distributed Random Forest

In [None]:
drf_mod,mod_id=helperfunc.get_model_by_algo("DRF",best_models)
if mod is not None:
    try:     
        sh_df=mod.scoring_history()
        sh_df.to_csv(run_id+'_drf_scoring_history.csv') 
    except:
         pass  
    try: 
        stats_drf={}
        stats_drf=helperfunc.drf_stats(drf_mod)
        n=run_id+'_drf_stats.json'
        dict_to_json(stats_drf,n)
        print(stats_drf)
    except:
         pass

Gain/lift plot and partial dependency plots can not be plotted for multiclass classification models.  

**Requirments to plot Gain/lift and Partial plots are:**
* models that include only numerical values
* Column must contain actual binary class labels

## Generalized Linear Models

In [None]:
glm_mod,mod_id=helperfunc.get_model_by_algo("GLM",best_models)
if mod is not None:
    try:     
        stats_glm={}
        stats_glm=glm_stats(mod)
        n=run_id+'_glm_stats.json'
        dict_to_json(stats_glm,n)
        print(stats_glm)
    except:
        pass

### Statistics for Gradient Boosting Machine

In [None]:
gbm_mod.confusion_matrix(data=test)

In [None]:
gbm_mod.varimp_plot()

In [None]:
drf_mod.confusion_matrix(data=test)

In [None]:
drf_mod.varimp_plot()

#### Statistics for Gradient Random Tree

In [None]:
glm_mod.confusion_matrix(data=test)

In [None]:
glm_mod.coef_norm

## Predictions

In [None]:
predictions_df=predictions_test(mod_best,test,run_id)

In [None]:
predictions_df.head()

In [None]:
modbest_pred = mod_best.predict(test_data=test)

In [None]:
y_true = test['Passholder Type'].as_data_frame()
y_pred = modbest_pred['predict'].as_data_frame()
confusion_matrix(y_true['Passholder Type'], y_pred['predict'])

In [None]:
result = confusion_matrix(y_true['Passholder Type'], y_pred['predict'])
target_col = len(df['Passholder Type'].unique())
prediction = 0
for i in (0,target_col - 1):
    prediction =  prediction + result[i,i]
print('Accuracy for Gbm model is: ',prediction/len(test)*100)

In [None]:
gbm_pred = gbm_mod.predict(test_data=test)

In [None]:
y_true = test['Passholder Type'].as_data_frame()
y_pred = gbm_pred['predict'].as_data_frame()
confusion_matrix(y_true['Passholder Type'], y_pred['predict'])

In [None]:
result = confusion_matrix(y_true['Passholder Type'], y_pred['predict'])
target_col = len(df['Passholder Type'].unique())
prediction = 0
for i in (0,target_col - 1):
    prediction =  prediction + result[i,i]
print('Accuracy for Gbm model is: ',prediction/len(test)*100)

In [None]:
drf_pred = drf_mod.predict(test_data=test)

In [None]:
y_true = test['Passholder Type'].as_data_frame()
y_pred = drf_pred['predict'].as_data_frame()
confusion_matrix(y_true['Passholder Type'], y_pred['predict'])

In [None]:
result = confusion_matrix(y_true['Passholder Type'], y_pred['predict'])
target_col = len(df['Passholder Type'].unique())
prediction = 0
for i in (0,target_col - 1):
    prediction =  prediction + result[i,i]
print('Accuracy for DRF model is: ',prediction/len(test)*100)

In [None]:
glm_pred = glm_mod.predict(test_data=test)

In [None]:
y_true = test['Passholder Type'].as_data_frame()
y_pred = glm_pred['predict'].as_data_frame()
confusion_matrix(y_true['Passholder Type'], y_pred['predict'])

In [None]:
result = confusion_matrix(y_true['Passholder Type'], y_pred['predict'])
target_col = len(df['Passholder Type'].unique())
prediction = 0
for i in (0,target_col - 1):
    prediction =  prediction + result[i,i]
print('Accuracy for XRT model is: ',prediction/len(test)*100)

## Prediction Analysis:


| Algorithms                         | Accuracy     |
|------------- ------------          | ------------ |
| Best Of Family model               |**99.92%**    |
| GBM Model                          |**99.93%**    |
| DRF Model                          |**98.83%**    |
| XRT Model                          |**97.12%**    |

* Based on the above table we can infer that the GBM model is giving us the best results.
* Even though the AUC of Best of family model was better, but when comes to prediction the GBM is performing better.
* H2O models predicts the labels based on the probability ratio

In [None]:
# Update and save meta data

meta_data['end_time'] = time.time()
meta_data['execution_time'] = meta_data['end_time'] - meta_data['start_time']
  
n=run_id+'_meta_data.json'
dict_to_json(meta_data,n)

In [None]:
meta_data

In [162]:
# Clean up

os.chdir(server_path)


# h2o.cluster().shutdown()


# h2o.cluster().shutdown()

In [163]:
h2o.cluster().shutdown()

H2O session _sid_a21f closed.
