# Assignment 2

## Abstract


For this assignment we selected one of the Kaggle competitions and used H2O AutoML for getting the best model and its metrics to see where will our kernel lie in the leaderboard if we use this and work ahead with the Kaggle competition and further also try to understand how H2O works.

##### Kaggle competition - Kobe Bryant Shot Selection

The Kaggle competition that we have chosen is - Kobe Bryant Shot Selection - Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

Link to the competition - https://www.kaggle.com/c/kobe-bryant-shot-selection

##### H2O - AutoML

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.


We have applied AutoML to our Kobe Bryant Dataset and tried to check how are the models generated by H2O when we specify it with the target variable and other predictors with it. Thats the main AIM of this assignment. H2O at the backend(server) tries to run various Algorithms that it possesses and for various hyperparameter values will generate various models.

Now, the number of models that H2O generates depends on the amount of time you specify it to run for the dataset loaded.


<span class="girk">`Data & Prediction we are going to make`</span> - Here we are going to predict that weather Kobe Brayants made the shot or not.

`Conclusion` - We, on the way give certain parameters to H2O to get a good model, with a value of Log Loss(which is the deciding factor of our Kaggle competition) which tuns out to be the best - `0.34436`. And thus making us win the competition(if we had participated). Thus we submit this as an assignment to show, that even we students can win a Kaggle competition.


<img src="Images/xrOu1.png" width="300">

## Part 1 - Data cleaning

In [36]:
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split

In [37]:
# Load data and roughly clean it, then sort as game date
df = pd.read_csv("./data.csv")
df.drop(['game_event_id', 'game_id', 'lat', 'lon', 'team_id', 'team_name'], axis=1, inplace=True)
df.sort_values('game_date',  inplace=True)
mask = df['shot_made_flag'].isnull()

### Data Cleaning(Important)

##### This is the most important step that decides our winning factor and getting the best model out of H2O

First of all we will explain the kind of the data and describe about its columns. This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag).

We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.

To avoid leakage, your method should only train on events that occurred prior to the shot for which you are predicting! Since this is a playground competition with public answers, it's up to you to abide by this rule.

The field names are self explanatory and contain the following attributes:

   * action_type
   * combined_shot_type
   * game_event_id
   * game_id
   * lat
   * loc_x
   * loc_y
   * lon
   * minutes_remaining
   * period
   * playoffs
   * season 
   * seconds_remaining
   * shot_distance
   * <span class="girk">`shot_made_flag`</span> (this is what we are predicting)
   * shot_type
   * shot_zone_area
   * shot_zone_basic
   * shot_zone_range
   * team_id
   * team_name
   * game_date
   * matchup
   * opponent
   * shot_id


In [9]:
# Clean data
actiontypes = dict(df.action_type.value_counts())
df['type'] = df.apply(lambda row: row['action_type'] if actiontypes[row['action_type']] > 20\
                          else row['combined_shot_type'], axis=1)
df.drop(['action_type', 'combined_shot_type'], axis=1, inplace=True)

df['away'] = df.matchup.str.contains('@')
df.drop('matchup', axis=1, inplace=True)

df['distance'] = df.apply(lambda row: row['shot_distance'] if row['shot_distance'] <45 else 45, axis=1)

df['time_remaining'] = df.apply(lambda row: row['minutes_remaining'] * 60 + row['seconds_remaining'], axis=1)
df['last_moments'] = df.apply(lambda row: 1 if row['time_remaining'] < 3 else 0, axis=1)

data = pd.get_dummies(df['type'],prefix="action_type")

features=["away", "period", "playoffs", "shot_type", "shot_zone_area", "shot_zone_basic", "season",
           "shot_zone_range", "opponent", "distance", "minutes_remaining", "last_moments"]
for f in features:
    data = pd.concat([data, pd.get_dummies(df[f], prefix=f),], axis=1)

In [38]:
# Need work on game_date, add this into feature and increse n_estimators can inprove results but waste time and memory 
X = data[~mask]
y = df.shot_made_flag[~mask]

In [3]:
data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 19214 to 19185
Data columns (total 27 columns):
Unnamed: 0              25697 non-null int64
Unnamed: 0.1            25697 non-null int64
Unnamed: 0.1.1          25697 non-null int64
Unnamed: 0.1.1.1        25697 non-null int64
Unnamed: 0.1.1.1.1      25697 non-null int64
Unnamed: 0.1.1.1.1.1    25697 non-null int64
loc_x                   25697 non-null int64
loc_y                   25697 non-null int64
minutes_remaining       25697 non-null int64
period                  25697 non-null int64
playoffs                25697 non-null int64
season                  25697 non-null object
seconds_remaining       25697 non-null int64
shot_distance           25697 non-null int64
shot_made_flag          25697 non-null float64
shot_type               25697 non-null object
shot_zone_area          25697 non-null object
shot_zone_basic         25697 non-null object
shot_zone_range         25697 non-null object
game_date               2

## Initiating H2O server

In [3]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)56-b12)
  Starting server from C:\Users\prabh\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\prabh\AppData\Local\Temp\tmphypnslfl
  JVM stdout: C:\Users\prabh\AppData\Local\Temp\tmphypnslfl\h2o_prabh_started_from_python.out
  JVM stderr: C:\Users\prabh\AppData\Local\Temp\tmphypnslfl\h2o_prabh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,04 secs
H2O cluster version:,3.12.0.1
H2O cluster version age:,"1 year, 8 months and 20 days !!!"
H2O cluster name:,H2O_from_python_prabh_8sk2ta
H2O cluster total nodes:,1
H2O cluster free memory:,3.512 Gb
H2O cluster total cores:,12
H2O cluster allowed cores:,12
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


#### Loading the data to the server cleansed

In [14]:
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("data.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


##### Was trying to check the correlation of each columns with others for understanding the data

In [17]:
data.corr()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,game_event_id,game_id,lat,...,loc_y,lon,minutes_remaining,period,playoffs,seconds_remaining,shot_distance,shot_made_flag,team_id,shot_id
Unnamed: 0,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1.1,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1.1.1,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1.1.1.1,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1.1.1.1.1,1.0,1.0,1.0,1.0,1.0,1.0,0.999997,0.025521,0.760329,-0.033551,...,0.033551,-0.012451,-0.008314,-0.003366,0.612019,-0.004856,0.020567,-0.01307,,0.999997
Unnamed: 0.1.1.1.1.1.1,0.999997,0.999997,0.999997,0.999997,0.999997,0.999997,1.0,0.025433,0.761232,-0.033476,...,0.033476,-0.012453,-0.008251,-0.003357,0.612991,-0.004833,0.020464,-0.013016,,1.0
game_event_id,0.025521,0.025521,0.025521,0.025521,0.025521,0.025521,0.025433,1.0,-0.005982,-0.059602,...,0.059602,-0.029954,-0.274276,0.955914,-0.007963,-0.01867,0.063295,-0.037232,,0.025433
game_id,0.760329,0.760329,0.760329,0.760329,0.760329,0.760329,0.761232,-0.005982,1.0,0.011361,...,-0.011361,-0.012944,0.009581,0.005061,0.917898,-0.009029,-0.027247,-0.001612,,0.761232
lat,-0.033551,-0.033551,-0.033551,-0.033551,-0.033551,-0.033551,-0.033476,-0.059602,0.011361,1.0,...,-1.0,0.017578,0.077399,-0.039737,-0.000857,0.057766,-0.818124,0.14807,,-0.033476


##### We signify the target variable for the H2O

In [17]:
# Identify predictors and response
x = train.columns
y = "shot_made_flag"
x.remove(y)

##### And then convert the column to a factor to make it easy for the H2O. Since H2O considers the data type of the variables and predictors on its own, which might turn out to be irrelevant and unfavourable for the model generation. We make sure by converting it manually and make the H2O interpret better during the feature selection.

In [18]:
# For binary classification, response should be a factor
train[y] = train[y].asfactor()

Before we start training H2O with our data, we tried in many simple methods first by not denoting few parameters to it. 
* First, we tried giving no run_time to it, so by `DEFAULT` it consideres the run_time for 1 hour.
While we dont want it to run that long, even though it gave high score(Log Loss) at the end. In the next step we signified few parameters to it to facilitate
* Second, we ran H2O for around 1000 senconds and came up with a better score than the First - `0.599577`

<img src="Images/Capture.PNG"/>

* After a fine tweaking and managing with the parameters, we specified the `seed="1"` and `run_time=1000sec` and ran the H2O as shown below.

In [22]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
# aml = H2OAutoML(max_models=20, seed=1, max_runtime_secs=500)
aml = H2OAutoML(max_runtime_secs=1000, seed=1, max_models=100)
aml.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [23]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

model_id,auc,logloss
DRF_0_AutoML_20190227_183234,0.948715,0.344336
XRT_0_AutoML_20190227_183234,0.923095,0.484171
StackedEnsemble_0_AutoML_20190227_183234,0.795427,0.562137
GBM_grid_0_AutoML_20190227_183234_model_0,0.785895,0.621769
GLM_grid_0_AutoML_20190227_183234_model_0,0.711312,0.600094
GLM_grid_0_AutoML_20190227_185017_model_1,0.702603,0.603857
GBM_grid_0_AutoML_20190227_185017_model_3,0.701112,0.680623
GLM_grid_0_AutoML_20190227_183234_model_1,0.699046,0.608849
GBM_grid_0_AutoML_20190227_185017_model_2,0.697811,0.619767
GLM_grid_0_AutoML_20190227_185017_model_0,0.69258,0.612955




##### At the end, we got pretty good `Log Loss = 0.344336` which is a very good score

To show how our kernel would rank in the Kaggle competition :


<img src="Images/image1.jpg" width="700" >

##### Now, for further analysis of the model that are generated, we consider the best model generated

H2O gives you the best features that are to be considered for the best model automatically. And we can check the same.


<img src="Images/image2.PNG" width="500" />

#### and from this we see, that the important variables turned out to be :

* variable	
* matchup
* action_type	
* opponent	
* seconds_rema
* game_event_id	
* Unnamed: 0.1.1.1.1
* shot_zone_range
* game_id
* shot_type
* playoffs


In [26]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [34]:
print(mod_best.logloss())
print(mod_best.algo)
print(mod_best.model_id)
print(mod_best.gini)

0.6922279011323961
drf
DRF_0_AutoML_20190227_183234
Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_0_AutoML_20190227_183234


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.22555366606423846
RMSE: 0.474924905710617
LogLoss: 0.6922279011323961
Mean Per-Class Error: 0.3614601725940524
AUC: 0.6760012086173288
Gini: 0.35200241723465764
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2237670104211395: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,2042.0,7840.0,0.7934,(7840.0/9882.0)
1,852.0,7156.0,0.1064,(852.0/8008.0)
Total,2894.0,14996.0,0.4859,(8692.0/17890.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2237670,0.6221527,323.0
max f2,0.0084907,0.8026722,398.0
max f0point5,0.5622336,0.6120439,145.0
max accuracy,0.5786885,0.6561766,138.0
max precision,1.0,0.9446064,0.0
max recall,0.0000217,1.0,399.0
max specificity,1.0,0.9980773,0.0
max absolute_mcc,0.6636141,0.3033630,106.0
max min_per_class_accuracy,0.4212417,0.6215341,214.0


Gains/Lift Table: Avg response rate: 44.76 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0140861,1.0,2.2340160,2.2340160,1.0,1.0,0.0314685,0.0314685,123.4015984,123.4015984
,2,0.0200112,0.9866741,2.2340160,2.2340160,1.0,1.0,0.0132368,0.0447053,123.4015984,123.4015984
,3,0.0325880,0.98,2.2340160,2.2340160,1.0,1.0,0.0280969,0.0728022,123.4015984,123.4015984
,4,0.0400224,0.9683344,2.2340160,2.2340160,1.0,1.0,0.0166084,0.0894106,123.4015984,123.4015984
,5,0.0528787,0.96,2.2340160,2.2340160,1.0,1.0,0.0287213,0.1181319,123.4015984,123.4015984
,6,0.1,0.9137067,2.2340160,2.2340160,1.0,1.0,0.1052697,0.2234016,123.4015984,123.4015984
,7,0.1500279,0.8616468,2.2340160,2.2340160,1.0,1.0,0.1117632,0.3351648,123.4015984,123.4015984
,8,0.2,0.8161639,2.2340160,2.2340160,1.0,1.0,0.1116384,0.4468032,123.4015984,123.4015984
,9,0.3,0.7307431,2.2340160,2.2340160,1.0,1.0,0.2234016,0.6702048,123.4015984,123.4015984




ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.22106351697169108
RMSE: 0.4701739220455459
LogLoss: 0.6504722093837271
Mean Per-Class Error: 0.3593506824850108
AUC: 0.6764376703609113
Gini: 0.3528753407218226
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.253813820765132: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,447.0,1758.0,0.7973,(1758.0/2205.0)
1,184.0,1558.0,0.1056,(184.0/1742.0)
Total,631.0,3316.0,0.492,(1942.0/3947.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2538138,0.6160538,316.0
max f2,0.0,0.7979844,399.0
max f0point5,0.5535611,0.6178954,150.0
max accuracy,0.5535611,0.6645554,150.0
max precision,0.9725332,0.9672131,6.0
max recall,0.0,1.0,399.0
max specificity,0.9999266,0.9995465,0.0
max absolute_mcc,0.6799914,0.3212875,99.0
max min_per_class_accuracy,0.4196293,0.6222732,218.0


Gains/Lift Table: Avg response rate: 44.13 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0134279,0.98,2.1802851,2.1802851,0.9622642,0.9622642,0.0292767,0.0292767,118.0285077,118.0285077
,2,0.0215353,0.96,2.1949806,2.1858175,0.96875,0.9647059,0.0177956,0.0470723,119.4980626,118.5817519
,3,0.0301495,0.9222313,1.9992233,2.1325049,0.8823529,0.9411765,0.0172216,0.0642939,99.9223340,113.2504896
,4,0.0400304,0.9006982,1.8591068,2.0650206,0.8205128,0.9113924,0.0183697,0.0826636,85.9106833,106.5020564
,5,0.0506714,0.88,1.9960500,2.0505367,0.8809524,0.905,0.0212400,0.1039036,99.6049970,105.0536739
,6,0.1000760,0.7827241,1.7429127,1.8986717,0.7692308,0.8379747,0.0861079,0.1900115,74.2912656,89.8671685
,7,0.1499873,0.7054746,1.7022152,1.8332968,0.7512690,0.8091216,0.0849598,0.2749713,70.2215203,83.3296809
,8,0.2006587,0.64,1.3141561,1.7022007,0.58,0.7512626,0.0665901,0.3415614,31.4156142,70.2200681
,9,0.2999747,0.5317773,1.1502334,1.5194548,0.5076531,0.6706081,0.1142365,0.4557979,15.0233429,51.9454766




ModelMetricsBinomial: drf
** Reported on cross-validation data. **

MSE: 0.218546364440412
RMSE: 0.46748942708943914
LogLoss: 0.631739907321219
Mean Per-Class Error: 0.35333141104999033
AUC: 0.6858717519578176
Gini: 0.37174350391563515
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.24677690550684933: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,2090.0,7792.0,0.7885,(7792.0/9882.0)
1,794.0,7214.0,0.0992,(794.0/8008.0)
Total,2884.0,15006.0,0.4799,(8586.0/17890.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2467769,0.6269227,324.0
max f2,0.0479236,0.8024308,395.0
max f0point5,0.5817181,0.6336701,139.0
max accuracy,0.5817181,0.6694243,139.0
max precision,0.9999794,0.9753086,0.0
max recall,0.0199886,1.0,398.0
max specificity,0.9999794,0.9997976,0.0
max absolute_mcc,0.6302461,0.3338268,118.0
max min_per_class_accuracy,0.4186780,0.6269980,223.0


Gains/Lift Table: Avg response rate: 44.76 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0124651,0.98,2.1338359,2.1338359,0.9551570,0.9551570,0.0265984,0.0265984,113.3835895,113.3835895
,2,0.0218558,0.96,2.1409320,2.1368849,0.9583333,0.9565217,0.0201049,0.0467033,114.0931985,113.6884854
,3,0.0304080,0.94,1.9857920,2.0943900,0.8888889,0.9375,0.0169830,0.0636863,98.5791986,109.4389985
,4,0.0405813,0.91,1.9762449,2.0647723,0.8846154,0.9242424,0.0201049,0.0837912,97.6244909,106.4772349
,5,0.0500279,0.8925367,1.9431973,2.0418157,0.8698225,0.9139665,0.0183566,0.1021479,94.3197335,104.1815726
,6,0.1,0.8040549,1.7767174,1.9093407,0.7953020,0.8546674,0.0887862,0.1909341,77.6717410,90.9340659
,7,0.1500279,0.7278681,1.6499269,1.8228372,0.7385475,0.8159463,0.0825425,0.2734765,64.9926889,82.2837185
,8,0.2,0.6544646,1.4718517,1.7351399,0.6588367,0.7766909,0.0735514,0.3470280,47.1851694,73.5139860
,9,0.3007267,0.54,1.1294054,1.5322526,0.5055494,0.6858736,0.1137612,0.4607892,12.9405417,53.2252599



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.5103969,0.0170002,0.4675797,0.5187256,0.529346,0.5340972,0.5022359
auc,0.6860095,0.0048444,0.6834494,0.6826773,0.6764585,0.6919201,0.6955421
err,0.4896031,0.0170002,0.5324203,0.4812745,0.470654,0.4659027,0.4977641
err_count,1751.8,60.826637,1905.0,1722.0,1684.0,1667.0,1781.0
f0point5,0.5276347,0.0087455,0.5043513,0.5349369,0.5363154,0.5370902,0.5254799
f1,0.6285752,0.0045262,0.6163142,0.6351695,0.6300528,0.6304589,0.6308808
f2,0.7779176,0.0087778,0.7921715,0.7816248,0.763497,0.7631212,0.7891735
lift_top_group,2.1341498,0.0233789,2.1733377,2.108709,2.1678126,2.0877979,2.1330914
logloss,0.6317399,0.0069462,0.6470163,0.6278963,0.6386968,0.6254495,0.6196405


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_lift,validation_classification_error
,2019-02-27 18:33:17,42.067 sec,0.0,,,,,,,,,,
,2019-02-27 18:33:17,42.201 sec,1.0,0.6417147,13.3249071,0.5684000,1.8563113,0.5444752,0.6257825,12.6196298,0.5865238,1.2540661,0.5586521
,2019-02-27 18:33:17,42.375 sec,2.0,0.6210705,11.6372305,0.5767059,2.1514338,0.5499440,0.5561231,6.2937922,0.6037496,1.4513433,0.5586521
,2019-02-27 18:33:17,42.537 sec,3.0,0.6051961,10.3191719,0.5841894,2.2114993,0.5527932,0.5274344,3.7649288,0.6214963,1.6196496,0.5586521
,2019-02-27 18:33:18,42.711 sec,4.0,0.5887855,8.8804535,0.5916112,2.2295480,0.5524298,0.5132453,2.4721088,0.6310471,1.7470406,0.5586521
,2019-02-27 18:33:18,42.884 sec,5.0,0.5770250,7.8624513,0.5975191,2.2331333,0.5517220,0.5045948,1.8703764,0.6399109,1.8342081,0.5586521
,2019-02-27 18:33:18,43.069 sec,6.0,0.5670412,6.9897580,0.6020044,2.2340160,0.5514291,0.5005857,1.5548858,0.6412568,1.8819135,0.5254624
,2019-02-27 18:33:18,43.321 sec,7.0,0.5543975,6.0015319,0.6107800,2.2340160,0.5521686,0.4964831,1.3044694,0.6455224,1.9568156,0.5328097
,2019-02-27 18:33:18,43.557 sec,8.0,0.5451264,5.1261757,0.6138782,2.2340160,0.5525744,0.4935844,1.1362039,0.6464534,1.9679974,0.5391437


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
matchup,18858.0214844,1.0,0.1455128
action_type,17095.3730469,0.9065306,0.1319118
opponent,10931.6328125,0.5796808,0.0843510
seconds_remaining,7407.8325195,0.3928213,0.0571605
game_event_id,6554.1416016,0.3475519,0.0505733
---,---,---,---
Unnamed: 0.1.1.1.1,2066.5766602,0.1095861,0.0159462
shot_zone_range,1420.8073730,0.0753423,0.0109633
game_id,725.1743774,0.0384544,0.0055956



See the whole table with table.as_data_frame()
<bound method ModelBase.gini of >


## Conclusion

The H2O helps us execute and run the dataset for different models for different hyperparameter values, and get the best out of it using the leaderboard it generates. We use this leaderboard to get the model and use it for further analysis. We can see, that by giving the appropriate parameters to H2O it can give us the best models with best metric value and make us win the Kaggle competition. Thus, signifying that even we students can win a Kaggle competition efficiently.
Thus, the leaderboard turned out to give us the below algorithms with the metric as shown below for the prediction of Kobe Bryants' shots:

<img src='Images/image3.PNG' width="500" />
    
##### And it turned out that DRF was the algorithm that gave us best model

## Contribution

Tried to dig through with the way H2O works. Trying to check what parameters can be tuned and given to it to make better models with best scores.
* Own - 80%
* Other sources - 20%

## Citations

Prof. GitHub H2O - https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

H2O - https://h2o.ai

Kaggle competition - https://www.kaggle.com/c/kobe-bryant-shot-selection

Kaggle kernels - 
* https://www.kaggle.com/selfishgene/psychology-of-a-professional-athlete/data
* https://www.kaggle.com/apapiu/exploring-kobe-s-shots


## License

<font size="4">MIT License</font>
    
<img src="Images/OSI_Approved_License.png" width="100" align="right"/>

<font size="4">
    
<b>Copyright (c) 2019 PRABHU SUBRAMANIAN, PREETAM JAIN</b>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
