# Modelling football stadium attendance using a GBM algorithm
The aim of this script is to generate an artificial data set to perform a GBM algorithm and deal with overfitting.
The script is slightly long, specially the data generation part, so do not hesitate to go fast through it. Thanks for you attention.

In [1]:
import h2o
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 day 3 hours 56 mins
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.7
H2O_cluster_version_age:,4 days
H2O_cluster_name:,H2O_from_python_Gurdeep_Singh_ugwgkm
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.466 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


## Generating a random data set

In [2]:
random.seed(123)
sampleSize = 1000

In [3]:
# Generating stadium sizes (rounded at the closest hundred)
stadiumSize = []
for i in range (sampleSize):
    size = round(np.random.normal(loc = 50000, scale = 12000), -2)
    if(size < 10000):
        size = 10000
    stadiumSize.append(size)
#print(stadiumSize)

In [4]:
# Generating players' total value (rounded at the closest thousand)
playersValue = []
for i in range (sampleSize):
    # Hypothesis: players' value proportional to the stadium size + some randomness
    playersValue.append(round(stadiumSize[i] * 20 + random.randrange(200000, 500000, 1), -3))
#print(playersValue)

In [5]:
# Generating players' average age
playersAge = []
for i in range (sampleSize):
    # Hypothesis: age is completely random
    playersAge.append(round(np.random.normal(27, 2), 2))
#print(playersAge)

In [6]:
# Generating teams' winning average
victory = []
for i in range (sampleSize):
    # Hypothesis: winning 80% influenced by team's value + 20% of randomness
    percentage = round(playersValue[i]/np.max(playersValue)*80 + np.random.normal(loc=0.5, scale=0.1)*20)
    if(percentage < 0):
        percentage = 0
    elif(percentage > 100):
        percentage = 100
    victory.append(percentage)
#print(victory)

In [7]:
# Generating the output variable: average number of fans going to the stadium during the season
attendance = []
for i in range (sampleSize):
    # Hypothesis 1: attendance = stadium size * 0.8 +- some randomness
    v = round(np.random.normal(loc=stadiumSize[i]*0.6, scale = stadiumSize[i]*0.1), -2)
    # Hypothesis 2: the higher the players' value, the higher the attendance
    v = v + v*playersValue[i]/np.max(playersValue)*0.4
    # Hypothesis 3: the higher the victory rate, the higher the attendance
    v = v + v*victory[i]**0.5/100
    # Correcting for extreme values
    if(v < stadiumSize[i]*0.3):
        v = stadiumSize[i]*0.2
    elif(v > stadiumSize[i]*0.9):
        v = stadiumSize[i]*0.9
    attendance.append(round(v, -2))
#print(attendance)

In [8]:
# Creating the dataframe
teamsDF = pd.DataFrame(list(zip(stadiumSize, playersValue, playersAge, victory, attendance)))
teamsDF.columns =['StadiumSize','PlayersValue','PlayersAge','VictoryPercentage', "StadiumAttendance"]
print(teamsDF.head())  

hf = h2o.H2OFrame(teamsDF, destination_frame = "teams")

   StadiumSize  PlayersValue  PlayersAge  VictoryPercentage  StadiumAttendance
0      60100.0     1429000.0       27.78               65.0            54000.0
1      40600.0     1152000.0       27.45               52.0            36500.0
2      39600.0     1038000.0       27.40               54.0            34600.0
3      69800.0     1810000.0       26.21               81.0            53500.0
4      59300.0     1526000.0       26.88               66.0            53400.0
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [9]:
# Import the data
teams = h2o.get_frame("teams")

## Modelling a GBM algorithm

In [10]:
# Split the data into train and test set
train, test = teams.split_frame(
    ratios = [0.8],
    destination_frames = ["teams_train", "teams_test"],
    seed = 123)
train = h2o.get_frame("teams_train")
test = h2o.get_frame("teams_test")

In [11]:
# Defining X and Y variables
y = 'StadiumAttendance'
ignoreFields = y
x = [i for i in train.names if i not in ignoreFields]
print(x)

['StadiumSize', 'PlayersValue', 'PlayersAge', 'VictoryPercentage']


In [12]:
# Constructing a GBM algorithm (with default hyperparameters and no cross validation)
from h2o.estimators.gbm import H2OGradientBoostingEstimator
myGBM = H2OGradientBoostingEstimator(model_id = "baseline_GBM")
myGBM.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
# Model performance
print("The GBM baseline model has a MAE of %d on the Train set and %d on the Test set"
      % (myGBM.mae(train), myGBM.model_performance(test).mae()))

The GBM baseline model has a MAE of 3178 on the Train set and 4618 on the Test set


The model exhibits better results in the train set than in the test set. Our model overfits.

In [14]:
# Completely overfitted model
myGBM_overfitted = H2OGradientBoostingEstimator(model_id = "overfitted_GBM", ntrees = 1000, max_depth = 10)
myGBM_overfitted.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
print("The GBM overfitted model has a MAE of %d on the Train set and %d on the Test set"
      % (myGBM_overfitted.mae(train), myGBM_overfitted.model_performance(test).mae()))

The GBM overfitted model has a MAE of 285 on the Train set and 5324 on the Test set


We can see that the train error has significantly falled while the test error has increased. We can conclude that this specification indeed increases the previous overfitting.

In [16]:
# Constructing a final GBM: using cross validation to reduce overfitting
myGBM_cv = H2OGradientBoostingEstimator(model_id = "crossval_GBM", nfolds = 20)
myGBM_cv.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [17]:
print("The GBM with cross validation model has a MAE of %d on the Train set and %d on the Test set"
      % (myGBM_cv.mae(train), myGBM_cv.model_performance(test).mae()))

The GBM with cross validation model has a MAE of 3178 on the Train set and 4618 on the Test set


The results exhibited by our CV model seem to be exactly the same as our baseline GBM. While using Cross Val, we train on different samples, so that the results are not supposed to be equal to a default GBM. I suppose I did a mistake somewhere.