In this projext, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(36926175)

In [2]:
#We will predict the "salary" value in the data set:

baseball = pd.read_csv("baseball.csv")
baseball.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [3]:
# find the datatypes of the columns
baseball.dtypes

attendance_binary              int64
previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [4]:
#Find the number of rows and columns

baseball.shape

(2427, 17)

In [5]:
#Split the data into train and test

from sklearn.model_selection import train_test_split

train, test = train_test_split(baseball, test_size=0.3)

In [6]:
#Check the missing value in train dataset

train.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [7]:
# check the missing values in test dataset

test.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [8]:
#Data preparation

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder



In [9]:
#Seperate target variable

train_target = train[['attendance_binary']]
test_target = test[['attendance_binary']]

train_inputs = train.drop(['attendance_binary'], axis=1)
test_inputs = test.drop(['attendance_binary'], axis=1)

In [10]:
# Convert datatype for binary column 

train_inputs['previous_homewin']=train_inputs['previous_homewin'].astype('boolean')

In [11]:
train_inputs['previous_homewin'].value_counts()

True     894
False    804
Name: previous_homewin, dtype: Int64

In [12]:
train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin             boolean
dtype: object

In [13]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=['int64','float']).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

# Idnetify the binary column
binary_columns= train_inputs.select_dtypes('boolean').columns.to_list()

In [14]:
binary_columns

['previous_homewin']

In [15]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [16]:
categorical_columns

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

In [17]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [18]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [19]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [20]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [21]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)
 
train_x

array([[-0.47732149311450606, 3.053145493585396, 0.3278832902726411, ...,
        0.0, 0.0, True],
       [1.3144949023511214, -0.7245380488785358, -0.23363544586749493,
        ..., 0.0, 1.0, True],
       [-0.2915042515800707, 0.5346897986094414, 0.6086426583427091, ...,
        0.0, 0.0, True],
       ...,
       [0.20540142723527344, 0.5346897986094414, -0.23363544586749493,
        ..., 0.0, 0.0, True],
       [0.858360971541139, 1.7939176460974184, -0.795154182007631, ...,
        0.0, 0.0, True],
       [-1.63330247843652, 0.5346897986094414, -0.795154182007631, ...,
        0.0, 0.0, False]], dtype=object)

In [22]:
train_x.shape

(1698, 37)

In [23]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[-0.82143505, -0.72453805, -0.79515418, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.36423643,  0.5346898 ,  0.60864266, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.76805155, -0.72453805,  0.04712392, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.03762568, -0.72453805,  0.32788329, ...,  0.        ,
         1.        ,  0.        ],
       [ 1.63771293, -0.72453805, -1.35667292, ...,  0.        ,
         0.        ,  1.        ],
       [-0.57476597, -0.72453805,  0.04712392, ...,  0.        ,
         1.        ,  0.        ]])

In [24]:
test_x.shape

(729, 37)

## Find the Baseline (0.5 point)

In [25]:
# Find majority class
train_target.value_counts()

attendance_binary
1                    880
0                    818
dtype: int64

In [26]:
# Find percentage
train_target.value_counts()/len(train_target)

attendance_binary
1                    0.518257
0                    0.481743
dtype: float64

## Baseline accuracy for the model is 51.41%

## SVM Model 1:

## SVC(kernel= 'linear')

In [27]:
from sklearn.metrics import accuracy_score

In [28]:
from sklearn.svm import SVC
 
lin_svm1 = SVC(kernel="linear")

lin_svm1.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(kernel='linear')

In [29]:
#Predict the train values
train_target_pred = lin_svm1.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8445229681978799

In [30]:
#Predict the test values
test_target_pred = lin_svm1.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8326474622770919

## SVM Model 2:

## SVC(kernel='poly')

In [31]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm2 = SVC(kernel="poly", degree=2, coef0=1, C=100, gamma='scale')

pol_svm2.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(C=100, coef0=1, degree=2, kernel='poly')

In [32]:
#Predict the train values
train_target_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.9287396937573617

In [33]:
#Predict the test values
test_target_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.7956104252400549

In [34]:
## Adjusting C value from C=100 to C=1

In [35]:
pol_svm2 = SVC(kernel="poly", degree=2, coef0=1, C=1, gamma='scale')

pol_svm2.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(C=1, coef0=1, degree=2, kernel='poly')

In [36]:
#Predict the train values
train_target_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8722025912838633

In [37]:
#Predict the test values
test_target_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8203017832647462

In [38]:
## Adjusting C to C=0.1 and coef0=1 to coef0=0.001

In [39]:
pol_svm2 = SVC(kernel="poly", degree=2, coef0=0.001, C=0.1, gamma='scale')

pol_svm2.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(C=0.1, coef0=0.001, degree=2, kernel='poly')

In [40]:
#Predict the train values
train_target_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8451118963486455

In [41]:
#Predict the test values
test_target_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.821673525377229

## SVM Model 3:

## SVC(kernel = 'rbf')

In [42]:
rbf_svm3 = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm3.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(C=10)

In [43]:
#Predict the train values
train_target_pred = rbf_svm3.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.9723203769140165

In [44]:
#Predict the test values
test_target_pred = rbf_svm3.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8065843621399177

In [45]:
## to control overfitting, changing C value 

In [65]:
rbf_svm3 = SVC(kernel="rbf", C=0.06, gamma='scale')

rbf_svm3.fit(train_x, train_target)

  return f(*args, **kwargs)


SVC(C=0.06)

In [66]:
#Predict the train values
train_target_pred = rbf_svm3.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8380447585394581

In [67]:
#Predict the test values
test_target_pred = rbf_svm3.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8148148148148148

## SGD Model 1:

## SGD Model with No Penalty

In [49]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)
np.random.seed(42)
sgd_logreg = SGDClassifier(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_logreg.fit(train_x, train_target)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [50]:
#Predict the train values
train_target_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8268551236749117

In [51]:
#Predict the test values
test_target_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8257887517146777

## SGD Model 2:

## SGD Model with l2 penalty

In [52]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_logreg = SGDClassifier(max_iter=100, penalty='l2', eta0=0.1, tol=0.0001) 

sgd_logreg.fit(train_x, train_target)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, tol=0.0001)

In [53]:
#Predict the train values
train_target_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8256772673733804

In [54]:
#Predict the test values
test_target_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.7887517146776406

## LogisticRegression Model:

In [55]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(penalty='none')

log_reg.fit(train_x, train_target)


  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(penalty='none')

In [56]:
#Predict the train values
train_target_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.8433451118963486

In [57]:
#Predict the test values
test_target_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.8271604938271605

## L2 Regularization 

In [59]:
log_reg = LogisticRegression(solver='liblinear', penalty='l2')

log_reg.fit(train_x, train_target)

  return f(*args, **kwargs)


LogisticRegression(solver='liblinear')

In [60]:
#Predict the train values
train_target_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.839811542991755

In [61]:
#Predict the test values
test_target_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.831275720164609

## List the train and test values of each model :

## Which model performs the best and why?  How does it compare to baseline? 

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what steps were taken? 

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what steps were taken about it? 