# **NFL Spread Analysis**
This notebook will atempt to classify spread winners using a Linear SVC Classification model (1 = favorite win, 0 = underdog win). This Algotrithm will also atempt to predict the score of each game using the ridge regression model. Lastly a comparison between the score and spread prediction will be made as the score will also entail a spread prediction. 

## 1. Problem
Predict the score of NFL games and classify the winner of the spread for each game

## 2. Data
Spread data from 1967-2020
https://www.kaggle.com/tobycrabtree/nfl-scores-and-betting-data?select=spreadspoke_scores.csv

## 3. Features
* Week #
* Playoff or regular season game
* Home team
* Away team
* Favorite team to win
* Spread of the favorite
* Over under line
* Stadium
* Nutrality of stadium
* Tempurature 
* Wind speed (mph)
* Humidity
* Stadium details

## 4. Evaluation
Using training test split set of the data.
Will make predicitions on the 2021 season games in real time.

## 5. Procedure
1. Read in NFL spread data <br>
2. Get data ready
  - Remove games that have no spread values
  - Remove teams that are not in the current 32 team league 
  - Convert team names to team IDs to match the spread favorites 
  - Convert text to numeric values using one hot encoder
  - Convert spread to a positive value 
  - Calculate and add a favorite spread win column <br>
3. Feature scale necessary items <br>
4. Split data into training and testing data <br>
5. Fit the Linear SVC Model <br>
6. Fit the ridge regression model <br>
7. Evaluate findings using prediction scores and confusion matricies <br>
8. Improve the model by tuning hyper parameters
9. Save the model
10. Make visuals :)

### Imoprt libraries

In [408]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sb
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import ridge_regression
from sklearn.svm import LinearSVC

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

### Read in Data

In [356]:
spread_data = pd.read_csv('data/spreadspoke_scores.csv')
spread_data.head()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,Unnamed: 17
0,9/2/1966,1966,1,False,Miami Dolphins,14.0,23.0,Oakland Raiders,,,,Orange Bowl,False,83.0,6.0,71,,
1,9/3/1966,1966,1,False,Houston Oilers,45.0,7.0,Denver Broncos,,,,Rice Stadium,False,81.0,7.0,70,,
2,9/4/1966,1966,1,False,San Diego Chargers,27.0,7.0,Buffalo Bills,,,,Balboa Stadium,False,70.0,7.0,82,,
3,9/9/1966,1966,2,False,Miami Dolphins,14.0,19.0,New York Jets,,,,Orange Bowl,False,82.0,11.0,78,,
4,9/10/1966,1966,1,False,Green Bay Packers,24.0,3.0,Baltimore Colts,,,,Lambeau Field,False,64.0,8.0,62,,


### Get Data Ready for Processing

In [357]:
# Drop columns that aren't needed for evaluation 
spread_data.drop('schedule_date', axis=1, inplace=True)
spread_data.drop('schedule_season', axis=1, inplace=True)
spread_data.drop('weather_humidity', axis=1, inplace=True)
spread_data.drop('Unnamed: 17', axis=1, inplace=True)
spread_data.tail()
spread_data.dtypes

schedule_week           object
schedule_playoff          bool
team_home               object
score_home             float64
score_away             float64
team_away               object
team_favorite_id        object
spread_favorite        float64
over_under_line         object
stadium                 object
stadium_neutral         object
weather_temperature    float64
weather_wind_mph       float64
weather_detail          object
dtype: object

In [358]:
spread_data.isna().sum() # Check columns with missing data

schedule_week              0
schedule_playoff           0
team_home                  0
score_home               208
score_away               208
team_away                  0
team_favorite_id        2719
spread_favorite         2687
over_under_line         2697
stadium                    0
stadium_neutral            1
weather_temperature     1097
weather_wind_mph        1114
weather_detail         10412
dtype: int64

In [359]:
# Fill missing temperature and winds with averages
spread_data['weather_temperature'].fillna(spread_data['weather_temperature'].mean(), inplace=True)
spread_data['weather_wind_mph'].fillna(spread_data['weather_wind_mph'].mean(), inplace=True)

# Fill blank weather detail spots with 'stadium'
spread_data['weather_detail'].fillna('stadium', inplace=True)

spread_data.head(10)

Unnamed: 0,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail
0,1,False,Miami Dolphins,14.0,23.0,Oakland Raiders,,,,Orange Bowl,False,83.0,6.0,stadium
1,1,False,Houston Oilers,45.0,7.0,Denver Broncos,,,,Rice Stadium,False,81.0,7.0,stadium
2,1,False,San Diego Chargers,27.0,7.0,Buffalo Bills,,,,Balboa Stadium,False,70.0,7.0,stadium
3,2,False,Miami Dolphins,14.0,19.0,New York Jets,,,,Orange Bowl,False,82.0,11.0,stadium
4,1,False,Green Bay Packers,24.0,3.0,Baltimore Colts,,,,Lambeau Field,False,64.0,8.0,stadium
5,2,False,Houston Oilers,31.0,0.0,Oakland Raiders,,,,Rice Stadium,False,77.0,6.0,stadium
6,2,False,San Diego Chargers,24.0,0.0,New England Patriots,,,,Balboa Stadium,False,69.0,9.0,stadium
7,1,False,Atlanta Falcons,14.0,19.0,Los Angeles Rams,,,,Atlanta-Fulton County Stadium,False,71.0,7.0,stadium
8,2,False,Buffalo Bills,20.0,42.0,Kansas City Chiefs,,,,War Memorial Stadium,False,63.0,11.0,stadium
9,1,False,Detroit Lions,14.0,3.0,Chicago Bears,,,,Tiger Stadium,False,67.0,7.0,stadium


In [360]:
spread_data.isna().sum() # Check columns with missing data

schedule_week             0
schedule_playoff          0
team_home                 0
score_home              208
score_away              208
team_away                 0
team_favorite_id       2719
spread_favorite        2687
over_under_line        2697
stadium                   0
stadium_neutral           1
weather_temperature       0
weather_wind_mph          0
weather_detail            0
dtype: int64

In [361]:
# Remove rows without a home score, away score, favorite id, spread favorite or an over under line
spread_data.dropna(inplace=True)

# Remove rows with favorite_ids as 'PICK'
spread_data = spread_data[spread_data['team_favorite_id'] != 'PICK']
spread_data = spread_data[spread_data['over_under_line'] != ' ']

spread_data.head(10)

Unnamed: 0,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail
350,Superbowl,True,Green Bay Packers,33.0,14.0,Oakland Raiders,GB,-13.5,43,Orange Bowl,True,60.0,12.0,stadium
538,Superbowl,True,Baltimore Colts,7.0,16.0,New York Jets,IND,-18.0,40,Orange Bowl,True,66.0,12.0,Rain
727,Superbowl,True,Kansas City Chiefs,23.0,7.0,Minnesota Vikings,MIN,-12.0,39,Tulane Stadium,True,55.0,14.0,Rain
916,Superbowl,True,Baltimore Colts,16.0,13.0,Dallas Cowboys,IND,-2.5,36,Orange Bowl,True,59.0,11.0,stadium
1105,Superbowl,True,Dallas Cowboys,24.0,3.0,Miami Dolphins,DAL,-6.0,34,Tulane Stadium,True,34.0,18.0,stadium
1294,Superbowl,True,Miami Dolphins,14.0,7.0,Washington Redskins,MIA,-1.0,33,Los Angeles Memorial Coliseum,True,64.0,7.0,stadium
1483,Superbowl,True,Miami Dolphins,24.0,7.0,Minnesota Vikings,MIA,-6.5,33,Rice Stadium,True,47.0,7.0,stadium
1672,Superbowl,True,Minnesota Vikings,6.0,16.0,Pittsburgh Steelers,PIT,-3.0,33,Tulane Stadium,True,51.0,17.0,stadium
1861,Superbowl,True,Dallas Cowboys,17.0,21.0,Pittsburgh Steelers,PIT,-7.0,36,Orange Bowl,True,49.0,18.0,stadium
2064,Superbowl,True,Minnesota Vikings,14.0,32.0,Oakland Raiders,LVR,-4.0,38,Rose Bowl,True,52.0,6.0,stadium


In [362]:
len(spread_data) # Check total length of the new data

10253

In [363]:
spread_data['over_under_line'].unique()

array(['43', '40', '39', '36', '34', '33', '38', '37', '30', '31', '31.5',
       '36.5', '32', '41', '42.5', '35', '40.5', '42', '35.5', '44', '45',
       '44.5', '39.5', '49', '47', '50', '48', '46', '37.5', '51', '54',
       '38.5', '34.5', '49.5', '43.5', '45.5', '52', '41.5', '47.5',
       '46.5', '52.5', '53', '48.5', '54.5', '55', '53.5', '51.5', '50.5',
       '56', '33.5', '36.6', '32.5', '28', '29.5', '55.5', '58.5', '63',
       '59.5', '58', '56.5', '57', '60', '30.5', '59', '57.5', '63.5',
       '61.5'], dtype=object)

In [364]:
spread_data['team_favorite_id'].unique()

array(['GB', 'IND', 'MIN', 'DAL', 'MIA', 'PIT', 'LVR', 'TB', 'CHI', 'DEN',
       'KC', 'LAR', 'NO', 'NYJ', 'PHI', 'SEA', 'TEN', 'CIN', 'NE', 'ARI',
       'LAC', 'CLE', 'WAS', 'BUF', 'ATL', 'DET', 'NYG', 'SF', 'BAL',
       'JAX', 'CAR', 'HOU'], dtype=object)

In [365]:
# Add old teams to new team names
spread_data['team_home'].unique()

array(['Green Bay Packers', 'Baltimore Colts', 'Kansas City Chiefs',
       'Dallas Cowboys', 'Miami Dolphins', 'Minnesota Vikings',
       'Tampa Bay Buccaneers', 'Buffalo Bills', 'Chicago Bears',
       'Denver Broncos', 'Los Angeles Rams', 'New Orleans Saints',
       'New York Jets', 'Philadelphia Eagles', 'Seattle Seahawks',
       'St. Louis Cardinals', 'Washington Redskins',
       'New England Patriots', 'New York Giants', 'Pittsburgh Steelers',
       'San Diego Chargers', 'San Francisco 49ers', 'Atlanta Falcons',
       'Cincinnati Bengals', 'Cleveland Browns', 'Houston Oilers',
       'Detroit Lions', 'Oakland Raiders', 'Los Angeles Raiders',
       'Indianapolis Colts', 'Phoenix Cardinals', 'Arizona Cardinals',
       'Jacksonville Jaguars', 'St. Louis Rams', 'Carolina Panthers',
       'Baltimore Ravens', 'Tennessee Oilers', 'Tennessee Titans',
       'Houston Texans', 'Los Angeles Chargers',
       'Washington Football Team', 'Las Vegas Raiders'], dtype=object)

In [366]:
# Convert old team names to new 32 team roster names 

spread_data['team_home'].replace('Baltimore Colts', 'Indianapolis Colts', inplace=True)
spread_data['team_home'].replace('St. Louis Cardinals', 'Arizona Cardinals', inplace=True)
spread_data['team_home'].replace('Phoenix Cardinals', 'Arizona Cardinals', inplace=True)
spread_data['team_home'].replace('Washington Redskins', 'Washington Football Team', inplace=True)
spread_data['team_home'].replace('San Diego Chargers', 'Los Angeles Chargers', inplace=True)
spread_data['team_home'].replace('San Diego Chargers', 'Los Angeles Chargers', inplace=True)
spread_data['team_home'].replace('Houston Oilers', 'Tennessee Titans', inplace=True)
spread_data['team_home'].replace('Oakland Raiders', 'Las Vegas Raiders', inplace=True)
spread_data['team_home'].replace('Los Angeles Raiders', 'Las Vegas Raiders', inplace=True)
spread_data['team_home'].replace('St. Louis Rams', 'Los Angeles Rams', inplace=True)
spread_data['team_home'].replace('Tennessee Oilers', 'Tennessee Titans', inplace=True)

spread_data['team_home'].replace('Las Vegas Raiders', 'Los Vegas Raiders', inplace=True) #intentionally misspell for id order
spread_data['team_home'].replace('San Francisco 49ers', 'Sfrancisco 49ers', inplace=True)

spread_data['team_away'].replace('Baltimore Colts', 'Indianapolis Colts', inplace=True)
spread_data['team_away'].replace('St. Louis Cardinals', 'Arizona Cardinals', inplace=True)
spread_data['team_away'].replace('Phoenix Cardinals', 'Arizona Cardinals', inplace=True)
spread_data['team_away'].replace('Washington Redskins', 'Washington Football Team', inplace=True)
spread_data['team_away'].replace('San Diego Chargers', 'Los Angeles Chargers', inplace=True)
spread_data['team_away'].replace('San Diego Chargers', 'Los Angeles Chargers', inplace=True)
spread_data['team_away'].replace('Houston Oilers', 'Tennessee Titans', inplace=True)
spread_data['team_away'].replace('Oakland Raiders', 'Las Vegas Raiders', inplace=True)
spread_data['team_away'].replace('Los Angeles Raiders', 'Las Vegas Raiders', inplace=True)
spread_data['team_away'].replace('St. Louis Rams', 'Los Angeles Rams', inplace=True)
spread_data['team_away'].replace('Tennessee Oilers', 'Tennessee Titans', inplace=True)

spread_data['team_away'].replace('Las Vegas Raiders', 'Los Vegas Raiders', inplace=True) #intentionally misspell for id order
spread_data['team_away'].replace('San Francisco 49ers', 'Sfrancisco 49ers', inplace=True)

In [367]:
spread_data['team_home'].nunique()

32

In [368]:
spread_data['team_away'].nunique()

32

In [369]:
# Create numpy arrays to order ids and team names
team_names = np.array(spread_data['team_home'].unique())
team_ids = np.array(spread_data['team_favorite_id'].unique())

In [370]:
# Merge and transpose the two arrays to create pairs for ids and names
team_names = np.sort(team_names)
team_ids = np.sort(team_ids)
team_array = np.array((team_ids, team_names)).T
print(team_array)

[['ARI' 'Arizona Cardinals']
 ['ATL' 'Atlanta Falcons']
 ['BAL' 'Baltimore Ravens']
 ['BUF' 'Buffalo Bills']
 ['CAR' 'Carolina Panthers']
 ['CHI' 'Chicago Bears']
 ['CIN' 'Cincinnati Bengals']
 ['CLE' 'Cleveland Browns']
 ['DAL' 'Dallas Cowboys']
 ['DEN' 'Denver Broncos']
 ['DET' 'Detroit Lions']
 ['GB' 'Green Bay Packers']
 ['HOU' 'Houston Texans']
 ['IND' 'Indianapolis Colts']
 ['JAX' 'Jacksonville Jaguars']
 ['KC' 'Kansas City Chiefs']
 ['LAC' 'Los Angeles Chargers']
 ['LAR' 'Los Angeles Rams']
 ['LVR' 'Los Vegas Raiders']
 ['MIA' 'Miami Dolphins']
 ['MIN' 'Minnesota Vikings']
 ['NE' 'New England Patriots']
 ['NO' 'New Orleans Saints']
 ['NYG' 'New York Giants']
 ['NYJ' 'New York Jets']
 ['PHI' 'Philadelphia Eagles']
 ['PIT' 'Pittsburgh Steelers']
 ['SEA' 'Seattle Seahawks']
 ['SF' 'Sfrancisco 49ers']
 ['TB' 'Tampa Bay Buccaneers']
 ['TEN' 'Tennessee Titans']
 ['WAS' 'Washington Football Team']]


In [371]:
# Replace home and away team names with their IDS
for t_id, t_name in team_array:
    spread_data['team_home'].replace(t_name, t_id, inplace=True)
    spread_data['team_away'].replace(t_name, t_id, inplace=True)

In [372]:
spread_data.head()

Unnamed: 0,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail
350,Superbowl,True,GB,33.0,14.0,LVR,GB,-13.5,43,Orange Bowl,True,60.0,12.0,stadium
538,Superbowl,True,IND,7.0,16.0,NYJ,IND,-18.0,40,Orange Bowl,True,66.0,12.0,Rain
727,Superbowl,True,KC,23.0,7.0,MIN,MIN,-12.0,39,Tulane Stadium,True,55.0,14.0,Rain
916,Superbowl,True,IND,16.0,13.0,DAL,IND,-2.5,36,Orange Bowl,True,59.0,11.0,stadium
1105,Superbowl,True,DAL,24.0,3.0,MIA,DAL,-6.0,34,Tulane Stadium,True,34.0,18.0,stadium


In [373]:
# Replace wildcard, division, conference and superbowl with 19, 20, 21, 22
spread_data['schedule_week'].replace('WildCard', '19', inplace=True)
spread_data['schedule_week'].replace('Wildcard', '19', inplace=True)
spread_data['schedule_week'].replace('Division', '20', inplace=True)
spread_data['schedule_week'].replace('Conference', '21', inplace=True)
spread_data['schedule_week'].replace('SuperBowl', '22', inplace=True)
spread_data['schedule_week'].replace('Superbowl', '22', inplace=True)

In [374]:
spread_data['schedule_week'].unique()

array(['22', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
       '12', '13', '14', '15', '16', '17', '19', '20', '21', '18'],
      dtype=object)

In [375]:
# Add an 'actual spread' column (could be positive or negative)

# Take the favoured team and subtract their score from the underdogs score

home_teams = np.array(spread_data['team_home'])
away_teams = np.array(spread_data['team_away'])
home_scores = np.array(spread_data['score_home'])
away_scores = np.array(spread_data['score_away'])
favorites = np.array(spread_data['team_favorite_id'])

favorites_array = np.array((home_teams, away_teams, home_scores, away_scores, favorites)).T

actual_spread_array = []
actual_spread = 0
error_counter = 0

for ht_, at_, hs_, as_, f_ in favorites_array:
    if f_ == ht_:
        actual_spread = as_ - hs_
    elif f_ == at_:
        actual_spread = hs_ - as_
    else:
        print('Error: Favorite does not match either team')
        print(ht_, at_, f_)
        error_counter += 1
    actual_spread_array.append(actual_spread)

print(error_counter)
actual_spread_array = np.array((actual_spread_array))
actual_spread_array

0


array([-19.,   9.,  16., ...,   5., -14.,  22.])

In [376]:
# Add actual spread to data frames
spread_data['actual_spread'] = actual_spread_array
spread_data.head()

Unnamed: 0,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail,actual_spread
350,22,True,GB,33.0,14.0,LVR,GB,-13.5,43,Orange Bowl,True,60.0,12.0,stadium,-19.0
538,22,True,IND,7.0,16.0,NYJ,IND,-18.0,40,Orange Bowl,True,66.0,12.0,Rain,9.0
727,22,True,KC,23.0,7.0,MIN,MIN,-12.0,39,Tulane Stadium,True,55.0,14.0,Rain,16.0
916,22,True,IND,16.0,13.0,DAL,IND,-2.5,36,Orange Bowl,True,59.0,11.0,stadium,-3.0
1105,22,True,DAL,24.0,3.0,MIA,DAL,-6.0,34,Tulane Stadium,True,34.0,18.0,stadium,-21.0


In [377]:
# Add classification column (Did Favorite Cover? 1 = yes, 0 = no)
favorite_spread_array = np.array(spread_data['spread_favorite'])
cover_array = np.array((favorite_spread_array, actual_spread_array)).T

did_favorite_cover = []
error_counter = 0

for fav_spr, act_spr in cover_array:
    if act_spr < fav_spr:
        did_favorite_cover.append(1) # Did Cover
    elif act_spr > fav_spr:
        did_favorite_cover.append(0) # Did not cover
    else:
        did_favorite_cover.append(2) # Push

did_favorite_cover = np.array((did_favorite_cover))
did_favorite_cover

array([1, 0, 0, ..., 0, 1, 0])

In [378]:
spread_data['did_favorite_cover'] = did_favorite_cover
spread_data.head(20)

Unnamed: 0,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail,actual_spread,did_favorite_cover
350,22,True,GB,33.0,14.0,LVR,GB,-13.5,43.0,Orange Bowl,True,60.0,12.0,stadium,-19.0,1
538,22,True,IND,7.0,16.0,NYJ,IND,-18.0,40.0,Orange Bowl,True,66.0,12.0,Rain,9.0,0
727,22,True,KC,23.0,7.0,MIN,MIN,-12.0,39.0,Tulane Stadium,True,55.0,14.0,Rain,16.0,0
916,22,True,IND,16.0,13.0,DAL,IND,-2.5,36.0,Orange Bowl,True,59.0,11.0,stadium,-3.0,1
1105,22,True,DAL,24.0,3.0,MIA,DAL,-6.0,34.0,Tulane Stadium,True,34.0,18.0,stadium,-21.0,1
1294,22,True,MIA,14.0,7.0,WAS,MIA,-1.0,33.0,Los Angeles Memorial Coliseum,True,64.0,7.0,stadium,-7.0,1
1483,22,True,MIA,24.0,7.0,MIN,MIA,-6.5,33.0,Rice Stadium,True,47.0,7.0,stadium,-17.0,1
1672,22,True,MIN,6.0,16.0,PIT,PIT,-3.0,33.0,Tulane Stadium,True,51.0,17.0,stadium,-10.0,1
1861,22,True,DAL,17.0,21.0,PIT,PIT,-7.0,36.0,Orange Bowl,True,49.0,18.0,stadium,-4.0,0
2064,22,True,MIN,14.0,32.0,LVR,LVR,-4.0,38.0,Rose Bowl,True,52.0,6.0,stadium,-18.0,1


### Split data into features and labels

In [390]:
X = spread_data.drop(['score_home', 'score_away','actual_spread','did_favorite_cover', 'stadium'], axis=1)
y = spread_data['did_favorite_cover']

In [391]:
categorical_features = ['schedule_playoff', 'stadium_neutral','weather_detail']
team_categories = ['team_home', 'team_away', 'team_favorite_id']
categorical_features = ['schedule_playoff', 'stadium_neutral','weather_detail','team_home', 'team_away', 'team_favorite_id']

In [392]:
X['over_under_line'].unique()

array(['43', '40', '39', '36', '34', '33', '38', '37', '30', '31', '31.5',
       '36.5', '32', '41', '42.5', '35', '40.5', '42', '35.5', '44', '45',
       '44.5', '39.5', '49', '47', '50', '48', '46', '37.5', '51', '54',
       '38.5', '34.5', '49.5', '43.5', '45.5', '52', '41.5', '47.5',
       '46.5', '52.5', '53', '48.5', '54.5', '55', '53.5', '51.5', '50.5',
       '56', '33.5', '36.6', '32.5', '28', '29.5', '55.5', '58.5', '63',
       '59.5', '58', '56.5', '57', '60', '30.5', '59', '57.5', '63.5',
       '61.5'], dtype=object)

### Label Encode the team names

In [382]:
#le = LabelEncoder()
#X[team_categories] = X[team_categories].apply(lambda col:le.fit_transform(col))
#X.head()

Unnamed: 0,schedule_week,schedule_playoff,team_home,team_away,team_favorite_id,spread_favorite,over_under_line,stadium_neutral,weather_temperature,weather_wind_mph,weather_detail
350,22,True,11,18,11,-13.5,43,True,60.0,12.0,stadium
538,22,True,13,24,13,-18.0,40,True,66.0,12.0,Rain
727,22,True,15,20,20,-12.0,39,True,55.0,14.0,Rain
916,22,True,13,8,13,-2.5,36,True,59.0,11.0,stadium
1105,22,True,8,19,8,-6.0,34,True,34.0,18.0,stadium


### Encode categories in X data

In [413]:
pd.set_option('display.max_columns', None)
enc = OneHotEncoder()
transformer = ColumnTransformer([('enc', enc, categorical_features)], remainder='passthrough')
transformed_X = transformer.fit_transform(X)
pd.DataFrame(transformed_X)

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 12)\t1.0\n..."
1,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ..."
2,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ..."
3,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 12)\t1.0\n..."
4,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 12)\t1.0\n..."
...,...
10248,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 12)\t1.0\n..."
10249,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 4)\t1.0\n ..."
10250,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 12)\t1.0\n..."
10251,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 12)\t1.0\n..."


In [396]:
# Maybe feature scale later if it takes too long

### Train Test Split

In [406]:
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size = 0.2)
lin_svc_clf = LinearSVC(random_state=42, tol=1e-05, max_iter=2000)#make_pipeline(StandardScaler(), LinearSVC(random_state=42, tol=1e-05))
lin_svc_clf.fit(X_train, y_train)

LinearSVC(max_iter=2000, random_state=42, tol=1e-05)

In [407]:
y_preds = lin_svc_clf.predict(X_test)

In [409]:
confusion_matrix(y_test, y_preds)

array([[604, 314, 110],
       [571, 302,  97],
       [ 26,  22,   5]], dtype=int64)

In [410]:
pd.crosstab(y_test, y_preds, rownames=['Actual Labels'], colnames=['Predicted Labels'])

Predicted Labels,0,1,2
Actual Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,604,314,110
1,571,302,97
2,26,22,5


In [411]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.50      0.59      0.54      1028
           1       0.47      0.31      0.38       970
           2       0.02      0.09      0.04        53

    accuracy                           0.44      2051
   macro avg       0.33      0.33      0.32      2051
weighted avg       0.48      0.44      0.45      2051

