# Innovation 3: Classification Models of Playoff Berths
Our primary goal is to leverage the derived team chemistry metric within models that classify which teams will reach the playoffs.
Using KNN, SVM, and naive Bayes algorithms, we will first create baseline classification models without team chemistry, then include the derived metric to determine its added predictive power.


Datasets used:

**connections_final_roster** (this dataset is a result of the visualization data cleaning) only has connections for the players on the roster at the end of the season.
Within this dataset we have the following team metric attributes:

* "games_together" = games together that season,

* "minutes_combined" = minutes player 1 + minutes player 2 when in the same game for that season,

* "total_seasons" = number of seasons played together prior to that season (across any team they played on together and does not count the current season),

* "total_games" = number of games played together prior to that season,

* "total_minutes" = sum of minutes combined prior to that season.

**season_outcomes** (this dataset is a result of the visualization data cleaning) shows the playoff outcomes of a team for a given season.

**NBA Team Stats** https://www.kaggle.com/datasets/supremeleaf/nba-regular-season-team-stats-2001-2023
Teams statistics such ppg, fgm etc. - this dataset was used as the features for the base model data set.



# Aggregating chemistry metrics by team and season

In [None]:
import numpy as np
import pandas as pd


In [None]:
# connections_final_roster.csv
!gdown --id 10smJ3VaJ55UycUewJRk__EsP9R3ms9_d
# season_outcomes.csv
!gdown --id 1GK4X5tMIGxzlqv4fr4DoZsOW8zbmoAy5
# NBA Team Stats.csv
!gdown --id 15sdNQA_QPDQp92QrV-AD-b_0MHatCCeW


Downloading...
From: https://drive.google.com/uc?id=10smJ3VaJ55UycUewJRk__EsP9R3ms9_d
To: /content/connections_final_roster.csv
100% 5.51M/5.51M [00:00<00:00, 62.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1GK4X5tMIGxzlqv4fr4DoZsOW8zbmoAy5
To: /content/season_outcomes.csv
100% 125k/125k [00:00<00:00, 68.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=15sdNQA_QPDQp92QrV-AD-b_0MHatCCeW
To: /content/NBA Team Stats.csv
100% 75.4k/75.4k [00:00<00:00, 10.0MB/s]


In [None]:
!ls

 connections_final_roster.csv  'NBA Team Stats.csv'   sample_data   season_outcomes.csv


In [None]:
final_roster_df = pd.read_csv("connections_final_roster.csv")

In [None]:
final_roster_df.head()

Unnamed: 0,player_1_id,player_1_name,player_2_id,player_2_name,SEASON,TEAM_ID,TEAM_ABBREVIATION,games_together,minutes_combined,total_seasons,total_games,total_minutes
0,2743,Kris Humphries,1627098,Malcolm Delaney,2016,1610612737,ATL,93,2151,0,0,0
1,2743,Kris Humphries,203471,Dennis Schroder,2016,1610612737,ATL,92,3556,1,32,958
2,2743,Kris Humphries,203501,Tim Hardaway Jr.,2016,1610612737,ATL,92,3197,1,32,851
3,203471,Dennis Schroder,1627098,Malcolm Delaney,2016,1610612737,ATL,92,4109,0,0,0
4,203501,Tim Hardaway Jr.,1627098,Malcolm Delaney,2016,1610612737,ATL,92,3740,0,0,0


In [None]:
# Group by 'TEAM_ID', 'TEAM_ABBREVIATION', and 'SEASON', and calculate the mean of the specified columns
# grouped_df = final_roster_df.groupby(['TEAM_ID', 'TEAM_ABBREVIATION', 'SEASON']).mean()[['games_together', 'minutes_combined', 'total_seasons', 'total_games', 'total_minutes']]
# Group by team and season and calculate the average
grouped_df = final_roster_df.groupby(['TEAM_ID', 'TEAM_ABBREVIATION', 'SEASON']).agg({
    'games_together': 'mean',
    'minutes_combined': 'mean',
    'total_seasons': 'mean',
    'total_games': 'mean',
    'total_minutes': 'mean'
}).reset_index()

In [None]:
grouped_df.head(500)

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,SEASON,games_together,minutes_combined,total_seasons,total_games,total_minutes
0,1610612737,ATL,2003,25.614679,1011.247706,0.000000,0.000000,0.000000
1,1610612737,ATL,2004,34.966942,1338.727273,0.016529,0.173554,8.917355
2,1610612737,ATL,2005,64.384615,2591.448718,0.205128,11.564103,544.461538
3,1610612737,ATL,2006,39.482993,1474.727891,0.292517,22.503401,996.408163
4,1610612737,ATL,2007,68.621212,2736.287879,0.666667,40.757576,2085.151515
...,...,...,...,...,...,...,...,...
495,1610612762,UTA,2022,21.897436,669.487179,0.089744,3.782051,164.333333
496,1610612763,MEM,2003,52.923810,2113.714286,0.000000,0.000000,0.000000
497,1610612763,MEM,2004,55.028571,2157.333333,0.628571,37.828571,1608.142857
498,1610612763,MEM,2005,57.480769,2190.259615,0.413462,25.528846,1027.384615


In [None]:
grouped_df['TEAM_ABBREVIATION'].unique()


array(['ATL', 'BOS', 'CLE', 'NOP', 'CHI', 'DAL', 'DEN', 'GSW', 'HOU',
       'LAC', 'LAL', 'MIA', 'MIL', 'MIN', 'BKN', 'NYK', 'ORL', 'IND',
       'PHI', 'PHX', 'POR', 'SAC', 'SAS', 'OKC', 'TOR', 'UTA', 'MEM',
       'WAS', 'DET', 'CHA'], dtype=object)

In [None]:
season_outcomes_df = pd.read_csv("season_outcomes.csv")

# Merging playoff outcome dataset with aggregated team metrics dataset, with playoff outcome being the label for classification model

In [None]:
season_outcomes_df.head()

Unnamed: 0,Season,Team,RS_Win_Loss,RS_Standing_Clean,PO_Outcome_Clean,season_short,team_abbreviation,team_fullname
0,2023-24,Hawks,4-3,5,TBD,2023,ATL,Atlanta Hawks
1,2022-23,Hawks,41-41,7,Lost East Conf 1st Rd,2022,ATL,Atlanta Hawks
2,2021-22,Hawks,43-39,8,Lost East Conf 1st Rd,2021,ATL,Atlanta Hawks
3,2020-21,Hawks,41-31,5,Lost East Conf Finals,2020,ATL,Atlanta Hawks
4,2019-20,Hawks,20-47,14,DNQ,2019,ATL,Atlanta Hawks


In [None]:
season_outcomes_df.columns

Index(['Season', 'Team', 'RS_Win_Loss', 'RS_Standing_Clean',
       'PO_Outcome_Clean', 'season_short', 'team_abbreviation',
       'team_fullname'],
      dtype='object')

In [None]:
merged_df = pd.merge(season_outcomes_df, grouped_df, left_on=['team_abbreviation', 'season_short'],
                     right_on=['TEAM_ABBREVIATION', 'SEASON'])

In [None]:
# Select only the relevant columns
final_df = merged_df[['TEAM_ID', 'TEAM_ABBREVIATION', 'SEASON', 'team_fullname', 'games_together',
                      'minutes_combined', 'total_seasons', 'total_games',
                      'total_minutes', 'PO_Outcome_Clean']]

In [None]:
final_df.head()

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,SEASON,team_fullname,games_together,minutes_combined,total_seasons,total_games,total_minutes,PO_Outcome_Clean
0,1610612737,ATL,2022,Atlanta Hawks,25.516484,927.318681,0.43956,24.197802,1240.120879,Lost East Conf 1st Rd
1,1610612737,ATL,2021,Atlanta Hawks,39.13587,1372.451087,0.369565,22.146739,1040.809783,Lost East Conf 1st Rd
2,1610612737,ATL,2020,Atlanta Hawks,55.588235,1985.463235,0.198529,10.573529,535.147059,Lost East Conf Finals
3,1610612737,ATL,2019,Atlanta Hawks,33.555556,1371.838384,0.20202,13.232323,621.848485,DNQ
4,1610612737,ATL,2018,Atlanta Hawks,36.657143,1442.75,0.171429,9.685714,367.342857,DNQ


In [None]:
final_df['PO_Outcome_Clean'].unique()

array(['Lost East Conf 1st Rd', 'Lost East Conf Finals', 'DNQ',
       'Lost East Conf Semis', 'Lost NBA Finals', 'NBA Champions',
       'DNQ (lost Play-in)', 'Lost West Conf Finals',
       'Lost West Conf 1st Rd', 'Lost West Conf Semis'], dtype=object)

In [None]:
# Transform the 'PO_Outcome_Clean' column
final_df['MADE_PLAYOFFS'] = final_df['PO_Outcome_Clean'].apply(lambda x: 0 if x in ['DNQ', 'DNQ (lost Play-in)'] else 1)

# Drop the original 'PO_Outcome_Clean' column if no longer needed
final_df.drop('PO_Outcome_Clean', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['MADE_PLAYOFFS'] = final_df['PO_Outcome_Clean'].apply(lambda x: 0 if x in ['DNQ', 'DNQ (lost Play-in)'] else 1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.drop('PO_Outcome_Clean', axis=1, inplace=True)


In [None]:
final_df.head(500)

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,SEASON,team_fullname,games_together,minutes_combined,total_seasons,total_games,total_minutes,MADE_PLAYOFFS
0,1610612737,ATL,2022,Atlanta Hawks,25.516484,927.318681,0.439560,24.197802,1240.120879,1
1,1610612737,ATL,2021,Atlanta Hawks,39.135870,1372.451087,0.369565,22.146739,1040.809783,1
2,1610612737,ATL,2020,Atlanta Hawks,55.588235,1985.463235,0.198529,10.573529,535.147059,1
3,1610612737,ATL,2019,Atlanta Hawks,33.555556,1371.838384,0.202020,13.232323,621.848485,0
4,1610612737,ATL,2018,Atlanta Hawks,36.657143,1442.750000,0.171429,9.685714,367.342857,0
...,...,...,...,...,...,...,...,...,...,...
495,1610612759,SAS,2022,San Antonio Spurs,14.838462,479.584615,0.323077,13.215385,464.661538,0
496,1610612759,SAS,2021,San Antonio Spurs,30.760870,1199.014493,0.217391,12.137681,461.478261,0
497,1610612759,SAS,2020,San Antonio Spurs,48.866667,1764.733333,0.891667,42.891667,1680.766667,0
498,1610612759,SAS,2019,San Antonio Spurs,43.595588,1617.970588,0.720588,42.772059,1847.698529,0


In [None]:
final_df['team_fullname'] = final_df['team_fullname'].apply(lambda x: 'Washington Wizards' if x in ['Washington'] else x)
final_df['team_fullname'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['team_fullname'] = final_df['team_fullname'].apply(lambda x: 'Washington Wizards' if x in ['Washington'] else x)


array(['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder',
       'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns',
       'Portland Trail Blazers', 'Sacramento Kings', 'San Antonio Spurs',
       'Toronto Raptors', 'Utah Jazz', 'Washington Wizards'], dtype=object)

In [None]:
final_df['team_fullname'].unique()

array(['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder',
       'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns',
       'Portland Trail Blazers', 'Sacramento Kings', 'San Antonio Spurs',
       'Toronto Raptors', 'Utah Jazz', 'Washington Wizards'], dtype=object)

In [None]:
final_df.shape

(575, 10)

In [None]:
final_df.to_csv("team_metrics_with_playoff_outcome.csv")

In [None]:
final_df.columns

Index(['TEAM_ID', 'TEAM_ABBREVIATION', 'SEASON', 'team_fullname',
       'games_together', 'minutes_combined', 'total_seasons', 'total_games',
       'total_minutes', 'MADE_PLAYOFFS'],
      dtype='object')

#Getting Baseline Model Dataset


In [None]:
base_df = pd.read_csv("NBA Team Stats.csv")

In [None]:
base_df.head(500)

Unnamed: 0,Season,TEAM,GP,PTS,FGM,FGA,FG%,3PM,3PA,3P%,...,FTA,FT%,OR,DR,REB,AST,STL,BLK,TO,PF
0,2022-2023,Sacramento Kings,82,120.7,43.6,88.2,49.4,13.8,37.3,36.9,...,25.1,79.0,9.5,32.9,42.5,27.3,7.0,3.4,13.1,19.7
1,2022-2023,Golden State Warriors,82,118.9,43.1,90.2,47.9,16.6,43.2,38.5,...,20.2,79.4,10.5,34.1,44.6,29.8,7.2,4.0,15.7,21.4
2,2022-2023,Atlanta Hawks,82,118.4,44.6,92.4,48.3,10.8,30.5,35.2,...,22.6,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8
3,2022-2023,Boston Celtics,82,117.9,42.2,88.8,47.5,16.0,42.6,37.7,...,21.6,81.2,9.7,35.6,45.3,26.7,6.4,5.2,12.7,18.8
4,2022-2023,Oklahoma City Thunder,82,117.5,43.1,92.6,46.5,12.1,34.1,35.6,...,23.7,80.9,11.4,32.3,43.6,24.4,8.2,4.2,12.5,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,2006-2007,New York Knicks,82,97.5,35.4,77.5,45.7,5.8,16.7,34.6,...,29.2,71.5,12.6,30.7,43.3,18.7,6.6,3.2,17.1,23.6
496,2006-2007,Houston Rockets,82,97.0,35.4,79.6,44.5,8.6,23.1,37.2,...,23.2,75.3,10.7,32.6,43.3,20.8,7.1,4.1,14.2,20.9
497,2006-2007,Charlotte Bobcats,82,96.9,36.1,81.0,44.6,5.6,15.6,35.7,...,26.0,73.4,11.2,28.6,39.8,22.4,7.8,4.5,14.9,24.2
498,2006-2007,Cleveland Cavaliers,82,96.8,36.3,81.2,44.7,6.0,17.1,35.2,...,26.0,69.6,12.7,30.8,43.5,20.8,7.6,4.3,14.4,21.7


In [None]:
base_df['Season'].unique()

array(['2022-2023', '2021-2022', '2020-2021', '2019-2020', '2018-2019',
       '2017-2018', '2016-2017', '2015-2016', '2014-2015', '2013-2014',
       '2012-2013', '2011-2012', '2010-2011', '2009-2010', '2008-2009',
       '2007-2008', '2006-2007', '2005-2006', '2004-2005', '2003-2004',
       '2002-2003', '2001-2002'], dtype=object)

In [None]:
base_df['NEW_SEASON'] = base_df['Season'].apply(lambda x: x.split("-")[0])


In [None]:
base_teams = base_df['TEAM'].unique()
base_teams

array(['Sacramento Kings', 'Golden State Warriors', 'Atlanta Hawks',
       'Boston Celtics', 'Oklahoma City Thunder', 'Los Angeles Lakers',
       'Utah Jazz', 'Milwaukee Bucks', 'Memphis Grizzlies',
       'Indiana Pacers', 'New York Knicks', 'Denver Nuggets',
       'Minnesota Timberwolves', 'Philadelphia 76ers',
       'New Orleans Pelicans', 'Dallas Mavericks', 'Phoenix Suns',
       'LA Clippers', 'Portland Trail Blazers', 'Brooklyn Nets',
       'Washington Wizards', 'Chicago Bulls', 'San Antonio Spurs',
       'Toronto Raptors', 'Cleveland Cavaliers', 'Orlando Magic',
       'Charlotte Hornets', 'Houston Rockets', 'Detroit Pistons',
       'Miami Heat', 'Charlotte Bobcats', 'New Orleans Hornets',
       'New Jersey Nets', 'Seattle SuperSonics',
       'NO/Oklahoma City Hornets'], dtype=object)

In [None]:
len(base_teams)

35

In [None]:
# Changes needed for base_teams LA Clippers' (versus 'Los Angeles Clippers' in the first array)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'Los Angeles Clippers' if x in ['LA Clippers'] else x)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'Charlotte Hornets' if x in ['Charlotte Bobcats'] else x)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'New Orleans Pelicans' if x in ['New Orleans Hornets', 'NO/Oklahoma City Hornets'] else x)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'Brooklyn Nets' if x in ['New Jersey Nets'] else x)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'Washington Wizards' if x in ['Washington'] else x)
base_df['TEAM'] = base_df['TEAM'].apply(lambda x: 'Oklahoma City Thunder' if x in ['Seattle SuperSonics'] else x)



In [None]:
base_teams = base_df['TEAM'].unique()
base_teams

array(['Sacramento Kings', 'Golden State Warriors', 'Atlanta Hawks',
       'Boston Celtics', 'Oklahoma City Thunder', 'Los Angeles Lakers',
       'Utah Jazz', 'Milwaukee Bucks', 'Memphis Grizzlies',
       'Indiana Pacers', 'New York Knicks', 'Denver Nuggets',
       'Minnesota Timberwolves', 'Philadelphia 76ers',
       'New Orleans Pelicans', 'Dallas Mavericks', 'Phoenix Suns',
       'Los Angeles Clippers', 'Portland Trail Blazers', 'Brooklyn Nets',
       'Washington Wizards', 'Chicago Bulls', 'San Antonio Spurs',
       'Toronto Raptors', 'Cleveland Cavaliers', 'Orlando Magic',
       'Charlotte Hornets', 'Houston Rockets', 'Detroit Pistons',
       'Miami Heat'], dtype=object)

In [None]:
len(base_teams)

30

In [None]:
base_df.head(500)

Unnamed: 0,Season,TEAM,GP,PTS,FGM,FGA,FG%,3PM,3PA,3P%,...,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,NEW_SEASON
0,2022-2023,Sacramento Kings,82,120.7,43.6,88.2,49.4,13.8,37.3,36.9,...,79.0,9.5,32.9,42.5,27.3,7.0,3.4,13.1,19.7,2022
1,2022-2023,Golden State Warriors,82,118.9,43.1,90.2,47.9,16.6,43.2,38.5,...,79.4,10.5,34.1,44.6,29.8,7.2,4.0,15.7,21.4,2022
2,2022-2023,Atlanta Hawks,82,118.4,44.6,92.4,48.3,10.8,30.5,35.2,...,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,2022
3,2022-2023,Boston Celtics,82,117.9,42.2,88.8,47.5,16.0,42.6,37.7,...,81.2,9.7,35.6,45.3,26.7,6.4,5.2,12.7,18.8,2022
4,2022-2023,Oklahoma City Thunder,82,117.5,43.1,92.6,46.5,12.1,34.1,35.6,...,80.9,11.4,32.3,43.6,24.4,8.2,4.2,12.5,21.0,2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,2006-2007,New York Knicks,82,97.5,35.4,77.5,45.7,5.8,16.7,34.6,...,71.5,12.6,30.7,43.3,18.7,6.6,3.2,17.1,23.6,2006
496,2006-2007,Houston Rockets,82,97.0,35.4,79.6,44.5,8.6,23.1,37.2,...,75.3,10.7,32.6,43.3,20.8,7.1,4.1,14.2,20.9,2006
497,2006-2007,Charlotte Hornets,82,96.9,36.1,81.0,44.6,5.6,15.6,35.7,...,73.4,11.2,28.6,39.8,22.4,7.8,4.5,14.9,24.2,2006
498,2006-2007,Cleveland Cavaliers,82,96.8,36.3,81.2,44.7,6.0,17.1,35.2,...,69.6,12.7,30.8,43.5,20.8,7.6,4.3,14.4,21.7,2006


In [None]:
base_df.shape

(657, 22)

In [None]:
base_df.drop('GP', axis=1, inplace=True)
base_df.drop('Season', axis=1, inplace=True)

In [None]:
base_df.head(500)

Unnamed: 0,TEAM,PTS,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,NEW_SEASON
0,Sacramento Kings,120.7,43.6,88.2,49.4,13.8,37.3,36.9,19.8,25.1,79.0,9.5,32.9,42.5,27.3,7.0,3.4,13.1,19.7,2022
1,Golden State Warriors,118.9,43.1,90.2,47.9,16.6,43.2,38.5,16.0,20.2,79.4,10.5,34.1,44.6,29.8,7.2,4.0,15.7,21.4,2022
2,Atlanta Hawks,118.4,44.6,92.4,48.3,10.8,30.5,35.2,18.5,22.6,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,2022
3,Boston Celtics,117.9,42.2,88.8,47.5,16.0,42.6,37.7,17.5,21.6,81.2,9.7,35.6,45.3,26.7,6.4,5.2,12.7,18.8,2022
4,Oklahoma City Thunder,117.5,43.1,92.6,46.5,12.1,34.1,35.6,19.2,23.7,80.9,11.4,32.3,43.6,24.4,8.2,4.2,12.5,21.0,2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,New York Knicks,97.5,35.4,77.5,45.7,5.8,16.7,34.6,20.9,29.2,71.5,12.6,30.7,43.3,18.7,6.6,3.2,17.1,23.6,2006
496,Houston Rockets,97.0,35.4,79.6,44.5,8.6,23.1,37.2,17.5,23.2,75.3,10.7,32.6,43.3,20.8,7.1,4.1,14.2,20.9,2006
497,Charlotte Hornets,96.9,36.1,81.0,44.6,5.6,15.6,35.7,19.1,26.0,73.4,11.2,28.6,39.8,22.4,7.8,4.5,14.9,24.2,2006
498,Cleveland Cavaliers,96.8,36.3,81.2,44.7,6.0,17.1,35.2,18.1,26.0,69.6,12.7,30.8,43.5,20.8,7.6,4.3,14.4,21.7,2006


In [None]:
base_df.columns

Index(['TEAM', 'PTS', 'FGM', 'FGA', 'FG%', '3PM', '3PA', '3P%', 'FTM', 'FTA',
       'FT%', 'OR', 'DR', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF',
       'NEW_SEASON'],
      dtype='object')

In [None]:
final_df['SEASON'] = final_df['SEASON'].astype(int)
base_df['NEW_SEASON'] = base_df['NEW_SEASON'].astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['SEASON'] = final_df['SEASON'].astype(int)


In [None]:
final_joined_df = pd.merge(final_df, base_df, left_on=['SEASON', 'team_fullname'], right_on=['NEW_SEASON', 'TEAM'])


In [None]:
final_joined_df.head(500)

Unnamed: 0,TEAM_ID,TEAM_ABBREVIATION,SEASON,team_fullname,games_together,minutes_combined,total_seasons,total_games,total_minutes,MADE_PLAYOFFS,...,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,NEW_SEASON
0,1610612737,ATL,2022,Atlanta Hawks,25.516484,927.318681,0.439560,24.197802,1240.120879,1,...,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,2022
1,1610612737,ATL,2021,Atlanta Hawks,39.135870,1372.451087,0.369565,22.146739,1040.809783,1,...,81.2,10.0,33.9,44.0,24.6,7.2,4.2,11.3,18.7,2021
2,1610612737,ATL,2020,Atlanta Hawks,55.588235,1985.463235,0.198529,10.573529,535.147059,1,...,81.2,10.6,35.1,45.6,24.1,7.0,4.8,12.7,19.3,2020
3,1610612737,ATL,2019,Atlanta Hawks,33.555556,1371.838384,0.202020,13.232323,621.848485,0,...,79.0,9.9,33.4,43.3,24.0,7.8,5.1,15.7,23.1,2019
4,1610612737,ATL,2018,Atlanta Hawks,36.657143,1442.750000,0.171429,9.685714,367.342857,0,...,75.2,11.6,34.5,46.1,25.8,8.2,5.1,16.6,23.6,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1610612759,SAS,2022,San Antonio Spurs,14.838462,479.584615,0.323077,13.215385,464.661538,0,...,74.3,11.8,31.9,43.7,27.2,7.0,3.9,14.7,19.9,2022
496,1610612759,SAS,2021,San Antonio Spurs,30.760870,1199.014493,0.217391,12.137681,461.478261,0,...,75.4,11.0,34.3,45.3,27.9,7.6,4.9,12.3,18.1,2021
497,1610612759,SAS,2020,San Antonio Spurs,48.866667,1764.733333,0.891667,42.891667,1680.766667,0,...,79.2,9.3,34.6,43.9,24.4,7.0,5.1,11.0,18.0,2020
498,1610612759,SAS,2019,San Antonio Spurs,43.595588,1617.970588,0.720588,42.772059,1847.698529,0,...,81.0,9.0,35.6,44.6,24.7,7.3,5.5,12.3,19.4,2019


In [None]:
final_joined_df.shape

(575, 30)

In [None]:
final_joined_df.columns

Index(['TEAM_ID', 'TEAM_ABBREVIATION', 'SEASON', 'team_fullname',
       'games_together', 'minutes_combined', 'total_seasons', 'total_games',
       'total_minutes', 'MADE_PLAYOFFS', 'TEAM', 'PTS', 'FGM', 'FGA', 'FG%',
       '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OR', 'DR', 'REB', 'AST',
       'STL', 'BLK', 'TO', 'PF', 'NEW_SEASON'],
      dtype='object')

In [None]:
team_chemistry_dataset = final_joined_df[['team_fullname', 'SEASON','games_together', 'minutes_combined', 'total_seasons', 'total_games',
       'total_minutes','PTS', 'FGM', 'FGA', 'FG%',
       '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OR', 'DR', 'REB', 'AST',
       'STL', 'BLK', 'TO', 'PF','MADE_PLAYOFFS']]

In [None]:
team_chemistry_dataset.head(500)

Unnamed: 0,team_fullname,SEASON,games_together,minutes_combined,total_seasons,total_games,total_minutes,PTS,FGM,FGA,...,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,MADE_PLAYOFFS
0,Atlanta Hawks,2022,25.516484,927.318681,0.439560,24.197802,1240.120879,118.4,44.6,92.4,...,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,1
1,Atlanta Hawks,2021,39.135870,1372.451087,0.369565,22.146739,1040.809783,113.9,41.5,88.3,...,81.2,10.0,33.9,44.0,24.6,7.2,4.2,11.3,18.7,1
2,Atlanta Hawks,2020,55.588235,1985.463235,0.198529,10.573529,535.147059,113.7,40.8,87.2,...,81.2,10.6,35.1,45.6,24.1,7.0,4.8,12.7,19.3,1
3,Atlanta Hawks,2019,33.555556,1371.838384,0.202020,13.232323,621.848485,111.8,40.6,90.6,...,79.0,9.9,33.4,43.3,24.0,7.8,5.1,15.7,23.1,0
4,Atlanta Hawks,2018,36.657143,1442.750000,0.171429,9.685714,367.342857,113.3,41.4,91.8,...,75.2,11.6,34.5,46.1,25.8,8.2,5.1,16.6,23.6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,San Antonio Spurs,2022,14.838462,479.584615,0.323077,13.215385,464.661538,113.0,43.1,92.6,...,74.3,11.8,31.9,43.7,27.2,7.0,3.9,14.7,19.9,0
496,San Antonio Spurs,2021,30.760870,1199.014493,0.217391,12.137681,461.478261,113.2,43.2,92.7,...,75.4,11.0,34.3,45.3,27.9,7.6,4.9,12.3,18.1,0
497,San Antonio Spurs,2020,48.866667,1764.733333,0.891667,42.891667,1680.766667,111.1,41.9,90.5,...,79.2,9.3,34.6,43.9,24.4,7.0,5.1,11.0,18.0,0
498,San Antonio Spurs,2019,43.595588,1617.970588,0.720588,42.772059,1847.698529,114.1,42.2,89.4,...,81.0,9.0,35.6,44.6,24.7,7.3,5.5,12.3,19.4,0


In [None]:
base_model = final_joined_df[['team_fullname', 'SEASON','PTS', 'FGM', 'FGA', 'FG%',
       '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OR', 'DR', 'REB', 'AST',
       'STL', 'BLK', 'TO', 'PF','MADE_PLAYOFFS']]

In [None]:
base_model.head(500)

Unnamed: 0,team_fullname,SEASON,PTS,FGM,FGA,FG%,3PM,3PA,3P%,FTM,...,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,MADE_PLAYOFFS
0,Atlanta Hawks,2022,118.4,44.6,92.4,48.3,10.8,30.5,35.2,18.5,...,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,1
1,Atlanta Hawks,2021,113.9,41.5,88.3,47.0,12.9,34.4,37.4,18.1,...,81.2,10.0,33.9,44.0,24.6,7.2,4.2,11.3,18.7,1
2,Atlanta Hawks,2020,113.7,40.8,87.2,46.8,12.4,33.4,37.3,19.7,...,81.2,10.6,35.1,45.6,24.1,7.0,4.8,12.7,19.3,1
3,Atlanta Hawks,2019,111.8,40.6,90.6,44.9,12.0,36.1,33.3,18.5,...,79.0,9.9,33.4,43.3,24.0,7.8,5.1,15.7,23.1,0
4,Atlanta Hawks,2018,113.3,41.4,91.8,45.1,13.0,37.0,35.2,17.6,...,75.2,11.6,34.5,46.1,25.8,8.2,5.1,16.6,23.6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,San Antonio Spurs,2022,113.0,43.1,92.6,46.5,11.1,32.2,34.5,15.8,...,74.3,11.8,31.9,43.7,27.2,7.0,3.9,14.7,19.9,0
496,San Antonio Spurs,2021,113.2,43.2,92.7,46.7,11.3,32.0,35.2,15.4,...,75.4,11.0,34.3,45.3,27.9,7.6,4.9,12.3,18.1,0
497,San Antonio Spurs,2020,111.1,41.9,90.5,46.2,9.9,28.4,35.0,17.4,...,79.2,9.3,34.6,43.9,24.4,7.0,5.1,11.0,18.0,0
498,San Antonio Spurs,2019,114.1,42.2,89.4,47.2,10.7,28.5,37.6,19.0,...,81.0,9.0,35.6,44.6,24.7,7.3,5.5,12.3,19.4,0


In [None]:
# base_model.csv
!gdown --id 1o4j5ElESpuIm58IlhQTEjMiMmSV7x_mm

Downloading...
From: https://drive.google.com/uc?id=1o4j5ElESpuIm58IlhQTEjMiMmSV7x_mm
To: /content/base_model.csv
100% 66.5k/66.5k [00:00<00:00, 66.1MB/s]


In [None]:
base_model.to_csv("base_model.csv")
team_chemistry_dataset.to_csv("team_chemistry_dataset.csv")

# Prepare Dataset

In [None]:
base_model = pd.read_csv("base_model.csv", index_col=0)
team_chemistry_dataset = pd.read_csv("team_chemistry_dataset.csv", index_col=0)

# base_df = pd.get_dummies(base_model)
# team_chemistry_df = pd.get_dummies(team_chemistry_dataset)

In [None]:
base_df = base_model

In [None]:
team_chemistry_df = team_chemistry_dataset

In [None]:
base_df

Unnamed: 0,team_fullname,SEASON,PTS,FGM,FGA,FG%,3PM,3PA,3P%,FTM,...,FT%,OR,DR,REB,AST,STL,BLK,TO,PF,MADE_PLAYOFFS
0,Atlanta Hawks,2022,118.4,44.6,92.4,48.3,10.8,30.5,35.2,18.5,...,81.8,11.2,33.2,44.4,25.0,7.1,4.9,12.4,18.8,1
1,Atlanta Hawks,2021,113.9,41.5,88.3,47.0,12.9,34.4,37.4,18.1,...,81.2,10.0,33.9,44.0,24.6,7.2,4.2,11.3,18.7,1
2,Atlanta Hawks,2020,113.7,40.8,87.2,46.8,12.4,33.4,37.3,19.7,...,81.2,10.6,35.1,45.6,24.1,7.0,4.8,12.7,19.3,1
3,Atlanta Hawks,2019,111.8,40.6,90.6,44.9,12.0,36.1,33.3,18.5,...,79.0,9.9,33.4,43.3,24.0,7.8,5.1,15.7,23.1,0
4,Atlanta Hawks,2018,113.3,41.4,91.8,45.1,13.0,37.0,35.2,17.6,...,75.2,11.6,34.5,46.1,25.8,8.2,5.1,16.6,23.6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
570,Washington Wizards,2007,98.8,36.4,81.6,44.6,7.0,19.7,35.6,19.0,...,78.2,12.3,29.3,41.6,19.6,7.7,4.8,13.2,19.6,1
571,Washington Wizards,2006,104.3,37.4,83.2,45.0,6.8,19.7,34.8,22.6,...,76.5,12.2,29.0,41.2,20.2,7.7,4.6,13.8,22.2,1
572,Washington Wizards,2005,101.7,36.3,81.2,44.7,6.1,17.0,35.7,23.0,...,75.7,12.6,28.6,41.2,18.6,8.0,4.1,13.9,22.6,1
573,Washington Wizards,2004,100.5,36.2,82.9,43.7,6.3,18.3,34.3,21.9,...,72.5,13.8,29.0,42.8,19.1,8.7,4.2,14.3,22.0,1


In [None]:
team_chemistry_df = team_chemistry_df.drop(columns={"team_fullname", "SEASON"})
base_df = base_df.drop(columns={"team_fullname", "SEASON"})

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
base_X = base_df.iloc[:, :-1]
base_y = base_df.iloc[:, -1]
chem_X = team_chemistry_df.iloc[:, :-1]
chem_y = team_chemistry_df.iloc[:, -1]

In [None]:
base_train_X, base_test_X, base_train_y, base_test_y = train_test_split(base_X, base_y, test_size=0.2, random_state=12)
chem_train_X, chem_test_X, chem_train_y, chem_test_y = train_test_split(chem_X, chem_y, test_size=0.2, random_state=12)

In [None]:
scaler = StandardScaler()
scaler.fit(base_train_X)
base_train_X = scaler.transform(base_train_X)
base_test_X = scaler.transform(base_test_X)

In [None]:
scaler = StandardScaler()
scaler.fit(chem_train_X)
chem_train_X = scaler.transform(chem_train_X)
chem_test_X = scaler.transform(chem_test_X)

KNN

In [None]:
# BASE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, validation_curve, learning_curve, GridSearchCV, cross_validate, cross_val_score
from sklearn.metrics import accuracy_score

param_grid = {'n_neighbors': list(range(1, 32)), 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(base_train_X, base_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'n_neighbors': 22, 'weights': 'distance'}
[0.71956522 0.71956522 0.6826087  0.71956522 0.72608696 0.72608696
 0.7173913  0.73695652 0.73913043 0.73913043 0.74130435 0.73478261
 0.73695652 0.73695652 0.73478261 0.73695652 0.72608696 0.72826087
 0.71956522 0.73478261 0.73913043 0.74130435 0.72826087 0.74347826
 0.72391304 0.72391304 0.7326087  0.72826087 0.72173913 0.72608696
 0.73043478 0.73913043 0.72173913 0.72391304 0.7326087  0.73478261
 0.7326087  0.73913043 0.74565217 0.74347826 0.74130435 0.74782609
 0.74565217 0.75434783 0.74130435 0.74347826 0.74347826 0.74347826
 0.73478261 0.73695652 0.73695652 0.74130435 0.72391304 0.7326087
 0.73478261 0.73913043 0.72826087 0.73478261 0.73043478 0.74130435
 0.72826087 0.73043478]
0.7543478260869565
{'n_neighbors': 22, 'weights': 'distance'}
KNeighborsClassifier(n_neighbors=22, weights='distance')


In [None]:
# CHEM

param_grid = {'n_neighbors': list(range(1, 32)), 'weights': ['uniform', 'distance']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(chem_train_X, chem_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'n_neighbors': 31, 'weights': 'uniform'}
[0.70434783 0.70434783 0.71086957 0.70434783 0.72391304 0.72391304
 0.69782609 0.71956522 0.73043478 0.73043478 0.71521739 0.72391304
 0.72173913 0.72173913 0.7173913  0.74347826 0.74130435 0.74130435
 0.74130435 0.74347826 0.75       0.75217391 0.74782609 0.74782609
 0.75869565 0.76086957 0.75869565 0.75434783 0.74782609 0.75
 0.73695652 0.74347826 0.74130435 0.74565217 0.73695652 0.74347826
 0.74565217 0.74565217 0.75       0.74565217 0.75434783 0.75652174
 0.74782609 0.76086957 0.75       0.75       0.75434783 0.74565217
 0.75       0.75       0.75       0.74782609 0.75       0.74782609
 0.74565217 0.74565217 0.75434783 0.75434783 0.75652174 0.76086957
 0.76521739 0.76304348]
0.7652173913043477
{'n_neighbors': 31, 'weights': 'uniform'}
KNeighborsClassifier(n_neighbors=31)


In [None]:
base_KNN = KNeighborsClassifier(n_neighbors=31, weights='uniform')
base_KNN.fit(base_train_X, base_train_y)
base_pred_KNN = base_KNN.predict(base_test_X)
print("train: " + str(np.mean(base_KNN.predict(base_train_X) == base_train_y)))
print("test: " + str(np.mean(base_pred_KNN == base_test_y)))

train: 0.7456521739130435
test: 0.7652173913043478


In [None]:
chem_KNN = KNeighborsClassifier(n_neighbors=31, weights='uniform')
chem_KNN.fit(chem_train_X, chem_train_y)
chem_pred_KNN = chem_KNN.predict(chem_test_X)
print("train: " + str(np.mean(chem_KNN.predict(chem_train_X) == chem_train_y)))
print("test: " + str(np.mean(chem_pred_KNN == chem_test_y)))

train: 0.782608695652174
test: 0.8


SVM

In [None]:
from sklearn.svm import SVC

param_grid = {'C': list(range(1, 10)), 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}
grid = GridSearchCV(SVC(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(base_train_X, base_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
[0.83043478 0.81304348 0.83043478 0.81304348 0.83043478 0.81086957
 0.83043478 0.80869565 0.83043478 0.81956522 0.83043478 0.81956522
 0.83043478 0.81521739 0.83043478 0.81304348 0.83043478 0.81521739
 0.83043478 0.81521739 0.82826087 0.81086957 0.82826087 0.81086957
 0.82826087 0.80652174 0.82826087 0.80652174 0.82826087 0.80434783
 0.82826087 0.80434783 0.82826087 0.79565217 0.82826087 0.79565217]
0.8304347826086957
{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
SVC(C=1, kernel='linear')


In [None]:
from sklearn.svm import SVC

param_grid = {'C': list(range(1, 10)), 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}
grid = GridSearchCV(SVC(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(chem_train_X, chem_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'C': 5, 'gamma': 'scale', 'kernel': 'linear'}
[0.83913043 0.80652174 0.83913043 0.80652174 0.8326087  0.79782609
 0.8326087  0.79782609 0.83478261 0.79347826 0.83478261 0.79347826
 0.83478261 0.79130435 0.83478261 0.79130435 0.83913043 0.8
 0.83913043 0.8        0.83478261 0.79782609 0.83478261 0.79782609
 0.83913043 0.78913043 0.83913043 0.78913043 0.83913043 0.79347826
 0.83913043 0.79347826 0.83913043 0.79130435 0.83913043 0.78913043]
0.8391304347826087
{'C': 5, 'gamma': 'scale', 'kernel': 'linear'}
SVC(C=5, kernel='linear')


In [None]:
base_SVM = SVC(C=5, kernel='linear')
base_SVM.fit(base_train_X, base_train_y)
base_pred_SVM = base_SVM.predict(base_test_X)
print("train: " + str(np.mean(base_SVM.predict(base_train_X) == base_train_y)))
print("test: " + str(np.mean(base_pred_SVM == base_test_y)))

train: 0.8478260869565217
test: 0.8608695652173913


In [None]:
chem_SVM = SVC(C=5, kernel='linear')
chem_SVM.fit(chem_train_X, chem_train_y)
chem_pred_SVM = chem_SVM.predict(chem_test_X)
print("train: " + str(np.mean(chem_SVM.predict(chem_train_X) == chem_train_y)))
print("test: " + str(np.mean(chem_pred_SVM == chem_test_y)))

train: 0.8717391304347826
test: 0.8869565217391304


Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
base_NB = GaussianNB().fit(base_train_X, base_train_y)
base_pred_NB = base_NB.predict(base_test_X)
print("train: " + str(np.mean(base_NB.predict(base_train_X) == base_train_y)))
print("test: " + str(np.mean(base_pred_NB == base_test_y)))

train: 0.6695652173913044
test: 0.6695652173913044


In [None]:
chem_NB = GaussianNB().fit(chem_train_X, chem_train_y)
chem_pred_NB = chem_NB.predict(chem_test_X)
print("train: " + str(np.mean(chem_NB.predict(chem_train_X) == chem_train_y)))
print("test: " + str(np.mean(chem_pred_NB == chem_test_y)))

train: 0.758695652173913
test: 0.7652173913043478


Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

param_grid = {'min_samples_leaf': np.arange(2, 20, 2).tolist(),
              'max_depth': np.arange(2, 48, 3).tolist(),
              'max_features': np.arange(2, base_train_X.shape[1]).tolist(),
              'min_samples_split': np.arange(2, 48, 3)}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(base_train_X, base_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'max_depth': 8, 'max_features': 11, 'min_samples_leaf': 16, 'min_samples_split': 35}
[0.58913043 0.58043478 0.62826087 ... 0.67826087 0.66086957 0.67391304]
0.7282608695652175
{'max_depth': 8, 'max_features': 11, 'min_samples_leaf': 16, 'min_samples_split': 35}
DecisionTreeClassifier(max_depth=8, max_features=11, min_samples_leaf=16,
                       min_samples_split=35)


In [None]:
param_grid = {'min_samples_leaf': np.arange(2, 20, 2).tolist(),
              'max_depth': np.arange(2, 48, 3).tolist(),
              'max_features': np.arange(2, chem_train_X.shape[1]).tolist(),
              'min_samples_split': np.arange(2, 48, 3)}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy', return_train_score=False).fit(chem_train_X, chem_train_y)
print(grid.best_params_)
print(grid.cv_results_['mean_test_score'])
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

KeyboardInterrupt: ignored

In [None]:
from sklearn.tree import DecisionTreeClassifier
base_tree = DecisionTreeClassifier(max_depth=8, max_features=11, min_samples_leaf=16,
                       min_samples_split=35).fit(base_train_X, base_train_y)
base_sorted = sorted(zip(base_X.columns.values, base_tree.feature_importances_), key=lambda x: x[1], reverse=True)
base_pred_tree = base_tree.predict(base_test_X)
print("train: " + str(np.mean(base_tree.predict(base_train_X) == base_train_y)))
print("test: " + str(np.mean(base_pred_tree == base_test_y)))

train: 0.7782608695652173
test: 0.6782608695652174


In [None]:
base_sorted

[('FG%', 0.41524658265028114),
 ('TO', 0.18408883731771847),
 ('3P%', 0.12969594283249428),
 ('FGA', 0.09141414517624213),
 ('FTM', 0.06968401612253847),
 ('PTS', 0.03996359751609722),
 ('STL', 0.03431023638143401),
 ('3PA', 0.017943656287000252),
 ('OR', 0.017652985716194034),
 ('FGM', 0.0),
 ('3PM', 0.0),
 ('FTA', 0.0),
 ('FT%', 0.0),
 ('DR', 0.0),
 ('REB', 0.0),
 ('AST', 0.0),
 ('BLK', 0.0),
 ('PF', 0.0)]

In [None]:
chem_tree = DecisionTreeClassifier(max_depth=8, max_features=11, min_samples_leaf=16,
                       min_samples_split=35).fit(chem_train_X, chem_train_y)
chem_sorted = sorted(zip(chem_X.columns.values, chem_tree.feature_importances_), key=lambda x: x[1], reverse=True)
chem_pred_tree = chem_tree.predict(chem_test_X)
print("train: " + str(np.mean(chem_tree.predict(chem_train_X) == chem_train_y)))
print("test: " + str(np.mean(chem_pred_tree == chem_test_y)))

train: 0.8065217391304348
test: 0.7043478260869566


In [None]:
chem_sorted

[('games_together', 0.36247240602560316),
 ('FG%', 0.31106391987124615),
 ('DR', 0.05768647636197136),
 ('total_minutes', 0.057380049866331426),
 ('3PM', 0.0347066188488798),
 ('FTA', 0.03373965044603679),
 ('3P%', 0.030777820792852243),
 ('total_games', 0.028304428888929513),
 ('OR', 0.02205019760346825),
 ('minutes_combined', 0.019943755197113493),
 ('FGA', 0.01962680257525846),
 ('TO', 0.018959777600889863),
 ('AST', 0.0032880959214194845),
 ('total_seasons', 0.0),
 ('PTS', 0.0),
 ('FGM', 0.0),
 ('3PA', 0.0),
 ('FTM', 0.0),
 ('FT%', 0.0),
 ('REB', 0.0),
 ('STL', 0.0),
 ('BLK', 0.0),
 ('PF', 0.0)]