In [1]:
import pandas as pd

In [2]:
#read in the data from csv
players = pd.read_csv('all_seasons.csv', index_col=0)
allstars = pd.read_csv('allstars.csv', index_col=0)

# players data is from kaggle: https://www.kaggle.com/justinas/nba-players-data
# it was collected from the NBA stats api and basketball reference
# allstars data was scraped from basketball reference

In [3]:
display(players[players['player_name'] == "Michael Jordan"])
print(players.season.min())

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
392,Michael Jordan,CHI,34.0,198.12,97.975872,North Carolina,USA,1984,1,3,...,29.6,5.9,4.3,13.4,0.042,0.132,0.331,0.567,0.208,1996-97
467,Michael Jordan,CHI,35.0,198.12,97.975872,North Carolina,USA,1984,1,3,...,28.7,5.8,3.5,8.5,0.047,0.125,0.336,0.533,0.174,1997-98
2208,Michael Jordan,WAS,39.0,198.12,97.975872,North Carolina,USA,1984,1,3,...,22.9,5.7,5.2,-0.3,0.027,0.163,0.361,0.468,0.295,2001-02
3055,Michael Jordan,WAS,40.0,198.12,97.975872,North Carolina,USA,1984,1,3,...,20.0,6.1,3.8,0.0,0.028,0.167,0.288,0.491,0.207,2002-03


1996-97


The players dataframe has stats from every season from every player starting from 1996 to now. The 2019-20 season is incomplete, due to the coronavirus. To predict the current season allstars mid season, we should make sure there are no full season counting stats in the data, since they cannot be compared between mid or full season. The only stat like this in the dataframe is gp - games played. It can be changed to a fraction of games played instead, to solve the problem.

The NBA is always changing, so using old data to train for current models might not be smart. Limitting the data will also make everything run faster. Initially I will only use the data from 2010 onwards, maybe I will experiment with using the older data aswell later.

Players with very low counting stats like pts/reb/ast or who appeared in very few games are unlikely to be allstars, we can check this later after labelling the data frame.

In [4]:
print(players.columns)

Index(['player_name', 'team_abbreviation', 'age', 'player_height',
       'player_weight', 'college', 'country', 'draft_year', 'draft_round',
       'draft_number', 'gp', 'pts', 'reb', 'ast', 'net_rating', 'oreb_pct',
       'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct', 'season'],
      dtype='object')


In [5]:
col_to_keep = ['player_name', 'team_abbreviation', 'age', 'player_height', 'player_weight', 'gp', 'pts', 'reb', 'ast', 'net_rating', 'oreb_pct',
       'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct', 'season']
players = players[col_to_keep]

In [6]:
players = players[players.season >= '2010-11']
print(players.season.min())
print(len(players))

2010-11
4919


To actually label the data we need the allstars dataframe, which contains the all stats from every season, dating back to 1990.
It is also possible to creature features out of this data, for example to predict if a player is an allstar in 2019-20, it would be useful to know if they were an allstar in 2018-19. This is likely to be a very powerful feature, it will be interesting to see how a classifier performs both with and without using this feature. A similar feature would be the total amount of allstar selections the player had prior to the current season.

In [7]:
allstars.head()

Unnamed: 0_level_0,Player,Age,Tm,Lg,Season
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Bam Adebayo\adebaba01,22,MIA,NBA,2019-20
2,Giannis Antetokounmpo\antetgi01,25,MIL,NBA,2019-20
3,Devin Booker\bookede01,23,PHO,NBA,2019-20
4,Jimmy Butler\butleji01,30,MIA,NBA,2019-20
5,Anthony Davis\davisan02,26,LAL,NBA,2019-20


In [8]:
#Try to make a function that can fix the name strings, this is necessary to combine data from both tables.
def fix_name(name):
    split = name.find('\\')
    return name[:split]

test_name = 'Govert\\123'
print(test_name)
test = fix_name(test_name)
print(test)

#Actually this can also be done inline with a lambda function
allstars['Player'] = allstars.Player.apply(lambda x: x[:x.find('\\')])
    

Govert\123
Govert


In [9]:
display(allstars.head())

Unnamed: 0_level_0,Player,Age,Tm,Lg,Season
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Bam Adebayo,22,MIA,NBA,2019-20
2,Giannis Antetokounmpo,25,MIL,NBA,2019-20
3,Devin Booker,23,PHO,NBA,2019-20
4,Jimmy Butler,30,MIA,NBA,2019-20
5,Anthony Davis,26,LAL,NBA,2019-20


In [10]:
def is_allstar(name, season):
    df = allstars[(allstars.Player == name) & (allstars.Season == season)]
    return len(df)


players['allstar'] = players[['player_name','season']].apply(lambda x: is_allstar(x.iloc[0], x.iloc[1]), axis=1)

In [11]:
players.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,allstar
6226,Kirk Hinrich,ATL,30.0,193.04,86.18248,72,10.2,2.5,4.0,-9.5,0.011,0.09,0.171,0.543,0.221,2010-11,0
6227,Kwame Brown,CHA,29.0,210.82,122.46984,66,7.9,6.8,0.7,-7.3,0.102,0.228,0.151,0.55,0.041,2010-11,0
6228,Kobe Bryant,LAL,32.0,198.12,92.98636,82,25.3,5.1,4.7,7.7,0.035,0.135,0.35,0.548,0.258,2010-11,1
6229,Kosta Koufos,DEN,22.0,213.36,120.20188,50,3.2,2.6,0.1,-6.1,0.134,0.185,0.189,0.474,0.025,2010-11,0
6230,Kris Humphries,NJN,26.0,205.74,106.59412,74,10.0,10.4,1.1,-5.3,0.125,0.322,0.173,0.555,0.069,2010-11,0


In [12]:
print(players.allstar.value_counts())
print(len(allstars[allstars.Season >= '2010-11']))

0    4667
1     252
Name: allstar, dtype: int64
259


Oops, looks like we've missed some players. Upon inspecting the data frames, it seems the problem is that the players dataframe has ascii formatting and the allstars dataframe has unicode, so one of them contains accented characters and the other doesnt. Let's try to fix this...

In [13]:
import unidecode

def extract_word(text):
    oldstr = text
    newstr = unidecode.unidecode(text)
    if newstr != oldstr:
        print("Input Text::{}".format(oldstr))
        print("Output Text::{}".format(newstr))
    return newstr

allstars['Player'] = allstars.Player.apply(extract_word)

Input Text::Luka Dončić
Output Text::Luka Doncic
Input Text::Nikola Jokić
Output Text::Nikola Jokic
Input Text::Nikola Jokić
Output Text::Nikola Jokic
Input Text::Nikola Vučević
Output Text::Nikola Vucevic
Input Text::Goran Dragić
Output Text::Goran Dragic
Input Text::Kristaps Porziņģis
Output Text::Kristaps Porzingis
Input Text::Manu Ginóbili
Output Text::Manu Ginobili
Input Text::Manu Ginóbili
Output Text::Manu Ginobili
Input Text::Peja Stojaković
Output Text::Peja Stojakovic
Input Text::Peja Stojaković
Output Text::Peja Stojakovic
Input Text::Peja Stojaković
Output Text::Peja Stojakovic


In [14]:
players['allstar'] = players[['player_name','season']].apply(lambda x: is_allstar(x.iloc[0], x.iloc[1]), axis=1)
print(players.allstar.value_counts())
print(len(allstars[allstars.Season >= '2010-11']))


0    4660
1     259
Name: allstar, dtype: int64
259


Success! Now let's try if it is possible to filter out some players based on low counting stats or low games played.

In [15]:
statfilter = (players.pts + players.reb + players.ast) > 10
players = players[statfilter]
print(players.allstar.value_counts())

0    2616
1     259
Name: allstar, dtype: int64


In [16]:
test = players[players['gp'] > 20]
print(test.allstar.value_counts())

0    2495
1     257
Name: allstar, dtype: int64


Looks like filtering on games played is not very useful, it doesn't filter out that many rows of data and we also lose some actual all star seasons. Now let's add more features based on previous allstar seasons. The season data is not that nice to work with in it's current format, so we should change that first. I will change 2019-20 to 2019, so the season listed in the dataframes is the season that started in that year. This is a bit confusing since the 2019-2020 allstar game is in 2020, but aslong as all data is uniform in this it should be fine.

In [17]:
allstars['Season'] = allstars.Season.apply(lambda x: x[:x.find('-')])
allstars['Season'] = pd.to_numeric(allstars.Season)
allstars.head()

Unnamed: 0_level_0,Player,Age,Tm,Lg,Season
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Bam Adebayo,22,MIA,NBA,2019
2,Giannis Antetokounmpo,25,MIL,NBA,2019
3,Devin Booker,23,PHO,NBA,2019
4,Jimmy Butler,30,MIA,NBA,2019
5,Anthony Davis,26,LAL,NBA,2019


In [19]:
players['season'] = players.season.apply(lambda x: x[:x.find('-')])
players['season'] = pd.to_numeric(players.season)
players.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,allstar
6226,Kirk Hinrich,ATL,30.0,193.04,86.18248,72,10.2,2.5,4.0,-9.5,0.011,0.09,0.171,0.543,0.221,2010,0
6227,Kwame Brown,CHA,29.0,210.82,122.46984,66,7.9,6.8,0.7,-7.3,0.102,0.228,0.151,0.55,0.041,2010,0
6228,Kobe Bryant,LAL,32.0,198.12,92.98636,82,25.3,5.1,4.7,7.7,0.035,0.135,0.35,0.548,0.258,2010,1
6230,Kris Humphries,NJN,26.0,205.74,106.59412,74,10.0,10.4,1.1,-5.3,0.125,0.322,0.173,0.555,0.069,2010,0
6231,Kurt Thomas,CHI,38.0,205.74,104.32616,52,4.1,5.8,1.2,6.8,0.077,0.227,0.096,0.527,0.074,2010,0


In [20]:
def was_allstar(name, season):
    df = allstars[(allstars.Player == name) & (allstars.Season == (season -1))]
    return len(df)


players['was_allstar'] = players[['player_name','season']].apply(lambda x: was_allstar(x.iloc[0], x.iloc[1]), axis=1)

In [22]:
def total_previous_allstar(name,season):
    df = allstars[(allstars.Player == name) & (allstars.Season < season)]
    return len(df)

players['previous_allstars'] = players[['player_name','season']].apply(lambda x: total_previous_allstar(x.iloc[0], x.iloc[1]), axis=1)

In [24]:
display(players[players['player_name'] == "LeBron James"])

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,allstar,was_allstar,previous_allstars
6266,LeBron James,MIA,26.0,203.2,113.398,79,26.7,7.5,7.0,10.5,0.033,0.184,0.312,0.594,0.343,2010,1,1,6
7004,LeBron James,MIA,27.0,203.2,113.398,62,27.1,7.9,6.2,10.7,0.05,0.196,0.317,0.605,0.318,2011,1,1,7
7428,LeBron James,MIA,28.0,203.2,113.398,76,26.8,8.0,7.3,14.1,0.044,0.208,0.298,0.64,0.344,2012,1,1,8
7954,LeBron James,MIA,29.0,203.2,113.398,77,27.1,6.9,6.3,7.9,0.037,0.188,0.309,0.649,0.311,2013,1,1,9
8322,LeBron James,CLE,30.0,203.2,113.398,69,25.3,6.0,7.4,9.8,0.025,0.166,0.324,0.577,0.366,2014,1,1,10
8640,LeBron James,CLE,31.0,203.2,113.398,76,25.3,7.4,6.8,11.0,0.047,0.187,0.311,0.588,0.339,2015,1,1,11
9220,LeBron James,CLE,32.0,203.2,113.398,74,26.4,8.6,8.7,7.7,0.04,0.209,0.297,0.619,0.388,2016,1,1,12
10028,LeBron James,CLE,33.0,203.2,113.398,82,27.5,8.6,9.1,1.6,0.033,0.201,0.31,0.621,0.432,2017,1,1,13
10520,LeBron James,LAL,34.0,203.2,113.398,55,27.4,8.5,8.3,2.0,0.029,0.193,0.311,0.588,0.376,2018,1,1,14
11042,LeBron James,LAL,35.0,205.74,113.398,58,25.6,7.8,10.7,10.4,0.028,0.186,0.308,0.581,0.48,2019,1,1,15


In [29]:
#only thing left to do is scale games played
max_games_per_season = players.groupby('season').max().gp
display(max_games_per_season)

season
2010    83
2011    66
2012    82
2013    83
2014    83
2015    82
2016    82
2017    82
2018    82
2019    64
Name: gp, dtype: int64

What's going on here? The normal NBA season duration is 82 games. The 83 games seasons are probably due to players getting traded and playing more games. 2011 was shortened by a lockout, the 2019 season by the corona virus. 

In [32]:
players.gp.value_counts()

82    187
81    141
80    136
79    101
76     96
     ... 
16      3
10      3
4       3
83      3
8       2
Name: gp, Length: 83, dtype: int64

In [31]:
# only 3 instances of 83 games played!
max_games_per_season = max_games_per_season.apply(lambda x: min(x, 82))
display(max_games_per_season)

season
2010    82
2011    66
2012    82
2013    82
2014    82
2015    82
2016    82
2017    82
2018    82
2019    64
Name: gp, dtype: int64

In [33]:
def scale_gp(gp, season):
    return gp / max_games_per_season.loc[season]

players['scaled_gp'] = players[['gp','season']].apply(lambda x: scale_gp(x.iloc[0], x.iloc[1]), axis=1)

In [34]:
display(players[players['player_name'] == "LeBron James"])

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,allstar,was_allstar,previous_allstars,scaled_gp
6266,LeBron James,MIA,26.0,203.2,113.398,79,26.7,7.5,7.0,10.5,0.033,0.184,0.312,0.594,0.343,2010,1,1,6,0.963415
7004,LeBron James,MIA,27.0,203.2,113.398,62,27.1,7.9,6.2,10.7,0.05,0.196,0.317,0.605,0.318,2011,1,1,7,0.939394
7428,LeBron James,MIA,28.0,203.2,113.398,76,26.8,8.0,7.3,14.1,0.044,0.208,0.298,0.64,0.344,2012,1,1,8,0.926829
7954,LeBron James,MIA,29.0,203.2,113.398,77,27.1,6.9,6.3,7.9,0.037,0.188,0.309,0.649,0.311,2013,1,1,9,0.939024
8322,LeBron James,CLE,30.0,203.2,113.398,69,25.3,6.0,7.4,9.8,0.025,0.166,0.324,0.577,0.366,2014,1,1,10,0.841463
8640,LeBron James,CLE,31.0,203.2,113.398,76,25.3,7.4,6.8,11.0,0.047,0.187,0.311,0.588,0.339,2015,1,1,11,0.926829
9220,LeBron James,CLE,32.0,203.2,113.398,74,26.4,8.6,8.7,7.7,0.04,0.209,0.297,0.619,0.388,2016,1,1,12,0.902439
10028,LeBron James,CLE,33.0,203.2,113.398,82,27.5,8.6,9.1,1.6,0.033,0.201,0.31,0.621,0.432,2017,1,1,13,1.0
10520,LeBron James,LAL,34.0,203.2,113.398,55,27.4,8.5,8.3,2.0,0.029,0.193,0.311,0.588,0.376,2018,1,1,14,0.670732
11042,LeBron James,LAL,35.0,205.74,113.398,58,25.6,7.8,10.7,10.4,0.028,0.186,0.308,0.581,0.48,2019,1,1,15,0.90625


In [35]:
players.drop(columns=['gp'], inplace=True)

In [36]:
display(players[players['player_name'] == "LeBron James"])

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,allstar,was_allstar,previous_allstars,scaled_gp
6266,LeBron James,MIA,26.0,203.2,113.398,26.7,7.5,7.0,10.5,0.033,0.184,0.312,0.594,0.343,2010,1,1,6,0.963415
7004,LeBron James,MIA,27.0,203.2,113.398,27.1,7.9,6.2,10.7,0.05,0.196,0.317,0.605,0.318,2011,1,1,7,0.939394
7428,LeBron James,MIA,28.0,203.2,113.398,26.8,8.0,7.3,14.1,0.044,0.208,0.298,0.64,0.344,2012,1,1,8,0.926829
7954,LeBron James,MIA,29.0,203.2,113.398,27.1,6.9,6.3,7.9,0.037,0.188,0.309,0.649,0.311,2013,1,1,9,0.939024
8322,LeBron James,CLE,30.0,203.2,113.398,25.3,6.0,7.4,9.8,0.025,0.166,0.324,0.577,0.366,2014,1,1,10,0.841463
8640,LeBron James,CLE,31.0,203.2,113.398,25.3,7.4,6.8,11.0,0.047,0.187,0.311,0.588,0.339,2015,1,1,11,0.926829
9220,LeBron James,CLE,32.0,203.2,113.398,26.4,8.6,8.7,7.7,0.04,0.209,0.297,0.619,0.388,2016,1,1,12,0.902439
10028,LeBron James,CLE,33.0,203.2,113.398,27.5,8.6,9.1,1.6,0.033,0.201,0.31,0.621,0.432,2017,1,1,13,1.0
10520,LeBron James,LAL,34.0,203.2,113.398,27.4,8.5,8.3,2.0,0.029,0.193,0.311,0.588,0.376,2018,1,1,14,0.670732
11042,LeBron James,LAL,35.0,205.74,113.398,25.6,7.8,10.7,10.4,0.028,0.186,0.308,0.581,0.48,2019,1,1,15,0.90625


In [37]:
players.to_csv('data.csv')

Data is ready to use for classification. I still have one idea for a feature that could be helpful, team win percentage. Team record certainly plays a role in the allstars selections. Adding this feature would require more data scraping though, maybe I will do this later.