Lab 4: Extending Logistic Regression

Cameron Matson

Zihao Mao

In [147]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

In [148]:
# first lests load the datasets in

data_path = '../../data/nba-players-stats-since-1950'
players = pd.read_csv(os.path.join(data_path, 'players.csv'))
players.head()

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


We probably don't need the "collage [sic]" or the their birth locationg.  Probably don't really need their birth year either, but it might be interesting to look at generational splits.  Also that unnamed column looks just like the index, so we can drop that too.

In [149]:
players.drop(['Unnamed: 0', 'collage', 'birth_city', 'birth_state'], axis=1, inplace=True)
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 4 columns):
Player    3921 non-null object
height    3921 non-null float64
weight    3921 non-null float64
born      3921 non-null float64
dtypes: float64(3), object(1)
memory usage: 122.6+ KB


Good.  They're all non null, and seem to be the correct datatype.

Now let's load the players stats.

In [150]:
stats = pd.read_csv(os.path.join(data_path, 'seasons_stats.csv'))
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24691 entries, 0 to 24690
Data columns (total 53 columns):
Unnamed: 0    24691 non-null int64
Year          24624 non-null float64
Player        24624 non-null object
Pos           24624 non-null object
Age           24616 non-null float64
Tm            24624 non-null object
G             24624 non-null float64
GS            18233 non-null float64
MP            24138 non-null float64
PER           24101 non-null float64
TS%           24538 non-null float64
3PAr          18839 non-null float64
FTr           24525 non-null float64
ORB%          20792 non-null float64
DRB%          20792 non-null float64
TRB%          21571 non-null float64
AST%          22555 non-null float64
STL%          20792 non-null float64
BLK%          20792 non-null float64
TOV%          19582 non-null float64
USG%          19640 non-null float64
blanl         0 non-null float64
OWS           24585 non-null float64
DWS           24585 non-null float64
WS          

There are a lot of fields here, and they're pretty inconsistently filled.  Some of this arises from the fact that its such a long timeline.  For example, in 1950, there was no such thing as a 3-pointer, so it wouldn't make sense for those players to have 3pt% stats. 

Inspecting the dataset a little further, we notice that there is no stat for points per game (PPG).  The total number of points scored is listed, but that is hard to compare across seasons where they played different games.  To make the dataset more valid, i.e. to make the points column a valid comparisson measure, we'll only consider seasons in which they played the current full 82 game schedule.  Which doesn't reduce the power of the dataset by that much, they moved to a 82 game season in 1967, and only the lockout shortened 1998-99 season didn't have a full scehdule.

Actually we might want to limit it to just seasons after 1980 when they introduced the 3 pointer.  That should just make the prediction task easier, although we lose even more of the dataset.  But if we consider the business case as being how to decide players posisitions *TODAY* it makes sense.

In [151]:
stats = stats[stats.Year >= 1980]
stats = stats[stats.Year != 1998]
stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18380 entries, 5727 to 24690
Data columns (total 53 columns):
Unnamed: 0    18380 non-null int64
Year          18380 non-null float64
Player        18380 non-null object
Pos           18380 non-null object
Age           18380 non-null float64
Tm            18380 non-null object
G             18380 non-null float64
GS            17686 non-null float64
MP            18380 non-null float64
PER           18375 non-null float64
TS%           18307 non-null float64
3PAr          18295 non-null float64
FTr           18295 non-null float64
ORB%          18375 non-null float64
DRB%          18375 non-null float64
TRB%          18375 non-null float64
AST%          18375 non-null float64
STL%          18375 non-null float64
BLK%          18375 non-null float64
TOV%          18321 non-null float64
USG%          18375 non-null float64
blanl         0 non-null float64
OWS           18380 non-null float64
DWS           18380 non-null float64
WS       

Now lets just focus on a few categories
- Player
- Year
- Age
- games played (G)
- minutes played (MP)
- field goals, feild goal attempts, and percentage (FG, FGA, FG%)
- free throws (FT, FTA, FT%), two-pointers (2P, 2PA, 2P%), and three-pointers (3P, 3PA, 3P%)
- offensive, defensive, and total rebounds (ORB, DRB, TRB)
- assists (AST)
- steals (STL)
- blocks (BLK)
- turnovers (TOV)
- personal fouls (PF)
- points (PTS)

And of course our label: position.  We could probably use any of the features as a label actually, and see if one could predict performance in one aspect of the game based on info in the another.  But for now we'll stick with predicting position.


In [152]:
stats_to_keep = {'Player', 'Year','Pos', 'Age', 'G', 'MP', 'FG', 'FGA', 'FG%', 'FT', 'FTA', 'FT%',
                '2P', '2PA', '2P%', '3P', '3PA', '3P%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
                'TOV', 'PF', 'PTS'}

stats_to_drop = set(stats.columns)-stats_to_keep
stats.drop(stats_to_drop, axis=1, inplace=True)
stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18380 entries, 5727 to 24690
Data columns (total 27 columns):
Year      18380 non-null float64
Player    18380 non-null object
Pos       18380 non-null object
Age       18380 non-null float64
G         18380 non-null float64
MP        18380 non-null float64
FG        18380 non-null float64
FGA       18380 non-null float64
FG%       18295 non-null float64
3P        18380 non-null float64
3PA       18380 non-null float64
3P%       14969 non-null float64
2P        18380 non-null float64
2PA       18380 non-null float64
2P%       18266 non-null float64
FT        18380 non-null float64
FTA       18380 non-null float64
FT%       17657 non-null float64
ORB       18380 non-null float64
DRB       18380 non-null float64
TRB       18380 non-null float64
AST       18380 non-null float64
STL       18380 non-null float64
BLK       18380 non-null float64
TOV       18380 non-null float64
PF        18380 non-null float64
PTS       18380 non-null float64

To take care of some of the null values, when players had 0 attempts in a shooting category (FG, 3P, 2P, FT) they left the percentage field blank (can't divide by 0), but for our purposes its probably okay if we just say it was 0%.

In [153]:
stats['3P%'] = stats['3P%'].fillna(0)
stats['2P%'] = stats['2P%'].fillna(0)
stats['FT%'] = stats['FT%'].fillna(0)
stats['FG%'] = stats['FG%'].fillna(0)


Okay.  Finally, let's add the player description data to the stats dataframe.

In [205]:
stats['height'] = np.nan
stats['weight'] = np.nan
stats['born'] = np.nan

In [343]:
iplayer = players.set_index(keys='Player')
istats = stats.reset_index(drop=True)
for i, row in istats.iterrows():
    name = row[1]
    h = iplayer.loc[name].loc['height']
    w = iplayer.loc[name].loc['weight']
    b = iplayer.loc[name].loc['born']
    istats.iloc[i, 27] = h
    istats.iloc[i, 28] = w
    istats.iloc[i, 29] = b

Unnamed: 0,Year,Player,Pos,Age,G,MP,FG,FGA,FG%,3P,...,TRB,AST,STL,BLK,TOV,PF,PTS,height,weight,born
0,1980.0,Kareem Abdul-Jabbar*,C,32.0,82.0,3143.0,835.0,1383.0,0.604,0.0,...,886.0,371.0,81.0,280.0,297.0,216.0,2034.0,218.0,102.0,1947.0
1,1980.0,Tom Abernethy,PF,25.0,67.0,1222.0,153.0,318.0,0.481,0.0,...,191.0,87.0,35.0,12.0,39.0,118.0,362.0,201.0,99.0,1954.0
2,1980.0,Alvan Adams,C,25.0,75.0,2168.0,465.0,875.0,0.531,0.0,...,609.0,322.0,108.0,55.0,218.0,237.0,1118.0,206.0,95.0,1954.0
3,1980.0,Tiny Archibald*,PG,31.0,80.0,2864.0,383.0,794.0,0.482,4.0,...,197.0,671.0,106.0,10.0,242.0,218.0,1131.0,185.0,68.0,1948.0
4,1980.0,Dennis Awtrey,C,31.0,26.0,560.0,27.0,60.0,0.45,0.0,...,115.0,40.0,12.0,15.0,27.0,66.0,86.0,208.0,106.0,1948.0


In [347]:
stats = istats
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18380 entries, 0 to 18379
Data columns (total 30 columns):
Year      18380 non-null float64
Player    18380 non-null object
Pos       18380 non-null object
Age       18380 non-null float64
G         18380 non-null float64
MP        18380 non-null float64
FG        18380 non-null float64
FGA       18380 non-null float64
FG%       18380 non-null float64
3P        18380 non-null float64
3PA       18380 non-null float64
3P%       18380 non-null float64
2P        18380 non-null float64
2PA       18380 non-null float64
2P%       18380 non-null float64
FT        18380 non-null float64
FTA       18380 non-null float64
FT%       18380 non-null float64
ORB       18380 non-null float64
DRB       18380 non-null float64
TRB       18380 non-null float64
AST       18380 non-null float64
STL       18380 non-null float64
BLK       18380 non-null float64
TOV       18380 non-null float64
PF        18380 non-null float64
PTS       18380 non-null float64
he