# Chapter 1

## The NBA Draft

In [3]:
import pandas as pd

## Reading data

Let's read in the draft round picks from between 2000 and 2009

In [101]:
draft = pd.read_csv('../data/draft.csv', encoding='latin-1')
draft

Unnamed: 0,Rk,Year,Lg,Rd,Pk,Tm,Player,Age,Pos,Born,...,TRB,AST,STL,BLK,FG%,2P%,3P%,FT%,WS,WS/48
0,1,2009,NBA,1,1,LAC,Blake Griffin,20.106,F,us,...,8.8,4.4,0.9,0.5,0.498,0.521,0.333,0.694,75.2,0.167
1,2,2009,NBA,1,2,MEM,Hasheem Thabeet,22.135,C,tz,...,2.7,0.1,0.3,0.8,0.567,0.567,,0.578,4.8,0.099
2,3,2009,NBA,1,3,OKC,James Harden,19.308,G,us,...,5.3,6.3,1.6,0.5,0.442,0.509,0.363,0.858,133.3,0.226
3,4,2009,NBA,1,4,SAC,Tyreke Evans,19.284,G-F,us,...,4.6,4.8,1.2,0.4,0.440,0.468,0.323,0.757,28.4,0.075
4,5,2009,NBA,1,5,MIN,Ricky Rubio,18.252,G,es,...,4.2,7.8,1.9,0.1,0.391,0.416,0.326,0.840,36.4,0.102
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288,289,2000,NBA,1,25,PHO,Jake Tsakalidis,21.024,C,ge,...,3.9,0.3,0.2,0.7,0.490,0.490,0.000,0.657,9.9,0.095
289,290,2000,NBA,1,26,DEN,Mamadou N'Diaye,25.019,C,sn,...,3.3,0.1,0.2,0.9,0.427,0.429,0.000,0.736,1.8,0.101
290,291,2000,NBA,1,27,IND,Primo_ Brezec,20.275,C,si,...,3.9,0.5,0.2,0.4,0.498,0.499,0.167,0.701,10.8,0.084
291,292,2000,NBA,1,28,POR,Erick Barkley,22.133,G,us,...,0.8,1.5,0.7,0.0,0.356,0.373,0.267,0.900,0.2,0.027


## Wrangling data

Firstly let's remove superfluous columns from the data

In [102]:
draft = draft.drop(draft.columns[[2, 3]], axis=1)
# note, we've reduced the values by 2 when compared with the book because we've done this in two stages.
draft = draft.drop(draft.columns[13:22], axis=1) 

Now let's remove unwanted rows - these are mostly blank

In [103]:
draft = draft.drop([90, 131])

### Viewing data

Let's get an impression of the data, first and last three rows:

In [104]:
draft[1:4]

Unnamed: 0,Rk,Year,Pk,Tm,Player,Age,Pos,Born,College,From,To,G,MP,WS,WS/48
1,2,2009,2,MEM,Hasheem Thabeet,22.135,C,tz,UConn,2010.0,2014.0,224.0,10.5,4.8,0.099
2,3,2009,3,OKC,James Harden,19.308,G,us,Arizona State,2010.0,2020.0,826.0,34.3,133.3,0.226
3,4,2009,4,SAC,Tyreke Evans,19.284,G-F,us,Memphis,2010.0,2019.0,594.0,30.7,28.4,0.075


In [105]:
draft[-3:]

Unnamed: 0,Rk,Year,Pk,Tm,Player,Age,Pos,Born,College,From,To,G,MP,WS,WS/48
290,291,2000,27,IND,Primo_ Brezec,20.275,C,si,,2002.0,2010.0,342.0,18.1,10.8,0.084
291,292,2000,28,POR,Erick Barkley,22.133,G,us,St. John's,2001.0,2002.0,27.0,9.9,0.2,0.027
292,293,2000,29,LAL,Mark Madsen,24.158,F,us,Stanford,2001.0,2009.0,453.0,11.8,8.2,0.074


### Converting data to categories

If you plan to model or visualize data, converting variables to factors that are truely categorical is almost mandatory.

pandas has a datatype called "categorical", which is the equivalent of R's factor type.

In [106]:
draft['Year'] = draft['Year'].astype('category')
draft['Tm'] = draft['Tm'].astype('category')
draft['Born'] = draft['Born'].astype('category')
draft['From'] = draft['From'].astype('category')
draft['To'] = draft['To'].astype('category')

### Creating dervied variables

A field named Born2 should show whether a player was born in or outside the US.

In [107]:
draft['Born2'] = draft['Born'].apply(lambda place: 'USA' if place == 'us' else 'World').astype('category')

We want a second field to determine whether the player can from college or not.

In [108]:
draft['College2'] = draft['College'].isna().map({True: 0, False: 1}).astype('category')

Finally we want to clean up the player's position, replacing

* C by Center
* C-F and F-C by Big
* F by Forward
* G by Guard
* F-G and G-F by Swingman

In [109]:
draft['Pos2'] = draft['Pos'].map({
    'C': 'Center',
    'C-F': 'Big',
    'F': 'Forward',
    'F-C': 'Big',
    'F-G': 'Swingman',
    'G': 'Guard',
    'G-F': 'Swingman'
}).astype('category')

## Exploratory data analysis

EDA is most often a mix of computing basic statistics and creating visual content. We will concentrate here on a single variable *win shares*.

Let's start with the describe function. The describe() method in pandas provides a summary of statistical information about a DataFrame or Series. When applied to a DataFrame, it gives you a quick overview of the central tendency, dispersion, and shape of the distribution of your data.

In [115]:
draft.describe()

Unnamed: 0,Rk,Pk,Age,G,MP,WS,WS/48
count,291.0,291.0,291.0,289.0,289.0,289.0,289.0
mean,147.243986,15.219931,20.718677,528.048443,21.55917,29.486159,0.076332
std,84.949147,8.471845,1.408725,319.388779,7.780285,33.709831,0.061509
min,1.0,1.0,17.249,6.0,4.3,-1.6,-0.326
25%,73.5,8.0,19.327,249.0,15.7,4.1,0.051
50%,148.0,15.0,21.016,549.0,21.6,19.7,0.079
75%,220.5,22.5,22.0525,790.0,27.7,43.9,0.106
max,293.0,30.0,25.019,1326.0,38.4,236.1,0.244


Note, in this case, the median is named (correctly) as the 50th percentile.