In [2]:
import pandas as pd
import numpy as np
import altair as alt
from statsmodels.stats.weightstats import ttest_ind

## Context

In the National Basketball Association (NBA), games are played between Fall and Spring each year, ending with a set of playoff games and a championship.  One "season" of games thus spans two calendar years. For example, the 2019-2020 season began on October 22, 2019 and will end on April 15, 2020. The playoffs will begin on April 18, 2020, and will end with the NBA Finals in June 2020.

Teams in the NBA are organized into two Conferences:  East and West.  From time to time, as teams change locations or new teams are added, the conferences change.

Players join the NBA by being *drafted*, typically when they are 18 years old.

## This dataset

The dataset in this exam contains information about the "Player of the Week", awarded each week to the player(s) who performed the best during that week's games. 

Each observation in this dataset represents an instance of a certain player being awarded Player of the Week.

The variables in this dataset are:

![](./documentation.png)

In [3]:
nba = pd.read_csv('./NBA_player_of_the_week.csv')

# Data Cleaning and Adjusting

## Heights and Weights

Notice that the `Height` variable sometimes has numbers, and sometimes has the letters "cm" after it.  Similarly, the `Weight` variable sometimes has the letters "kg" after it. We need to fix this to make theses variables numeric.

In [6]:
# Version A
nba = nba.assign(
    Height = pd.to_numeric(nba['Height'].str.replace('cm','').str.replace('-','')),
    Weight = pd.to_numeric(nba['Weight'].str.replace('kg',''))
)

## Heights and Weights

Notice that the `Height` variable is sometimes recorded in inches ("5-9"), and sometimes recorded in centimeters ("221cm").  Similarly, the `Weight` variable is in pounds by default, but sometimes appears in kilograms. 

We need to fix these variables to be consistently in inches and in pounds.

In [7]:
# Version B & C
def fix_height(height):
    
    if 'cm' in height: #swap cm to inches
        return int(height.replace('cm',''))*0.394
    else: #swap feet-inches to inches
        values = height.split('-')
        return int(values[0])*12+int(values[1])
    
def fix_weight(weight):
    if 'kg' in weight: #swap to pounds
        return int(weight.replace('kg',''))*2.2
    else:
        return int(weight)*2.2

In [8]:
# Version B & C
nba = nba.assign(
    Height = nba['Height'].apply(fix_height),
    Weight = nba['Weight'].apply(fix_weight)
)

TypeError: argument of type 'int' is not iterable

# Player summaries

Number of unique players who have won "Player of the Week" in the timespan covered by this dataset:

In [9]:
nba[['Player']].drop_duplicates().count()

Player    274
dtype: int64

The teams that have had a player win the award the most times are:

In [10]:
nba['Team'].value_counts().nlargest(3)

Los Angeles Lakers     71
San Antonio Spurs      61
Cleveland Cavaliers    59
Name: Team, dtype: int64

In [11]:
nba['Player'].value_counts().nlargest(3)

LeBron James    61
Kobe Bryant     33
Kevin Durant    26
Name: Player, dtype: int64

# Positions and size

Basketball players play in "guard" positions (PG, SG, G) or "forward" positions (F, C, F-C, FC, G-F, GF, PF, SF).  It is generally expected that forwards are much taller than guards.  We will use the players in this dataset to analyze size differences between these positions.

## Unique players

We need to narrow down the dataset so that each unique player only appears once.  Since heights and weights are sometimes changing over the years, we will use the median measurement for each unique player.

In [9]:
# Version C
nba = nba.groupby('Player').agg(
    Height=('Height', 'median'),
    Weight=('Weight', 'median'),
    Position=('Position', 'first')
)

In [10]:
# Version B
nba.groupby('Player').agg(
    Height=('Height', 'median'),
    Weight=('Weight', 'median'),
    Position=('Position', 'first')
)

Unnamed: 0_level_0,Height,Weight,Position
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaron McKie,77.0,459.8,G
Adrian Dantley,77.0,457.6,SF
Al Harrington,81.0,539.0,F
Al Horford,82.0,539.0,FC
Al Jefferson,82.0,635.8,FC
...,...,...,...
World B. Free,74.0,407.0,G
Xavier McDaniel,79.0,451.0,PF
Yao Ming,90.0,682.0,C
Zach Randolph,81.0,572.0,PF


## Refactoring position variable

We will create a new variable that identifies each player as simply a Guard or a Forward.

In [11]:
# Version A
conditions = [
    nba['Position'].str.contains('G'),
    nba['Position'].str.contains('F')
]

choices = [
    'Guard',
    'Forward'
]

nba = nba.assign(
    Position_GF = np.select(conditions, choices, None)
)
nba

Unnamed: 0_level_0,Height,Weight,Position,Position_GF
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaron McKie,77.0,459.8,G,Guard
Adrian Dantley,77.0,457.6,SF,Forward
Al Harrington,81.0,539.0,F,Forward
Al Horford,82.0,539.0,FC,Forward
Al Jefferson,82.0,635.8,FC,Forward
...,...,...,...,...
World B. Free,74.0,407.0,G,Guard
Xavier McDaniel,79.0,451.0,PF,Forward
Yao Ming,90.0,682.0,C,
Zach Randolph,81.0,572.0,PF,Forward


In [12]:
# Version B & C
conditions = [
    nba['Position'].str.contains('PG|SG|G'),
    nba['Position'].str.contains('F|C|F-C|FC|G-F|GF|PF|SF'),
]

choices = [
    'Guard',
    'Forward'
]

nba = nba.assign(
    Position_GF = np.select(conditions, choices, None)
)
nba

Unnamed: 0_level_0,Height,Weight,Position,Position_GF
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaron McKie,77.0,459.8,G,Guard
Adrian Dantley,77.0,457.6,SF,Forward
Al Harrington,81.0,539.0,F,Forward
Al Horford,82.0,539.0,FC,Forward
Al Jefferson,82.0,635.8,FC,Forward
...,...,...,...,...
World B. Free,74.0,407.0,G,Guard
Xavier McDaniel,79.0,451.0,PF,Forward
Yao Ming,90.0,682.0,C,Forward
Zach Randolph,81.0,572.0,PF,Forward


## Comparing Heights

The following plots show the heights and weights of players, separated by position.

In [13]:
# Version B & C
chart = alt.Chart(nba).mark_boxplot().encode(
    x=alt.X('Position_GF', axis=alt.Axis(title='Position')),
    y=alt.Y('Height')
)
chart

In [14]:
# Version B & C
chart = alt.Chart(nba).mark_boxplot().encode(
    x=alt.X('Position_GF', axis=alt.Axis(title='Position')),
    y=alt.Y('Weight')
)
chart

#### Version B
It is clear from these boxplots that Forwards tend to be taller and weight more than Guards.  

It appears from the plots that the Forwards in the dataset have a mean height of about 81 inches, while Guards have a mean height of 75.3 inches. Forwards have a mean weight of about 245 pounds, while Guards have a mean weight of 200 pounds.

#### Version C

It is clear from these boxplots that Forwards tend to be taller and weight more than Guards.  

It appears from the plots that the Forwards in the dataset have a median height of about 81 inches, while Guards have a median height of 75.3 inches. Forwards have a median weight of about 245 pounds, while Guards have a median weight of 200 pounds.

In [15]:
# Version A
chart = alt.Chart(nba).mark_bar().encode(
    x=alt.X('Position_GF', axis=alt.Axis(title='Position')),
    y=alt.Y('Height')
)
chart

In [16]:
# Version A
chart = alt.Chart(nba).mark_bar().encode(
    x=alt.X('Position_GF', axis=alt.Axis(title='Position')),
    y=alt.Y('Weight')
)
chart

#### Version A
We find that while Forwards weigh more than Guards, Guards are taller than Forwards.

# Different Guard positions

Among the Guard positions, there are two specialties:  Point Guard (PG) and Shooting Guard (SG).  We are interested in studying whether these positions also have a height difference.

The sample mean for each position is given below.

In [17]:
nba[(nba['Position'] == 'PG') | (nba['Position'] == 'SG')].groupby('Position').agg(
    Height=('Height', 'mean')
)

Unnamed: 0_level_0,Height
Position,Unnamed: 1_level_1
PG,73.848485
SG,77.158176


We will conduct a t-test at the 0.05 level.

#### Version: A

Since it appears shooting guards are taller, our null hypothesis is that the true mean height of shooting guards is greater than the true mean height of point guards.

In [35]:
# Version A
my_test = ttest_ind(
    x1 = nba[(nba['Position'] == 'PG')]['Height'],
    x2 = nba[(nba['Position'] == 'SG')]['Height'],
    alternative = 'smaller'
)
t_stat = my_test[0]
p_value = my_test[1]
deg_free = my_test[2]
print("We obtain a t statistic of {t_stat}. This yields a p-value of {p_value}.".format(t_stat = deg_free, p_value = p_value))

We obtain a t statistic of 65.0. This yields a p-value of 2.478455178844141e-09.


In [37]:
# Version B & C
my_test = ttest_ind(
    x1 = nba[(nba['Position'] == 'PG')]['Height'],
    x2 = nba[(nba['Position'] == 'SG')]['Height'],
    alternative = 'two-sided'
)
t_stat = my_test[0]
p_value = my_test[1]
deg_free = my_test[2]
print("We obtain a t statistic of {t_stat}. This yields a p-value of {p_value}.".format(t_stat = deg_free, p_value = p_value))

We obtain a t statistic of 65.0. This yields a p-value of 4.956910357688282e-09.


#### Version A & B
We fail to reject the null hypothesis, and find that there is no detectable height difference between Point Guards and Shooting Guards.

#### Version C
We reject the null hypothesis, and conclude that there is strong evidence that Shooting Guards tend to be taller than Point Guards.