In [None]:
import pandas as pd
import numpy as np
from plotnine import *
from statsmodels.stats.weightstats import ttest_ind

## Context

In the National Basketball Association (NBA), games are played between Fall and Spring each year, ending with a set of playoff games and a championship.  One "season" of games thus spans two calendar years. For example, the 2019-2020 season began on October 22, 2019 and will end on April 15, 2020. The playoffs will begin on April 18, 2020, and will end with the NBA Finals in June 2020.

Teams in the NBA are organized into two Conferences:  East and West.  From time to time, as teams change locations or new teams are added, the conferences change.

Players join the NBA by being *drafted*, typically when they are 18 years old.

## This dataset

The dataset in this exam contains information about the "Player of the Week", awarded each week to the player(s) who performed the best during that week's games. 

Each observation in this dataset represents an instance of a certain player being awarded Player of the Week.

The variables in this dataset are:

![](../documentation.png)

In [None]:
nba = pd.read_csv('../NBA_player_of_the_week.csv')

# Data Cleaning and Adjusting

## Heights and Weights

Notice that the `Height` variable sometimes has numbers, and sometimes has the letters "cm" after it.  Similarly, the `Weight` variable sometimes has the letters "kg" after it. We need to fix this to make theses variables numeric.

In [None]:
nba = nba.assign(
    Height = pd.to_numeric(nba['Height'].str.replace('cm','').str.replace('-[0-9]*','')),
    Weight = pd.to_numeric(nba['Weight'].str.replace('kg',''))
)
nba

# Player summaries

Number of unique players who have won "Player of the Week" in the timespan covered by this dataset:

In [None]:
len(nba['Player'].unique())

The teams that have had a player win the award the most times are:

In [None]:
nba['Team'].value_counts().nlargest(3)

The players that have won the Player of the Week Award the most times are:

In [None]:
nba['Player'].value_counts().nlargest(3)

# Positions and size

Basketball players play in "guard" positions (PG, SG, G) or "forward" positions (F, C, F-C, FC, G-F, GF, PF, SF).  It is generally expected that forwards are much taller than guards.  We will use the players in this dataset to analyze size differences between these positions.

## Unique players

We need to narrow down the dataset so that each unique player only appears once.  Since heights and weights are sometimes changing over the years, we will use the median measurement for each unique player.

## Refactoring position variable

We will create a new variable that identifies each player as simply a Guard or a Forward.

In [None]:
conditions = [
    nba['Position'].str.contains('G'),
    True
]

choices = [
    'Guard',
    'Forward'
]

nba = nba.assign(
    Position_GF = np.select(conditions, choices)
)

## Comparing Heights

The following plots show the heights and weights of players, separated by position.

In [None]:
heights = (ggplot(nba, aes(x = 'Position_GF', y = 'Height'))
          + geom_col()
          + xlab('Position'))

weights = (ggplot(nba, aes(x = 'Position_GF', y = 'Weight'))
          + geom_col()
          + xlab('Position'))

display([heights, weights])

We find that while Forwards weigh more than Guards, Guards are taller than Forwards.

# Different Guard positions

Among the Guard positions, there are two specialties:  Point Guard (PG) and Shooting Guard (SG).  We are interested in studying whether these positions also have a height difference.

The sample mean for each position is given below.

In [None]:
nba[(nba['Position'] == 'PG') | (nba['Position'] == 'SG')].groupby('Position').agg(
    Height=('Height', 'mean')
)

We will conduct a t-test at the 0.05 level.


Since it appears shooting guards are taller, our null hypothesis is that the true mean height of shooting guards is greater than the true mean height of point guards.

In [None]:
my_test = ttest_ind(
    x1 = nba[(nba['Position'] == 'PG')]['Height'],
    x2 = nba[(nba['Position'] == 'SG')]['Height'],
    alternative = 'smaller'
)
t_stat = my_test[0]
p_value = my_test[1]
deg_free = my_test[2]
print("We obtain a t statistic of {t_stat}. This yields a p-value of {p_value}.".format(t_stat = deg_free, p_value = p_value))

We fail to reject the null hypothesis, and find that there is no detectable height difference between Point Guards and Shooting Guards.