# Inferential statistics
## Part I - Data Cleaning

Your family is very passionate about basketball. You always have discussions over players, games, statistics and whatnot. As you can imagine those discussions never reach a conclusion since everyone is simply sharing their opinion with no statistics to back them up!

![](../images/basket.jpg)

Since you are attending a data analysis bootcamp you'd like to take advantage of your newfound knowledge to finally put an end to your family's discussions. 

Luckily we have found a dataset containing data related to the players of the WNBA for the 2016-2017 season that we can use. 

Let's start with cleaning the data and then we'll continue with a general exploratory analysis and some inferential statistics.

### Dataset

The dataset we will be using contains the statistics from the WNBA players for the 2016-2017 season. You will be able to find more information on the dataset in the [codebook](../data/codebook.md) uploaded to the repository.

### Libraries

First we'll import the necessary libraries first and increase the maximum number of displayed columns so you will be able to see all the dataset in the same window.

In [1]:
import pandas as pd
pd.set_option('max_columns', 100)

### Load the dataset

Load the dataset into a df called `wnba` and take an initial look at it using the `head()` method.

In [2]:
#your code here
wnba = pd.read_csv("../data/wnba.csv")
wnba

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,Tiffany Hayes,ATL,G,178,70.0,22.093170,US,"September 20, 1989",27,Connecticut,6,29,861,144,331,43.5,43,112,38.4,136,161,84.5,28,89,117,69,37,8,50,467,0,0
139,Tiffany Jackson,LA,F,191,84.0,23.025685,US,"April 26, 1985",32,Texas,9,22,127,12,25,48.0,0,1,0.0,4,6,66.7,5,18,23,3,1,3,8,28,0,0
140,Tiffany Mitchell,IND,G,175,69.0,22.530612,US,"September 23, 1984",32,South Carolina,2,27,671,83,238,34.9,17,69,24.6,94,102,92.2,16,70,86,39,31,5,40,277,0,0
141,Tina Charles,NY,F/C,193,84.0,22.550941,US,"May 12, 1988",29,Connecticut,8,29,952,227,509,44.6,18,56,32.1,110,135,81.5,56,212,268,75,21,22,71,582,11,0


### Check NaN values
As you know, one of our first steps is to check if there are any NaN values in the dataset to find any issues. Look for the columns that cointain NaN values and count how many rows there are with that value.

In [4]:
#your code here
null_cols = wnba.isnull().sum()
null_cols[null_cols > 0]

Weight    1
BMI       1
dtype: int64

We can see that there are only two NaNs in the whole dataset, one in the Weight column and one in the BMI one. Let's look at the actual rows that contain the NaN values.

In [6]:
#your code here
wnba[pd.isnull(wnba).any(axis=1)]

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
91,Makayla Epps,CHI,G,178,,,US,"June 6, 1995",22,Kentucky,R,14,52,2,14,14.3,0,5,0.0,2,5,40.0,2,0,2,4,1,0,4,6,0,0


It looks like there is only a single row that has NaN values in it, which is good! Just in case, let's check how much removing a single row may influence our dataset by calculating the percentage of values we will be removing.

In [9]:
#your code here
percent_missing = wnba.isnull().sum() * 100 / len(wnba)
percent_missing

Name            0.000000
Team            0.000000
Pos             0.000000
Height          0.000000
Weight          0.699301
BMI             0.699301
Birth_Place     0.000000
Birthdate       0.000000
Age             0.000000
College         0.000000
Experience      0.000000
Games Played    0.000000
MIN             0.000000
FGM             0.000000
FGA             0.000000
FG%             0.000000
3PM             0.000000
3PA             0.000000
3P%             0.000000
FTM             0.000000
FTA             0.000000
FT%             0.000000
OREB            0.000000
DREB            0.000000
REB             0.000000
AST             0.000000
STL             0.000000
BLK             0.000000
TO              0.000000
PTS             0.000000
DD2             0.000000
TD3             0.000000
dtype: float64

It is very important to be as careful as possible when dealing with NaN values and only drop data when it is strictly necessary. This decision can also be influenced by the nature of our analysis. If, for example, our analysis will not require the Weight and BMI of the players at all we can simply keep the row, given that the NaN values are only present in the Weight and BMI column.

In this specific example, let's say our decision is to drop it. Write some code to drop the NaN values. 

In [10]:
#your code here
wnba.dropna()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,Tiffany Hayes,ATL,G,178,70.0,22.093170,US,"September 20, 1989",27,Connecticut,6,29,861,144,331,43.5,43,112,38.4,136,161,84.5,28,89,117,69,37,8,50,467,0,0
139,Tiffany Jackson,LA,F,191,84.0,23.025685,US,"April 26, 1985",32,Texas,9,22,127,12,25,48.0,0,1,0.0,4,6,66.7,5,18,23,3,1,3,8,28,0,0
140,Tiffany Mitchell,IND,G,175,69.0,22.530612,US,"September 23, 1984",32,South Carolina,2,27,671,83,238,34.9,17,69,24.6,94,102,92.2,16,70,86,39,31,5,40,277,0,0
141,Tina Charles,NY,F/C,193,84.0,22.550941,US,"May 12, 1988",29,Connecticut,8,29,952,227,509,44.6,18,56,32.1,110,135,81.5,56,212,268,75,21,22,71,582,11,0


**Do you think it is a good decision? Think about a case in which you wouldn't want to drop the value.**

In [13]:
#your answer here

# it is only one row and we may need to use those variable later, it is only representing 0,699 pourcent of the dataset

### Let's make an overview of the dataset
First, check the data types of our data:

In [13]:
#your code here
wnba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 32 columns):
Name            143 non-null object
Team            143 non-null object
Pos             143 non-null object
Height          143 non-null int64
Weight          142 non-null float64
BMI             142 non-null float64
Birth_Place     143 non-null object
Birthdate       143 non-null object
Age             143 non-null int64
College         143 non-null object
Experience      143 non-null object
Games Played    143 non-null int64
MIN             143 non-null int64
FGM             143 non-null int64
FGA             143 non-null int64
FG%             143 non-null float64
3PM             143 non-null int64
3PA             143 non-null int64
3P%             143 non-null float64
FTM             143 non-null int64
FTA             143 non-null int64
FT%             143 non-null float64
OREB            143 non-null int64
DREB            143 non-null int64
REB             143 non-null int64
AST

It looks like most of the data types are correct. Birthdate column could be casted to a `datetime` type, however, we won't use it in our analysis so for simplicity, let's leave it as an `object`. Weight column could also be casted to an `int64` type as all numbers are integers.

**Let's change the type of Weight column for practice.**

In [24]:
#your code here

#wnba['Weight'].round()

wnba['Weight'] = wnba['Weight'].astype('Int64')


**After checking the data types, let's check for outliers using the describe() method.**

In [28]:
#your code here

stats = wnba.describe().transpose()
stats['IQR'] = stats['75%'] - stats['25%']
stats

outliers = pd.DataFrame(columns=wnba.columns)

for col in stats.index:
    iqr = stats.at[col,'IQR']
    cutoff = iqr * 5
    lower = stats.at[col,'25%'] - cutoff
    upper = stats.at[col,'75%'] + cutoff
    results = wnba[(wnba[col] < lower) | 
                   (wnba[col] > upper)].copy()
    results['Outlier'] = col
    outliers = outliers.append(results)

list_outliers= outliers['Outlier'].tolist()
list(set(outliers))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


['STL',
 'OREB',
 'Games Played',
 'BLK',
 'Pos',
 'FG%',
 'FGM',
 'Name',
 'Birth_Place',
 'REB',
 '3PA',
 'MIN',
 'Height',
 '3P%',
 'Outlier',
 'Experience',
 'Birthdate',
 'FGA',
 'Team',
 'DD2',
 'Weight',
 'AST',
 'TO',
 'FTM',
 'FT%',
 'PTS',
 '3PM',
 'BMI',
 'DREB',
 'TD3',
 'FTA',
 'College',
 'Age']

In [30]:
outliers

Unnamed: 0,3P%,3PA,3PM,AST,Age,BLK,BMI,Birth_Place,Birthdate,College,DD2,DREB,Experience,FG%,FGA,FGM,FT%,FTA,FTM,Games Played,Height,MIN,Name,OREB,Outlier,PTS,Pos,REB,STL,TD3,TO,Team,Weight
46,0.0,1,0,43,28,64,22.126197,US,"May 9, 1989",Delaware,4,116,5,48.9,272,133,65.4,78,51,30,196,939,Elena Delle Donne,99,BLK,317,G/F,215,32,0,36,WAS,85
19,37.4,123,46,78,22,47,20.671696,US,"August 27, 1994",Connecticut,8,206,2,48.2,417,201,79.5,171,136,29,193,952,Breanna Stewart,43,DD2,584,F/C,249,29,0,68,SEA,77
28,35.1,114,40,127,31,53,21.208623,US,"April 19, 1986",Tennessee,10,205,10,47.8,383,183,76.5,115,88,29,193,889,Candace Parker,37,DD2,494,F/C,242,43,1,80,LA,79
37,38.3,60,23,175,28,5,22.05219,US,"August 2, 1989",Gonzaga,10,75,6,52.3,199,104,82.8,29,24,22,173,673,Courtney Vandersloot,13,DD2,255,G,88,22,0,64,CHI,66
55,33.3,60,20,50,22,13,22.477454,NG,"March 2, 1995",Kentucky,13,199,R,45.2,365,165,78.6,117,92,30,191,926,Evelyn Akhator,73,DD2,442,F,272,37,0,67,DAL,82
69,44.9,49,22,40,25,46,23.76641,US,"February 20, 1992",Nebraska,17,226,3,54.8,299,164,82.4,142,117,29,188,833,Jordan Hooper,108,DD2,467,F,334,29,0,46,CHI,84
103,36.7,49,18,63,27,14,22.351743,US,"February 7, 1990",Stanford,9,179,6,55.7,386,215,87.2,148,129,30,188,948,Nneka Ogwumike,57,DD2,577,F,236,53,0,47,LA,79
131,0.0,0,0,39,32,61,24.487297,US,"June 10, 1985",LSU,16,184,10,66.1,336,222,79.0,162,128,29,198,895,Sylvia Fowles,113,DD2,572,C,297,39,0,71,MIN,96
141,32.1,56,18,75,29,22,22.550941,US,"May 12, 1988",Connecticut,11,212,8,44.6,509,227,81.5,135,110,29,193,952,Tina Charles,56,DD2,582,F/C,268,21,0,71,NY,84
28,35.1,114,40,127,31,53,21.208623,US,"April 19, 1986",Tennessee,10,205,10,47.8,383,183,76.5,115,88,29,193,889,Candace Parker,37,TD3,494,F/C,242,43,1,80,LA,79


AttributeError: 'DataFrame' object has no attribute 'data'

**Comment on your result. What do you see?**

In [20]:
#your answer here

#their is some zero value in those outliers

**Now we can save the cleaned data to a new .csv file called `wnba_clean.csv` in the data folder.**

In [21]:
#your code here