# Inferential statistics
## Part II - Exploratory Data Analysis

Before starting the actual analysis it's a good idea to explore the data that we will be using, to give yourself a first idea of the questions you will be able to answer with your data, the bias you could have, other data you could need, etc.

### Libraries
In addition to pandas we will also import matplolib and seaborn so that we will able to plot our data to better understand it.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('max_columns', 100)

### Explore the dataset

Let's load the cleaned dataset first. Import it with the name `wnba` and show the head.

In [2]:
#your code here

wnba = pd.read_csv('../data/wnba_clean.csv')

wnba.head()

Unnamed: 0.1,Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0


**Use describe() to take an initial look at the data.**

In [3]:
#your code here

wnba.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,142.0,70.859155,41.536891,0.0,35.25,70.5,106.75,142.0
Height,142.0,184.612676,8.698128,165.0,175.75,185.0,191.0,206.0
Weight,142.0,78.978873,10.99611,55.0,71.5,79.0,86.0,113.0
BMI,142.0,23.091214,2.073691,18.390675,21.785876,22.873314,24.180715,31.55588
Age,142.0,27.112676,3.66718,21.0,24.0,27.0,30.0,36.0
Games Played,142.0,24.429577,7.075477,2.0,22.0,27.5,29.0,32.0
MIN,142.0,500.105634,289.373393,12.0,242.25,506.0,752.5,1018.0
FGM,142.0,74.401408,55.980754,1.0,27.0,69.0,105.0,227.0
FGA,142.0,168.704225,117.165809,3.0,69.0,152.5,244.75,509.0
FG%,142.0,43.102817,9.855199,16.7,37.125,42.05,48.625,100.0


Most of the game-related stats have a very high range of values which can be explained by the fact that the dataset contains data on both players that play the majority of games and also players that may spend almost the entirety of the season on the bench.

There are also some extremes in the weight and age columns. Feel free, if you'd like, to check which are the players with a very high (or low) age/weight and do some research on them. This is useful to confirm that they are simply outliers and not errors in the data.

In [4]:
#your code here

# wnba.nlargest(3, 'BMI')

# wnba.nlargest(3, '3PM')

# wnba.nlargest(3, '3PA')

# wnba.nlargest(3, '3P%')

### Looking at the distributions of the data
Let's take a look at the distribution of the 4 stats that describe the physical characteristics of the players.

**Plot the four distributions about `height`, `weight`, `age` and `BMI`.**

In [5]:
#your code here

import plotly.express as px

fig1 = px.histogram(wnba, x='Height', title='Height', width=1200, height=600, opacity=0.75)
fig2 = px.histogram(wnba, x='Weight', title='Weight', width=1200, height=600, opacity=0.75)
fig3 = px.histogram(wnba, x='Age', title='Age', width=1200, height=600, opacity=0.75)
fig4 = px.histogram(wnba, x='BMI', title='Body Mass Index', width=1200, height=600, opacity=0.75)
fig1.show()
fig2.show()
fig3.show()
fig4.show()

**What conclusions do you think we can take from this plots?**

In [6]:
#your conclusions here

# Most values seem to be in the middle. For me it was surprising that the height was lower then I expected.

In addition to what the describe() method already told us we can see that the physical characteristics of the players (apart from the age) more or less follow a standard distribution, which is expected when looking at the distribution of values that arise from a lot of factors that are independent from each other as is the case for many physical quantities. 

The height distribution looks like a bimodal. This may be due to the fact that players in basketball fall under two main categories (please note that this is a very gross generalization): shorter and more agile, and taller and less agile. Therefore there are less "average" height players since they will neither be as agile as the shorter players nor have the same impact in the paint (that is, under the basket) as a taller player.

The age distribution is a bit skewed to the right which is expected since most professional players stop playing after their prime physical years come to an end.

We could do the same for the main game stats. They are points, assists, blocks, rebounds and steals. 

**Now plot the distribution of the columns `REB`, `AST`, `STL`, `PTS` and `BLK` the same way you did in the last cell.**

In [7]:
#your code here

fig1 = px.histogram(wnba, x='REB', title='Total Rebounds', width=1200, height=600, opacity=0.75)
fig2 = px.histogram(wnba, x='AST', title='Assists', width=1200, height=600, opacity=0.75)
fig3 = px.histogram(wnba, x='STL', title='Steals', width=1200, height=600, opacity=0.75)
fig4 = px.histogram(wnba, x='PTS', title='Total points', width=1200, height=600, opacity=0.75)
fig5 = px.histogram(wnba, x='BLK', title='Blocks', width=1200, height=600, opacity=0.75)
fig1.show()
fig2.show()
fig3.show()
fig4.show()
fig5.show()

**What conclusions do you think we can take from this plots?**

In [8]:
#your conclusions here

# It seems strange that all data starts from zero.

As expected all of the above distribution are heavily skewed to the right, since most players will have very low to average stats while there will be a handful of star players whose stats peak above everyone else. It is also important to think about the fact that we are simply taking the stats as they are without considering the minutes played by each player. Even though skill plays a very important factor in determining this kind of stats we also have to consider that players that play more minutes will, on average, score more points (or blocks, assists, etc.).

**For the sake of it let's look at the same distributions by dividing those stats by the minutes played for each player in the dataset.** 

In [9]:
#your code here

fig1 = px.histogram(wnba, x=wnba['REB'] / wnba['MIN'], title='Total Rebounds', labels={'x':'Total Rebounds', 'y':'count'}, width=1200, height=600, opacity=0.75)
fig2 = px.histogram(wnba, x=wnba['AST'] / wnba['MIN'], title='Assists', labels={'x':'Assists', 'y':'count'}, width=1200, height=600, opacity=0.75)
fig3 = px.histogram(wnba, x=wnba['STL'] / wnba['MIN'], title='Steals', labels={'x':'Steals', 'y':'count'}, width=1200, height=600, opacity=0.75)
fig4 = px.histogram(wnba, x=wnba['PTS'] / wnba['MIN'], title='Total points', labels={'x':'Total points', 'y':'count'}, width=1200, height=600, opacity=0.75)
fig5 = px.histogram(wnba, x=wnba['BLK'] / wnba['MIN'], title='Blocks', labels={'x':'Blocks', 'y':'count'}, width=1200, height=600, opacity=0.75)

fig1.show()
fig2.show()
fig3.show()
fig4.show()
fig5.show()

**What conclusions do you think we can take from this plots?**

In [10]:
#your conclusions here

# The data seems to be better divided now.

### Summary

The main insights we obtained from this exploratory analysis are:
- Game-related stats have a very high range of values.
- There are some extremes in the weight and age columns.
- The physical characteristics of the players more or less follow a standard distribution.
- We need to take into account that our dataset contains data on both players that play the majority of games and also players that may spend almost the entirety of the season on the bench.

Now, it's time to try to put an end to your family's discussions. As seen on the README, the main discussions are:
- Your grandmother says that your sister couldn't play in a professional basketball league (not only the WNBA, but ANY professional basketball league) because she's too skinny and lacks muscle.
- Your sister says that most female professional players fail their free throws.
- Your brother-in-law heard on the TV that the average assists among NBA (male) and WNBA (female) players is 52 for the 2016-2017 season. He is convinced this average would be higher if we only considered the players from the WNBA.

**Do you think you have all the necessary data to answer these questions?**

In [11]:
#your comments here

# I think have all the data to answer these questions.