### Key concepts
1. Distributions are where data comes from

2. Gaussian distribution

3. Mean & standard deviation

4. Percentiles & outliers

5. Power law distribution

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import random
from IPython.core.display import clear_output

def coin_flip(p=0.5):
    val = random.random()
    if val < p:
        return 1 # Heads
    else:
        return 0 # Tails

def roll_dice():
    val1 = int(random.random() * 6) + 1
    val2 = int(random.random() * 6) + 1
    return val1 + val2
    
def coin_flips(n=10):
    res = 0
    for i in range(n):
        res += coin_flip()
    return res

def random_avg(n=10):
    res = 0
    for i in range(n):
        res += 10 * random.random()
    return res / n

def simulate(f, runs=100):
    return pd.DataFrame({'Val': [f() for i in range(runs)]})

def show_counts(df):
    df = df.assign(Count=1)
    df = df.groupby('Val').count()
    return df

def histogram(df, bins=10, range=None):
    if isinstance(df, pd.Series):
        df = pd.DataFrame({df.name: df})
    counts = df.assign(Count=1)
    col_name = df.columns[0]
    counts = counts.groupby(col_name).count()
    df.plot.hist(bins=bins, range=range)
    plt.show()
    if counts.shape[0] < 30:
        return counts

### 1. Distributions
We think of data as coming from a distribution. A distribution is a random process that produces individual data points.  
A distribution is like a person, and data is like a photograph of that person.
The more data points you have, the more you know about the true distribution, like a high-resolution photograph.

#### Dice Rolls

We can analyze a distribution by making repeated random simulations and plotting a histogram of the results. Here, we simulate rolling dice 100 times.

In [None]:
histogram(simulate(f=roll_dice, runs=100), bins=range(2, 14))

To understand why the distribution has this shape, consider that every square in this diagram represents an equally likely outcome.

<img src="../images/dicediagram.jpg" style="width: 300px;">
[(source)](https://casmusings.wordpress.com/about/)

#### Coin Flips
Here is a simulation of flipping a coin 10 times. Flipping many coins approximates a bell curve distribution. 

In [None]:
histogram(simulate(f=coin_flips, runs=100), bins=range(0, 11))

## TRY THIS

In [None]:
# Implement a distribution
# this function should flip a coin repeatedly until you get a head. it should return the number of coin flips required. 
def coin_flips_to_head():
    return ...

In [None]:
histogram(simulate(f=coin_flips_to_head))

### 2. The Gaussian (Normal) distribution
A Gaussian distribution is a bell curve, where data clusters around a central mean.
This is the most common distribution of data. Height, test scores, and commute times all look like Gaussians.
It's so common because the sum of random numbers always looks has a Gaussian distribution.

In [None]:
# the more data points we have, the closer our data looks like the true distribution
flips = []
for _ in range(100):
    flips.append(simulate(f=random_avg, runs=100))
    clear_output(wait=True)
    histogram(pd.concat(flips), bins=50, range=(0, 10))

## Real-world Data: NBA Player Statistics 🏀

Let's load a dataset of NBA player statistics to see how these distributions come up in the real world.

![](../images/nickyoung.gif)

In [None]:
nba = pd.read_csv('../datasets/nba.csv')
nba = nba.drop('Unnamed: 0', axis=1)
for col in nba.columns:
    if col not in ['Player', 'Pos', 'Tm']:
        nba[col] = pd.to_numeric(nba[col])

The data has some basic data for each player:

- **Player**: The player's name
- **Tm**: The player's team. *Players who were traded within the season are listed multiple times.*
- **Pos**: The player's position ([reference](https://en.wikipedia.org/wiki/Basketball_positions))
- **Age**: The player's age
- **2P, 2PA, 2P%**: The player's 2-pointer stats (goals made, attempts, percentage)
- **3P, 3PA, 3P%**: The player's 3-pointer stats
- **FT, FTA, FT%**: The player's free throw stats
- Plenty of others, see [this reference](https://en.wikipedia.org/wiki/Basketball_statistics)

Use `.head()` to have a look at the first few rows to familiarize yourself with the data.

In [None]:
nba.head()

### 3-Pointer histogram

Here is a histogram of 3-point shooting % (for players with at least 20 attempts)

In [None]:
histogram(nba[nba['3PA'] >= 20]['3P%'])

## TRY THIS: Make a histogram of each player's free throw shooting %


In [None]:
# Your code here

Let's see who is in the bottom bucket.

In [None]:
nba[nba.FTA > 20].sort_values('FT%')[['FT', 'FTA', 'Player']].head(8)

In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/L_Knvctdm-8?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

## TRY This: Plot a histogram of NBA players' age. Is it a Gaussian distribution?

In [None]:
histogram(...)

### Mean and Standard Deviation

1. The mean is the average value of a distribution. For the Gaussian distribution, it is also the middle value.
2. The standard deviation is the average distance from the mean.

In [None]:
nba[nba['3PA'] >= 20]['3P%'].mean()

In [None]:
nba[nba['3PA'] >= 20]['3P%'].std()

## TRY This

What is the mean and standard deviation of age?

In [None]:
# Your code here

The standard deviation is a way to quickly understand how rare a particular data-point is:

1. A point within 1 std-dev of the mean is around 68%
2. Within 2 std-dev ~ 95%
3. Within 3 std-dev ~ 99.7%
4. Within 4 std-dev ~99.99%

### Outliers

Real data is not entirely normally distributed. Particularly, there are more unusual points than you would expect by random chance. Generally, we consider points outside 3 std-dev's to be outliers.

## TRY This

Which players are outliers in:
1. 3-point %
2. Age
3. Free-throw % (consider players with at least 100 attempts)

In [None]:
# Your code here

### Power law distribution

There are lots of distributions that are not Gaussian. In particular, many distributions are fat-tailed with small numbers of samples having outsized effects on the overall sum and the mean:

1. Spotify plays per artist
2. Purchases on amazon
3. Income
4. Wealth

In [None]:
# total points scored looks like this
histogram(nba['PTS'])

In [None]:
nba['PTS'].mean()

In [None]:
# The average player scores many fewer points than the average points scored
nba['PTS'].median()

## TRY THIS

How many players account for 50% of the points scored?

**Hint**:
1. Sort the nba dataset by `PTS`
2. add a new column `PointsSoFar`, use the `cumsum` function to calculate the cumulative sum of points
3. add a new column `PointsRatio`, which is `PointsSoFar` divided by the total points scored (`nba['PTS'].sum()`)
4. where does `PointsRatio` go over 50% (can count the number of players where `PointsRatio < 0.5`)

In [None]:
# Sort the dataset by points. ascending=False means reverse the sort.
# Fill in the column we want to sort by.
sorted_data = nba.sort_values(..., ascending=False)

In [None]:
# Change this code to calculate PointsSoFar.
sorted_data['PointsSoFar'] = sorted_data[...].cumsum()

In [None]:
# Change this code to calculate the PointsRatio.
sorted_data['PointsRatio'] = sorted_data[...] / sorted_data[...].sum()

In [None]:
# Change this code to filter out rows with PointsRatio > 0.5 and count how many rows are left.
sorted_data[...]

The `PointsRatio` column answers the question "what percent of all points scored are scored by the top n players", for every value of n. Using this plot, how many players account for 20% of all points scored?

In [None]:
sorted_data.PointsRatio.reset_index(drop=True).plot()

Why do some players score so much more than others? Hint: have a look at what the distribution looks like when we divide by minutes played:

In [None]:
nba['PointsPerMinute'] = nba['PTS'] / nba['MP']
histogram(nba[nba['MP'] >= 100]['PointsPerMinute'])

In [None]:
nba[nba['MP'] >= 100].sort_values('PointsPerMinute', ascending=False)