# <center> Continuous Distributions </center>
## <center> Normal and Standard Normal</center>

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import pandas as pd
from itertools import *


<tr>
    <center>
    <td><img src='pmf_one_die.png' width="400" height="400"> </td>
    <td><img src='pmf_two_dice.png' width="400" height="400"></td>
    <td><img src='pmf_three_dice.png' width="400" height="400"></td>
    </center>
<tr>
    <center>
    <td><img src='cdf_one_die.png' width="400" height="400"> </td>
    <td><img src='cdf_two_dice.png' width="400" height="400"></td>
    <td><img src='cdf_three_dice.png' width="400" height="400"></td>
    </center>

What if our dice has an infinite number of sides?

<center><img src='infinite_dice.gif' width='500' height='500'></center>

Simulating an infinite dice with values 1-6...

In [None]:
def infinite_dice_roll():
    return(np.random.uniform(1,6))

In [None]:
infinite_dice_roll()

In [None]:
one_die_infinite_rolls=[infinite_dice_roll() for x in range(100)]

In [None]:
one_die_infinite_outcomes = dict(Counter(one_die_infinite_rolls))
one_die_infinite_outcomes

How could we calculate the probabilities for each possible value?

In [None]:
probs = np.zeros(100)

### <center>The probability of a continuous random variable at any specific value is zero.

In [None]:
plt.plot(one_die_infinite_rolls, probs)
plt.title('One Infinite Die Roll')
plt.xlabel('Roll')

Using a distribution plot.

In [None]:
sns.distplot(one_die_infinite_rolls)
plt.title('One Infinite Die Roll')
plt.xlabel('Roll')

Now for two and three infinite dice.

In [None]:
two_dice_infinite_rolls=[infinite_dice_roll()+infinite_dice_roll() for x in range(100)]
three_dice_infinite_rolls=[infinite_dice_roll()+infinite_dice_roll()+infinite_dice_roll() for x in range(100)]

Scatter plots.

In [None]:
plt.plot(two_dice_infinite_rolls, probs)
plt.title('Two Infinite Dice Roll')
plt.xlabel('Roll sum')
plt.show()
plt.plot(three_dice_infinite_rolls, probs)
plt.title('Three Infinite Dice Roll')
plt.xlabel('Roll sum')

And distribution plots. 

In [None]:
sns.distplot(two_dice_infinite_rolls)
plt.show()
sns.distplot(three_dice_infinite_rolls)

Now lets increase the number of rolls.

In [None]:
one_die_infinite_rolls=[infinite_dice_roll() for x in range(10000)]
two_dice_infinite_rolls=[infinite_dice_roll()+infinite_dice_roll() for x in range(10000)]
three_dice_infinite_rolls=[infinite_dice_roll()+infinite_dice_roll()+infinite_dice_roll() for x in range(10000)]

Distribution plots.

In [None]:
sns.distplot(one_die_infinite_rolls)
plt.title('One Infinite Die Roll')
plt.xlabel('Roll')

In [None]:
sns.distplot(two_dice_infinite_rolls)
plt.title('Two Infinite Dice Rolls')
plt.xlabel('Roll sum')

In [None]:
sns.distplot(three_dice_infinite_rolls)
plt.title('Three Infinite Dice Rolls')
plt.xlabel('Roll sum')

<tr>
    <center>
    <td><img src='pmf_one_die.png' width="400" height="400"> </td>
    <td><img src='pmf_two_dice.png' width="400" height="400"></td>
    <td><img src='pmf_three_dice.png' width="400" height="400"></td>
    </center>
<tr>
    <center>
    <td><img src='pmf_one_die_inf.png' width="400" height="400"> </td>
    <td><img src='pmf_two_dice_inf.png' width="400" height="400"></td>
    <td><img src='pmf_three_dice_inf.png' width="400" height="400"></td>
    </center>

### Plotting CDF for a continuous random variable

Using histograms to bin the continuous values.

In [None]:
counts, bin_edges = np.histogram(one_die_infinite_rolls, bins=100)
cumulative_probs = np.cumsum(np.divide(counts,10000))
plt.plot(bin_edges[1:], cumulative_probs)
plt.title('One Infinite Die Roll CDF')
plt.xlabel('Roll')

In [None]:
counts, bin_edges = np.histogram(two_dice_infinite_rolls, bins=100)
cumulative_probs = np.cumsum(np.divide(counts,10000))
plt.plot(bin_edges[1:], cumulative_probs)
plt.title('Two Infinite Dice Rolls CDF')
plt.xlabel('Roll sum')

In [None]:
counts, bin_edges = np.histogram(three_dice_infinite_rolls, bins=100)
cumulative_probs = np.cumsum(np.divide(counts,10000))
plt.plot(bin_edges[1:], cumulative_probs)
plt.title('Three Infinite Dice Rolls CDF')
plt.xlabel('Roll sum')

<tr>
    <center>
    <td><img src='cdf_one_die.png' width="400" height="400"> </td>
    <td><img src='cdf_two_dice.png' width="400" height="400"></td>
    <td><img src='cdf_three_dice.png' width="400" height="400"></td>
    </center>
<tr>
    <center>
    <td><img src='cdf_one_die_inf.png' width="400" height="400"> </td>
    <td><img src='cdf_two_dice_inf.png' width="400" height="400"></td>
    <td><img src='cdf_three_dice_inf.png' width="400" height="400"></td>
    </center>

Consider rolling 100 infinite dice, 10,000 times..

In [None]:
hundred_dice_rolls = []
for i in range(10000):
    roll_sum = 0
    for i in range(100):
        roll_sum+=infinite_dice_roll()
    hundred_dice_rolls.append(roll_sum)
sns.distplot(hundred_dice_rolls)
plt.title('One Hundred Infinite Die Rolls')
plt.xlabel('Roll sum')

In [None]:
counts, bin_edges = np.histogram(hundred_dice_rolls, bins=100)
cumulative_probs = np.cumsum(np.divide(counts,10000))
plt.plot(bin_edges[1:], cumulative_probs)
plt.title('One Hundred Infinite Dice Rolls CDF')
plt.xlabel('Roll sum')

### <center> Central Limit Theorem </center> <br>
<center>The sum of a large number of random variables will be approximately normally distributed, regardless of the underlying distribution.

Consider a loaded die...

In [None]:
loaded_die = [1,2,3,6,6,6]

One roll of the loaded die.

In [None]:
rolls = [np.random.choice(loaded_die) for x in range(1000)]
sns.distplot(rolls)
plt.title('Loaded Die Roll')
plt.xlabel('Roll')

Ten rolls of the loaded die.

In [None]:
rolls = [np.random.choice(loaded_die)+np.random.choice(loaded_die)
         +np.random.choice(loaded_die)+np.random.choice(loaded_die)
         +np.random.choice(loaded_die)+np.random.choice(loaded_die)
         +np.random.choice(loaded_die)+np.random.choice(loaded_die)
         +np.random.choice(loaded_die)+np.random.choice(loaded_die)for x in range(1000)]
sns.distplot(rolls)
plt.title('Ten Loaded Die Rolls')
plt.xlabel('Roll sum')

## <center> The Normal Distribution

<center><img src='normal_dist.png' height='500' width='500'>

### <center> Properties

Mean = Median = Mode

Area under the curve is 1.

68% of values are within 1 standard deviation of the mean, 𝜇±𝜎 <br> <br>
95% of values are within 2 standard deviations of the mean, 𝜇±2𝜎 <br> <br>
99.7% of values are within 3 standard deviations of the mean, 𝜇±3𝜎

Can be found with the population mean and variance/standard deviation.

<center><img src='norm_dist2.png' height='500' width='500'>

<center><img src='norm_dist3.jpg' height='500' width='500'>

A radar is used to measure speeds of cars on a highway. The speeds are normally distributed with a mean of 60 mph and a standard deviation of 10 mph. What is the probability that a car picked at random is travelling 45 mph?

In [None]:
stats.norm(60,10).pdf(45)

What is the probability that a car picked at random is travelling 45 mph or less?

In [None]:
stats.norm(60,10).cdf(45)

Assuming a normal distribution, if 95% of data scientists drink between 7.5 and 15.5 ounces of coffee a day, what is the mean amount of coffee drank by data scientists and what is the standard deviation?

In [None]:
mean = (7.5 + 15.5)/2
print('Mean:',mean, 'ounces of coffee')

In [None]:
standard_deviation = (15.5-11.5)/2
print('Standard deviation:', standard_deviation)

Using the mean and standard deviation, check that 95% of data scientists do indeed drink 7.5 - 15.5 ounces of coffee per day.

In [None]:
stats.norm(11.5,2.0).cdf(15.5)-stats.norm(11.5,2.0).cdf(7.5)

## <center>Standard Normal Distribution

<br>
<br><center> A normal distribution with <br><br>
𝜇 = 0<br><br>
𝜎 = 1

<center>We can convert any normal distribution to a standard normal distribution with the following:  <br><br>
Z = (𝑥-𝜇)/𝜎 <br> <br>

Consider two stocks, Stock A and Stock B. In terms of closing prices over the last couple years, Stock A has had a mean price of 12.75 with a standard deviation of 1.5 and Stock B has had a mean price of 5.30 with a standard deviation of 1.15. <br><br>
If the current stock prices are 10.50 for Stock A and 4.50 for Stock B, which stock would be the better deal currently?

In [None]:
print('Stock A Z-score')
(10.50-12.75)/1.5

In [None]:
print('Stock B Z-score')
(4.5-5.30)/1.5

Z-score when dealing with samples.

The average number of scoops of ice cream sold by employees at Scoops Ahoy is 15 per hour with a standard deviation of 3. <br> <br>
Steve worked for 6 hours and averaged 18 scoops sold per hour. <br> <br>
What percent of employees did he do better than in selling scoops?


<center><img src='z_test.jpg'>

In [2]:
x = 20
mean = 18
standard_deviation =  3
n = 6

In [3]:
z = (x-mean)/(standard_deviation/np.sqrt(n))
z

1.6329931618554518

In [4]:
stats.norm(0,1).cdf(z)

0.9487647825701253

## <center> Activity

1) The average amount spent per order at Scoops Ahoy is 6.45 with a standard deviation of 2.35. What is the probability that the next customer will spend 7.50 or more?

2) The newest scoop flavor is a little inconsistent in terms of sweetness. About 99.7% of the time, one cup of the new flavor will have 32 - 48 grams of sugar. What is the average amount of sugar per cup and what is the standard deviation? What is the probability that you order a scoop and it has between 50 and 60 grams of sugar?

3) The average viewer count for a season of a Netflix show is 12 million with a standard deviation of 5 million. Stranger Things had an average view count of 20 million across three seasons. What percentage of shows does Stranger Things beat in terms of viewer count?