# <center> Central Limit Theorem and Confidence Intervals

## <center> Central Limit Theorem </center> <br><br>
<center> The distribution of sample means is approximately normal, given enough samples, and despite the underlying distribution.

### <center> Features

<center> The mean of the sample means will be approximately the population mean.

<center> The variance of the sample means will be approximately equal to the population variance divided by the sample size.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
normal = np.random.normal(50,15,1000)

In [None]:
uniform = np.random.uniform(1,25,1000)

In [None]:
exponential = np.random.exponential(5,1000)

In [None]:
beta = np.random.beta(0.5,0.5,1000)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10,10))
sns.distplot(normal, ax=axes[0,0],axlabel='Normal')
sns.distplot(uniform, color='green', ax=axes[0,1],axlabel='Uniform')
sns.distplot(exponential, color='red', ax=axes[1,0],axlabel='Exponential')
sns.distplot(beta, color='purple', ax=axes[1,1],axlabel='Beta')

In [None]:
df = pd.DataFrame()
df['Dist'] = ['Normal', 'Uniform', 'Exponential', 'Beta']
df['Mean'] = [normal.mean(), uniform.mean(), exponential.mean(), beta.mean()]
df['Variance'] = [normal.var(), uniform.var(), exponential.var(), beta.var()]
df.index=df['Dist']
df.drop('Dist',axis=1,inplace=True)
df

In [None]:
sample_size = 10
sample_count = 50

normal_samples = [np.random.choice(normal, sample_size, replace=True) for x in range(sample_count)]
normal_sample_means = [np.mean(x) for x in normal_samples]

uniform_samples = [np.random.choice(uniform, sample_size, replace=True) for x in range(sample_count)]
uniform_sample_means = [np.mean(x) for x in uniform_samples]

exponential_samples = [np.random.choice(exponential, sample_size, replace=True) for x in range(sample_count)]
exponential_sample_means = [np.mean(x) for x in exponential_samples]

beta_samples = [np.random.choice(beta, sample_size, replace=True) for x in range(sample_count)]
beta_sample_means = [np.mean(x) for x in beta_samples]

sample_means = [normal_sample_means, uniform_sample_means, exponential_sample_means, beta_sample_means]

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10,10))
sns.distplot(normal_sample_means, ax=axes[0,0],axlabel='Normal')
sns.distplot(uniform_sample_means, color='green', ax=axes[0,1],axlabel='Uniform')
sns.distplot(exponential_sample_means, color='red', ax=axes[1,0],axlabel='Exponential')
sns.distplot(beta_sample_means, color='purple', ax=axes[1,1],axlabel='Beta')

In [None]:
df.insert(1,'Sample Mean',[np.mean(x) for x in sample_means])
df.insert(3,'Sample Variance',[np.var(x) for x in sample_means])
df

In [None]:
df['Expected Variance'] = df['Variance']/sample_size
df

## <center> The more samples you take and the larger those samples are, the closer to a normal distribution the sample means will be.

Scoops Ahoy Corporate wants you to estimate the average annual revenue for its shops. They tell you the standard deviation of revenue is 50,000. Also, they recently sampled 30 stores and calculated an average annual revenue of $50,000. If they want to be 92% certain of the true average annual revenue across all stores, what interval of revenue should you give them?

# <center>Confidence Intervals 

<center> <img src='norm_dist2.png'>

Finding the Z-critical value for a given percentage.

In [None]:
stats.norm.ppf(0.95)

<center><img src='z_crit1.png'>

In [None]:
stats.norm.ppf(0.95+(1-.95)/2)

<center><img src='z_crit2.png'>

68% confidence

In [None]:
stats.norm.ppf(.68+(1-.68)/2)

Scoops Ahoy Corporate wants you to estimate the average annual revenue for its shops. They tell you the standard deviation of revenue is 50,000. Also, they recently sampled 30 stores and calculated an average annual revenue of 450,000. If they want to be 92% certain of the true average annual revenue across all stores, what interval of revenue should you give them?

In [None]:
z = stats.norm.ppf(0.92+(1-0.92)/2)
z

<center>Margin of Error = z ∗ σ / √n

In [None]:
sigma = 75000
n = 30
moe = z*sigma/np.sqrt(n)
moe

In [None]:
lower_interval, upper_interval = 450000-moe, 450000+moe
print('With a 92% confidence level, the average revenue is between', int(lower_interval), 'and', int(upper_interval), 'dollars per year.')

# <center> Activity

1) Professor Juniper has taken a random Pokemon in the wild. She tells you that the standard deviation of speed for all Pokemon is 15. She has tasked you with estimated the average speed of all Pokemon with a confidence level of 85%. Using the data of the sample found in pokemon_sample.csv, what interval should you report?

2) Professor Juniper has previously studied the height of all Pokemon. From a sample of 30 Pokemon, she calculated the average height to be between 25 and 40 inches, with a confidence of 90%. What was the average height of those 30 Pokemon? What is the standard deviation of Pokemon height for the population? 

What if we don't know the population standard deviation?

## <center> T-Distribution

<img src='t_dist.png'>

 ### <center>Standard Deviation of a Sample

<center><img src='standard_dev.png'>

In [None]:
sample = pd.read_csv('pokemon_sample.csv')['Speed']
s = np.std(sample, ddof = 1)
s

Confidence intervals for a T-distribution.

<center>Margin of Error = t ∗ σ / √n

<center>Consider we only have the sample of 30 Pokemon, what is the 90% confidence interval for the average speed of all Pokemon?

In [None]:
sigma = np.std(sample, ddof=1)/np.sqrt(len(sample))

In [None]:
stats.t.interval(alpha = 0.90,                              # Confidence level
                 df= len(sample)-1,                         # Degrees of freedom
                 loc = sample.mean(),                       # Sample mean
                 scale = sigma) 

In [None]:
moe = stats.t.ppf(0.9+(1-.9)/2, len(sample)-1)*np.std(sample, ddof=1)/np.sqrt(len(sample))
sample.mean()-moe, sample.mean()+moe

# <center> Activity

1)   Using a sample of data from home sales in King County (king_county_sample.csv), calculate a 95% confidence interval for the average home sale price.

2)   Is there a tighter 90% confidence interval for number of bedrooms or number of bathrooms?

3) The 85% confidence interval found for the average salary of a data scientist was found to be 90,000 to 120,000. What was the sample size?