# Question 1: We are looking to see if the mean prison population has changed over time, from 2012 to 2016.

## Step 1: Set up null and alternative hypotheses


Ho: The mean prison population has not changed significantly from 2012 to 2016

Ha: The mean prison population has significanlty changed from 2012 to 2016


## Step 2: Choose a significance level

In [None]:
alpha = 0.05

## Step 3: Calculate the test statistic

### Step 3.1: Import Necessary Libraries

In [None]:
# Import Pandas Library
import pandas as pd

# Import Seaborn
import seaborn as sns
sns.set_style('whitegrid')

# Import Stats
from scipy import stats

# Import Math
import math

# Import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Import Numpy
import numpy as np

### Step 3.2: Load, clean and explore dataset

In [None]:
# Load in and preview first dataset, 'prison_custody_by_state.csv'
# This data includes year-end prison custody totals by state as reported by the Bureau of 
# Justice Statistics' National Prisoner Statistics Program. Federal row reflects Bureau of Prison custody totals.

df1 = pd.read_csv('prison_custody_by_state.csv')
df1.head()

In [None]:
# Check out datatypes

df1.dtypes

In [None]:
# Remove commas from data and convert strings into integers

cols_to_remove_commas = df1.columns[2:]
df1[cols_to_remove_commas] = df1[cols_to_remove_commas].replace({',':''}, regex=True).astype(int)
df1.head()

In [None]:
type(df1['2001'][3])

In [None]:
df1.mean()

In [None]:
list_2012 = df1["2012"].values
list_2013 = df1["2013"].values
list_2014 = df1["2014"].values
list_2015 = df1["2015"].values
list_2016 = df1["2016"].values

In [None]:
list_2012

In [None]:
# Draw a plot showing overlapping of distribution means and sds for incpection

sns.set(color_codes=True)
sns.set(rc={'figure.figsize':(12,10)})
sns.distplot(list_2012) # Blue distribution
#sns.distplot(list_2013) # Orange distribution
#sns.distplot(list_2014) # Green distribution
#sns.distplot(list_2015) # Red distribution
sns.distplot(list_2016) # Orange distribution (with only 2012) # Purple distribution (with all)

#### Looking at the distributions from 2012 and 2016, they look vey similar.

In [None]:
print('The mean prison population in the U.S. in 2012 was: {}. The standard deviation of the prison population in the U.S. in 2012 was: {}.'.format(list_2012.mean(),list_2012.std()))


In [None]:
print('The mean prison population in the U.S. in 2016 was: {}. The standard deviation of the prison population in the U.S. in 2016 was: {}.'.format(list_2016.mean(),list_2016.std()))


### Step 3.3: Calculating & Visualizing the t-value and p-value

#### We will use a paired t-test because we are looking at the same sample (individual States Prison Populations (+ Federal Prison Population)) over time

In [None]:
# Paired T-Test

t_stat, p_value = stats.ttest_rel(list_2012, list_2016) 
print('The t-value is: {}.  The p-value is: {}'.format(t_stat, p_value))

In [None]:
#Visulize t and p-value

fig = plt.figure(figsize=(10,10))
ax = fig.gca()

xs = np.linspace(-4, 4, 500)
ys= stats.t.pdf(xs, 51+51-2, 0, 1)
    
ax.plot(xs, ys, linewidth=4, color='blue')
    
ax.axvline(t_stat, color = 'red', linestyle='--', lw=5)
ax.axvline(-t_stat, color = 'red', linestyle='--', lw=5)
    
plt.show()

## Step 4: Determine the Critical or p-value

### Step 4.1: Determine t-critical

Our degrees of freedom for this test will be:
    df = number of samples in 'list_2012' + number of samples in 'list_2016' - 2

In [None]:
# Calculate degrees of freedom
df = len(list_2012) + len(list_2016) - 2
df

In [None]:
# Calculate critical t value for the upper bound

upper_t_critical = np.round(stats.t.ppf(1 - 0.025, df=100),3)
upper_t_critical

In [None]:
# Calculate critical t value for the lower bound

upper_t_critical = np.round(stats.t.ppf(1 - 0.975, df=100),3)
upper_t_critical

### Step 4.2: Determine p-value

We saw above when we ran 'stats.ttest_rel(list_2012, list_2016)' above, that our t-value was 2.1946365203688156 and our p-value was 0.032862108959372936

In [None]:
p_value

## Step 5: Compare t-value & critical t-value to Reject or Fail to Reject the Null Hypothesis

### Step 5.1 Visually Compare the calculate t-statistic and the t-critical value

In [None]:
xs = np.linspace(-5, 5, 200)
ys = stats.t.pdf(xs, df, 0, 1)

fig = plt.figure(figsize=(8,5))

ax = fig.gca()

ax.plot(xs, ys, linewidth=3, color='darkblue')

ax.axvline(t_stat, color='red', linestyle='--', lw=5,label='t-statistic')
ax.axvline(-t_stat, color='red', linestyle='--', lw=5)

ax.axvline(t_critical,color='green',linestyle='--',lw=4,label='critical t-value')
ax.axvline(-t_critical,color='green',linestyle='--',lw=4)

ax.fill_betweenx(ys,xs,t_critical,where = xs > t_critical, color = 'purple')
ax.fill_betweenx(ys,xs,-t_critical,where = xs < -t_critical, color = 'purple')

ax.legend()
plt.show()

Visually we can see that our calculated t-statistic is more extreme on both sides than the t-critical value.  This would tell us that we should reject our null hypothesis.

### Step 5.2 Numerically Compare the calculate t-statistic and the t-critical value

In [None]:
if (t_stat > t_critical) & (-t_stat < - t_critical):
    print('We reject our null hypothesis since our calculated t-statistic is {}, which is greater than the t-critical value, {}.'.format(t_stat, t_critical))
if (t_stat < t_critical) & (-t_stat > - t_critical):
    print('We fail to reject our null hypothesis since our calculated t-statistic is {}, less than the t-critical value, {}.'.format(t_stat, t_critical))

### Step 5.3: Confirm our p-value is less than our alpha value:

In [None]:
if(p_value < alpha):
    print('Since our p-value is less than our alpha value we will reject the null hypothesis')
if(p_value > alpha):
    print('Since our p-value is greater than our alpha value we fail to reject the null hypothesis')

When we visually compare our t-critical value and calculated t-value and numerically compare our t-critical value and calculated t-value, we see that our calculated t-value is more extreme on both sides than our t-critical value.  We also have a p-value that is less than our alpha value, showing us that these results are statistically significant.  We can reject the null hypothesis, the mean prison population has not changed significantly from 2012 to 2016.