# Milestone 2

## Author - Matthew Denko



## Instructions

    Milestone 2 focuses on Unit 2 of the course. You will apply what you have learned about statistical analysis and hypothesis testing to the data and the problem you have selected.

    For Milestone 2 you should:

    (1) explore the dataset supported by charts and summary statistics;
    (2) identify a likely distribution for several of the features;
    (3) compute basic summary statistics by both classical, bootstrap, and Bayesian methods;
    (4) compute confidence intervals for the above summary statistics by classical, bootstrap, and Bayesian methods; and
    (5) leverage confidence intervals in performing hypothesis tests to determine if the differences in pairs and multiple populations are significant.

# (1) Explore the dataset supported by charts and summary statistics

## Importing/Cleaning Data/Summary Statistics

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
from matplotlib import pyplot
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import sem, t
from scipy import mean

%matplotlib inline

In [None]:
# Defining Functions

def plot_hist(x, p=5):
    # Plot the distribution and mark the mean
    pyplot.hist(x, alpha=.5)
    pyplot.axvline(x.mean())
    # 95% confidence interval    
    pyplot.axvline(np.percentile(x, p/2.), color='red', linewidth=3)
    pyplot.axvline(np.percentile(x, 100-p/2.), color='red', linewidth=3)
    
def bern_pmf(x, p):
    if (x == 1):
        return p
    elif (x == 0):
        return 1 - p
    else:
        return "Value Not in Support of Distribution"
    
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), st.sem(a)
    h = se * st.t.ppf((1 + confidence) / 2., n-1)
    return print("The mean is:",m, "The lower bound is:",m-h, "The upper bound is",m+h)

def t_test(a, b, alpha, alternative='two-sided'):
    from scipy import stats
    import scipy.stats as ss
    import pandas as pd
    import statsmodels.stats.weightstats as ws
    
    diff = a.mean() - b.mean()

    res = ss.ttest_ind(a, b)
      
    means = ws.CompareMeans(ws.DescrStatsW(a), ws.DescrStatsW(b))
    confint = means.tconfint_diff(alpha=alpha, alternative=alternative, usevar='unequal') 
    degfree = means.dof_satt()

    index = ['DegFreedom', 'Difference', 'Statistic', 'PValue', 'Low95CI', 'High95CI']
    return pd.Series([degfree, diff, res[0], res[1], confint[0], confint[1]], index = index)   

In [None]:
# Reading Data

url = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
Auto = pd.read_csv(url, header=None)

#Assigning Column Names

Auto.columns = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors", "body-style", "drive-wheels",
               "engine-location", "wheel-base","length", "width", "height", "curb-weight", "engine-type", "num-of=cylinders",
               "engine-size", "fuel-system", "bore", "stroke", "compression-ratio","horsepower","peak-rpm","city-mpg",
               "highway-mpg","price"]
print(Auto.columns)
print(Auto.head(10))

In [None]:
# Summary Statistics

print(Auto.describe())

In [None]:
#Removing cases with missing data

Auto.loc ['price',:] = pd.to_numeric(Auto['price'], errors='coerce').fillna(0)
Auto = Auto.replace(to_replace= "?", value= float('NaN'))

# Dropping rows with nulls

Auto = Auto.dropna(axis = 0)
Auto_null = Auto.isnull().sum()
print("""Null Counts by Column
""",Auto_null)

In [None]:
# Converting Data to numeric

#price
Auto.loc [:,'price'] = pd.to_numeric(Auto['price'], errors='coerce').fillna(0)
Auto.loc[:,'price'] = Auto['price'].astype('float')

#horsepower
Auto.loc [:,'horsepower'] = pd.to_numeric(Auto['horsepower'], errors='coerce').fillna(0)
Auto.loc[:,'horsepower'] = Auto['horsepower'].astype('float')

#height
Auto.loc [:,'height'] = pd.to_numeric(Auto['height'], errors='coerce').fillna(0)
Auto.loc[:,'height'] = Auto['height'].astype('float')

### Comments:
    
    All null values have now been removed from the dataset. And the feautures that I want to explore further have been converted numeric values. I will be focusing my analysis on three features: price, horsepower, and height. In the next section I will be taking a graphical look at their distribution.

## Graphical Analysis

In [None]:
# Price

price_hist = plot_hist(Auto.loc[:,'price'])
plt.show(price_hist)

### Comments:
    Price appears to be slightly normal distribution with a right skew. The mean is around 12,000 however the Q3 value is near 28,000 showing the strength of the tail.

In [None]:
# horsepower

hp_hist = plot_hist(Auto.loc[:,'horsepower'])
plt.show(hp_hist)

### Comments:
    Horsepower also appears to have a slightly normal distribution. It has high concentration between 50 and 125 with a mean of around 98.

In [None]:
# height 

height_hist = plot_hist(Auto.loc[:,'height'])
plt.show(height_hist)

### Comments:
    Height is strongly left skewed as the majority of results are in between 45 and 60 with some outliers less than 10.

# (2) Identify a likely distribution for several of the features

### Comments:
    For the identifying of a likely distribution for the three features I chose (price, horsepower, and height) I want to convert them each to probability values and create a new subsetted dataframe. 

## Subsetting Data Frame for necessary columns

In [None]:
# Converting Data to numeric

#price
Auto.loc [:,'price'] = pd.to_numeric(Auto['price'], errors='coerce').fillna(0)
Auto.loc[:,'price'] = Auto['price'].astype('float')

#horsepower
Auto.loc [:,'horsepower'] = pd.to_numeric(Auto['horsepower'], errors='coerce').fillna(0)
Auto.loc[:,'horsepower'] = Auto['horsepower'].astype('float')

#height
Auto.loc [:,'height'] = pd.to_numeric(Auto['height'], errors='coerce').fillna(0)
Auto.loc[:,'height'] = Auto['height'].astype('float')

#Determing Mean Values

#price
price_mean = Auto['price'].mean()
print('Price Mean is: $', price_mean)

#horsepower
hp_mean = Auto['horsepower'].mean()
print('Horespower Mean is :', hp_mean)

#height
height_mean = Auto['height'].mean()
print('Height Mean is:', height_mean)

### Comments:
    I will now create a new dataframe for 1 or 0 indicators of whether or not each of the columns are greater than the interger value of the mean. For price this will be 11,374. For Horsepower this will be 171. For Height this will be 54.

In [None]:
#Creating New Data Frame for probabilities

auto_new = pd.DataFrame()

#price
auto_new.loc[:,'price'] = (Auto.loc[:,'price'] > 11374).astype(int)

#horsepower
auto_new.loc[:,'horsepower'] = (Auto.loc[:,'horsepower'] > 171).astype(int)

#height
auto_new.loc[:,'height'] = (Auto.loc[:,'height'] > 54).astype(int)

### Comments:
    Each of the variables have now been converted to probability variables. So for price, horsepower, and height I choose a Bernouili distribution for parameter p, the probability that they are above their mean values.

# (3) Compute basic summary statistics by both classical, bootstrap, and Bayesian methods

# (4) Compute confidence intervals for the above summary statistics by classical, bootstrap, and Bayesian methods; and

# (5) Leverage confidence intervals in performing hypothesis tests to determine if the differences in pairs and multiple populations are significant.

### Comments
    
    For the final three sections I will split the work by two areas: Bayesian, Bootstrap vs Classical. 

# Bayesian

## Compute Basic Summary Statistics

In [None]:
#Summary Statistics

print(auto_new.describe())

### Comments:

    The likelihood is the probability seeing our data x given the parameter θ this is written as p(X|θ). The likelihood distribution allows is to specify how we think the data was generated. In this case we know how the data was generated and we can think of it as being generated from a Bernoulli Distribution.
    
    In this distrbution there is a parameter p which is the proability of getting a 1 and the probability of getting a 0 is 1-p. For price that is the probability the price of a car is >11,374 which based off the mean of the new dataset that probability is .34 and probability of not having car with a price above the mean is .66. For horsepower that is the probability of the horsepower of a car > 171 which based off the mean of the new dataset that probility is .01 and the probability of not having a car with horsepower >171 is .99. For height that is the probability that the height of a car >54 which based off the mean of the new dataset that probability is .53 and probability of not having a car with height >54 is .47.

## Computing Likehood

In [None]:
# Price

#Computing Likelihood
price = auto_new.loc[:,'price']
likelihood = np.product(st.bernoulli.pmf(price,.3375))
print("This is the likelihood of price:", likelihood)

#Graphing Likelihood
sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
p_x = [np.product(st.bernoulli.pmf(price, p)) for p in params]
plt.plot(params, p_x)
sns.despine()

### Comments:

    This graph shows the distribution of the likelihood for price with a sample of 100. As you can see the peak is around .34.

In [None]:
# Horespower

#Computing Likelihood
hp = auto_new.loc[:,'horsepower']
likelihood = np.product(st.bernoulli.pmf(price,.3375))
print("This is the likelihood of horsepower:", likelihood)

#Graphing Likelihood
sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
p_x = [np.product(st.bernoulli.pmf(hp, p)) for p in params]
plt.plot(params, p_x)
sns.despine()

### Comments:
    
        This graph shows the distribution of the likelihood for horsepower with a sample of 100. As you can see the peak is around 0.1.

In [None]:
# Height

#Computing Likelihood
height = auto_new.loc[:,'height']
likelihood = np.product(st.bernoulli.pmf(price,.3375))
print("This is the likelihood of height:", likelihood)

#Graphing Likelihood
sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
p_x = [np.product(st.bernoulli.pmf(height, p)) for p in params]
plt.plot(params, p_x)
sns.despine()

### Comments:

    This graph shows the distribution of the likelihood for price with a sample of 100. As you can see the peak is around .53.

## Prior Distribution

In [None]:
# Price

p_fair = np.array([np.product(st.bernoulli.pmf(price, p)) for p in params])
p_fair = p_fair / np.sum(p_fair)
plt.plot(params, p_fair)
sns.despine()

### Comments:
      In this case the prior distribution for price is exremely similar to the likelihood distributio for price.

In [None]:
# Horsepower

p_fair = np.array([np.product(st.bernoulli.pmf(hp, p)) for p in params])
p_fair = p_fair / np.sum(p_fair)
plt.plot(params, p_fair)
sns.despine()

### Comments:
      In this case the prior distribution for horsepower is exremely similar to the likelihood distributio for horsepower.

In [None]:
# Height

p_fair = np.array([np.product(st.bernoulli.pmf(height, p)) for p in params])
p_fair = p_fair / np.sum(p_fair)
plt.plot(params, p_fair)
sns.despine()

### Comments:
    In this case the prior distribution for height is exremely similar to the likelihood distributio for height.

## Posterior Distribution

In [None]:
#Price

sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
likelihood = [np.product(st.bernoulli.pmf(price, p)) for p in params]
p_fair = np.array([np.product(st.bernoulli.pmf(price, p)) for p in params])
prior = p_fair / np.sum(p_fair)
posterior = [prior[i] * likelihood[i] for i in range(prior.shape[0])]
posterior = posterior / np.sum(posterior)
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(8,8))
axes[0].plot(params, likelihood)
axes[0].set_title("Price Sampling Distribution")
axes[1].plot(params, prior)
axes[1].set_title("Price Prior Distribution")
axes[2].plot(params, posterior)
axes[2].set_title("Price Posterior Distribution")
sns.despine()
plt.tight_layout()

In [None]:
#Horsepower

sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
likelihood = [np.product(st.bernoulli.pmf(hp, p)) for p in params]
p_fair = np.array([np.product(st.bernoulli.pmf(hp, p)) for p in params])
prior = p_fair / np.sum(p_fair)
posterior = [prior[i] * likelihood[i] for i in range(prior.shape[0])]
posterior = posterior / np.sum(posterior)
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(8,8))
axes[0].plot(params, likelihood)
axes[0].set_title("Horsepower Sampling Distribution")
axes[1].plot(params, prior)
axes[1].set_title("Horsepower Prior Distribution")
axes[2].plot(params, posterior)
axes[2].set_title("Horsepower Posterior Distribution")
sns.despine()
plt.tight_layout()

In [None]:
#Height

sns.set(style='ticks', palette='Set2')
params = np.linspace(0, 1, 100)
likelihood = [np.product(st.bernoulli.pmf(height, p)) for p in params]
p_fair = np.array([np.product(st.bernoulli.pmf(height, p)) for p in params])
prior = p_fair / np.sum(p_fair)
posterior = [prior[i] * likelihood[i] for i in range(prior.shape[0])]
posterior = posterior / np.sum(posterior)
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(8,8))
axes[0].plot(params, likelihood)
axes[0].set_title("Height Sampling Distribution")
axes[1].plot(params, prior)
axes[1].set_title("Height Prior Distribution")
axes[2].plot(params, posterior)
axes[2].set_title("Height Posterior Distribution")
sns.despine()
plt.tight_layout()

### Comments:
    For each of the features: Price, Height, and Horsepower, they all had very similar likelihood, prior, and posterior distributions. In this case the Baynesian model does not give us many advantages as our own specification of the sampling distribution turned out to be very close to actual distribution.

# Bootstrap vs Classical

## Generating the Bootstrap Sample

In [None]:
#Bootstrap

Auto_bootstrap = Auto.sample(frac=1, replace=True)

### Comments:
    
     Bootstrapping is continued resampling with equivalent size and replacement from an original dataset. It allows us to make the sample size larger without generating more results. In the next section I will be comparing a bootstrap dataset with the original sample.
    

## Summary Statistics

In [None]:
#Dataset

print('Dataset Summary Statistics:',Auto.describe())

#Bootstrap Sample

print('Bootstrap Summary Statistics:',Auto_bootstrap.describe())

In [None]:
# Histograms

# Price - dataset
dataset_hist = plot_hist(Auto.loc[:,'price'])
print('Dataset Price Histogram')
plt.show(dataset_hist)

# Price - bootstrap
bootstrap_hist = plot_hist(Auto_bootstrap.loc[:,'price'])
print('Boostrap Price Histogram')
plt.show(bootstrap_hist)

# Horsepower - dataset
dataset_hist = plot_hist(Auto.loc[:,'horsepower'])
print('Dataset Price Histogram')
plt.show(dataset_hist)

# Horsepower - bootstrap
bootstrap_hist = plot_hist(Auto_bootstrap.loc[:,'horsepower'])
print('Boostrap Price Histogram')
plt.show(bootstrap_hist)

# Height - dataset
dataset_hist = plot_hist(Auto.loc[:,'height'])
print('Dataset Price Histogram')
plt.show(dataset_hist)

# Height - bootstrap
bootstrap_hist = plot_hist(Auto_bootstrap.loc[:,'height'])
print('Boostrap Price Histogram')
plt.show(bootstrap_hist)

### Comments: 

    The bootstrap sample appears to be very similar to the dataset however they are still slightly different. The distribution across all three variables is very similar as are the mean values. However, for price the Q3 value is significantly different in the bootstrap sample then in the dataset.

## Confidence Interval

### Comments: 

    I will create confidence intervals for both the bootstrap sample and the dataset using a 95% confidence level.

In [None]:
#Price - Dataset

price_ci = mean_confidence_interval(Auto.loc[:,"price"], confidence=0.95)
print(price_ci)

In [None]:
#Price - Bootstrap

price_ci_bs = mean_confidence_interval(Auto_bootstrap.loc[:,"price"], confidence=0.95)
print(price_ci_bs)

In [None]:
#Horsepower - Dataset

hp_ci = mean_confidence_interval(Auto.loc[:,"horsepower"], confidence=0.95)
print(hp_ci)

In [None]:
#Horsepower - Bootstrap

hp_ci_bs = mean_confidence_interval(Auto_bootstrap.loc[:,"horsepower"], confidence=0.95)
print(hp_ci_bs)

In [None]:
#Height - Dataset

height_ci = mean_confidence_interval(Auto.loc[:,"height"], confidence=0.95)
print(height_ci)

In [None]:
#Height - Bootstrap

height_ci_bs = mean_confidence_interval(Auto_bootstrap.loc[:,"height"], confidence=0.95)
print(height_ci_bs)

### Comments:

    A confidence interval can be interpretted as if we repeated this process many many times then about 95% of the invervals captured will capture the true mean. In this case, most of the variables have very simimlar confidence intervals except for price. The dataset and bootstrap sample have largely different bounds. 


## Hypothesis Testing

### Comments:
    
    Because of the differences between the bootstrap of price and the dataset of price, I want to examine this further and see if there are siginficant differences between the two population means. To do this I will plot a series of differences between the sample and bootstrap sample.

In [None]:
# Bootstrap Difference in Means

# price
diffs = []
price = Auto.loc[:,"price"]
price_bs = Auto_bootstrap.loc[:,"price"]
for i in range(1000):
    sample = price.sample(frac = 1.0, replace = True)
    price_mean = price.mean()
    bs_mean = sample.mean()
    diffs.append(price_mean - bs_mean)
diffs = pd.Series(diffs)

plot_hist(diffs)

### Comments:
    
    The graph above plots the difference of means between the dataset and a repeated bootstrap sample of price. Based off this graph it appears that there is no significant difference in means between the dataset and repeated bootstramp samples.

In [None]:
# Bootstrap Difference in Means

# horsepower
diffs = []
horsepower = Auto.loc[:,"horsepower"]
horsepower_bs = Auto_bootstrap.loc[:,"horsepower"]
for i in range(1000):
    sample = horsepower.sample(frac = 1.0, replace = True)
    hp_mean = horsepower.mean()
    bs_mean = sample.mean()
    diffs.append(hp_mean - bs_mean)
diffs = pd.Series(diffs)

plot_hist(diffs)

### Comments:

    Based off this graph it appears that there is no significant difference in means between the dataset and repeated bootstramp samples.

In [None]:
# Bootstrap Difference in Means

# height
diffs = []
height = Auto.loc[:,"height"]
height_bs = Auto_bootstrap.loc[:,"height"]
for i in range(1000):
    sample = height.sample(frac = 1.0, replace = True)
    height_mean = height.mean()
    bs_mean = sample.mean()
    diffs.append(height_mean - bs_mean)
diffs = pd.Series(diffs)

plot_hist(diffs)

### Comments:

    Based off this graph it appears that there is no significant difference in means between the dataset and repeated bootstramp samples. 

## T - Test

### Comments:
    
    I will now be running a t-test between each of the dataset means against their bootstrap means to see if there is any significant difference between the two means. If the pvalue is less than .05 than that means we can reject the null hypothesis that the two means are equal. If the pvalue is greater than .05 than we fail to reject the null hypothesis.

In [None]:
# T-test

#price
price = Auto.loc[:,"price"]
price_bs = Auto_bootstrap.loc[:,"price"]
test = t_test(price, price_bs, 0.05)
print(test)

### Comments:

    With a pvalue of 0.7 we fail to reject the null hypothesis that the mean for price from the dataset and mean for price from the bootstrap sample are equalivalent.

In [None]:
# T-test

#horsepower
hp = Auto.loc[:,"horsepower"]
hp_bs = Auto_bootstrap.loc[:,"horsepower"]
test = t_test(hp, hp_bs, 0.05)
print(test)

### Comments:

    With a pvalue of 0.92 we fail to reject the null hypothesis that the mean for horsepower from the dataset and mean for horsepower from the bootstrap sample are equivalent.

In [None]:
# T-test

#height
height = Auto.loc[:,"height"]
height_bs = Auto_bootstrap.loc[:,"height"]
test = t_test(height, height_bs, 0.05)
print(test)

### Comments:

    With a pvalue of 0.77 we fail to reject the null hypothesis that the mean for height from the dataset and mean for height from the bootstrap sample are equivalent.