# Lesson 4 Assignment - Automobile Price Hypothesis Testing
## Author - Matthew Denko

### Instructions
Test hypotheses for the price of automobiles:

1. Compare and test the normality of the distributions of **price** and **log price**. <br/>
Use both a graphical method and a formal test. Hint: remember these relationships will always be approximate.

2. Test significance of **price (or log price)** stratified by 
a) **fuel type**, b) **aspiration**, and c) **rear vs. front wheel drive**. <br />Use both graphical methods and the formal test.

3. Apply ANOVA and Tukey's HSD test to the auto price data to compare the **price (or log price** if closer to a normal distribution) of autos stratified by **body style**.

4. Graphically explore the differences between the price conditioned by the categories of each variable. <br/>
Hint: Make sure you have enough data for each category and drop categories with insufficient data.

5. Use standard ANOVA and Tukey HSD to test the differences of these groups.


#### Note: 
Please clearly outline the results of these tests with markdown text. Be sure your discussion includes narrative for your figures and tables so it is clear to the reader the importance of each bit of evidence.

# Importing/Cleaning Data

In [None]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
# Defining Functions

def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

In [None]:
# Reading Data

url = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
Auto = pd.read_csv(url, header=None)

#Assigning Column Names

Auto.columns = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors", "body-style", "drive-wheels",
               "engine-location", "wheel-base","length", "width", "height", "curb-weight", "engine-type", "num-of=cylinders",
               "engine-size", "fuel-system", "bore", "stroke", "compression-ratio","horsepower","peak-rpm","city-mpg",
               "highway-mpg","price"]
print(Auto.columns)
print(Auto.describe())
print(Auto.head(10))

In [None]:
#Removing cases with missing data

Auto.loc ['price',:] = pd.to_numeric(Auto['price'], errors='coerce').fillna(0)
Auto = Auto.replace(to_replace= "?", value= float('NaN'))

#dropping rows with nulls

Auto = Auto.dropna(axis = 0)
Auto_null = Auto.isnull().sum()
print("""Null Counts by Column
""",Auto_null)

# Testing Normality

## Price vs Log(price)

In [None]:
# Converting price to float

Auto.loc [:,'price'] = pd.to_numeric(Auto['price'], errors='coerce').fillna(0)
Auto.loc[:,'price'] = Auto['price'].astype('float')

# Creating log price

Auto.loc[:,'log_price'] = np.log(Auto["price"])
Auto.dtypes

#Dropping inf values

Auto = Auto.replace(to_replace= float("-inf"), value= 0)

In [None]:
#Price Histogram

price_hist = plt.hist(Auto.loc[:,'price'])
plt.show(price_hist)

In [None]:
#Log Price Histogram

log_price_hist = plt.hist(Auto.loc[:,'log_price'])
plt.show(log_price_hist)

### Comments:
    Based off the histogram plots of price and log(price) it is clear that price has much closer to normal distribution than price. While neither is a strong normal distribution, log price is strongly left skewed.

# Hypothesis Testing

## Creating Stratified Samples

In [None]:
# Creating Stratified Samples

# fuel-type
fuel_type = pd.DataFrame()
fuel_type = stratified_sample_df(Auto, 'fuel-type',100)
print(fuel_type.describe())

# aspiration
aspiration = pd.DataFrame()
aspiration = stratified_sample_df(Auto, 'aspiration',100)
print(aspiration.describe())

# drive-wheels
drive_wheels = pd.DataFrame()
drive_wheels = stratified_sample_df(Auto, 'drive-wheels',100)
print(drive_wheels.describe())

# body-style
body_style = pd.DataFrame()
body_style = stratified_sample_df(Auto, 'body-style',100)
print(body_style.describe())


## Graphing Distributions of Stratified Samples vs Population

In [None]:
##Price Histogram - Original Sample

price_hist = plt.hist(Auto.loc[:,'price'])
plt.show(price_hist)

In [None]:
##Price Histogram - Stratified by fuel-type

ft_hist = plt.hist(fuel_type.loc[:,'price'])
plt.show(ft_hist)

In [None]:
##Price Histogram - Stratified by aspiration

aspiration_hist = plt.hist(fuel_type.loc[:,'price'])
plt.show(aspiration_hist)

In [None]:
##Price Histogram - Stratified by drive-wheel

dw_hist = plt.hist(drive_wheels.loc[:,'price'])
plt.show(dw_hist)

In [None]:
##Price Histogram - Stratified by body-style

bs_hist = plt.hist(body_style.loc[:,'price'])
plt.show(bs_hist)

### Comments:

    Based off the histogram plots of the the samples stratified by fuel-type, aspiration, drive wheels, and body style there appears to be 3 main groups of prices. The first group is around 0-2500, the second group is around 7500-11000 and the third group is around 16000-18000. Each stratified sample has a similar spread in this manner that the original sample does not have. 

## ANOVA TESTING

In [None]:
# original sample vs fuel-type sample

ft_anova = stats.f_oneway(Auto['price'],fuel_type['price'])
print('fuel-type ANOVA test', ft_anova)

# original sample vs aspiration sample

as_anova = stats.f_oneway(Auto['price'],aspiration['price'])
print('aspiration ANOVA test', as_anova)

# original sample vs drive-wheel sample

dw_anova = stats.f_oneway(Auto['price'],drive_wheels['price'])
print('drive-wheel ANOVA test', dw_anova)

# original sample vs body-style sample

bs_anova = stats.f_oneway(Auto['price'],body_style['price'])
print('body-style ANOVA test', bs_anova)

### Comments:

    Based off the ANOVA tests between the original sample and the stratified samples and using a pvalue of 0.05, we cannot reject any of the null hypothesises that the two sample means of price are siginificantly different. 