# Boston Housing Dataset
Using the cannonical Boston Housing dataset, we'll consider these research questions:
- How much variation in **median house price** in each census tract can be explained by the characteristics present in the dataset? This is a question of measure of fit of a model, and is assessed using the **R Squared** value
- What are the strength and magnitude of association between each characteristic (independent variable) and the median housing price (dependent variable)? This question is assessed by the magnitude of the beta values output by the model, and the uncertainty around that estimate (captured in the Standard Error around that estimate, and the corresponding p value).

In [58]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import statsmodels.formula.api as smf # linear modeling
import matplotlib.pyplot as plt

In [2]:
# Load and format data, then print the description
boston = load_boston()
bos_df = pd.DataFrame(boston.data, columns=boston.feature_names)
bos_df["med_price"] = pd.Series(boston.target * 1000)
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Before you get started, you should make sure you understand the dataset by answering the following questions:
- What is your **unit of analysis**? In other words, what does each row represent?
- What is your outcome of interest?
- What will performing a linear regression allow you to do?

## Preliminary analysis (before modeling)

In [1]:
# Draw a historgram of your outcome of interest
# Make sure to include a descriptive chart title (i.e., what is it actually a distribution of?)

In [2]:
# Compute the correlation between your outcome of interest and each other variable
# Hint: you can use the `.corr()` method of your dataframe to compute all correlations, 
# then select and sort the values you are interested in

In [3]:
# Draw a scatter plot between your most correlated feature and your outcome of interest
# Make sure to include a descriptive plot title

## Univariate Regression
Perform a univariate regression of your most correlated variable to your outcome of interest. Note, this allows you to assess a hypothesis regarding the relationship between these variables.

> **Answer**: Type out your null hypothesis here 

In [4]:
# Fit your model to your data

In [5]:
# How much variation in your outcome is explained by variation in you independent variable?
# Hint: the value is stored as part of your model fit

In [6]:
# Create a scatter plot of your independent variable versus your dependent variable
# Overlay on the scatter plot the predictions from your model
# Make sure to include descriptive lables/titles

Based on your model, can you reject your null hypothesis? If so, what do you believe the relationship to be between your independnet and dependent variables? Make sure to include specific values in your response.

> **Answer**: Your answer here.

## Multivariate regression
Perform a multivariate regression of all of your features to your outcome of interest. Note, this allows you to assess a hypothesis regarding the relationship between each feature and your outcome.

In [7]:
# Fit a model with all features
# Hint: you can "join" together boston.feature_names

In [8]:
# Create a plot that shows:
# - X axis: the coefficient for each feature in your model, 
#   AND the confidence intervals around it
# - Y axis: the name of each feature
# Hint: you can extract this information using MODEL_NAME.conf_int() and MODEL_NAME.params
# Hint: matplotlib has an `.errorbar()` chart type for this
# Include a descriptive chart title

How much variation in your outcome can you explain using all of these variables?

> **Answer**: Your answer here

Note the coefficient (beta value) on LSTAT -- how (and why) is it different than above?

> **Answer**: Your answer here


In [9]:
# Create a scatter plot of your actual data (x-axis) versus your predictions 
# Add a 45 degree line to the plot
# Bonus (not for credit): Shade the areas of under/over prediction