# California Housing Prices

## Table of Contents

* [Introducing the Data Set](#introducing-the-data-set)
* [Automated Approach to Feature Selection]()
* [Exploring the Models We Built]()

### Introducing the Data Set

The *California Housing Prices Data Set* is a famous example data set. It records the data about **housing districts** (not individual houses) in California from the 1990 Census. As a machine learning problem, the goal is to predict the median housing price of the district. 

For this part, we're going to fit a multivariable linear regression model on a subset of the features in more of an automated way. Later, we will look at a more "human-in-the-loop"/exploratory data analysis style. 

Still, we need to import the data and set the predictors and target. 

In [13]:
import pandas as pd

I found the data at [this link](https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv), and I'm going to use the fact that pandas can pull data
off of the web to read it directly. 

In [14]:
housing_df = pd.read_csv("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")

In [15]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


Let's drop the observations with missing values. 

In [18]:
housing_df = housing_df.dropna(axis = "index")

In [19]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20433 non-null float64
latitude              20433 non-null float64
housing_median_age    20433 non-null float64
total_rooms           20433 non-null float64
total_bedrooms        20433 non-null float64
population            20433 non-null float64
households            20433 non-null float64
median_income         20433 non-null float64
median_house_value    20433 non-null float64
ocean_proximity       20433 non-null object
dtypes: float64(9), object(1)
memory usage: 1.7+ MB


In [27]:
housing_df = housing_df.drop(columns = ['ocean_proximity'])

In [28]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 9 columns):
longitude             20433 non-null float64
latitude              20433 non-null float64
housing_median_age    20433 non-null float64
total_rooms           20433 non-null float64
total_bedrooms        20433 non-null float64
population            20433 non-null float64
households            20433 non-null float64
median_income         20433 non-null float64
median_house_value    20433 non-null float64
dtypes: float64(9)
memory usage: 2.2 MB


### Automated Approach to Feature Selection

For the moment, we're going to have one objective:  use the tools available to us to build the "best"-fitting model with a certain number of variables , as judged by the $r^2$ score function on the data set . We are *not* going to worry too much about what the functions are doing or even what the real-world context for the data is. 

#### Using SelectKBest with f_regression

In [44]:
#import the tools we'll be using
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np

Let's build a pipeline that fits a linear regression model to the best 3 features from the data set. 

In [9]:
three_feature_selector = SelectKBest(score_func = f_regression, k = 3)
lm3 = LinearRegression()
lm3_pipeline = make_pipeline(three_feature_selector, lm3)

That's it. Now we have an object **lm3_pipeline** that reduces an input to the "best" three features of the data set, and builds a linear regression model based on these features. 

## Notice that this pipeline doesn't yet know what our data is. 

## Splitting the Data into Predictors (Features) and Response

In [30]:
y_var = housing_df["median_house_value"] #create a response
X_var = housing_df.drop(columns = ["median_house_value"]) #create predictors

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X_var, y_var, 
                                                    train_size = 0.7,
                                                   random_state = 23)

In [54]:
lm3_pipeline.fit(X_train, y_train) #actually creates the pipeline

Pipeline(memory=None,
         steps=[('selectkbest',
                 SelectKBest(k=3,
                             score_func=<function f_regression at 0x118f38d40>)),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [36]:
lm3.coef_

array([-4.93474411e+03, -6.97287617e-02,  4.14095523e+04])

In [37]:
lm3.intercept_

222498.8043618232

In [39]:
X_train.loc[12] #loc locates a row

longitude             -122.260
latitude                37.850
housing_median_age      52.000
total_rooms           2491.000
total_bedrooms         474.000
population            1098.000
households             468.000
median_income            3.075
Name: 12, dtype: float64

In [40]:
y_train[12]

213500.0

In [50]:
house_12 = np.array([-122.260, 37.95, 52, 2491, 474, 1098, 468, 3.075])

In [56]:
lm3_pipeline.predict([house_12])

array([162385.944369])

In [59]:
house_12 = X_train.loc[12]

In [60]:
lm3_pipeline.predict([house_12])

array([162879.41878035])

In [61]:
residual_12 = y_train[12] - lm3_pipeline.predict([house_12])

In [62]:
residual_12

array([50620.58121965])

Our goal was to do this in an automated way. 

Situation: You have ten minutes to come up with a model that exceeds score > 0.9 on the test data. 

In [63]:
lm3_pipeline.score(X_train, y_train)

0.47568013973680634

In [64]:
lm3_pipeline.score(X_test, y_test)

0.4961274011007707

In [65]:
X_train.loc[204] 

longitude             -122.2300
latitude                37.7800
housing_median_age      44.0000
total_rooms           2340.0000
total_bedrooms         825.0000
population            2813.0000
households             751.0000
median_income            1.6009
Name: 204, dtype: float64