# Regression Exercise 

California Housing Data

This data set contains information about all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. 

The task is to aproximate the median house value of each block from the values of the rest of the variables. 

 It has been obtained from the LIACC repository. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.
 

The Features:
 
* housingMedianAge: continuous. 
* totalRooms: continuous. 
* totalBedrooms: continuous. 
* population: continuous. 
* households: continuous. 
* medianIncome: continuous. 
* medianHouseValue: continuous. 

## The Data

** Import the cal_housing_clean.csv file with pandas. Separate it into a training (70%) and testing set(30%).**

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
housing = pd.read_csv("../sample_data/cal_housing_clean.csv")
housing.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [3]:
housing.describe()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,28.639486,2635.763081,537.898014,1425.476744,499.53968,3.870671,206855.816909
std,12.585558,2181.615252,421.247906,1132.462122,382.329753,1.899822,115395.615874
min,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,18.0,1447.75,295.0,787.0,280.0,2.5634,119600.0
50%,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [None]:
sns.pairplot(housin)

In [4]:
x_data = housing.drop("medianHouseValue", axis=1)
y_labels = housing['medianHouseValue']

In [5]:
x_data.columns

Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome'],
      dtype='object')

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    x_data, y_labels,
)

In [8]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(15480, 6) (5160, 6) (15480,) (5160,)


### Scale the Feature Data

** Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.**

In [9]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [10]:
scaler.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [11]:
X_train = pd.DataFrame(
    scaler.transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

In [12]:
X_test.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome
8879,1.0,0.058811,0.049038,0.021469,0.045387,0.543717
12718,0.686275,0.032872,0.024519,0.010062,0.023351,0.437718
3455,0.529412,0.083959,0.088144,0.061717,0.091761,0.321851
18668,0.431373,0.149545,0.149441,0.061493,0.144549,0.301334
5308,0.607843,0.049506,0.067349,0.022282,0.06841,0.29041


### Create Feature Columns

** Create the necessary tf.feature_column objects for the estimator. They should all be trated as continuous numeric_columns. **

In [13]:
import tensorflow as tf

In [14]:
median_age = tf.feature_column.numeric_column('housingMedianAge')
rooms = tf.feature_column.numeric_column('totalRooms')
bedrooms = tf.feature_column.numeric_column('totalBedrooms')
population = tf.feature_column.numeric_column('population')
households = tf.feature_column.numeric_column('households')
income = tf.feature_column.numeric_column('medianIncome')

In [15]:
feat_cols = [
    median_age,
    rooms,
    bedrooms,
    population,
    households,
    income
]

** Create the input function for the estimator object. (play around with batch_size and num_epochs)**

In [16]:
input_func = tf.estimator.inputs.pandas_input_fn(
    X_train,
    y_train,
    batch_size=10,
    num_epochs=1000,
    shuffle=True
)

** Create the estimator model. Use a DNNRegressor. Play around with the hidden units! **

In [17]:
dnn_model = tf.estimator.DNNRegressor(
    hidden_units=[6,6,6,6,6],
    feature_columns=feat_cols
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/rg/qlb5htps7fg440bjyl_5vztc0000gn/T/tmp7v3dzhra', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12c645550>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


##### ** Train the model for ~1,000 steps. (Later come back to this and train it for more and check for improvement) **

In [18]:
dnn_model.train(
    input_fn=input_func,
    steps=30000
)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/rg/qlb5htps7fg440bjyl_5vztc0000gn/T/tmp7v3dzhra/model.ckpt.
INFO:tensorflow:loss = 744655900000.0, step = 1
INFO:tensorflow:global_step/sec: 410.804
INFO:tensorflow:loss = 284381050000.0, step = 101 (0.245 sec)
INFO:tensorflow:global_step/sec: 563.99
INFO:tensorflow:loss = 79839930000.0, step = 201 (0.177 sec)
INFO:tensorflow:global_

INFO:tensorflow:global_step/sec: 562.214
INFO:tensorflow:loss = 144218720000.0, step = 5901 (0.177 sec)
INFO:tensorflow:global_step/sec: 592.392
INFO:tensorflow:loss = 102302065000.0, step = 6001 (0.169 sec)
INFO:tensorflow:global_step/sec: 603.523
INFO:tensorflow:loss = 135110150000.0, step = 6101 (0.166 sec)
INFO:tensorflow:global_step/sec: 592.979
INFO:tensorflow:loss = 89410020000.0, step = 6201 (0.168 sec)
INFO:tensorflow:global_step/sec: 639.534
INFO:tensorflow:loss = 82942980000.0, step = 6301 (0.157 sec)
INFO:tensorflow:global_step/sec: 660.219
INFO:tensorflow:loss = 71655890000.0, step = 6401 (0.152 sec)
INFO:tensorflow:global_step/sec: 579.096
INFO:tensorflow:loss = 47325786000.0, step = 6501 (0.177 sec)
INFO:tensorflow:global_step/sec: 604.445
INFO:tensorflow:loss = 63072220000.0, step = 6601 (0.164 sec)
INFO:tensorflow:global_step/sec: 618.215
INFO:tensorflow:loss = 76991550000.0, step = 6701 (0.160 sec)
INFO:tensorflow:global_step/sec: 630.629
INFO:tensorflow:loss = 857638

INFO:tensorflow:loss = 135110870000.0, step = 13801 (0.163 sec)
INFO:tensorflow:global_step/sec: 590.81
INFO:tensorflow:loss = 105737810000.0, step = 13901 (0.168 sec)
INFO:tensorflow:global_step/sec: 622.751
INFO:tensorflow:loss = 119484790000.0, step = 14001 (0.160 sec)
INFO:tensorflow:global_step/sec: 555.398
INFO:tensorflow:loss = 25754135000.0, step = 14101 (0.180 sec)
INFO:tensorflow:global_step/sec: 691.405
INFO:tensorflow:loss = 117499270000.0, step = 14201 (0.144 sec)
INFO:tensorflow:global_step/sec: 586.301
INFO:tensorflow:loss = 74070310000.0, step = 14301 (0.172 sec)
INFO:tensorflow:global_step/sec: 636.423
INFO:tensorflow:loss = 46207110000.0, step = 14401 (0.156 sec)
INFO:tensorflow:global_step/sec: 625.226
INFO:tensorflow:loss = 91425390000.0, step = 14501 (0.161 sec)
INFO:tensorflow:global_step/sec: 561.114
INFO:tensorflow:loss = 158933990000.0, step = 14601 (0.177 sec)
INFO:tensorflow:global_step/sec: 606.354
INFO:tensorflow:loss = 194026730000.0, step = 14701 (0.165 s

INFO:tensorflow:loss = 114213175000.0, step = 21701 (0.168 sec)
INFO:tensorflow:global_step/sec: 637.605
INFO:tensorflow:loss = 119100680000.0, step = 21801 (0.155 sec)
INFO:tensorflow:global_step/sec: 607.46
INFO:tensorflow:loss = 24401620000.0, step = 21901 (0.166 sec)
INFO:tensorflow:global_step/sec: 606.266
INFO:tensorflow:loss = 95306015000.0, step = 22001 (0.163 sec)
INFO:tensorflow:global_step/sec: 637.588
INFO:tensorflow:loss = 57372010000.0, step = 22101 (0.157 sec)
INFO:tensorflow:global_step/sec: 631.213
INFO:tensorflow:loss = 189214200000.0, step = 22201 (0.158 sec)
INFO:tensorflow:global_step/sec: 641.693
INFO:tensorflow:loss = 132695590000.0, step = 22301 (0.156 sec)
INFO:tensorflow:global_step/sec: 538.364
INFO:tensorflow:loss = 71860150000.0, step = 22401 (0.188 sec)
INFO:tensorflow:global_step/sec: 577.914
INFO:tensorflow:loss = 90832490000.0, step = 22501 (0.170 sec)
INFO:tensorflow:global_step/sec: 647.979
INFO:tensorflow:loss = 45021405000.0, step = 22601 (0.155 sec

INFO:tensorflow:loss = 83341115000.0, step = 29601 (0.150 sec)
INFO:tensorflow:global_step/sec: 620.876
INFO:tensorflow:loss = 51352252000.0, step = 29701 (0.164 sec)
INFO:tensorflow:global_step/sec: 605.462
INFO:tensorflow:loss = 49504526000.0, step = 29801 (0.162 sec)
INFO:tensorflow:global_step/sec: 669.28
INFO:tensorflow:loss = 38280360000.0, step = 29901 (0.150 sec)
INFO:tensorflow:Saving checkpoints for 30000 into /var/folders/rg/qlb5htps7fg440bjyl_5vztc0000gn/T/tmp7v3dzhra/model.ckpt.
INFO:tensorflow:Loss for final step: 76348860000.0.


<tensorflow_estimator.python.estimator.canned.dnn.DNNRegressor at 0x12c645400>

** Create a prediction input function and then use the .predict method off your estimator model to create a list or predictions on your test data. **

In [19]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(
    X_test,
    batch_size=10,
    num_epochs=1,
    shuffle=False
)

In [20]:
predictions = [prediction['predictions'] for prediction in dnn_model.predict(pred_input_func)]

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /var/folders/rg/qlb5htps7fg440bjyl_5vztc0000gn/T/tmp7v3dzhra/model.ckpt-30000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


** Calculate the RMSE. You should be able to get around 100,000 RMSE (remember that this is in the same units as the label.) Do this manually or use [sklearn.metrics](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) **

In [21]:
from sklearn.metrics import mean_squared_error

In [22]:
mean_squared_error(y_test, predictions) ** 0.5

79129.46209866836

Should be < 10000

# Great Job!