<a href="https://colab.research.google.com/github/nrhodes/cs152fa2019/blob/master/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3

This assignment is to predict housing prices. We're using a dataset from Ames, Iowa ([paper](http://jse.amstat.org/v19n3/decock.pdf), [dataset](http://jse.amstat.org/v19n3/decock/AmesHousing.txt)).

You may find it useful to start with the [Chollet notebook](https://colab.research.google.com/github/nrhodes/cs152fa2019/blob/master/Lecture_12_3_7_predicting_house_prices_(tf_2_0).ipynb) that does regression on Boston housing prices.

This notebook currently loads in the Ames housing dataset into a [Pandas](https://pandas.pydata.org/) dataframe ([Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)). Pandas is a Python library suited for working with large amounts of tabular data.

Your goal is to train a Neural Network to a low Mean Absolute Error on your validation dataset (that is, you want to minimize the average error in predicted house price).  Since the number of samples is fairly small, I recommend using some form of K-fold cross-validation. 

In order to ensure that the K-fold cross-validation error is accurate, ensure that you don't make the mistake that the Chollet notebook does of doing a one-time normalization of the data  and then splitting it up into training and validation.  Instead, when you do normalization, it should be based on the mean and standard deviation of *only* the data the model is being trained on (otherwise, information from the validation data has *leaked* to the training data).

You'll need to decide:
*  The structure of the Neural Network (number of layers and sizes)
*  The number of epochs
*  The choice of _k_ for k-fold validation
*  The optimizer to use
*  The activation functions to use
* The value of hyperparameters
*  Whether you want to remove uniquely-identifying information from the training instances (e.g., each training instance has a unique _Order_ and _PID_ field)
*  How you want to deal with categorical information ( [this article](https://pbpython.com/categorical-encoding.html) gives some possibilities).
*  Whether you want to create any new columns (based on existing columns)
* Whether you want to remove any columns
* Whether you want to add any regularization


Please keep a log in your notebook, keeping track of what you've tried and what its results are.

I'll be grading this assignment on:
*  Whether you've fixed the leaking validation error (w.r.t. normalization).
*  The effort/creativity you've shown in attempting to create a good model (as shown in the log).
*  The MAE you've achieved on the validation data.
* How clear your notebook is. Don't just have code cells, have text cells that describe what's going on. However, don't just copy the text cells from the Aggarwal notebook.

I expect that you'll spend between 10 and 20 hours on this assignment.


## Initialization
Ensure we're using TensorFlow 2.0.

In [26]:
# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
!pip install gast==0.2.2  # downgrade to resolve a proble with tf and gast 0.3
import tensorflow as tf 
print(tf.__version__)
print(tf.keras.__version__)

2.0.0-rc2
2.2.4-tf


Setup numpy and pandas for use.  Standard usage is to use `np` for numpy and `pd` for pandas.

In [0]:
import pandas as pd
import numpy as np

np.random.seed(829)

## Loading the dataset
This reads from the Ames housing tab-separated file into a pandas dataframe.

In [28]:
df = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep=None)

  """Entry point for launching an IPython kernel.


We'll make one change to the data. We'll remove a handful of outliers with total indoor square footage > 4000. I don't care about predicting those mansions in Ames, Iowa.

In [0]:
df = df[df['1st Flr SF'] + df['2nd Flr SF'] <= 4000]

Let's look at the data:

In [30]:
df

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,...,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,...,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,...,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,...,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,...,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,...,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900
5,6,527105030,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,6,1998,1998,Gable,CompShg,VinylSd,VinylSd,BrkFace,20.0,TA,TA,PConc,TA,TA,No,GLQ,602.0,Unf,0.0,324.0,926.0,...,Y,SBrkr,926,678,0,1604,0.0,0.0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,1998.0,Fin,2.0,470.0,TA,TA,Y,360,36,0,0,0,0,,,,0,6,2010,WD,Normal,195500
6,7,527127150,120,RL,41.0,4920,Pave,,Reg,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,2001,2001,Gable,CompShg,CemntBd,CmentBd,,0.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,616.0,Unf,0.0,722.0,1338.0,...,Y,SBrkr,1338,0,0,1338,1.0,0.0,2,0,2,1,Gd,6,Typ,0,,Attchd,2001.0,Fin,2.0,582.0,TA,TA,Y,0,0,170,0,0,0,,,,0,4,2010,WD,Normal,213500
7,8,527145080,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,1992,1992,Gable,CompShg,HdBoard,HdBoard,,0.0,Gd,TA,PConc,Gd,TA,No,ALQ,263.0,Unf,0.0,1017.0,1280.0,...,Y,SBrkr,1280,0,0,1280,0.0,0.0,2,0,2,1,Gd,5,Typ,0,,Attchd,1992.0,RFn,2.0,506.0,TA,TA,Y,0,82,0,0,144,0,,,,0,1,2010,WD,Normal,191500
8,9,527146030,120,RL,39.0,5389,Pave,,IR1,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,1995,1996,Gable,CompShg,CemntBd,CmentBd,,0.0,Gd,TA,PConc,Gd,TA,No,GLQ,1180.0,Unf,0.0,415.0,1595.0,...,Y,SBrkr,1616,0,0,1616,1.0,0.0,2,0,2,1,Gd,5,Typ,1,TA,Attchd,1995.0,RFn,2.0,608.0,TA,TA,Y,237,152,0,0,0,0,,,,0,3,2010,WD,Normal,236500
9,10,527162130,60,RL,60.0,7500,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,1999,1999,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,TA,TA,No,Unf,0.0,Unf,0.0,994.0,994.0,...,Y,SBrkr,1028,776,0,1804,0.0,0.0,2,1,3,1,Gd,7,Typ,1,TA,Attchd,1999.0,Fin,2.0,442.0,TA,TA,Y,140,60,0,0,0,0,,,,0,6,2010,WD,Normal,189000


As you can see, we have 2925 total samples and 82 columns of data (including the sales prices, which is what we will be trying to predict). The 82 columns are described at [the Ames Housing Data Documentation](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt).

As you can see, some of the columns (like `Year Built`) are numeric, whereas others (like `House Style`) are strings. You can't feed strings directly into a Neural Network.  The `extract_data` function below drops all string-columns. See [the docs for DataFrame.drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). However, at some point, you'll want to add code to encode these strings values as numbers so that the neural network can use that data.

# Preparing training and test datasets
Break apart into training and test set.  Remember that you should only ever use the test set **one time**, at the very end to determine your final accuracy.  Don't change your model and then retest the test set accuracy!

In [0]:
mask = np.random.rand(len(df)) < 0.85

train_df = df[mask]
test_df = df[~mask]

In [0]:
def extract_data_labels(df):
  labels = df['SalePrice'].values
  df_without_string_columns_or_sale_price = df.copy().drop(columns=['SalePrice']).select_dtypes(exclude=['object'])
  data = df_without_string_columns_or_sale_price.values.astype('float64')
  return (data, labels)

(train_data, train_targets) = extract_data_labels(train_df)
(test_data, test_targets) = extract_data_labels(test_df)

In [33]:
train_data.shape

(2472, 38)

In [34]:
test_data.shape

(453, 38)

As you can see, we have 2472 training samples and 453 test samples. The data comprises 38 features (the original 82 columns minus the target column and minus the string columns).

In [35]:
train_targets

array([215000, 105000, 172000, ..., 132000, 170000, 188000])

## Hints
### Specifying non-default hyper-parameters for optimizers
When we've used optimizers so far, we've specificed the optimizer as a string (like `'adam'`), which instantiates the optimizer with all-default values:
```
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
```


In order to specificy the hyper-parameters explicitly for an optimizer, you'll need to instantiate it with desired hyper-parameters, and then pass that instantiated optimizer to the compile method:
```
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
model.compile(optimizer=opt, loss='mse', metrics=['mae'])
```



### Normalization
If it's good to normalize the inputs, does it also make sense to normalize the output?

# Submit instructions

Due date: October 28, 2019, 1 PM.

1. Before submitting, calculate (and print) the Mean Absolute Error on the test set. This'll give the best estimate of how well the model will generalize:
   `test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)`.  
   Don't do this calculation until you're ready to submit!  That is, you should only ever calculate this *one time*.
2. Make sure that the output of all cells is up-to-date (do a _Runtime/Run all_). 
1. Duplicate your notebook:
    1. Choose _File/Save a Copy in Drive…_
    1. Rename the new copy:
        1. Click on notebook name at the top of the window.
        2. Rename to "CS152Sp19Assign3 _FirstName_ _LastInitial_" (using the correct assignment number, along with your first name, and your last initial). I need this naming so I can easily navigate through the large number of shared docs I'll have by the end of the semester.
2. Share your document with me:
    1. Click on the _Share_ button at the top-right of your notebook.
    1. Enter `rhodes@g.hmc.edu` as the email address.
    1. Click the pencil icon and select _Can  edit. 
    1. Click on _Done_.
3. Don't edit the file after the submission deadline! (Feel free to make as many changes as you'd like before then, though.)
4. I'll provide inline comments when I grade the submission.
