# LAB 01:  Basic Feature Engineering in Keras 

**Learning Objectives**

* Setup up the environment
* Create the project datasets
* Create an input pipeline using tf.data
* Build, train, and evaluate a model using Keras
* Feature engineer categorical and numeric features 
* Load and preprocess test data
* Create and test a prediction model


## Introduction 
In this lab, we utilize feature engineering to improve the prediction of housing prices.  We will use Keras to build a housing price prediction model, using feature engineering to improve model prediciton.  

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solution/feateng-solution_bqml.ipynb). **NOTE TO SELF**:  UPDATE HYPERLINK. 

### Set up environment variables and load necessary libraries

In [0]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "$PROJECT

In [0]:
import os

PROJECT = "cloud-training-demos" # REPLACE WITH YOUR PROJECT NAME
REGION = "us-west1-b" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
os.environ["BUCKET"] = PROJECT # DEFAULT BUCKET WILL BE PROJECT ID

if PROJECT == "your-gcp-project-here":
  print("Don't forget to update your PROJECT name! Currently:", PROJECT)

## Import TensorFlow and other libraries

In [0]:
!pip install sklearn

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

from keras.utils import plot_model
from keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint

# PART 1:  Set-Up!



#### **Exercise**:   **REVIEW and THEN RUN** all cells in Part 1. This helps us get the data, validate data pre-processing and ensure that the data is ready for the neural network.

## The Source Dataset

The California housing dataset data contains 20,640 observations on 10 variables.  The data has been pre-processed so that there are no missing values.



#### **Exercise**:   **RUN** the query to create a Pandas dataframe

In [0]:
#NOTE TO Lab Reviewers: Lab requires students to clone the training-data-analyst repo in the Qwiklab portion,

dataframe = pd.read_csv('housing_pre-proc.csv', error_bad_lines=False)
dataframe.head()

In [0]:
#See datatype for each feature

dataframe.info()

In [0]:
#Check for null values

dataframe.isnull().sum()

####  Split the dataset for ML

The dataset we loaded was a single CSV file. We will split this into train, validation, and test sets.


In [0]:
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

In [0]:
#Print out the output.  

print("\nTrain:\n")
print(train.head())
print(train.shape)

print("\nValidati:\n")
print(val.head())
print(val.shape)

print("\nTest:\n")
print(test.head())
print(test.shape)

Now, we need to output the split files.  We will specifically need the test.csv later for testing.  You should see the files appear in the home directory.


In [0]:
train.to_csv('train.csv', encoding='utf-8', index=False)
train.head()

In [0]:
val.to_csv('val.csv', encoding='utf-8', index=False)
val.head()

In [0]:
test.to_csv('test.csv', encoding='utf-8', index=False)
test.head()

# Part 2:  Your feature engineering lab starts here!

## Objective:   Build an input pipeline

#### **Exercise**:   Create an input pipeline using tf.data

Next, we will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets). This will enable us  to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model.  

In [0]:
#TODO: This function is missing two lines.  Correct and run the cell.

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)


In [0]:
# SOLUTION
#A utility method to create a tf.data dataset from a Pandas Dataframe

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('median_house_value')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

#### **Exercise**:   **RUN** the cell to initialize the training datasets.

In [0]:
batch_size = 32 
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

## Understand the input pipeline

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.

#### **Exercise**:   **RUN** the cell to see a sample of the features from the batch.

In [0]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of households:', feature_batch['households'])
  print('A batch of ocean_proximity:', feature_batch['ocean_proximity'])
  print('A batch of targets:', label_batch )

We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

### Numeric columns
The output of a feature column becomes the input to the model. A numeric is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In the California housing prices dataset, most columns from the dataframe are numeric.

#### **Exercise**:   Create a variable called **num_c** to hold only the numerical feature columns.

In [0]:
#TODO YOUR CODE HERE

In [0]:
#SOLUTION

num_c = ['longitude',  'latitude',
                'housing_median_age', 'total_rooms', 'total_bedrooms',
                 'population', 'households', 'median_income']


### Scaler function
It is very important for numerical variables to get scaled before they are "fed" into the neural network. Here we use min-max scaling. Here we are creating a function named 'get_scal' which takes list of numerical features and  returns 'minmax' function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. 'minmax' function itself takes a 'numerical' number from a particular feature and return scaled value of that number. 

#### **Exercise**:   **RUN** the next two cells to scale the numeric features.

In [0]:
#Scalardef get_scal(feature):
def get_scal(feature):
  def minmax(x):
    mini = train[feature].min()
    maxi = train[feature].max()
    return (x - mini)/(maxi-mini)
  return(minmax)

In [0]:
feature_columns = []
for header in num_c:
  scal_input_fn = get_scal(header)
  feature_columns.append(feature_column.numeric_column(header, normalizer_fn=scal_input_fn))


#### **Exercise**:   **RUN** the cell to see the total number of feature columns.  Compare this number to the number of numeric features you input earlier.

In [0]:
print('Total number of feature coLumns: ',len(feature_columns))

## Objective:  Build, train, and evaluate a model using Keras

#### **Exercise**:   Correct the cell below that creates, compiles, and fits a Keras model.

In [0]:
#TODO - CODE IS INCORECT 
#Model create
tf.keras.layers.DenseFeatures(feature_columns) = feature_layer 

tf.keras.Sequential  = model([
  feature_layer,
  layers.Dense(12,  input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear')
])

### Model compile
model.fit(optimizer='adam',
              loss='mse',
              metrics=['mse']) 

### Model Fit
history = model.compile(train_ds,
          validation_data=val_ds,
          epochs=32) 

In [0]:
#SOLUTION
#Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12,  input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

### Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse']) 

### Model Fit
history = model.fit(train_ds,
          validation_data=val_ds,
          epochs=32) 

#### **Exercise**:   **RUN** the cell to show loss and accuracy.

In [0]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)


### Visualize the model loss curve

Next, we will use matplotlib to draw the model's loss curves for training and validation.  A line plot is also created showing the mean squared error loss over the training epochs for both the train (blue) and test (orange) sets.

#### **Exercise**:   **RUN** the cell to show the the models loss curves.

In [0]:
# plot
import matplotlib.pyplot as plt
nrows = 1
ncols = 2
fig = plt.figure(figsize=(10, 5))

for idx, key in enumerate(['loss', 'mse']):  
    ax = fig.add_subplot(nrows, ncols, idx+1)
    plt.plot(history.history[key])
    plt.plot(history.history['val_{}'.format(key)])
    plt.title('model {}'.format(key))
    plt.ylabel(key)
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left');

## Objective:  Load and preprocess test data

#### **Exercise**:   In the next two cells, read in the test.csv file and validate that there are no null values.

In [0]:
#TODO  YOUR CODE HERE

In [0]:
#SOLUTION

test_data = pd.read_csv('test.csv')
test_data.head()

In [0]:
#TODO YOUR CODE HERE

In [0]:
#SOLUTION
#No null values.
test_data.isnull().sum()

## Input function for test data

#### **Exercise**:   **RUN** the cells to create the input function for the test data and to initialize the test_predict variable.

In [0]:
def test_input_fn(features, batch_size=256):
    """An input function for prediction."""
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

In [0]:
test_predict = test_input_fn(dict(test_data))

## Prediction:  Linear Regression

To predict with Keras, you simply call [model.predict()](https://keras.io/models/model/#predict) and pass in the housing features you want to predict the median_house_value for. Note:  We are predicting the model locally.

#### **Exercise**:   **RUN** the cell to create the median house value prediction on the test data.


In [0]:
predicted_median_house_value=model.predict(test_predict)

#### **Exercise**:  Write a prediction DataFrame for a linear regression model

In [0]:
#TODO YOUR CODE HERE.  HINT:  Copy the first line from the test.csv you read in earlier.

In [0]:
#Prediction model:  Pass in the features from one row of the test data.

#This example shows median house value of $117,800 for INLAND property.

#Copy of first line from my test.csv:  Note, do not include the median house value ($117,800), 
#that is what we are trying to predict.

#-121.86	39.78	12.0	7653.0	1578.0	3628.0	1494.0	3.0905	117800.0	INLAND

model.predict({
    'longitude': tf.convert_to_tensor([-121.86]),
    'latitude': tf.convert_to_tensor([39.78]),
    'housing_median_age': tf.convert_to_tensor([12.0]), 		
    'total_rooms': tf.convert_to_tensor([7653.0]),
    'total_bedrooms': tf.convert_to_tensor([1578.0]),  
    'population': tf.convert_to_tensor([3628.0]),
    'households': tf.convert_to_tensor([1494.0]),	
    'median_income': tf.convert_to_tensor([3.0905]),
    'ocean_proximity': tf.convert_to_tensor(['INLAND'])
    
}, steps=1)

In [0]:
#SOLUTION
#Prediction model:  Pass in the features from one row of the test data.

model.predict({
    'longitude': tf.convert_to_tensor([-122.43]),
    'latitude': tf.convert_to_tensor([37.63]),
    'housing_median_age': tf.convert_to_tensor([34.0]), 		
    'total_rooms': tf.convert_to_tensor([4135.0]),
    'total_bedrooms': tf.convert_to_tensor([687.0]),  
    'population': tf.convert_to_tensor([2154.0]),
    'households': tf.convert_to_tensor([742.0	]),	
    'median_income': tf.convert_to_tensor([4.9732]),
    'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
    
}, steps=1)

#### **Exercise**:  Analysis

The array returns a predicted value.  What does this number mean?  Let's compare this value to the test set.   

Go to the test.csv you read in a few cells up.  Locate the first line and find the median_house_value - which should be 249,000 dollars near the ocean. What value did your model predicted for the median_house_value? Was it a solid model performance? Let's see if we can improve this a bit with feature engineering!  


## Feature Engineering



#### **Exercise**:   Create a cell that indicates which features will be used in the model.

Note:  Be sure to bucketize 'housing_median_age' and ensure that 'ocean_proximity' is one-hot encoded.  And, don't forget your numeric values!


In [0]:
#TODO - YOUR CODE HERE

In [0]:
#SOLUTION
num_c = ['longitude',  'latitude',
                'housing_median_age', 'total_rooms', 'total_bedrooms', 
                 'households','population', 'median_income']

bucket_c = ['housing_median_age']

#categorical features
cat_i_c = ['ocean_proximity'] #indicator columns


#### **Exercise**:   **RUN** the next two cells to scale the features.



In [0]:
#Scalardef get_scal(feature):

def get_scal(feature):
  def minmax(x):
    mini = train[feature].min()
    maxi = train[feature].max()
    return (x - mini)/(maxi-mini)
  return(minmax)

In [0]:
#All numeric features -scaling

feature_columns = []
for header in num_c:
  scal_input_fn = get_scal(header)
  feature_columns.append(feature_column.numeric_column(header, normalizer_fn=scal_input_fn))


### Categorical Feature
In this dataset, 'ocean_proximity' is represented as a string.  We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector.

#### **Exercise**:   Create a categorical feature using 'ocean_proximity'.



In [0]:
#TODO - YOUR CODE HERE

In [0]:
#SOLUTION

for feature_name in cat_i_c:
  vocabulary = dataframe[feature_name].unique()
  cat_c = tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
  one_hot = feature_column.indicator_column(cat_c)
  feature_columns.append(one_hot)

### Bucketized Feature

Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider our raw data that represents a homes' age. Instead of representing the house age as a numeric column, we could split the hoome age into several buckets using a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). Notice the one-hot values below describe which age range each row matches.

#### **Exercise**:   Create a Bucketized column using 'housing_median_age'

In [0]:
#TODO - YOUR CODE HERE

In [0]:
#SOLUTION

Age = feature_column.numeric_column("housing_median_age")

# bucketized cols
age_buckets = feature_column.bucketized_column(Age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)



### Feature Cross

Combining features into a single feature, better known as [feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross), enables a model to learn separate weights for each combination of features.

#### **Exercise**:   Create a Feature Cross of  'housing_median_age' and 'ocean_proximity'.

In [0]:
#TODO - YOUR CODE HERE

In [0]:
#SOLUTION

vocabulary = dataframe['ocean_proximity'].unique()
ocean_proximity = tf.feature_column.categorical_column_with_vocabulary_list('ocean_proximity', vocabulary)

crossed_feature = feature_column.crossed_column([age_buckets, ocean_proximity], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

#### **Exercise**:   **RUN** the cell to determine the number of feature columns you now have.  Compare this number to the previous number of features.

In [0]:
print('Total number of feature coumns: ',len(feature_columns))

#### **Exercise**:   **RUN** the cell to compile, create, and train the Keras model.


In [0]:
#TODO YOUR CODE HERE

In [0]:
#SOLUTION

#Model create
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12,  input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='median_house_value')
])

#Model compile
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mse'])  

### Model Fit
history = model.fit(train_ds,
          validation_data=val_ds,
          epochs=32)


#### **Exercise**:   **RUN** the next two cells to show loss and accuracy and to plot the model.


In [0]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

In [0]:
# plot
import matplotlib.pyplot as plt
nrows = 1
ncols = 2
fig = plt.figure(figsize=(10, 5))

for idx, key in enumerate(['loss', 'mse']):  
    ax = fig.add_subplot(nrows, ncols, idx+1)
    plt.plot(history.history[key])
    plt.plot(history.history['val_{}'.format(key)])
    plt.title('model {}'.format(key))
    plt.ylabel(key)
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left');

#### **Exercise**:  Create a prediction model.  Note:  You may use the same values from the previous prediciton.



In [0]:
#TODO YOUR CODE HERE

In [0]:
#Prediction model:  Pass in the features from one row of the test data.

#Copy of first line from my test.csv: 
#-121.86	39.78	12.0	7653.0	1578.0	3628.0	1494.0	3.0905	117800.0	INLAND

#This example shows median house value of $117,800 for INLAND property, prediciton is $183,504

model.predict({
    'longitude': tf.convert_to_tensor([-121.86]),
    'latitude': tf.convert_to_tensor([39.78]),
    'housing_median_age': tf.convert_to_tensor([12.0]), 		
    'total_rooms': tf.convert_to_tensor([7653.0]),
    'total_bedrooms': tf.convert_to_tensor([1578.0]),  
    'population': tf.convert_to_tensor([3628.0]),
    'households': tf.convert_to_tensor([1494.0]),	
    'median_income': tf.convert_to_tensor([3.0905]),
    'ocean_proximity': tf.convert_to_tensor(['INLAND'])
    
}, steps=1)

In [0]:
#SOLUTION - NEAR OCEAN:  Median_house_value is $249,000, prediction is $234,000

model.predict({
    'longitude': tf.convert_to_tensor([-122.43]),
    'latitude': tf.convert_to_tensor([37.63]),
    'housing_median_age': tf.convert_to_tensor([34.0]), 		
    'total_rooms': tf.convert_to_tensor([4135.0]),
    'total_bedrooms': tf.convert_to_tensor([687.0]),  
    'population': tf.convert_to_tensor([2154.0]),
    'households': tf.convert_to_tensor([742.0	]),	
    'median_income': tf.convert_to_tensor([ 4.9732]),
    'ocean_proximity': tf.convert_to_tensor(['NEAR OCEAN'])
    
}, steps=1)

### Analysis 

The array returns a predicted value.  Compare this value to the test set you ran earlier. Your predicted value may be a bit better.

Now that you have your "feature engineering template" setup, you can experiment by creating additional features.  For exmample, you can create derived features, such as households per population, and see how they impact the model.  You can also experiment with replacing the features you used to create the feature cross.
 