<a href="https://colab.research.google.com/github/praveenprabharavindran/MachineLearning/blob/main/Build_Train_Deploy_Neural_Network_TensorFlow/House_Price_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import libraries
* Numpy is a powerful n-dimensional array library that allows to easily create and manipulate arrays of data
* Numpy can convert TensofFlow's native data structures, to Python native data types
* Matplotlib is a graphics plot library

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__)

# Problem statement
 Using example data, develop a model that predicts house prices basecd on the size of a house

# Get Data
 In this example we will be using a truncated vesrion of the Ames dataset that only contains information on homes sold in May 2010.

## About the Ames Dataset
 The Ames dataset is a widely available dataset that has become one of thye standard datasets used when predicting home prices based on features of the home. It is based on the great work of Dean De Cock. His rational and insight into this dataset can be found at  https://jse.amstat.org/v19n3/decock.pdf.

 ## Getting the truncated dataset we use
 This dataset can be found with the excercies files for this course. The filename is AmesHousing-05-2010.csv

In [None]:
from google.colab import files

# prompt  user to select a file to upload, and store it to a dictionary named 'uploaded' 
# 'uploded' is a dictionary with key = filename and value = contents of the file
uploaded = files.upload()

# 'iter()' function creates and iterator over the dictionary variable 'uploaded'
# 'next()' function iterates over the uploaded variable and gets the first file
# 'csv_housefile' variable is assigned the name of the file that was uploaded
csv_housefile = next(iter(uploaded))


print('User uploaded file: {name}, with lenght:{length}'.format(name=csv_housefile, length=len(uploaded[csv_housefile])))

# Load data into a pandas dataframe
* Pandas allows easil review and manipulation the data 
* Check out pandas webstie: https://pandas.pydata.org/
* Here is a [Pandas 10 minute intro](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) to understand how pandas, and pandas dataframes makes working with tabular data in Python easy.

In [None]:
import pandas as pd
df_housing = pd.read_csv(csv_housefile)


# Review House Price Data

In [None]:
pd.set_option('display.max_columns',None)
df_housing.head(5)

## Review Data using data_table formatter
* Using the data_table_formatter displays the table in an interactive mode making it easier to explore and analyze dataframes
* However this approach supports only 20 columns beyond which is fallsback to the pandas display.

In [None]:
# from google.colab import data_table
# data_table.enable_dataframe_formatter()
# 
# columns = df_housing.columns
# display(df_housing[columns[:20]])


# Pre process data
* Among the given parameters (columns), the size of the house has the greatest impact on the cost. 
* The total square footage of the house is not immediately accessible and requires aggregating several fields together.  

 **Note**: There are some aggregate fields in the data that could be used to simply calucation:
  >Total Bsmt SF = BsmtFin SF 1 + BsmtFin SF 2 + Bsmt Unf SF  
  >Gr Liv Area = 1st Flr SF + 2nd Flr SF

* In summary: 
  - total space in the house  = sum of Basement and Upper floors  
   = Total Bsmt SF + Gr Liv Area

In [None]:
# check if any of the field values are null
df_housing[['Total Bsmt SF', 'Gr Liv Area']].isnull().values.any()

In [None]:
# Add new feature column Total SF = Total Bsmt SF + Gr Liv Area
df_housing['Total SF'] = df_housing['Total Bsmt SF'] + df_housing['Gr Liv Area']

In [None]:
# Review the result of the abover operation
print(df_housing[['Total SF', 'Total Bsmt SF', 'Gr Liv Area']].head(5))

# Visualize the processed data

In [None]:
# This function visualizes our data and optionally a learned line
def visualize_data(x_vals, y_vals,
                   addn_x_vals=None, addn_y_vals=None, add_addn_reg_line=False):
  
  f, ax = plt.subplots(figsize=(8,8))
  plt.plot(x_vals, y_vals, 'ro')   # red dot for each data point
  # Optionally plot another set of data points in a different color and symbol
  if (addn_x_vals is not None):
    plt.plot(addn_x_vals, addn_y_vals, 'g^') # green triangles for additional data points
    # Optionally, plot a regression line.
    if (add_addn_reg_line):
      x_min_index = addn_x_vals.argmin()
      x_max_index = addn_x_vals.argmax()
      print(x_min_index,[addn_x_vals[x_min_index],addn_y_vals[x_min_index]] ) 
      print(x_max_index,[addn_x_vals[x_max_index],addn_y_vals[x_max_index]] ) 
      plt.plot([addn_x_vals[x_min_index],addn_y_vals[x_min_index]], 
               [addn_x_vals[x_max_index],addn_y_vals[x_max_index]], 
               'b-')  # draw a blue regression line
    
  plt.tick_params(axis='both', which='major', labelsize=14)
  
  plt.show()  # now plot the line showing the data and the optional line

## Visualizing Total SF and Price
* Using the visualize_data function we can see the relationship between Total Square Feet (Total SF) and Price.
* As can be seen from the plot there is a linear relationship between the price and size of the house

In [None]:

# Plot Total SF vs. Price
visualize_data(df_housing['Total SF'], df_housing['SalePrice'],add_addn_reg_line=True)

## Prepare data
* If values are on very different scales it will be difficult for the model to determine the relationships between features. 
* With the data we have , Square Footage (SF) ranges from 800-4200, and Prices range from 80,000 to 400,0000. 
* This means there is a nearly **`100 times`** difference in scale. 
* Normalization is a process to reduce both qualtities to the same scale while preserving the differences between prices and sizes of homes. 
* This will help our model learn the relationship between price and size.

In [None]:
# Scale data so SF and Sale Price are on similar scales with values 
#  from 0.0 to 1.0

from sklearn.preprocessing import MinMaxScaler

sf_scaler = MinMaxScaler()
sf_scaled = sf_scaler.fit_transform(df_housing['Total SF'].values.reshape(-1,1).astype(np.float64))
    
price_scaler = MinMaxScaler()
price_scaled = price_scaler.fit_transform(df_housing['SalePrice'].values.reshape(-1,1).astype(np.float64))

## Create Model

In [None]:
# Create model using the TensorFlow Keras library
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=1, activation='linear', input_shape=(1,),
                                kernel_initializer='random_uniform',
                                bias_initializer='zeros'))


## Compile the model

In [None]:
# Compile model
optimizer = "sgd"
model.compile(loss='mean_squared_error', optimizer=optimizer )

## Train the Model

In [None]:
# Split the dataset into training dataset - 70%, Testing dataset - 30%
# we do this using the sklearn train_test_split method
from sklearn.model_selection import train_test_split

sf_train_scaled, sf_test_scaled, price_train_scaled, price_test_scaled = train_test_split(sf_scaled, 
                                                    price_scaled, 
                                                    test_size=0.3, random_state=42)
     

In [None]:


# Train model using data
initial_epochs = 8
batch_size = 10
train_hist = model.fit(sf_train_scaled, price_train_scaled, 
                       epochs=initial_epochs, batch_size=batch_size, verbose=1)

# Is 8 epochs enough??? Maybe/Maybe not
     