## Student Name: [Enter your name here]

In [None]:
#import any required libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
import keras
from keras.models import Sequential
from keras.layers import Dense

### Step 1 – Data Acquisition
Load the training data 'house_prices_train.csv' into a dataframe. Explore the data to get a better understanding of its structure and any data preparation steps that you need to perform.

In [None]:
#Load the data and view the dimensions

url      = '' #TODO: provide the url for the training data
data     = pd.read_csv(url)
data_dim = data.shape

print ('There are {} rows and {} columns.'.format(data_dim[0], data_dim[1]))

Lets view samples of the data

In [None]:
#view a few observations
data.head()

#### Use your intuition!
At first glance is there any field that, without a doubt, will not contribute to the predictions?

In [None]:
#TODO: remove/exclude the unnecessary field(s) that will not contribute towards the prediction


### Step 2 – Data Exploration
- Gather summary/descriptive statistics and inspect **all the fields**. This can help you to identify outliers and detect any inconsistencies
- View the frequency of missing values.

In [None]:
#TODO: gather descriptive statistics to view the range of values in each field. 



In [None]:
#TODO: show the frequency of missing values


State your observations about the summary statistics and missing values **(in this cell)**:
- 
- 
- 
- 
- 

Note: recall that not all missing values need to be deleted, some of them can be imputed.

#### The continuous and categorical independent variables
List the continuous and categorical data and state any discrepancy between the number of expected records in the dataset and the `count` that is reported above. 

For the fields that are discussed, view `data_description.txt` which explains the range of values for each field. What does this tell you about these 'missing' values. How do you recommend addressing them? **(You do not need to demonstrate your recommendations)**


#### The dependent variable
Are there any discrepancies with the dependent variable? Plot a histogram showing its distribution. Is the distribution skewed?

In [None]:
#TODO: Plot the histogram


## Building the Pipeline
Based on your recommendations above, lets build a pipeline that does the following:
- prepare the data and perform data imputation
- transform the continuous and categorical data (scaling and encoding respectively)
- select the useful features e.g. feature selection, *you can optionally include this in the pipeline or perform this step prior to building the pipeline*
- build, train and evaluate the neural network using Keras.
- perform hyper-parameter tuning using RandomSearchCV **(optional)**
- make predictions with new data

### Step 3 – Data Preparation
Here is some helpful information on [preprocessing and feature extraction pipelines in scikit-learn](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)

<span style="color:red">NOTE: You can modify the cell below to suit your needs. However, ensure that the preprocessing steps that you perform is done in the data frame e.g. `data` </span>

In [None]:
#impute missing continuous values with the median and scale the data

continuous_features  = [] #TODO: provide a list of continuous fields that will be used in the model(except the dependent variable)
continous_transformer = Pipeline(
    steps = [
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())
    ])

#impute the NA categorical values and encode the data

categorical_features = [] #TODO: provide a list of categorical fields that will be used in the model
categorical_transformer = Pipeline(
    steps = [
    ('imputer', SimpleImputer(strategy = 'constant', fill_value = 'NotApp')), #Use an alternative value to indicate NA in the dataset
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
    ])

data_preprocessor   = ColumnTransformer(
    transformers = [
        ('continious', continous_transformer, continuous_features),
        ('categorical', categorical_transformer, categorical_features)
    ])

#NOTE: the steps above will not be performed until we call `fit_transform` (in the next cell).


### Step 4 – Data Transformation & Feature Selection
Here is some helpful information on [feature selection as part of a pipeline](https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-as-part-of-a-pipeline). If you add a feature selection algorithm to the pipeline, ensure that it supports regression.

In [None]:
data_prep_pipeline  = Pipeline(steps=[('preprocessor', data_preprocessor), #This performs the data preparation steps in the cell above
                     ('feature_selection',  #TODO: identify a feature selection algorithm or exclude this line if you have previously performed feature selection on the data.
                                                          ), 
                    ])

transformed_data    = data_prep_pipeline.fit_transform(data.iloc[:, :-1], data['SalePrice']) #transform the data


### Step 5 – Building the Model
#### Build the neural network using Keras
Build a feed forward neural network with: an input layer, hidden layers and one output layer. 

Note: you are required to provide a suitable [optimizer](https://keras.io/api/optimizers/) and [loss function](https://keras.io/api/losses/) for the regression task. Optimizers include: 'Adam', 'SGD' and RMSprop. Loss functions include: 'mean_squared_error', 'mean_squared_logarithmic_error', 'mean_absolute_error'

In [None]:
X = transformed_data #this is the transformed data from the pipeline
y = data['SalePrice'] #this is the output

#Build a sequential model with at least three dense layers (you can add more layers as needed)
#Note: you can also add this keras model to the data preprocessing pipeline but we can skip that step for now.
ffnn_model = Sequential()
ffnn_model.add(Dense(40, activation='relu', input_shape=(X.shape[1],))) #X.shape[1] is the number of selected features 

#TODO: Add the first hidden layer with a suitable number of units/neurons and the 'relu' activation function
#TODO: Add the second hidden layer with a suitable number of units/neurons and the 'relu' activation function

#TODO: Add the output layer

ffnn_model.compile(optimizer= , #TODO: state the optimize
                   loss= ,      #TODO: state the loss function
                   metrics=     #TODO: state the metric
                  )

ffnn_history = ffnn_model.fit(X, y, 
                              validation_split= , #TODO: state the validation split
                              epochs= , #TODO: state the number of epochs (you may need to run the model a few times to find a suitable value)
                              batch_size= , #TODO: state the number of observations to use in each batch
                              verbose=1)


In [None]:
# Visualize the training and validation loss
plt.plot(ffnn_history.history['loss'], 'b', ffnn_history.history['val_loss'], 'orange')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Val'], loc='upper right')
plt.show()

### Let's use the neural network to make predictions!

#### Load the test data

In [None]:
#TODO: Load the data from `house_prices_test.csv`
test_data_url = ''
test_data = pd.read_csv(test_data_url)


#### Prepare the test data using the pipeline
This will impute any missing values and scale/encode the fields.

In [None]:
prep_test_data = data_prep_pipeline.transform(test_data)

#### Use the neural network to make predictions

In [None]:
result = ffnn_model.predict( #TODO: provide the preprocessed test data (above)
        )

## Summary
Display samples of the predictions from your model and summarize your thoughts on the model's performance, the training process and its ability to generalize with new data. What are your recommendations to improve the model in the future?