## Assignment-1

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
      
1. <a href="#instructions">Instructions</a>   
2. <a href="#about_dataset">About the dataset</a>  
3. <a href="#pre_processing">Pre processing data</a>  
4. <a href="#train_test_dataset">Train/Test dataset</a>   
5. <a href="#modeling">Build regression model</a>  
6. <a href="#train_model">Train the model</a>  
7. <a href="report">Report Model performance</a>  
</font>
</div>

<h2 id="instructions">Instructions</h2>
This assignment is to build a baseline model in Keras with followingz:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

import required libraries:

In [1]:
import pandas as pd
import pylab as pl

<h2 id="about_dataset">About the dataset</h2>
<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different predictors that were used to make them. The predictors include:</strong>

<strong>1. Cement</strong><br>
<strong>2. Blast Furnace Slag</strong><br>
<strong>3. Fly Ash</strong><br>
<strong>4. Water</strong><br>
<strong>5. Superplasticizer</strong><br>
<strong>6. Coarse Aggregate</strong><br>
<strong>7. Fine Aggregate</strong><br>
the data can be found here again: https://cocl.us/concrete_data.

Let's download the data and read it into a pandas dataframe.

<h2 id="pre_processing">Processing Data</h2>

In [2]:
concrete_data = pd.read_csv('https://cocl.us/concrete_data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


#### Let's check how many data points we have.

In [3]:
concrete_data.shape

(1030, 9)

There are 1030 samples of dataset, it quite a few samples, we have to be careful not to overfit the training data.


Let's check the dataset for any missing values.

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

Great!, our data looks very clean, next start to build our model.

<h2 id="train_test_dataset">Train/Test dataset</h2>
#### We split our dataset into train and test set
#### Randomly split the data into a training and test sets by holding 30% of the data for testing.

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [6]:
# get the column as feature
concrete_data_columns = concrete_data.columns

# make the features and label
X = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
y = concrete_data['Strength'] # Strength column

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
def get_datatset(random):
    return train_test_split(X, y, test_size=param_test_size, random_state=random)

In [9]:
def verify_dataset(X_trainset, X_testset, y_trainset, y_testset):
    print("++++++++++++++++++++++++++++++++++++++++++++++++")
    print("X_trainset shape : {}".format(X_trainset.shape))
    print("y_trainset shape : {}".format(y_trainset.shape))
    print("X_testset shape : {}".format(X_testset.shape))
    print("y_testset shape : {}".format(y_testset.shape))

<h2 id="modeling">Build regression model</h2>
We will build a baseline Regression Model with following:
#### We will use the adam optimizer and the mean squared error as the loss function
#### One hidden layer of 10 nodes, and a ReLU activation function

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [10]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

Using TensorFlow backend.


#### define regression model

In [11]:
param_test_size = 0.3 # 30%
param_train_epochs = 50
param_optimizer = 'adam'
param_loss_func = 'mean_squared_error'
param_nn_node = 10
param_activation_func = 'relu'
param_input_shape = 8 # number of predictors

In [12]:

def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation = param_activation_func, input_shape=(param_input_shape,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer = param_optimizer, loss=param_loss_func, metrics=['mse'])
    return model

6. <a href="#train_model">Train the model</a>  
We will train the model and report of mean and the standard deviation of the mean squared errors.
The instruction as following:

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. 
2. Train the model on the training data using 50 epochs.
3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. 
4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

In [13]:
from sklearn.metrics import mean_squared_error

In [14]:
def regression_model_evaluatioin(batch):
    
    metrics = []
    # test for 50 times
    for i in range(1, batch + 1):
        print("++++++++++++++++++++++++++++++++++++++++++++++++")
        print("Step : {}".format(i))
        
        # get random training and testing data set
        X_trainset, X_testset, y_trainset, y_testset = get_datatset(i)
    
        # validate dataset
        verify_dataset(X_trainset, X_testset, y_trainset, y_testset)
    
        # build the model
        model = regression_model()
    
        # train model with trainginset
        model.fit(X_trainset, y_trainset, validation_split=0.3, epochs=param_train_epochs, verbose=0)
    
        # prediction with testing set
        y_predict = model.predict_classes(X_testset)
    
        
        # get mean squared error
        mse = mean_squared_error(y_testset, y_predict)
        print('Mean squared error: %.2f\n' %mse);
        metrics.append(mse)
    
    return metrics

Let's call the function now to create our model.

<h2 id="report">Report Model performance</h2>
Report the mean and the standard deviation of the mean squared errors.

In [15]:
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt


In [16]:
# experiment 50 times 
metrics = regression_model_evaluatioin(50)

++++++++++++++++++++++++++++++++++++++++++++++++
Step : 1
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Mean squared error: 1515.93

++++++++++++++++++++++++++++++++++++++++++++++++
Step : 2
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
Mean squared error: 1454.50

++++++++++++++++++++++++++++++++++++++++++++++++
Step : 3
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
Mean squared error: 1500.85

++++++++++++++++++++++++++++++++++++++++++++++++
Step : 4
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : 

#### the report the mean and the standard deviation of the mean squared errors.

In [17]:
mean = np.mean(metrics)
std = np.std(metrics)

In [18]:
print("++++++++++++++++++++++++++++++++++++++++++++++++")
print("mean of squared errors : {}".format(mean))
print("standard deviation of squared errors : {}".format(std))

++++++++++++++++++++++++++++++++++++++++++++++++
mean of squared errors : 1496.0535446148863
standard deviation of squared errors : 53.9245560117099
