###The case study is from an open source dataset from Kaggle. 

Link to the Kaggle project site: 
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling


Given a Bank customer, can we build a classifier which can determine whether they will leave or
not using Neural networks?


Case File: bank.csvView in a new window

### Load tensorflow

In [0]:
import tensorflow as tf
import numpy as np

tf.set_random_seed(42)

In [2]:
tf.__version__

'1.13.0-rc1'

### 1. Read the dataset

In [0]:
import pandas as pd

In [0]:
data = pd.read_csv('/content/gdrive/My Drive/AIML/Projects/Residency 6/bank.csv')

In [5]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
print ("Shape of the data: ", data.shape)
data.info()

Shape of the data:  (10000, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


### Observations:

#### We have 10000 rows and 14 features. 

#### Some of the features like Rownumber, CustomerId, Surname will not be useful for evaluation as they are unique for all users which does not describ any characteristics for our predictions. These features can be dropped.

#### There are some features with object type. These features should be converted to category or Label encoding should be done before evaluation.

#### The target feature Exited is of binary type (0 or 1)

#### All the features have varied scale of measurement. Noramlization should be done before evaluation.

---









### 2. Drop the columns which are unique for all users like IDs (2.5 points)

In [0]:
data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1, inplace = True)

In [8]:
print ("Shape of the data: ", data.shape)
data.head()

Shape of the data:  (10000, 11)


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [9]:
# Lets convert the Geography and Gender to categorial.
data["Geography"] = data["Geography"].astype('category')
data["Gender"] = data["Gender"].astype('category')
#data["Exited"] = data["Exited"].astype('category')

data["Geography"] = data["Geography"].cat.codes
data["Gender"] = data["Gender"].cat.codes
#data["Exited"] = data["Exited"].cat.codes

data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null int8
Gender             10000 non-null int8
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(7), int8(2)
memory usage: 722.7 KB


### Observations:

#### After dropping the irrelevant features, we are left with 10 features and a target.

#### Also we have converted the Gender and Geography features to category codes. 



---



### 3. Distinguish the feature and the target set (2.5 points)

In [0]:
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

In [12]:
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,0,0,42,2,0.0,1,1,1,101348.88
1,608,2,0,41,1,83807.86,1,0,1,112542.58
2,502,0,0,42,8,159660.8,3,1,0,113931.57
3,699,0,0,39,1,0.0,2,0,0,93826.63
4,850,2,0,43,2,125510.82,1,1,1,79084.1


In [13]:
y.head()

0    1
1    0
2    1
3    0
4    0
Name: Exited, dtype: int64

### Observations:

#### We have splitted the date to 10 features and a target.



---



### 4. Divide the data set into Train and test sets. (2.5 points)

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

In [15]:
print ("Shape: ", X_train.shape)
X_train.head()

Shape:  (8000, 10)


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
8369,684,1,1,37,1,126817.13,2,1,1,29995.83
9722,679,0,0,36,3,0.0,2,1,1,2243.41
6950,652,2,0,38,6,123081.84,2,1,1,188657.97
1919,618,0,1,56,7,0.0,1,1,1,142400.27
5713,537,0,1,47,10,0.0,2,0,1,25482.62


In [16]:
print ("Shape: ", X_test.shape)
X_test.head()

Shape:  (2000, 10)


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
8018,632,1,1,23,3,122478.51,1,1,0,147230.77
9225,594,1,0,32,4,120074.97,2,1,1,162961.79
3854,687,1,1,33,9,135962.4,2,1,0,121747.96
2029,520,0,1,33,4,156297.58,2,1,1,166102.61
3539,667,0,1,42,6,0.0,1,1,0,88890.05


In [17]:
print ("Shape: ", y_train.shape)
y_train.head()

Shape:  (8000,)


8369    1
9722    0
6950    0
1919    1
5713    0
Name: Exited, dtype: int64

In [18]:
print ("Shape: ", y_test.shape)
y_test.head()

Shape:  (2000,)


8018    1
9225    0
3854    0
2029    0
3539    0
Name: Exited, dtype: int64

In [19]:
print ("Unique train labels: ", np.unique(y_train))

Unique train labels:  [0 1]


In [20]:
print ("Unique test labels: ", np.unique(y_test))

Unique test labels:  [0 1]


### Observations:

#### We have splitted the date into train (80%) and test (20%)
#### We have 8000 records and 10 features in train dataset.
#### We have 2000 records and 10 features in test dataset



---



### 5. Normalize the train and test data (2.5 points)

In [0]:
from scipy import stats

X_train_std = stats.zscore(X_train) 
X_test_std = stats.zscore(X_test)

In [0]:
y_train_cat = tf.keras.utils.to_categorical(y_train)
y_test_cat = tf.keras.utils.to_categorical(y_test)

In [23]:
y_train[:5]

8369    1
9722    0
6950    0
1919    1
5713    0
Name: Exited, dtype: int64

In [24]:
y_train_cat[:5]

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

### Observations:

#### As the dataset have varied scales, normalizing the data will yield better results.

#### We have used zscore to normalize the features.

#### We have converted both train and test labels into one-hot vectors.



---



### 6. Initialize &amp; build the model (10 points)

In [0]:
# Build a neural Network with a binary crossentropy loss function and sgd optimizer in Keras. The output layer with 1 neurons.

#Initialize Sequential model
model1 = tf.keras.models.Sequential()

#Input Layer
model1.add(tf.keras.layers.Dense(10, input_dim = 10, activation='relu'))

#Add Dense Layer which provides 1 Output after applying sigmoid (Output Layer)
model1.add(tf.keras.layers.Dense(2, activation='sigmoid'))

#Comile the model
model1.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])

### Execute the model using model.fit()

In [0]:
model1.fit(X_train_std, y_train_cat, 
          validation_data=(X_test_std, y_test_cat), 
          epochs=10,
          batch_size=10)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f6fab9aaa90>

In [0]:
model1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 22        
Total params: 132
Trainable params: 132
Non-trainable params: 0
_________________________________________________________________


### Observations:

#### As we have binary classification, we have used binary crossentropy for loss and sigmoid for activation in output layer.

#### We have just tried with relu activation in input layer. We will find the best activation method using grid search.

#### Same way we have tried with sgd optimizer. We will find the best optimizer using grid search.

#### The accuracy we have got is 83.85%




---



### 7. Optimize the model (5 points)

In [25]:
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Nadam
from keras.optimizers import sgd
from keras.layers import Dropout
from keras.constraints import maxnorm

Using TensorFlow backend.


#### Lets first findout the best optimizer among 'SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam'



---



In [0]:
# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
  #Initialize Sequential model
  model2 = Sequential()
  
  #Input Layer
  model2.add(Dense(10, input_dim = 10, activation='relu'))
  
  #Add Dense Layer which provides 1 Outputs after applying softmax (Output Layer)
  model2.add(Dense(1, activation='sigmoid'))
  
	#Comile the model
  model2.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  
  return model2

model2 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=10, verbose=0)


# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)

grid = GridSearchCV(estimator=model2, param_grid=param_grid, n_jobs=-1, scoring="accuracy", cv=2)
grid_result = grid.fit(X_train_std, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.854750 using {'optimizer': 'Nadam'}
0.817125 (0.014375) with: {'optimizer': 'SGD'}
0.837500 (0.001250) with: {'optimizer': 'RMSprop'}
0.817000 (0.001000) with: {'optimizer': 'Adagrad'}
0.842375 (0.004875) with: {'optimizer': 'Adadelta'}
0.844125 (0.004625) with: {'optimizer': 'Adam'}
0.839750 (0.002250) with: {'optimizer': 'Adamax'}
0.854750 (0.002250) with: {'optimizer': 'Nadam'}


### Observations:

#### The best optimizer we have got is Nadam and the accuracy is 85.47%.
#### The accuracy have increased 2%.



#### Note: As there is difference in multiclass representation with scikit-learn and keras, we are not going to use the categorical transformation on target variable with gridsearch. If we use the categorical transformation of target variable, we will be ending up with the error, *"ValueError: Classification metrics can't handle a mix of multilabel-indicator and binary targets"*. So with gridsearchcv, we are going to use target variable without categorical transformation.

#### Lets find out the best learning rate.

In [0]:
# Tune Learning Rate
from keras.optimizers import Nadam

# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01):
  #Initialize Sequential model
  model4 = Sequential()
  #Input Layer
  model4.add(Dense(10, input_dim = 10, activation='relu'))
  #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
  model4.add(Dense(2, activation='sigmoid'))
	#Comile the model
  optimizer = Nadam(lr=learn_rate)
  model4.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
  return model4

# create model
model4 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=20, verbose=0)

# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learn_rate=learn_rate)

grid = GridSearchCV(estimator=model4, param_grid=param_grid, n_jobs=1, cv=2)
grid_result = grid.fit(X_train_std, y_train_cat)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.860062 using {'learn_rate': 0.01}
0.824000 (0.002375) with: {'learn_rate': 0.001}
0.860062 (0.001313) with: {'learn_rate': 0.01}
0.845062 (0.004188) with: {'learn_rate': 0.1}
0.834750 (0.018250) with: {'learn_rate': 0.2}
0.806750 (0.011500) with: {'learn_rate': 0.3}


### Observatins:

#### The best learning rate we got is 0.01 and the accuracy is 86%.
#### There is a slight increase in accuracy.



#### Now lets find out the best network weight initialization

In [0]:
# Tune Network Weight Initialization

def create_model(init_mode='uniform'):
  #Initialize Sequential model
  model5 = Sequential()
  #Input Layer
  model5.add(Dense(10, input_dim = 10, kernel_initializer = init_mode, activation='relu'))
  #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
  model5.add(Dense(2, kernel_initializer = init_mode, activation='sigmoid'))
	#Comile the model
  optimizer = Nadam(lr=0.01)
  model5.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  return model5

# create model
model5 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=20, verbose=0)

# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(init_mode=init_mode)

grid = GridSearchCV(estimator=model5, param_grid=param_grid, n_jobs=1, cv=2)
grid_result = grid.fit(X_train_std, y_train_cat)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Best: 0.856063 using {'init_mode': 'he_normal'}
0.839625 (0.005500) with: {'init_mode': 'uniform'}
0.851500 (0.001375) with: {'init_mode': 'lecun_uniform'}
0.843750 (0.009500) with: {'init_mode': 'normal'}
0.796875 (0.003125) with: {'init_mode': 'zero'}
0.854625 (0.002375) with: {'init_mode': 'glorot_normal'}
0.852687 (0.002812) with: {'init_mode': 'glorot_uniform'}
0.856063 (0.002563) with: {'init_mode': 'he_normal'}
0.854875 (0.000625) with: {'init_mode': 'he_uniform'}


### Observatins:

#### The best network weight initialization we have got is he_normal and the accuracy is 85%.
#### But here with weight initialization, the accuracy have got reduced from 86% to 85.6%
#### So we are not going to specify the weight initialization in our model.



#### Now lets find out the best Activation function.

In [0]:
# Tune the Neuron Activation Function

# Function to create model, required for KerasClassifier
def create_model(activation='relu'):
  #Initialize Sequential model
  model6 = Sequential()
  #Input Layer
  model6.add(Dense(10, input_dim = 10, activation=activation))
  #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
  model6.add(Dense(2, activation='sigmoid'))
	#Comile the model
  optimizer = Nadam(lr=0.01)
  model6.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  return model6

# create model
model6 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=20, verbose=0)

# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(activation=activation)

grid = GridSearchCV(estimator=model6, param_grid=param_grid, n_jobs=1, cv=2)
grid_result = grid.fit(X_train_std, y_train_cat)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.859812 using {'activation': 'softmax'}
0.859812 (0.000187) with: {'activation': 'softmax'}
0.855562 (0.002687) with: {'activation': 'softplus'}
0.854250 (0.001500) with: {'activation': 'softsign'}
0.850938 (0.004062) with: {'activation': 'relu'}
0.852687 (0.002063) with: {'activation': 'tanh'}
0.850625 (0.000250) with: {'activation': 'sigmoid'}
0.840375 (0.008000) with: {'activation': 'hard_sigmoid'}
0.807438 (0.000312) with: {'activation': 'linear'}


### Observatins:

#### The best neuron activation function we have got is softmax and the accuracy is 85.98%.
#### There is a very slight decrease in accuracy 86% to 85.98.


#### Now lets find out the best dropout rate.

In [0]:
# Tune Dropout Regularization
# Tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

from keras.layers import Dropout
from keras.constraints import maxnorm

# Function to create model, required for KerasClassifier
def create_model(dropout_rate=0.0):
  #Initialize Sequential model
  model7 = Sequential()
  #Input Layer
  model7.add(Dense(10, input_dim = 10, activation='softmax'))
  model7.add(Dropout(dropout_rate))
  #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
  model7.add(Dense(2, activation='sigmoid'))
	#Comile the model
  optimizer = Nadam(lr=0.01)
  model7.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  return model7

# create model
model7 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=20, verbose=0)

# define the grid search parameters
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(dropout_rate=dropout_rate)

grid = GridSearchCV(estimator=model7, param_grid=param_grid, n_jobs=1, cv=2)
grid_result = grid.fit(X_train_std, y_train_cat)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Best: 0.858188 using {'dropout_rate': 0.2}
0.856562 (0.000687) with: {'dropout_rate': 0.0}
0.856125 (0.000000) with: {'dropout_rate': 0.1}
0.858188 (0.001062) with: {'dropout_rate': 0.2}
0.857375 (0.000375) with: {'dropout_rate': 0.3}
0.854812 (0.002187) with: {'dropout_rate': 0.4}
0.855875 (0.004375) with: {'dropout_rate': 0.5}
0.824375 (0.006625) with: {'dropout_rate': 0.6}
0.796875 (0.003125) with: {'dropout_rate': 0.7}
0.796875 (0.003125) with: {'dropout_rate': 0.8}
0.796875 (0.003125) with: {'dropout_rate': 0.9}


### Observations:

#### The best dropout rate we have got is 0.2 and the accuracy is 85.81%.
#### There is a very slight decrease in accuracy 86% to 85.81.


#### Now lets find out the best number of neurons in the dense layer.

In [0]:
# Tune the Number of Neurons in the Hidden Layer

# The number of neurons in a layer is an important parameter to tune. 
# Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

# Function to create model, required for KerasClassifier
def create_model(neurons=1):
  #Initialize Sequential model
  model8 = Sequential()
  #Input Layer
  model8.add(Dense(neurons, input_dim = 10, activation='softmax'))
  model8.add(Dropout(0.2))
  #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
  model8.add(Dense(2, activation='sigmoid'))
	#Comile the model
  optimizer = Nadam(lr=0.01)
  model8.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  return model8

# create model
model8 = KerasClassifier(build_fn=create_model, epochs=10, batch_size=20, verbose=0)

# define the grid search parameters
neurons = [1, 5, 10, 15, 20, 25, 30]
param_grid = dict(neurons=neurons)

grid = GridSearchCV(estimator=model8, param_grid=param_grid, n_jobs=1, cv=2)
grid_result = grid.fit(X_train_std, y_train_cat)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.857750 using {'neurons': 30}
0.796875 (0.003125) with: {'neurons': 1}
0.856812 (0.000312) with: {'neurons': 5}
0.854687 (0.003937) with: {'neurons': 10}
0.855625 (0.000625) with: {'neurons': 15}
0.856625 (0.001375) with: {'neurons': 20}
0.857437 (0.002563) with: {'neurons': 25}
0.857750 (0.000875) with: {'neurons': 30}


### Observations:

#### We have got, the best number of neurons in the dense layer as 30 and the accuracy is 85.77%.
#### There is a very slight decrease in accuracy 85.81% to 85.77%.


#### Now lets find out the best batch size and number of epochs.

In [0]:
# Tune Batch Size and Number of Epochs

# Function to create model, required for KerasClassifier
def create_model():
  #Initialize Sequential model
  model3 = Sequential()
  
  #Input Layer
  model3.add(Dense(30, input_dim = 10, activation='softmax'))
  
  #Dropout
  model3.add(Dropout(0.2))
  
  #Add Dense Layer which provides 2 Outputs after applying sigmoid (Output Layer)
  model3.add(Dense(1, activation='sigmoid'))
  
	#Comile the model
  optimizer = Nadam(lr=0.01)
  model3.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  
  return model3

# create model
model3 = KerasClassifier(build_fn=create_model, verbose=0)

# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)

grid = GridSearchCV(estimator=model3, param_grid=param_grid, n_jobs=1, scoring="accuracy", cv=2)
grid_result = grid.fit(X_train_std, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.859500 using {'batch_size': 60, 'epochs': 10}
0.854750 (0.000750) with: {'batch_size': 10, 'epochs': 10}
0.846625 (0.006375) with: {'batch_size': 10, 'epochs': 50}
0.851750 (0.001500) with: {'batch_size': 10, 'epochs': 100}
0.857375 (0.001125) with: {'batch_size': 20, 'epochs': 10}
0.854875 (0.000375) with: {'batch_size': 20, 'epochs': 50}
0.847875 (0.001875) with: {'batch_size': 20, 'epochs': 100}
0.858375 (0.001125) with: {'batch_size': 40, 'epochs': 10}
0.854500 (0.002250) with: {'batch_size': 40, 'epochs': 50}
0.854375 (0.000625) with: {'batch_size': 40, 'epochs': 100}
0.859500 (0.001000) with: {'batch_size': 60, 'epochs': 10}
0.853875 (0.000625) with: {'batch_size': 60, 'epochs': 50}
0.853375 (0.000375) with: {'batch_size': 60, 'epochs': 100}
0.858750 (0.002000) with: {'batch_size': 80, 'epochs': 10}
0.854625 (0.001125) with: {'batch_size': 80, 'epochs': 50}
0.848875 (0.003125) with: {'batch_size': 80, 'epochs': 100}
0.858000 (0.001250) with: {'batch_size': 100, 'epochs': 

### Observations:

#### We have got, the best  batch size as 60 and number of epochs as 10 with accuracy 85.95%.

#### Now lets build out final model with all the best parameter we have identified with.

### Final Model after tuning...

In [26]:
#Initialize Sequential model
modelF = Sequential()
  
#Input Layer
modelF.add(Dense(30, input_dim = 10, activation='softmax'))
  
#Dropout
modelF.add(Dropout(0.2))

#Add Dense Layer which provides 10 Outputs
modelF.add(Dense(30, activation='softmax'))

#Dropout
modelF.add(Dropout(0.2))
  
#Add Dense Layer which provides 1 Output after applying sigmoid (Output Layer)
modelF.add(Dense(2, activation='sigmoid'))
 
#Comile the model
optimizer = Nadam(lr=0.01)
modelF.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
 
modelF.fit(X_train_std, y_train_cat, 
        validation_data=(X_test_std, y_test_cat), 
        epochs=10,
        batch_size=60)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fcd3f418390>

## Review model

In [27]:
modelF.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 30)                330       
_________________________________________________________________
dropout_1 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 30)                930       
_________________________________________________________________
dropout_2 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 62        
Total params: 1,322
Trainable params: 1,322
Non-trainable params: 0
_________________________________________________________________


### 9. Predict the results using 0.5 as a threshold (5 points)

In [0]:
# make predictions for the testing set without threshold (default threshold is 0.5 for binary classification)
y_pred = modelF.predict(X_test_std)

In [44]:
print ("Prediction: ", y_pred[:10])

Prediction:  [[0.80762196 0.19381288]
 [0.9382484  0.06173   ]
 [0.9277985  0.07209456]
 [0.96430403 0.03567612]
 [0.62053293 0.3812387 ]
 [0.5530484  0.44671857]
 [0.85903764 0.14185444]
 [0.9644735  0.03570348]
 [0.67253566 0.32916683]
 [0.9108384  0.08899054]]


In [0]:
# make predictions for the testing set with threshold 0.4
y_pred_threshold = (modelF.predict_proba(X_test_std) >= 0.5)

In [46]:
print ("Prediction: ", y_pred_threshold[:10])

Prediction:  [[ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]]


### Observations:

#### We have predicted the results with and without specifying the threshold 0.5. 
#### Lets check the accuracy score and confusion matrix for the same.

### 10. Print the Accuracy score and confusion matrix (2.5 points)

In [47]:
# Accuracy score for predictions without threshold

from sklearn import metrics
print("Accuracy score for predictions with no specified thershold: ", metrics.accuracy_score(y_test_cat, y_pred.round()))
print("Accuracy score for predictions with specified threshold 0.5: ", metrics.accuracy_score(y_test_cat, y_pred_threshold.round()))

Accuracy score for predictions with no specified thershold:  0.851
Accuracy score for predictions with specified threshold 0.5:  0.851


In [48]:
print ("Confusion Matrix for predictions with no specified threshold")
pd.DataFrame(metrics.confusion_matrix(y_test_cat.argmax(axis=1), y_pred.argmax(axis=1)),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos'])

Confusion Matrix for predictions with no specified threshold


Unnamed: 0,pred_neg,pred_pos
neg,1533,55
pos,243,169


In [49]:
print ("Confusion Matrix for predictions with specified threshold 0.5")
pd.DataFrame(metrics.confusion_matrix(y_test_cat.argmax(axis=1), y_pred_threshold.argmax(axis=1)),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos'])

Confusion Matrix for predictions with specified threshold 0.5


Unnamed: 0,pred_neg,pred_pos
neg,1533,55
pos,243,169


In [50]:
from sklearn.metrics import classification_report
print ("Classification Report for predictions with no specified threshold")
print(classification_report(y_test_cat, y_pred.round()))

Classification Report for predictions with no specified threshold
              precision    recall  f1-score   support

           0       0.86      0.97      0.91      1588
           1       0.75      0.41      0.53       412

   micro avg       0.85      0.85      0.85      2000
   macro avg       0.81      0.69      0.72      2000
weighted avg       0.84      0.85      0.83      2000
 samples avg       0.85      0.85      0.85      2000



In [51]:
from sklearn.metrics import classification_report
print ("Classification Report for predictions with specified threshold 0.5")
print(classification_report(y_test_cat, y_pred_threshold))

Classification Report for predictions with specified threshold 0.5
              precision    recall  f1-score   support

           0       0.86      0.97      0.91      1588
           1       0.75      0.41      0.53       412

   micro avg       0.85      0.85      0.85      2000
   macro avg       0.81      0.69      0.72      2000
weighted avg       0.84      0.85      0.83      2000
 samples avg       0.85      0.85      0.85      2000



### Observations:

#### For binary classification by default the threshold is 0.5. So there is no difference in the accuracy score or classification report with and without specifying the 0.5 threshold.