## Neural Network

Here's my implementation of a regular feed-forward network.  I've implemented it as a class called ```neural_net```.  When it's called it takes as arguments a list of integers, which specifies the number of nodes at each layer.  For example, ```neural_net(5,7,3)``` creates a neural network with 5 input neurons, a single hidden layer with 7 neurons, and 3 outpute neurons.  ```neural_net(5,5,5,5,5)``` is a neural net with input, output, and 3 hidden layers, with 5 neurons at each layer.

Since there's a number of ways to set things up (and mine might be slightly different than the way you did it), I'll explain here the details of my setup.  Each data point (training or testing) will be in the form of a column vector.  For example, let $X$ be a single input vector with $n$ features.  To find the resulting activations at the second layer of neurons (the first hidden layer) we multiply $X$ by the weight matrix $W$.  If there are $n$ input neurons and $m$ neurons in the hidden layer, then $W$ will be $m$ by $n$.  Thus, the values of the activations at the first hidden layer (before applying the activation function) is just given by the matrix product $WX$.  So to find the value of the activation at the first neuron in the hidden layer, we take the first row of $W$ and take the dot product with the vector $X$, the activation at the second neuron is obtained by taking the second row of $W$ and dotting with the vector $X$, and so on.

Doing it like this, we could incorporate bias terms explicitly, but adding on a separate vector of biases, say $B$, to $WX$, but in the code below I've implemented it slightly differently.  Instead of adding the bias vector $B$ onto $WX$, I insert the $B$ into the front of the matrix $W$, so that it is now an $m$ by $n+1$ matrix with first column $B$, and I turn $X$ into a $n+1$ dimensional vector, by adding a 1 in the first spot (notice that the dimensions still work out for the matrix multiplication).  It shouldn't be hard to convince yourself that modifying $X$ and $W$ in this way is the same thing as adding on the bias terms separately.

We can do the same after applying the activation function when going from the first hidden layer to the next layer, by using a matrix $W$ with one extra column (corresponding to the biases).  Notice though that this requires the extra step of putting an extra 1 in the first spot of the neuron vector at each layer before multiplying by $W$.

One final thing, note that in the discussion above $X$ could have consisted of $k$ observations, in which case it would have been an $n \times k$ matrix.  In this case, the formulas all hold the same, but at each layer you'll have a list of $k$ vectors which contain the values of the corresponding neurons for each of the $k$ input vectors.  Instead of adding a single 1 to the first position to account for the biases, we'll have to add a row of 1s at each step to the matrix $X$.

In [1]:
import numpy as np  
import pandas as pd
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
class neural_net:
    
    # The __init__ function is called whenever an instance of the class is created.
    # The argument *argv represents the list of integers that are passed to the class when the instance is created.
    
    def __init__(self,*argv):
        self.layer_sizes=argv  # List of layer sizes, starting with the input layer and ending with output layer.
        self.w={}              # Empty dictionary which we will fill up with the weight matrices.
        # Next we define the weight matrices, so that the ith weight matrix is called self.w[i].
        # I read somewhere that it's a good idea to initialize your weight so that they are uniformly 
        # distributed in the interval [-b,b], where b is some value depending on the number of neurons 
        # in the incoming and outgoing layers (see the definition below).  Notice in the size of the 
        # matrix I'm adding an extra column for the biases (it's self.layer_sizes[i+1] by self.layer_sizes[i]+1).
        for i in range(len(self.layer_sizes)-1):
            b=np.sqrt(6/(self.layer_sizes[i]+self.layer_sizes[i+1]))
            self.w[i]=np.random.uniform(-b,b,(self.layer_sizes[i+1],self.layer_sizes[i]+1))
    
    
    # The reset function below is used by the perf_test functions below, and just re-initializes the weights
    # in exactly the same way the __init__ function does above.
        
    def reset(self,*argv):
        self.layer_sizes=argv  
        self.w={}              
        for i in range(len(self.layer_sizes)-1):
            self.w[i]=np.random.normal(0,1,(self.layer_sizes[i+1],self.layer_sizes[i]+1))
            
            
    # Sigmoid function.
        
    def sigmoid(self,t):
        return 1.0/(1+np.exp(-t))
    
    
    # Function d_sigma(t)=t(1-t).
    
    def d_sigmoid(self,t):
        return t*(1.0-t)
    
    
    # The predict function takes a column vector input X (or matrix of column vectors) and feeds it forward through
    # the network, computing the activations at each neuron.  It returns the values of the output neurons as a column 
    # vector (if there is only one input), or as a matrix of k column vectors if there were k columns in X.
        
    def predict(self,X):
        self.number_training=X.shape[1]              # Number of data points to feed into the network.
        bias_row=np.ones(self.number_training)       # A row of ones which will be stacked on top of X.
        # Define two empty dictionaries, which will hold the values of the activations at each neuron. 
        # self.a holds the values of the activations before applying the activation function, while self.z holds
        # the values that are returned from the activation function. 
        self.a={}                                                                                        
        self.z={}                                    
        self.z[0]=np.vstack((bias_row,X))
        # The self.a values are obtained by multiplying the self.z of the previous layer by the corresponging self.w
        # matrices.  The self.z values are obtained from these by applying the activation function, and then stacking
        # a row of ones on top.
        for i in range(len(self.layer_sizes)-2):
            self.a[i+1]=self.w[i]@self.z[i]
            self.z[i+1]=np.vstack((bias_row,self.sigmoid(self.a[i+1])))
        # The last (ouptut) layer doesn't need a row of ones stacked on top (since we won't be multiplying by weights
        # and adding bias terms, so we handle this layer separately, and return the self.z values from the last layer.
        self.a[len(self.layer_sizes)-1]=self.w[len(self.layer_sizes)-2]@self.z[len(self.layer_sizes)-2]    
        self.z[len(self.layer_sizes)-1]=self.sigmoid(self.a[len(self.layer_sizes)-1]) 
        return self.z[len(self.z)-1]
    

    # The compute_delta function takes as input the training data points X and corresponding target variables T, and 
    # computes the corresponding delta values needed for back-propogation.  As set-up here, it computes these values 
    # for the cross-entropy error function, but could be changed easily to use other error functions.
    
    def compute_delta(self,X,T):
        self.delta={}
        # The ouput layer deltas are just the output values self.z[len(self.z)-1] minus the target variables T.
        self.delta[len(self.layer_sizes)-1]=self.z[len(self.z)-1]-T
        derivatives={}
        # Delta values at earlier layers are computed recursively from the formulas.  Because we've added a row
        # to each self.z to accomodate the bias terms, going backwards we must drop a row at each stage.
        for i in range(len(self.layer_sizes)-2):
            derivatives[len(self.layer_sizes)-2-i]=self.d_sigmoid(self.z[len(self.layer_sizes)-2-i])
            self.delta[len(self.layer_sizes)-2-i]=np.delete(((self.w[len(self.layer_sizes)-2-i].T)@self.delta[len(self.layer_sizes)-1-i])*derivatives[len(self.layer_sizes)-2-i],0,0)
        # Return the dictionary of delta values which will be used by backprop function to compute the gradients.    
        return self.delta  
    
    
    # The backprop function uses the predict and compute_deltas functions to compute the gradient of the cross-entropy
    # error function with respect to the weights self.w at single training point, and then averaging them to get the 
    # value of the gradient over all training points.
    
    def backprop(self,X,T):           
        self.predict(X)
        deltas=self.compute_delta(X,T)
        grad={}
        # grad_avg will be a list which contains the running average of the gradients as we run over the list of training
        # points.  Since we are taking the gradient with respect to the weights self.w, which form a list of matrices, 
        # we write the resulting gradients as matrices of the same sizes as the self.w matrices.
        grad_avg={}
        # The counter i runs over the number of weight matrices, and initializes each average gradient matrix as zero matrices.
        for i in range(len(self.layer_sizes)-1):
            grad_avg[i]=np.zeros((self.layer_sizes[i+1],self.layer_sizes[i]+1))
        # The counter m ranges over the set of all training points.  For each point it uses the deltas from the 
        # compute_deltas function and the self.z values to compute the required derivatives and fill up the gradient
        # matrices.  It then divides these matrices by the number of training data points, before adding them to the 
        # running averages.
        for m in range(X.shape[1]):
            grad[m]={}
            for i in range(len(self.layer_sizes)-1):
                grad[m][i]=np.copy(self.w[i])
                for k in range(self.w[i].shape[0]):
                    for j in range(self.w[i].shape[1]):
                        grad[m][i][k,j]=deltas[i+1][k,m]*self.z[i][j,m]
                grad_avg[i]+=(1/X.shape[1])*grad[m][i]
        return grad_avg
    
    
    # The fit function takes the training data X and target values T, along with a learning_rate and number of epochs, 
    # and fits the network to the data.  
    
    def fit(self,X,T,learning_rate,epochs):
        # As the counter i runs over the number of epochs, it updates the weight matrices self.w by recomputing the 
        # gradient, and subtracting it from the weights (times the learning rate).
        for i in range(epochs):
            self.grad_avg=self.backprop(X,T)
            for j in range(len(self.layer_sizes)-1):
                self.w[j]-=learning_rate*self.grad_avg[j]
    
    
    # The following three functions are for use when the network is being used to predict binary outcome variables (such
    # as the titanic problem, when we are trying to classify passengers as having survived (1) or perished (0).  It won't
    # work for regression problems (when we are trying to predict the values of some function) though it could be easily 
    # modified to do so.
    
                
    # The score function takes training data X and target values T, and determines how accurate the network predicts the
    # values of T based on the features in X.  It rounds the predicted values to 0 or 1, returns the percentage of 
    # correct predictions.
                
    def score(self,X,T):
        return 1-(np.abs((np.round_(self.predict(X))-T)).sum())/len(X.T)
    
    
    # The functions perf_test and perf_test2 are similar.  Both of them take as input some training data X along with 
    # target variables T, a value specifying the number of epochs, a list of values which contains the different learning
    # rates to test, a list (or two, in the case of perf_test2) of values containing the number of neurons to try for the 
    # hidden layer(s), and a fold number.  
    
    # For each different learning rate, and possible combination of hidden neurons from the lists supplied, perf_test 
    # and perf_test2 divides the training data into fold_number different sets, trains the data on all but one of these
    # sets, and tests the networks performance on the remaining "hold-out" data set.  The performance of the network is
    # recorded at each epoch.  The process is repeated for each different hold-out chuck of the data, retraining the 
    # network and recording it's performance at every epoch.
    
    # In this way we can test a whole range of different network configurations (neuron numbers, learning rates, epochs), 
    # to see which combinations give the best performance.  The only difference between perf_test and perf_test2 is that
    # perf_test2 trains networks with 2 hidden layers, while perf_test trains networks with only one hidden layer.
    
    def perf_test(self,X,T,epoch_number,learning_rate_range,neuron_range,fold_number=5):
        self.row_position=0
        number_of_rows=len(learning_rate_range)*len(neuron_range)*fold_number
        self.perf_data=pd.DataFrame(index=list(range(number_of_rows)),columns=['hidden neurons','learning parameter']+list(range(epoch_number+1)))
        kfolds=cross_validation.KFold(len(X.T),fold_number)
        for jjj in neuron_range:
            for iii in learning_rate_range:
                for train,test in kfolds:
                    self.perf_data.loc[self.row_position,'hidden neurons']=jjj
                    self.perf_data.loc[self.row_position,'learning parameter']=iii
                    self.reset(len(X.T[0]),jjj,1)
                    self.predict(X[:,train])
                    self.perf_data.loc[self.row_position,0]=self.score(X[:,test],T[:,test])
                    for run in range(epoch_number):
                        self.grad_avg=self.backprop(X[:,train],T[:,train])
                        for j in range(len(self.layer_sizes)-1):
                            self.w[j]-=iii*self.grad_avg[j]
                        self.perf_data.loc[self.row_position,run+1]=self.score(X[:,test],T[:,test])
                    self.row_position+=1
                print('hidden neurons=',jjj,"\t learning parameter=",iii)
        return self.perf_data
    
    
    def perf_test2(self,X,T,epoch_number,learning_rate_range,neuron_range1,neuron_range2,fold_number=5):
        self.row_position=0
        number_of_rows=len(learning_rate_range)*len(neuron_range1)*len(neuron_range2)*fold_number
        self.perf_data=pd.DataFrame(index=list(range(number_of_rows)),columns=['hidden neurons 1','hidden neurons 2','learning parameter']+list(range(epoch_number+1)))
        kfolds=cross_validation.KFold(len(X.T),fold_number)
        for jjj in neuron_range1:
            for kkk in neuron_range1:
                for iii in learning_rate_range:
                    for train,test in kfolds:
                        self.perf_data.loc[self.row_position,'hidden neurons 1']=jjj
                        self.perf_data.loc[self.row_position,'hidden neurons 2']=kkk
                        self.perf_data.loc[self.row_position,'learning parameter']=iii
                        self.reset(len(X.T[0]),jjj,kkk,1)
                        self.predict(X[:,train])
                        self.perf_data.loc[self.row_position,0]=self.score(X[:,test],T[:,test])
                        for run in range(epoch_number):
                            self.grad_avg=self.backprop(X[:,train],T[:,train])
                            for j in range(len(self.layer_sizes)-1):
                                self.w[j]-=iii*self.grad_avg[j]
                            self.perf_data.loc[self.row_position,run+1]=self.score(X[:,test],T[:,test])
                        self.row_position+=1
                    print('hidden neurons 1 =',jjj,'\t hidden neurons 2 =',kkk,"\t learning parameter=",iii)
        return self.perf_data
                        
                    

As a really dumb first check to see that the neural network does what it's supposed to, lets construct a simple one and train it on a single input vector $X$ with target vector $T$, and then pass $X$ back through to see that we get something close to $T$:

In [6]:
X_pract=np.array([[0.1,0.8,0.5]]).T     # Remember to take the transpose, since we want them to be column vectors.
T_pract=np.array([[0.6,0.9,0.15,0.18]]).T

In [7]:
# Initializing the network, with one hidden layer containing 5 neurons.

net_pract=neural_net(3,5,4)                        


# Training the network to our single trianing point, using learning rate 0.1 AND 200 epochs.
      
net_pract.fit(X_pract,T_pract,0.1,200)

We now pass the training vector back through to see if we get something similar to ```T_pract```:

In [8]:
net_pract.predict(X_pract)

array([[ 0.60000052],
       [ 0.8990522 ],
       [ 0.14982862],
       [ 0.18009814]])

which is pretty close to ```T_pract```, and hence we know with 100% certainty that there are absolutely no bugs in the above code.

Now let's look at making predictions with the titanic data:

In [3]:
train_df=pd.read_csv('titanic_train.csv')
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Function for formatting data.

def format_data(datafile,predictors):
    titanic=pd.read_csv(datafile)
    # Change 'male' and 'female' in the 'Sex' column to 0 and 1 respectively.
    titanic.loc[titanic['Sex']=='male','Sex']=0
    titanic.loc[titanic['Sex']=='female','Sex']=1
    # Creates columns 'Emb1' through 'Emb3' which will contain the point the passenger embarked at.  I.e. a passenger 
    # who embarks at 'C' will have a 1 in 'Emb1' column and 0 in 'Emb2' and 'Emb3', passengers who embarked at 'S' will
    # have a 1 in 'Emb2', and so on.
    titanic['Emb1']=0
    titanic['Emb2']=0
    titanic['Emb3']=0
    titanic.loc[titanic['Embarked']=='C','Emb1']=1
    titanic.loc[titanic['Embarked']=='S','Emb2']=1
    titanic.loc[titanic['Embarked']=='Q','Emb3']=1
    # Dropping the original 'Embarked' column.
    titanic=titanic.drop('Embarked',axis=1)
    # Drop any columns whose name is not in the list 'predictors'.
    for val in titanic.columns.values:
        if not(val in predictors+['Survived']):
            titanic=titanic.drop(val,axis=1)
    # In the columns from the list 'predictors', replace and NAN values with the median of the values in that column,
    # and normalize the data so that each feature has zero mean and unit variance.  This is not technically necessary
    # for the neural network to function, but if there are large values it will take much longer to train, so it's
    # better to normalize.
    for val in predictors:
        titanic[val] = titanic[val].fillna(titanic[val].median())
        titanic[val] = (titanic[val]-titanic[val].mean())/np.sqrt(titanic[val].var())
    return titanic

In [5]:
predictors=['Sex', 'Age', 'Fare', 'Pclass', 'SibSp', 'Parch','Emb1','Emb2','Emb3']

In [6]:
train_df=format_data('titanic_train.csv',predictors)
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Emb1,Emb2,Emb3
0,0,0.826913,-0.737281,-0.565419,0.43255,-0.473408,-0.502163,-0.481772,0.618959,-0.30739
1,1,-1.565228,1.354813,0.663488,0.43255,-0.473408,0.786404,2.073341,-1.613803,-0.30739
2,1,0.826913,1.354813,-0.258192,-0.474279,-0.473408,-0.48858,-0.481772,0.618959,-0.30739
3,1,-1.565228,1.354813,0.433068,0.43255,-0.473408,0.420494,-0.481772,0.618959,-0.30739
4,0,0.826913,-0.737281,0.433068,-0.474279,-0.473408,-0.486064,-0.481772,0.618959,-0.30739


In [7]:
# Convert the pandas data frame train_df into a set of feature vectors X and target variables T.  In the data frame 
# each passenger is a row, while we want each passenger to be a column vector of X, so we neet to take the transpose
# for both X and T (and we need to reshape T so that it's a 1-dimensional array).

X=train_df[predictors].values.T
T=np.reshape(train_df['Survived'].values.T,(1,-1))

We now want to use the ```perf_test``` functions to try to determine which parameters (number of hidden neurons, learning rate, number of epochs) to use for our model.  Obviously, the more parameters you test the longer it will take.  I've run this a couple of times with wide ranges of parameters, and it seems that a good range of neurons to try using would be 16-20, and some good candidates for the learning parameter might be 1.2,1.4 and 1.6.

To test these, we need to initialize a neural network which we'll call ```net```.  It turns out with the way that the function ```perf_test``` is written, it doesn't matter the initial configuration of neurons we pick (it will always select the right number of input neurons, a single output neuron, and let the number of hidden neurons range according to the parameters):

In [16]:
net=neural_net(1,1,1)

Remember that the 50 below is the number of epochs used in each test, and the 5 is the number of k-folds to use for each combination of learning rate and neuron count.  This means that for each combination of neuron number and learning rate, it will train and test 5 different neural networks using those parameters, on different subsets of the training data, and will record how well each of these networks performs at each of the 50 training epochs.  When I did this before, I used 200 epochs and 10 k-folds, but I didn't want this to take forever:

In [18]:
performance_data=net.perf_test(X,T,50,[1.2,1.4,1.6],[16,18,20],5)

hidden neurons= 16 	 learning parameter= 1.2
hidden neurons= 16 	 learning parameter= 1.4
hidden neurons= 16 	 learning parameter= 1.6
hidden neurons= 18 	 learning parameter= 1.2
hidden neurons= 18 	 learning parameter= 1.4
hidden neurons= 18 	 learning parameter= 1.6
hidden neurons= 20 	 learning parameter= 1.2
hidden neurons= 20 	 learning parameter= 1.4
hidden neurons= 20 	 learning parameter= 1.6


In [19]:
# Save the performance date as a csv in case the jupyter session ends unexpectedly.

performance_data.to_csv('cross validation data (practice).csv',index=False)

Let's look at how ```perf_test``` returns the data.  For each learning rate and number of hidden neurons, it returns a list of 51 scores, which reveal how well the network performed on the hold-out data after each training epoch.  Notice that each combination of parameters was tested 5 times (fold_number=5).

In [12]:
performance_data=pd.read_csv('cross validation data (practice).csv')
performance_data[:10]

Unnamed: 0,hidden neurons,learning parameter,0,1,2,3,4,5,6,7,...,41,42,43,44,45,46,47,48,49,50
0,16,1.2,0.670391,0.681564,0.72067,0.765363,0.731844,0.72067,0.73743,0.75419,...,0.793296,0.793296,0.793296,0.793296,0.793296,0.793296,0.793296,0.793296,0.793296,0.793296
1,16,1.2,0.539326,0.58427,0.640449,0.674157,0.685393,0.719101,0.724719,0.758427,...,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461
2,16,1.2,0.61236,0.61236,0.617978,0.707865,0.730337,0.735955,0.758427,0.764045,...,0.780899,0.780899,0.780899,0.780899,0.780899,0.780899,0.780899,0.780899,0.780899,0.780899
3,16,1.2,0.595506,0.674157,0.730337,0.730337,0.741573,0.741573,0.758427,0.758427,...,0.775281,0.775281,0.775281,0.775281,0.775281,0.775281,0.775281,0.775281,0.775281,0.775281
4,16,1.2,0.735955,0.741573,0.747191,0.752809,0.769663,0.769663,0.769663,0.769663,...,0.820225,0.820225,0.820225,0.820225,0.820225,0.820225,0.820225,0.814607,0.814607,0.814607
5,16,1.4,0.335196,0.558659,0.586592,0.642458,0.681564,0.709497,0.703911,0.709497,...,0.787709,0.787709,0.787709,0.793296,0.787709,0.787709,0.787709,0.787709,0.787709,0.787709
6,16,1.4,0.516854,0.634831,0.747191,0.747191,0.780899,0.780899,0.780899,0.780899,...,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045
7,16,1.4,0.595506,0.668539,0.719101,0.730337,0.758427,0.758427,0.769663,0.780899,...,0.808989,0.808989,0.803371,0.803371,0.803371,0.803371,0.803371,0.803371,0.803371,0.803371
8,16,1.4,0.382022,0.58427,0.589888,0.629213,0.640449,0.668539,0.713483,0.719101,...,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.764045,0.758427
9,16,1.4,0.646067,0.657303,0.61236,0.679775,0.719101,0.724719,0.792135,0.775281,...,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.831461,0.825843,0.825843


Let's see what the maximum score is in the table:

In [17]:
performance_data.columns=['hidden neurons','learning parameter']+list(range(51)) # I reimported the data from the csv, and pandas treated the column names as strings instead of ints, so I'm renaming the columns as ints.
performance_data.loc[:,list(range(51))].max().max()

0.85955056179800005

This is a pretty high score.  Unfortunately, we sometimes get high scores by accident, by just chosing a subset of the data to test our network on that is particularly predictable (I've seen some as high as 90 percent).  It's more revealing to look at the averages of each of the tests performed over each neuron number/learning parameter pair.

To do this, we create a new data frame which contains the average scores for each epoch and each neuron number/learning rate pair:

In [22]:
averages=performance_data.groupby(['hidden neurons','learning parameter']).mean().reset_index()
averages

Unnamed: 0,hidden neurons,learning parameter,0,1,2,3,4,5,6,7,...,41,42,43,44,45,46,47,48,49,50
0,16,1.2,0.630707,0.658785,0.691325,0.726106,0.731762,0.737393,0.749733,0.76095,...,0.800232,0.800232,0.800232,0.800232,0.800232,0.800232,0.800232,0.799109,0.799109,0.799109
1,16,1.4,0.495129,0.620721,0.651026,0.685795,0.716088,0.728416,0.752018,0.753135,...,0.79125,0.79125,0.790126,0.791243,0.790126,0.790126,0.790126,0.790126,0.789003,0.787879
2,16,1.6,0.566669,0.623928,0.649802,0.704758,0.713778,0.744071,0.75419,0.754177,...,0.788996,0.79012,0.788996,0.788996,0.788996,0.788996,0.788996,0.791243,0.793491,0.794614
3,18,1.2,0.451315,0.628579,0.696918,0.701406,0.72836,0.744065,0.751924,0.765407,...,0.79125,0.792373,0.79125,0.790126,0.790126,0.790126,0.790126,0.79125,0.790132,0.790132
4,18,1.4,0.491777,0.610502,0.615096,0.673398,0.709334,0.730657,0.746369,0.764327,...,0.796855,0.797979,0.797979,0.797979,0.797979,0.796855,0.796855,0.795732,0.795732,0.795732
5,18,1.6,0.408618,0.667742,0.682336,0.71155,0.713797,0.723903,0.727274,0.747505,...,0.785632,0.784508,0.785626,0.784502,0.785626,0.786749,0.786749,0.786749,0.787873,0.787873
6,20,1.2,0.501494,0.543149,0.607194,0.677905,0.70489,0.718348,0.728441,0.736269,...,0.786768,0.787892,0.787892,0.789015,0.789015,0.789015,0.790139,0.790139,0.790139,0.790139
7,20,1.4,0.558778,0.613828,0.669996,0.710426,0.733984,0.755339,0.758684,0.771038,...,0.80472,0.803597,0.803597,0.80472,0.80472,0.80472,0.80472,0.80472,0.80472,0.80472
8,20,1.6,0.459174,0.603754,0.624079,0.679047,0.704921,0.7318,0.727368,0.753135,...,0.801375,0.800251,0.800251,0.798004,0.79801,0.796887,0.799127,0.800251,0.800251,0.800251


The maximum value in the table of averages is:

In [24]:
averages.loc[:,list(range(51))].max().max()

0.8080848659844001

Now to figure out in which column the maximum is located in:

In [28]:
averages.loc[:,list(range(51))].max().idxmax()

29

and the row it's in:

In [26]:
averages.loc[:,29].idxmax()

0

So the maximum score was obtained in row 0 at column 29.  Double checking:

In [29]:
averages.loc[0,29]

0.8080848659844001

In [30]:
averages.loc[0,['hidden neurons','learning parameter']]

hidden neurons        16.0
learning parameter     1.2
Name: 0, dtype: float64

Hence, this test would seem to suggest that our network behaved the best on the data when there were 16 hidden neurons and a learning rate of 1.2, at epoch 29.  We could then train a network with this configuration on the entire titanic training data set, then compute the predictions of the test data and submit to Kaggle to see what we get.

In [33]:
titanic_net=neural_net(9,16,1)

In [34]:
titanic_net.fit(X,T,1.2,29)

In [35]:
test_df=format_data('titanic_test.csv',predictors)
test_full_data=pd.read_csv('titanic_test.csv')
test_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Emb1,Emb2,Emb3
0,0.872436,-0.755024,0.385769,-0.498872,-0.399769,-0.496818,-0.567462,-1.349059,2.840354
1,0.872436,1.321292,1.369729,0.616254,-0.399769,-0.511665,-0.567462,0.739484,-0.351227
2,-0.315441,-0.755024,2.550481,-0.498872,-0.399769,-0.463545,-0.567462,-1.349059,2.840354
3,0.872436,-0.755024,-0.204607,-0.498872,-0.399769,-0.481898,-0.567462,0.739484,-0.351227
4,0.872436,1.321292,-0.598191,0.616254,0.619154,-0.416992,-0.567462,0.739484,-0.351227


In [36]:
X_test=test_df[predictors].values.T

In [37]:
predictions=np.round_(titanic_net.predict(X_test))

In [38]:
predictions

array([[ 0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,
         0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  0.,
         1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,
         0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
         1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,
         1.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,
         0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
         0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
         1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,
         1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,
         1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,
         1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.

In [39]:
prediction_data=pd.DataFrame({'PassengerId':test_full_data['PassengerId'],'Survived':predictions[0]})

In [43]:
prediction_data.Survived=prediction_data.Survived.astype(int)  # Important: the predictions we made above are floats,
                                                               # while Kaggle wants them as integers, so we must convert
                                                               # them to integers unless we want to get a score of 0.00.

In [44]:
prediction_data.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [45]:
prediction_data.to_csv('titanic_predictions_practice.csv',index=False)

The above predictions only scored 0.75598 on the Kaggle data, so definitely not as high as the cross validation data we collected above might have indicated.  We could repeat the proceedure above with a wider range of hidden neurons, learning rates, and epochs tested, or we could repeat it with the ```perf_test2``` function to test out networks with 2 hidden layers.  The best I've scored using these methods is 0.78469.