# **Part A**

## **Load and preprocess the data**

In [1]:
import pandas as pd
import numpy as np

# Load the dataset

data = pd.read_csv('Titanic Dataset.csv')

# Preprocessing

#FARE, EMBARKED AND PARCH ARE IMPORTANT VARIABLES (SAID BY JONAS IN CLASS THAT WE MIGHT WANT TO CONSIDER TRYING-CHECK ALGORITHMS EXAM FROM LAST YEAR.)

data['sex'] = data['sex'].astype('category').cat.codes
features = ['pclass', 'sex', 'age', 'sibsp']
X = data[features].fillna(data[features].mean())
y = data['survived']

# Split the dataset

np.random.seed(42)
train_indices = np.random.rand(len(X)) < 0.8
X_train = X[train_indices]
X_test = X[~train_indices]
y_train = y[train_indices]
y_test = y[~train_indices]


# Standardize the features

X_train_mean = X_train.mean(axis=0)
X_train_std = X_train.std(axis=0)
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std

In [2]:
data.sample(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1139,3,0,"Rekic, Mr. Tido",1,38.0,0,0,349249,7.8958,,S,,,
1307,3,0,"Zakarian, Mr. Ortin",1,27.0,0,0,2670,7.225,,C,,,
53,1,0,"Carrau, Mr. Jose Pedro",1,17.0,0,0,113059,47.1,,S,,,"Montevideo, Uruguay"
356,2,0,"Butler, Mr. Reginald Fenton",1,25.0,0,0,234686,13.0,,S,,97.0,"Southsea, Hants"
17,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",0,50.0,0,1,PC 17558,247.5208,B58 B60,C,6.0,,"Montreal, PQ"


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   int8   
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       1309 non-null   int32  
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int32(1), int64(4), int8(1), object(5)
memory usage: 129.2+ KB


In [3]:
from sklearn.preprocessing import LabelEncoder

data['boat'] = LabelEncoder().fit_transform(data['boat'])

### What We Did Here:

**Loading the Dataset:**
- We loaded the Titanic dataset into a pandas DataFrame because it allows for easy manipulation and analysis.

**Preprocessing:**
- We converted the 'sex' column to numeric codes because models require numeric input.
- We selected relevant features ('pclass', 'sex', 'age', 'sibsp') for prediction.
- We filled missing values with the mean of the respective columns to handle NaNs because many models can't process missing data.
- We extracted the 'survived' column as our target variable for training the model.

**Splitting the Dataset:**
- We set a random seed for reproducibility, ensuring consistent results.
- We split the data into 80% training and 20% test sets to train the model on one part of the data and evaluate it on another.

**Standardizing the Features:**
- We calculated the mean and standard deviation of the training features.
- We standardized the features to have a mean of 0 and standard deviation of 1, improving model performance and ensuring consistent scaling between training and test sets.

### What Would Happen if the Variables Would Change?

**Loading a Different Dataset:**
- If a different dataset were loaded, the structure, column names, and the type of preprocessing needed might vary. Feature selection and handling of missing values might need adjustments based on the new dataset's characteristics.

**Changing the Features:**
- Modifying the features in the `features` list (e.g., adding 'fare', 'embarked', or 'parch') would impact the model input. Including more features could potentially improve model performance if those features are relevant, but it could also lead to overfitting if the features are not significant or introduce noise.

**Different Seed for Splitting:**
- Changing the seed value in `np.random.seed(42)` would result in a different train-test split. This could affect the model's performance slightly as the training and test sets would contain different samples.

**Different Split Ratio:**
- Altering the condition `np.random.rand(len(X)) < 0.8` to a different ratio (e.g., 0.7 for a 70-30 split) would change the size of the training and test sets. A smaller training set might reduce the model's ability to learn, while a smaller test set might not be representative enough to evaluate model performance accurately.

**Standardization Parameters:**
- If the means and standard deviations for standardization were calculated on the entire dataset rather than just the training set, it could lead to data leakage. This would mean that information from the test set would be used in training, leading to overly optimistic performance estimates.

**Handling Missing Values:**
- If missing values were filled with different statistics (e.g., median instead of mean), it could affect the distribution of the features and potentially the model's performance. Some models might be more sensitive to these changes than others.

## **Logistic model regression**

In [4]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Define the logistic regression function

def logistic_regression(X, y, num_steps, learning_rate):
    X = np.insert(X.values, 0, 1, axis=1)  # Add intercept
    weights = np.zeros(X.shape[1])
    
    for step in range(num_steps):
        z = np.dot(X, weights)
        predictions = sigmoid(z)
        
        gradient = np.dot(X.T, predictions - y) / y.size
        weights -= learning_rate * gradient

    return weights


# Training the logistic regression model

num_steps = 10000
learning_rate = 0.2
weights = logistic_regression(X_train, y_train, num_steps, learning_rate)

# Predictions

X_test_intercept = np.insert(X_test.values, 0, 1, axis=1)  
y_pred_prob = sigmoid(np.dot(X_test_intercept, weights))
y_pred_log_reg = (y_pred_prob >= 0.5).astype(int)

# Accuracy

accuracy_log_reg = np.mean(y_pred_log_reg == y_test)
print(f'Logistic Regression Accuracy: {accuracy_log_reg}')

Logistic Regression Accuracy: 0.7806691449814126


### What We Did Here:

**Sigmoid Function:**
- We defined the sigmoid function to map real values to probabilities between 0 and 1, essential for logistic regression.

**Logistic Regression Function:**
- We added an intercept term to the feature matrix to include a bias in our model.
- We initialized the weights to zero because it’s a standard starting point for optimization.
- We used gradient descent to iteratively update the weights by calculating predictions and gradients, minimizing the cost function over `num_steps` iterations.

**Training the Model:**
- We trained the logistic regression model with 10,000 iterations and a learning rate of 0.2 to optimize the weights.

**Making Predictions:**
- We added an intercept to the test data to align with our trained model.
- We predicted probabilities using the sigmoid function and converted them to binary outcomes (1 if probability >= 0.5, otherwise 0).

**Calculating Accuracy:**
- We computed accuracy by comparing our predictions to the actual test labels, giving us a measure of how well our model performed.

This process prepares and trains the logistic regression model, makes predictions, and evaluates its accuracy.

### What Would Happen if the Variables Would Change?

**Different Number of Steps:**
- Changing `num_steps` would affect the training process. Fewer steps might lead to underfitting, while more steps could improve convergence but increase computation time.

**Different Learning Rate:**
- Modifying `learning_rate` impacts how quickly the model learns. A smaller learning rate could slow down convergence, while a larger learning rate might speed it up but risk overshooting the optimal weights.

**Different Initialization of Weights:**
- Initializing weights differently could affect convergence speed and the final solution. Different initializations might lead to different local minima.

**Different Threshold for Predictions:**
- Changing the threshold for converting probabilities into binary predictions would alter the predicted labels, impacting accuracy and other evaluation metrics.

**Different Preprocessing:**
- If the features were standardized differently or if different features were selected, the model performance might vary. Proper standardization ensures equal contribution from features, and relevant feature selection can improve accuracy.

**Handling Missing Values Differently:**
- Filling missing values with different methods could affect data distribution and model performance. Different imputation methods might be more suitable depending on feature distributions and data characteristics.

## **[4,5,2] Neutral network**

In [5]:
import numpy as np
import pandas as pd

# Convert labels to categorical one-hot encoding

def to_categorical(labels, num_classes):
    return np.eye(num_classes)[labels]

y_train_cat = to_categorical(y_train.values, num_classes=2)
y_test_cat = to_categorical(y_test.values, num_classes=2)

# Define the neural network structure

class SimpleNN:
    def __init__(self, input_dim, hidden_units, output_units):
        self.weights1 = np.random.randn(input_dim, hidden_units) * 0.01   # Weights and bias from input layer to hidden layer.
        self.bias1 = np.zeros(hidden_units)
        self.weights2 = np.random.randn(hidden_units, output_units) * 0.01 # Weights and bias from hidden layer to output layer.
        self.bias2 = np.zeros(output_units)

    def relu(self, x):
        return np.maximum(0, x)
        # This function will return 0 if the input is less than 0, and will return the same number if the input is positive

    def softmax(self, x):
        x = np.clip(x, -500, 500)
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
        # Normalizes the output into a probability distribution.

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1  # Input to hidden layer
        self.a1 = self.relu(self.z1)  # Activation of hidden layer
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2  # Hidden to output layer
        self.a2 = self.softmax(self.z2)  # Output layer activation
        return self.a2
        # With one hidden layer

    def backward(self, x, y, output, learning_rate):
        m = y.shape[0]
        dz2 = output - y
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0)
        dw1 = np.dot(x.T, dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        self.weights2 -= learning_rate * dw2
        self.bias2 -= learning_rate * db2
        self.weights1 -= learning_rate * dw1
        self.bias1 -= learning_rate * db1
        # One hidden layer

    def train(self, x, y, epochs, learning_rate):
        for epoch in range(epochs):
            for i in range(x.shape[0]):
                x_sample = x[i:i+1]
                y_sample = y[i:i+1]
                output = self.forward(x_sample)
                self.backward(x_sample, y_sample, output, learning_rate)
        # Stochastic gradient descent - iterating through each sample

    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = np.argmax(output, axis=1)
        targets = np.argmax(y, axis=1)
        accuracy = np.mean(predictions == targets)
        return accuracy

# Normalize input data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the neural network

input_dim = X_train.shape[1]
hidden_units = 5
output_units = 2
learning_rate = 0.11
epochs = 100

nn = SimpleNN(input_dim, hidden_units, output_units)
nn.train(X_train, y_train_cat, epochs, learning_rate)

# Evaluate the neural network

accuracy_nn = nn.evaluate(X_test, y_test_cat)
print(f'Neural Network Accuracy: {accuracy_nn}')

Neural Network Accuracy: 0.7732342007434945


### What We Did Here:

**Convert Labels to Categorical One-Hot Encoding:**
- We converted the target labels into categorical one-hot encoding to use with the neural network. This transforms each label into a binary vector representing the classes.

**Define the Neural Network Structure:**
- We defined a simple neural network class `SimpleNN` with one hidden layer.
  - **Initialization:** Initialized weights and biases for the layers with small random values to start the training process.
  - **Activation Functions:** Defined ReLU for the hidden layer and softmax for the output layer to produce probabilities.
  - **Forward Propagation:** Implemented forward propagation to compute the activations through the network layers.
  - **Backward Propagation:** Implemented backward propagation to update weights and biases using gradient descent.
  - **Training:** Defined a training method to iterate through samples and update weights for each epoch.
  - **Evaluation:** Added a method to evaluate the network's accuracy on test data by comparing predictions to actual labels.

**Normalize Input Data:**
- We normalized the input data using `StandardScaler` to standardize features by removing the mean and scaling to unit variance, which helps in speeding up convergence.

**Initialize and Train the Neural Network:**
- Initialized the neural network with the specified input dimensions, number of hidden units, and output units.
- Trained the network with the training data for 100 epochs and a learning rate of 0.11.

**Evaluate the Neural Network:**
- Evaluated the trained neural network on the test data and calculated the accuracy, which measures the model's performance.

### What Would Happen if the Variables Would Change?

**Different Number of Hidden Units:**
- Changing `hidden_units` (e.g., increasing to 10 or decreasing to 3) would alter the network's capacity to learn patterns. More hidden units can capture more complex relationships but may lead to overfitting; fewer units might not capture the data's complexity well.

**Different Learning Rate:**
- Modifying `learning_rate` (e.g., to 0.05 or 0.2) impacts how quickly the model learns. A smaller learning rate could slow down training, while a larger learning rate might cause the training to become unstable and possibly diverge.

**Different Number of Epochs:**
- Changing `epochs` (e.g., to 50 or 200) would affect how long the network trains. Fewer epochs might result in underfitting, while more epochs can improve performance but might lead to overfitting if the model learns the noise in the training data.

**Different Initialization of Weights:**
- Initializing weights differently (e.g., with different scales or methods) could affect convergence speed and final performance. Different initializations might lead to different local minima.

**Different Normalization:**
- Using different normalization techniques (e.g., min-max scaling instead of standard scaling) could affect training. Proper normalization ensures features contribute equally, and different techniques might suit different data distributions better.

**Handling Missing Values Differently:**
- If missing values in the input data were handled differently (e.g., using median instead of mean for imputation), it could impact the distribution and consequently the model's performance. Different imputation methods might be more suitable based on the feature distributions and data characteristics.

These changes would impact the training process, convergence, and final performance of the neural network model.

## **Comparing performances**

In [6]:
print(f'Logistic Regression Accuracy: {accuracy_log_reg}')
print(f'Neural Network Accuracy: {accuracy_nn}')

Logistic Regression Accuracy: 0.7806691449814126
Neural Network Accuracy: 0.7732342007434945


### **Conclusions:** 

Logistic regression (78.07% accuracy) slightly outperformed the neural network (77.32% accuracy), indicating a potentially linear relationship in the data. The neural network might benefit from further tuning and feature engineering. Both models perform well, but logistic regression is more effective for this dataset.

# **Part B**

## **Adding an Additional Variable and Optimizing the Network Layout**

## **[5,5,2] Neutral network**

In [7]:
##The same features as before plus FARE

data['sex'] = data['sex'].astype('category').cat.codes
features = ['pclass', 'sex', 'age', 'sibsp', 'boat']
X = data[features].fillna(data[features].mean())
y = data['survived']

# Split the dataset

np.random.seed(42)
train_indices = np.random.rand(len(X)) < 0.8
X_train = X[train_indices]
X_test = X[~train_indices]
y_train = y[train_indices]
y_test = y[~train_indices]

# Standardize the features

X_train_mean = X_train.mean(axis=0)
X_train_std = X_train.std(axis=0)
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std

# Convert labels to categorical one-hot encoding

def to_categorical(labels, num_classes):
    return np.eye(num_classes)[labels]

y_train_cat = to_categorical(y_train.values, num_classes=2)
y_test_cat = to_categorical(y_test.values, num_classes=2)

# Define the neural network structure

class SimpleNN:
    def __init__(self, input_dim, hidden_units, output_units):
        self.weights1 = np.random.rand(input_dim, hidden_units)   # Weights and bias from input layer to hidden layer.
        self.bias1 = np.zeros(hidden_units)
        self.weights2 = np.random.rand(hidden_units, output_units) # Weights and bias from hidden layer to output layer.
        self.bias2 = np.zeros(output_units)

    def relu(self, x):
        return np.maximum(0, x)
    
        # This function will return 0 if the input is less than 0, and will return the same number if the input is positive

    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
        # Normalizes the output into a probability distribution.

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1  # Input to hidden layer
        self.a1 = self.relu(self.z1)  # Activation of hidden layer
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2  # Hidden to output layer
        self.a2 = self.softmax(self.z2)  # Output layer activation
        return self.a2

    def backward(self, x, y, output, learning_rate):
        m = y.shape[0]
        dz2 = output - y
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0)
        dw1 = np.dot(x.T, dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        self.weights2 -= learning_rate * dw2
        self.bias2 -= learning_rate * db2
        self.weights1 -= learning_rate * dw1
        self.bias1 -= learning_rate * db1

    def train(self, x, y, epochs):
        for epoch in range(epochs):
            output = self.forward(x)
            self.backward(x, y, output, learning_rate)          
        #We use batch gradient descent.
                
    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = np.argmax(output, axis=1)
        targets = np.argmax(y, axis=1)
        accuracy = np.mean(predictions == targets)
        return accuracy

# Initialize and train the neural network

input_dim = X_train.shape[1]
hidden_units = 5
output_units = 2
learning_rate = .3
epochs = 100

nn = SimpleNN(input_dim, hidden_units, output_units)
nn.train(X_train.values, y_train_cat, epochs)

#Trains and evaluates the neural network with a single output node.

# Evaluate the neural network

accuracy_nn = nn.evaluate(X_test.values, y_test_cat)
print(f'Neural Network Accuracy: {accuracy_nn}')

Neural Network Accuracy: 0.9553903345724907


In [18]:
class SimpleNN:
    def __init__(self, input_dim, hidden_units, output_units):
        self.weights1 = np.random.rand(input_dim, hidden_units)   # Weights and bias from input layer to hidden layer.
        self.bias1 = np.zeros(hidden_units)
        self.weights2 = np.random.rand(hidden_units, output_units) # Weights and bias from hidden layer to output layer.
        self.bias2 = np.zeros(output_units)

    def relu(self, x):
        return np.maximum(0, x)
    
    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1  # Input to hidden layer
        self.a1 = self.relu(self.z1)  # Activation of hidden layer
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2  # Hidden to output layer
        self.a2 = self.softmax(self.z2)  # Output layer activation
        return self.a2

    def backward(self, x, y, output, learning_rate):
        m = y.shape[0]
        dz2 = output - y
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0)
        dw1 = np.dot(x.T, dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        self.weights2 -= learning_rate * dw2
        self.bias2 -= learning_rate * db2
        self.weights1 -= learning_rate * dw1
        self.bias1 -= learning_rate * db1

    def train(self, x, y, epochs, learning_rate):
        for epoch in range(epochs):
            for i in range(x.shape[0]):
                x_sample = x[i:i+1]
                y_sample = y[i:i+1]
                output = self.forward(x_sample)
                self.backward(x_sample, y_sample, output, learning_rate)
                #Using stochastic gradient descent.
                
    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = np.argmax(output, axis=1)
        targets = np.argmax(y, axis=1)
        accuracy = np.mean(predictions == targets)
        return accuracy

# Initialize and train the neural network

input_dim = X_train.shape[1]
hidden_units = 5
output_units = 2
learning_rate = .2
epochs = 100

nn = SimpleNN(input_dim, hidden_units, output_units)
nn.train(X_train.values, y_train_cat, epochs, learning_rate)

# Evaluate the neural network

accuracy_nn = nn.evaluate(X_test.values, y_test_cat)
print(f'Neural Network Accuracy: {accuracy_nn}') 

Neural Network Accuracy: 0.9442379182156134


### What We Did Here:

**Preprocessing:**
- We converted the 'sex' column to numeric codes because models require numeric input.
- We selected relevant features ('pclass', 'sex', 'age', 'sibsp', 'boat') for prediction, based on the importance mentioned by Jonas in class.
- We filled missing values with the mean to handle NaNs because many models can't process missing data.
- We extracted the 'survived' column as our target variable for training the model.

**Splitting the Dataset:**
- We set a random seed for reproducibility, ensuring consistent results.
- We split the data into 80% training and 20% test sets to train the model on one part of the data and evaluate it on another.

**Standardizing the Features:**
- We calculated the mean and standard deviation of the training features.
- We standardized the features to have a mean of 0 and standard deviation of 1, improving model performance and ensuring consistent scaling between training and test sets.

**Convert Labels to Categorical One-Hot Encoding:**
- We converted the target labels into categorical one-hot encoding to use with the neural network. This transforms each label into a binary vector representing the classes.

**Define the Neural Network Structure:**
- We defined a simple neural network class `SimpleNN` with one hidden layer.
  - **Initialization:** Initialized weights and biases for the layers with small random values to start the training process.
  - **Activation Functions:** Defined ReLU for the hidden layer and softmax for the output layer to produce probabilities.
  - **Forward Propagation:** Implemented forward propagation to compute the activations through the network layers.
  - **Backward Propagation:** Implemented backward propagation to update weights and biases using gradient descent.
  - **Training:** Defined a training method to iterate through samples and update weights for each epoch using mini-batch gradient descent.
  - **Evaluation:** Added a method to evaluate the network's accuracy on test data by comparing predictions to actual labels.

**Initialize and Train the Neural Network:**
- Initialized the neural network with the specified input dimensions, number of hidden units, and output units.
- Trained the network with the training data for 100 epochs and a learning rate of 0.3.

**Evaluate the Neural Network:**
- Evaluated the trained neural network on the test data and calculated the accuracy, which measures the model's performance.

### What Would Happen if the Variables Would Change?

**Different Number of Hidden Units:**
- Changing `hidden_units` (e.g., increasing to 10 or decreasing to 3) would alter the network's capacity to learn patterns. More hidden units can capture more complex relationships but may lead to overfitting; fewer units might not capture the data's complexity well.

**Different Learning Rate:**
- Modifying `learning_rate` (e.g., to 0.1 or 0.5) impacts how quickly the model learns. A smaller learning rate could slow down training, while a larger learning rate might cause the training to become unstable and possibly diverge.

**Different Number of Epochs:**
- Changing `epochs` (e.g., to 50 or 200) would affect how long the network trains. Fewer epochs might result in underfitting, while more epochs can improve performance but might lead to overfitting if the model learns the noise in the training data.

**Different Initialization of Weights:**
- Initializing weights differently could affect convergence speed and final performance. Different initializations might lead to different local minima.

**Different Normalization:**
- Using different normalization techniques (e.g., min-max scaling instead of standard scaling) could affect training. Proper normalization ensures features contribute equally, and different techniques might suit different data distributions better.

**Handling Missing Values Differently:**
- If missing values in the input data were handled differently (e.g., using median instead of mean for imputation), it could impact the distribution and consequently the model's performance. Different imputation methods might be more suitable based on the feature distributions and data characteristics.

These changes would impact the training process, convergence, and final performance of the neural network model.

### **Conclusions:** 

The neural network achieved 95.54% accuracy, significantly outperforming logistic regression. This indicates the neural network effectively captured complex, non-linear relationships in the data. The high performance suggests successful tuning and model design, demonstrating the neural network's suitability for this dataset's intricacies.

# **Part C**

### **Why do we have two output nodes? What happens if you just use one? Can you adapt a network with just one output node to this problem?**

In a binary classification problem, you can use either one or two output nodes. Two output nodes with a softmax activation function provide probabilities for both classes, making it easy to interpret the model's predictions as a probability distribution. However, a single output node with a sigmoid activation function is also effective, outputting a probability for the positive class, which can be thresholded (e.g., at 0.5) to determine the class. Both methods are ok, but using only one output node with sigmoid activation is often simpler and sufficient for binary classification tasks.

## **Understanding the Output Nodes**

In [9]:
# Split the dataset

np.random.seed(42)
train_indices = np.random.rand(len(X)) < 0.8
X_train = X[train_indices]
X_test = X[~train_indices]
y_train = y[train_indices]
y_test = y[~train_indices]

# Standardize the features

X_train_mean = X_train.mean(axis=0)
X_train_std = X_train.std(axis=0)
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std

# Define the neural network with one output node

class SimpleNNOneOutput:
    def __init__(self, input_dim, hidden_units):
        self.weights1 = np.random.rand(input_dim, hidden_units)
        self.bias1 = np.zeros(hidden_units)
        self.weights2 = np.random.rand(hidden_units, 1)
        self.bias2 = np.zeros(1)

    def relu(self, x):
        return np.maximum(0, x)
            # This function will return 0 if the input is less than 0, and will return the same number if the input is positive

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1
        self.a1 = self.relu(self.z1)
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, x, y, output, learning_rate):
        m = y.shape[0]
        dz2 = output - y.reshape(-1, 1)
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0)
        dw1 = np.dot(x.T, dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        self.weights2 -= learning_rate * dw2
        self.bias2 -= learning_rate * db2
        self.weights1 -= learning_rate * dw1
        self.bias1 -= learning_rate * db1

    def train(self, x, y, epochs, batch_size, learning_rate):
        for epoch in range(epochs):
            for i in range(0, x.shape[0], batch_size):
                x_batch = x[i:i + batch_size]
                y_batch = y[i:i + batch_size]
                output = self.forward(x_batch)
                self.backward(x_batch, y_batch, output, learning_rate)

    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = (output >= 0.5).astype(int)
        accuracy = np.mean(predictions == y.reshape(-1, 1))
        return accuracy


# Initialize and train the neural network

input_dim = X_train.shape[1]
hidden_units = 6
learning_rate = 0.2
epochs = 100
batch_size = 10

nn = SimpleNNOneOutput(input_dim, hidden_units)
nn.train(X_train.values, y_train.values, epochs, batch_size, learning_rate)

# Evaluate the neural network

accuracy_nn_one_output = nn.evaluate(X_test.values, y_test.values)
print(f'Neural Network with One Output Node Accuracy: {accuracy_nn_one_output}')

Neural Network with One Output Node Accuracy: 0.966542750929368


### What We Did Here:

**Splitting the Dataset:**
- We set a random seed for reproducibility, ensuring consistent results.
- We split the data into 80% training and 20% test sets to train the model on one part of the data and evaluate it on another.

**Standardizing the Features:**
- We calculated the mean and standard deviation of the training features.
- We standardized the features to have a mean of 0 and standard deviation of 1, improving model performance and ensuring consistent scaling between training and test sets.

**Define the Neural Network with One Output Node:**
- We defined a simple neural network class `SimpleNNOneOutput` with one hidden layer and a single output node.
  - **Initialization:** Initialized weights and biases for the layers with small random values to start the training process.
  - **Activation Functions:** Defined ReLU for the hidden layer and sigmoid for the output layer to produce probabilities.
  - **Forward Propagation:** Implemented forward propagation to compute the activations through the network layers.
  - **Backward Propagation:** Implemented backward propagation to update weights and biases using gradient descent.
  - **Training:** Defined a training method to iterate through mini-batches of data and update weights for each epoch using mini-batch gradient descent.
  - **Evaluation:** Added a method to evaluate the network's accuracy on test data by comparing predictions to actual labels.

**Initialize and Train the Neural Network:**
- Initialized the neural network with the specified input dimensions, number of hidden units, and a single output unit.
- Trained the network with the training data for 100 epochs, using a batch size of 10 and a learning rate of 0.2.

**Evaluate the Neural Network:**
- Evaluated the trained neural network on the test data and calculated the accuracy, which measures the model's performance.

### What Would Happen if the Variables Would Change?

**Different Number of Hidden Units:**
- Changing `hidden_units` (e.g., increasing to 10 or decreasing to 3) would alter the network's capacity to learn patterns. More hidden units can capture more complex relationships but may lead to overfitting; fewer units might not capture the data's complexity well.

**Different Learning Rate:**
- Modifying `learning_rate` (e.g., to 0.1 or 0.3) impacts how quickly the model learns. A smaller learning rate could slow down training, while a larger learning rate might cause the training to become unstable and possibly diverge.

**Different Number of Epochs:**
- Changing `epochs` (e.g., to 50 or 200) would affect how long the network trains. Fewer epochs might result in underfitting, while more epochs can improve performance but might lead to overfitting if the model learns the noise in the training data.

**Different Batch Size:**
- Adjusting `batch_size` (e.g., to 5 or 20) impacts the number of samples used to update the model weights at each step. Smaller batch sizes can provide more frequent updates but may be noisier, while larger batch sizes might provide more stable updates but can require more memory.

**Different Initialization of Weights:**
- Initializing weights differently could affect convergence speed and final performance. Different initializations might lead to different local minima.

**Different Normalization:**
- Using different normalization techniques (e.g., min-max scaling instead of standard scaling) could affect training. Proper normalization ensures features contribute equally, and different techniques might suit different data distributions better.

**Handling Missing Values Differently:**
- If missing values in the input data were handled differently (e.g., using median instead of mean for imputation), it could impact the distribution and consequently the model's performance. Different imputation methods might be more suitable based on the feature distributions and data characteristics.

These changes would impact the training process, convergence, and final performance of the neural network model.

### **Conclusions:** 

The neural network with one output node achieved 96.65% accuracy, indicating excellent performance. This high accuracy suggests the model effectively captures complex patterns in the data, outperforming previous models. It demonstrates the neural network's strong ability to handle non-linear relationships in the dataset.

# **Part D**

## **Implementing Batches and Minibatches**

In [10]:
# Split the dataset

np.random.seed(42)
train_indices = np.random.rand(len(X)) < 0.8
X_train = X[train_indices]
X_test = X[~train_indices]
y_train = y[train_indices]
y_test = y[~train_indices]

# Standardize the features

X_train_mean = X_train.mean(axis=0)
X_train_std = X_train.std(axis=0)
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std

# Define the neural network with one output node

class SimpleNNMiniBatch:
    def __init__(self, input_dim, hidden_units):
        self.weights1 = np.random.rand(input_dim, hidden_units)
        self.bias1 = np.zeros(hidden_units)
        self.weights2 = np.random.rand(hidden_units, hidden_units)
        self.bias2 = np.zeros(hidden_units)
        self.weights3 = np.random.rand(hidden_units, 1)
        self.bias3 = np.zeros(1)

    def relu(self, x):
        return np.maximum(0, x)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def forward(self, x):
        self.z1 = np.dot(x, self.weights1) + self.bias1
        self.a1 = self.relu(self.z1)
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2
        self.a2 = self.relu(self.z2)
        self.z3 = np.dot(self.a2, self.weights3) + self.bias3
        self.a3 = self.sigmoid(self.z3)
        return self.a3

    def backward(self, x, y, output):
        m = y.shape[0]
        dz3 = output - y.reshape(-1, 1)
        dw3 = np.dot(self.a2.T, dz3) / m
        db3 = np.sum(dz3, axis=0) / m

        dz2 = np.dot(dz3, self.weights3.T) * (self.a2 > 0)
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0) / m

        dz1 = np.dot(dz2, self.weights2.T) * (self.a1 > 0)
        dw1 = np.dot(x.T, dz1) / m
        db1 = np.sum(dz1, axis=0) / m

        self.weights3 -= learning_rate * dw3
        self.bias3 -= learning_rate * db3
        self.weights2 -= learning_rate * dw2
        self.bias2 -= learning_rate * db2
        self.weights1 -= learning_rate * dw1
        self.bias1 -= learning_rate * db1

    def train(self, x, y, epochs, batch_size):
        for epoch in range(epochs):
            indices = np.arange(x.shape[0])
            np.random.shuffle(indices)
            x = x[indices]
            y = y[indices]

            for i in range(0, x.shape[0], batch_size):
                x_batch = x[i:i + batch_size]
                y_batch = y[i:i + batch_size]
                output = self.forward(x_batch)
                self.backward(x_batch, y_batch, output)

    def evaluate(self, x, y):
        output = self.forward(x)
        predictions = (output >= 0.5).astype(int)
        accuracy = np.mean(predictions == y.reshape(-1, 1))
        return accuracy

# Initialize and train the neural network

input_dim = X_train.shape[1]
hidden_units = 5
learning_rate = 0.2
epochs = 100
batch_size = 32

nn = SimpleNNMiniBatch(input_dim, hidden_units)
nn.train(X_train.values, y_train.values, epochs, batch_size)

# Evaluate the neural network

accuracy_nn_mini_batch = nn.evaluate(X_test.values, y_test.values)
print(f'Neural Network with Mini-Batches Accuracy: {accuracy_nn_mini_batch}')

Neural Network with Mini-Batches Accuracy: 0.966542750929368


### What We Did Here:

**Splitting the Dataset:**
- We set a random seed for reproducibility, ensuring consistent results.
- We split the data into 80% training and 20% test sets to train the model on one part of the data and evaluate it on another.

**Standardizing the Features:**
- We calculated the mean and standard deviation of the training features.
- We standardized the features to have a mean of 0 and standard deviation of 1, improving model performance and ensuring consistent scaling between training and test sets.

**Define the Neural Network with One Output Node and Mini-Batch Gradient Descent:**
- We defined a simple neural network class `SimpleNNMiniBatch` with two hidden layers and a single output node.
  - **Initialization:** Initialized weights and biases for the layers with small random values to start the training process.
  - **Activation Functions:** Defined ReLU for the hidden layers and sigmoid for the output layer to produce probabilities.
  - **Forward Propagation:** Implemented forward propagation to compute the activations through the network layers.
  - **Backward Propagation:** Implemented backward propagation to update weights and biases using gradient descent.
  - **Training:** Defined a training method to iterate through mini-batches of data and update weights for each epoch using mini-batch gradient descent.
  - **Evaluation:** Added a method to evaluate the network's accuracy on test data by comparing predictions to actual labels.

**Initialize and Train the Neural Network:**
- Initialized the neural network with the specified input dimensions, number of hidden units, and a single output unit.
- Trained the network with the training data for 100 epochs, using a batch size of 32 and a learning rate of 0.2.

**Evaluate the Neural Network:**
- Evaluated the trained neural network on the test data and calculated the accuracy, which measures the model's performance.

### What Would Happen if the Variables Would Change?

**Different Number of Hidden Units:**
- Changing `hidden_units` (e.g., increasing to 10 or decreasing to 3) would alter the network's capacity to learn patterns. More hidden units can capture more complex relationships but may lead to overfitting; fewer units might not capture the data's complexity well.

**Different Learning Rate:**
- Modifying `learning_rate` (e.g., to 0.1 or 0.3) impacts how quickly the model learns. A smaller learning rate could slow down training, while a larger learning rate might cause the training to become unstable and possibly diverge.

**Different Number of Epochs:**
- Changing `epochs` (e.g., to 50 or 200) would affect how long the network trains. Fewer epochs might result in underfitting, while more epochs can improve performance but might lead to overfitting if the model learns the noise in the training data.

**Different Batch Size:**
- Adjusting `batch_size` (e.g., to 16 or 64) impacts the number of samples used to update the model weights at each step. Smaller batch sizes can provide more frequent updates but may be noisier, while larger batch sizes might provide more stable updates but can require more memory.

**Different Initialization of Weights:**
- Initializing weights differently could affect convergence speed and final performance. Different initializations might lead to different local minima.

**Different Normalization:**
- Using different normalization techniques (e.g., min-max scaling instead of standard scaling) could affect training. Proper normalization ensures features contribute equally, and different techniques might suit different data distributions better.

**Handling Missing Values Differently:**
- If missing values in the input data were handled differently (e.g., using median instead of mean for imputation), it could impact the distribution and consequently the model's performance. Different imputation methods might be more suitable based on the feature distributions and data characteristics.

These changes would impact the training process, convergence, and final performance of the neural network model.

### **Conclusions:** 

The neural network with mini-batches achieved 96.65% accuracy, indicating exceptional performance. This high accuracy shows that mini-batch gradient descent effectively trained the model, capturing complex data patterns and improving generalization. It demonstrates the model's strong ability to handle non-linear relationships efficiently.

# In-Depth Analysis: Final Conclusions

In evaluating various models on the Titanic dataset, we performed multiple steps to preprocess the data, implemented logistic regression, built a neural network, and explored variations in neural network architecture and training methods. Here's a more in depth analysis of our findings:

#### 1. Logistic Regression

**Performance:**

- **Accuracy:** 78.07%
- **Advantages:** Simple, interpretable, and effective for linear relationships in the data. The model performed well, indicating that the dataset has a significant linear component.

**Key Steps:**

- **Data Preprocessing:** Converted categorical variables, filled missing values, standardized features.
- **Model Training:** Used gradient descent to optimize the weights.
- **Prediction & Evaluation:** Calculated probabilities using the sigmoid function and converted them to binary outcomes for accuracy calculation.

**Variable Impact:**

- **num_steps:** Increasing steps can improve accuracy but requires more training time.
- **learning_rate:** A higher rate speeds up training but risks overshooting optimal weights, while a lower rate may improve accuracy but needs more iterations.

#### 2. Neural Network (Basic)

**Performance:**

- **Accuracy:** 77.32%
- **Advantages:** Can capture complex, non-linear relationships in the data.
- **Disadvantages:** Slightly underperformed compared to logistic regression, possibly due to overfitting or suboptimal tuning.

**Key Steps:**

- **One-Hot Encoding:** Converted labels to one-hot encoding for compatibility with the neural network's output.
- **Neural Network Structure:** Input layer, hidden layer with ReLU activation, and an output layer with sigmoid activation.
- **Training & Evaluation:** Used gradient descent for iterative parameter updates.

**Variable Impact:**

- **hidden_units:** Increasing units captures more complex patterns but risks overfitting.
- **learning_rate, epochs, batch_size:** Adjusting these impacts training speed, convergence, and the balance between training efficiency and accuracy.

#### 3. Neural Network with One Output Node

**Performance:**

- **Accuracy:** 96.65%
- **Advantages:** Simplifying the network to one output node improved performance, suggesting that a simpler architecture enhances generalization.
- **Disadvantages:** Still requires careful tuning to avoid overfitting.

**Key Steps:**

- **Neural Network Structure:** Input layer, hidden layers with ReLU activation, and a single output layer with sigmoid activation.
- **Training & Evaluation:** Similar training process as before, using mini-batch gradient descent.

**Variable Impact:**

- **hidden_units, learning_rate, epochs, batch_size:** Same considerations as the previous network.

#### 4. Neural Network with Mini-Batch Training

**Performance:**

- **Accuracy:** 96.65%
- **Advantages:** Mini-batch training stabilizes and speeds up learning, balancing between stochastic gradient descent and full-batch gradient descent.
- **Disadvantages:** Requires further tuning to fully leverage mini-batch benefits.

**Key Steps:**

- **Neural Network Structure:** Similar to the one output node network but with mini-batch training.
- **Training Process:** Shuffled data before each epoch, processed data in small batches for parameter updates.

**Variable Impact:**

- **hidden_units, learning_rate, epochs, batch_size:** Adjusting these affects the balance between training efficiency and accuracy.

### Summary of Key Findings

- **Logistic Regression:** Strong performance with a simpler, linear model indicating significant linear relationships in the data.
- **Neural Network (Basic):** Captures non-linear relationships but needs careful tuning to avoid overfitting.
- **Neural Network with One Output Node:** Improved performance with a simplified architecture, demonstrating better generalization.
- **Mini-Batch Training:** Provides a balance between training efficiency and stability, achieving high accuracy with further tuning.

### How can we make this better in the future?

1. **Model Selection:** For datasets with significant linear relationships, logistic regression can be highly effective. For more complex patterns, neural networks are suitable but require careful architecture and parameter tuning.

2. **Feature Engineering:** Adding relevant features (like 'boat') can improve performance. Further exploration of additional features and their interactions can provide better insights and improvements.

3. **Parameter Tuning:** Fine-tuning hyperparameters (learning rate, epochs, batch size, hidden units) is crucial for optimizing model performance. Techniques like grid search or random search can be employed for systematic tuning.

4. **Regularization:** Implementing regularization techniques (e.g., dropout, L2 regularization) can help mitigate overfitting in neural networks.

5. **Cross-Validation:** Using cross-validation techniques can provide a more robust evaluation of model performance, ensuring that the results are not dependent on a specific train-test split.

6. **Ensemble Methods:** Exploring ensemble methods (e.g., combining logistic regression with neural networks) might improve performance by leveraging the strengths of different models.

By combining robust preprocessing, effective model selection, and advanced training techniques, we can achieve high-performing predictive models for the Titanic dataset.