#### Goal
- Predict if a passenger survived the sinking of the Titanic or not (classification task), this prediction is either a value of 0 or 1.
#### Metric
Accuracy: The ratio of correct predictions out of all predictions.
#### Data Summary
- train.csv contains details of passengers on board, and whether they survived (use this to predict survivorship of passengers in test.csv).
- test.csv contains details of passengers on board without info on whether they survived.
#### Submit
- File w/ 2 columns:
  - PassengerId (in any sorted order)
  - Survived (binary predictions 0/1)

In [53]:
import copy
import math

import numpy as np
import pandas as pd

titanic_data = pd.read_csv('train.csv')

In [54]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [55]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [56]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [57]:
X = titanic_data[features]
y = titanic_data['Survived']

In [58]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       714 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Embarked  889 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


In [59]:
# Cleaning Data
# Convert Qualities to classes
X.loc[:, 'Sex'] = X['Sex'].map({'male': 0, 'female': 1})
X.loc[:, 'Embarked'] = X['Embarked'].map({'C': 0, 'S': 1, 'Q': 2})
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,0,22.0,1,0,7.25,1.0
1,1,1,38.0,1,0,71.2833,0.0
2,3,1,26.0,0,0,7.925,1.0
3,1,1,35.0,1,0,53.1,1.0
4,3,0,35.0,0,0,8.05,1.0


In [60]:
# Fill NaN values in the 'Age' column with the mean, using .loc[]
age_mean = round(X['Age'].mean())
X.loc[:, 'Age'] = X['Age'].fillna(age_mean)

# Fill NaN values in the 'Embarked' column with the mean, using .loc[]
embarked_mean = round(X['Embarked'].mean())
X.loc[:, 'Embarked'] = X['Embarked'].fillna(embarked_mean)

  X.loc[:, 'Embarked'] = X['Embarked'].fillna(embarked_mean)


In [61]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       891 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Embarked  891 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


In [62]:
# Converting to number types
X['Sex'] = X['Sex'].astype(int)
X['Embarked'] = X['Embarked'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Sex'] = X['Sex'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Embarked'] = X['Embarked'].astype(int)


In [63]:
X_train = X.values
y = y.values

In [70]:
X_train

array([[ 3.    ,  0.    , 22.    , ...,  0.    ,  7.25  ,  1.    ],
       [ 1.    ,  1.    , 38.    , ...,  0.    , 71.2833,  0.    ],
       [ 3.    ,  1.    , 26.    , ...,  0.    ,  7.925 ,  1.    ],
       ...,
       [ 3.    ,  1.    , 30.    , ...,  2.    , 23.45  ,  1.    ],
       [ 1.    ,  0.    , 26.    , ...,  0.    , 30.    ,  0.    ],
       [ 3.    ,  0.    , 32.    , ...,  0.    ,  7.75  ,  2.    ]])

In [85]:
# Splitting Data
train_ratio = 0.8

# Get Random indices
indices = np.random.permutation(X.shape[0])
train_indices = [index for index in indices if index<train_ratio*len(indices)]
test_indices = list(set(indices) - set(train_indices))


# Split dataset
train_X, test_y = np.array([]), np.array([])
while len(train_indices) > 0:
    train_index = train_indices.pop()
    print(train_index)
    train_X.append(X[train_index])
    train_y.append(y[train_index])
    
test_X, test_y = np.array([]), np.array([])
while len(test_indices) > 0:
    test_index = test_indices.pop()
    print(test_index)
    test_X.append(X[test_index])
    test_y.append(y[test_index])

# Convert lists to numpy arrays for consistency
train_X, train_y = np.array(train_X), np.array(train_y)
test_X, test_y = np.array(test_X), np.array(test_y)

113


KeyError: np.int64(113)

In [65]:
X.shape
y.shape

(891,)

#### Hypothesis

$$\hat{y} =\sigma(W^TX + b)$$

Repeat $(*)$ until convergence 
$$
\begin{align*}
&\begin{cases}
w_{j} := w_{j} - \alpha\frac{\partial J(\vec{w},b)}{\partial w_{j}}, \quad 0\leq j\leq n-1 \\
\ \ \vdots \\
b := b - \alpha\frac{\partial J(\vec{w},b)}{\partial b} \\
\end{cases}\tag{*} \\ \\
&\frac{\partial J(\vec{w},b)}{\partial w_{j}} = \frac{1}{m}\sum^{m-1}_{i=0}\left(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\right)x_{j}^{(i)}\tag{Gradient w.r.t. $w_{j}$}\\
&\frac{\partial J(\vec{w},b)}{\partial b}=\frac{1}{m}\sum_{i=0}^{m-1}\left(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\right)\tag{Gradient w.r.t. $b$}\\
&J(\vec{w},b) = -\frac{1}{m}\sum_{i=0}^{m-1}L\left(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}\right) \tag{Logistic Cost Function}  \\
&L\left(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}\right) = \left[y^{(i)}\log\left({f_{\vec{w},b}(\vec{x}^{(i)})}\right) + (1-y^{(i)})\log\left(1 - f_{\vec{w},b}(\vec{x}^{(i)})\right)\right]\tag{Logistic Loss Function}
\end{align*}
$$

- $f_{\vec{w},b}(\vec{x}^{(i)})$  is the logistic regression  function.

In [13]:
def sigmoid(z):
    """
    Compute the sigmoid of z (loss of z)

    Args:
        z (ndarray): A scalar, numpy array of any size

    Returns:
        g (ndarray): sigmoid(z), with the same shape as z
    """
    return 1/(1+np.exp(-z))

In [14]:
def logistic_cost(X, y, w, b):
    """
    Computes cost with cross-entropy loss function

    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    cost = 0
    for i in range(m):
        z_i = np.dot(X[i], w) + b
        sig_loss_i = sigmoid(z_i)
        cost += -y[i]*np.log(sig_loss_i) - (1-y[i])*np.log(1-sig_loss_i)
    # Divide by m to get average loss
    return cost/m

In [15]:
def logistic_gradient(X, y, w, b): 
    """
    Computes the gradient for logistic regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """
    m, n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0

    for i in range(m):
        # Make a prediction for example i
        f_wb_i = sigmoid(np.dot(X[i], w) + b) # (n,)(n,)=scalar
        # Compute error
        err_i = f_wb_i - y[i]
        # Compute gradient w.r.t. weights
        for j in range(n):
            dj_dw[j] = dj_dw[j] + (err_i * X[i, j])
        # Compute gradient w.r.t. bias
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m
    dj_db = dj_db/m

    return dj_dw, dj_db

In [16]:
def gradient_descent(X, y, w_in, b_in, alpha, num_iters): 
    """
    Performs batch gradient descent
    
    Args:
      X (ndarray (m,n)   : Data, m examples with n features
      y (ndarray (m,))   : target values
      w_in (ndarray (n,)): Initial values of model parameters  
      b_in (scalar)      : Initial values of model parameter
      alpha (float)      : Learning rate
      num_iters (scalar) : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,))   : Updated values of parameters
      b (scalar)         : Updated value of parameter 
    """
    # Array to store cost J and w's at each iteration
    J_history = []
    w = copy.deepcopy(w_in)
    b = b_in

    for i in range(num_iters):
        # Calculate gradient and update parameters
        dj_dw, dj_db = logistic_gradient(X, y, w, b)

        # Update weights and bias
        w = w - (alpha * dj_dw)
        b = b - (alpha * dj_db)

        # Save cost J at each iteration
        if i < 10_000: # prevents exhausting resources
            J_history.append(logistic_cost(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]}   ")

    return w, b

In [21]:
w, b = gradient_descent(X_train, y, np.random.randn(X_train.shape[1])*0.05, 0.0, 0.001, 10000)

Iteration    0: Cost 1.5146300964563857   
Iteration 1000: Cost 0.5999811840226499   
Iteration 2000: Cost 0.583921429438737   
Iteration 3000: Cost 0.5717835827789809   
Iteration 4000: Cost 0.5614756361173404   
Iteration 5000: Cost 0.5523239272339309   
Iteration 6000: Cost 0.5440680476621638   
Iteration 7000: Cost 0.5365753745168446   
Iteration 8000: Cost 0.5297572565659183   
Iteration 9000: Cost 0.5235434242528281   


In [18]:
z = np.dot(X_train, w) + b
y_pred = sigmoid(z)
class_pred = np.where(y_pred<0.5, 0, 1)
class_pred

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,

In [19]:
# Calculate accuracy
accuracy = np.mean(class_pred == y)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 74.30%


In [20]:
titanic_test = pd.read_csv('test.csv')
X_test = titanic_test[features]

In [276]:
# Clean test data
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Sex       418 non-null    object 
 2   Age       332 non-null    float64
 3   SibSp     418 non-null    int64  
 4   Parch     418 non-null    int64  
 5   Fare      417 non-null    float64
 6   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 23.0+ KB


In [277]:
# Classifying strings to numbers
X_test.loc[:, 'Sex'] = X_test['Sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'Embarked'] = X_test['Embarked'].map({'C': 0, 'S': 1, 'Q': 2})

In [278]:
# Replace missing fare with average
fare_mean = X_test['Fare'].mean()
X_test.loc[:, 'Fare'] = X_test['Fare'].fillna(fare_mean)

In [279]:
# Make missing ages average
age_mean = round(X_test['Age'].mean())
X_test.loc[:, 'Age'] = X_test['Age'].fillna(age_mean)

In [280]:
# Convert to number types
X_test['Sex'] = X_test['Sex'].astype(int)
X_test['Embarked'] = X_test['Embarked'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Sex'] = X_test['Sex'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Embarked'] = X_test['Embarked'].astype(int)


In [281]:
z_test = np.dot(X_test, w) + b
test_pred = sigmoid(z_test)
class_test_pred = np.where(test_pred<0.5, 0, 1)

In [282]:
passenger_ids = titanic_test['PassengerId']

In [283]:
# Create a DataFrame
submission_df = pd.DataFrame({
    'PassengerId': passenger_ids,  # Your PassengerId list
    'Survived': class_test_pred    # Your predictions (0 or 1)
}) 

In [284]:
submission_df.to_csv('submission.csv', index=False)