# What is the problem?
Predict the survival of passengers on the Titanic, a classic binary classification problem. The objective is to model the probability that a passenger survived based on various features.

# What is the type of machine learning?
The approach used is supervised learning because the model is trained using a dataset that includes both the input features (predictors) and the output label (the target variable 'Survived'). Specifically, the technique applied is logistic regression, which is used for binary classification tasks.

# What are the feature variables and target variables?
## Target Variable: 'Survived'  This is what the model is trying to predict: whether a passenger survived (1) or did not survive (0
## Feature Variables:
'Travel_Class', 'Sex', 'Age', 'NumSiblings_Spouses',
       'NumParents_Children', 'Embarked').

In [9]:
import pandas as pd
# import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
# import pandas.testing as tm
from sklearn.model_selection import train_test_split

In [10]:
df = pd.read_csv('titanic_cleaned.csv')
df.head()

Unnamed: 0,Survived,Travel_Class,Sex,Age,NumSiblings_Spouses,NumParents_Children,Embarked
0,0,3,male,22.0,1,0,S
1,1,1,female,38.0,1,0,C
2,1,3,female,26.0,0,0,S
3,1,1,female,35.0,1,0,S
4,0,3,male,35.0,0,0,S


In [4]:
df.shape

(891, 7)

In [5]:
df.columns

Index(['Survived', 'Travel_Class', 'Sex', 'Age', 'NumSiblings_Spouses',
       'NumParents_Children', 'Embarked'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Survived             891 non-null    int64  
 1   Travel_Class         891 non-null    int64  
 2   Sex                  891 non-null    object 
 3   Age                  891 non-null    float64
 4   NumSiblings_Spouses  891 non-null    int64  
 5   NumParents_Children  891 non-null    int64  
 6   Embarked             891 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 48.9+ KB


In [7]:
df.Survived.value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [11]:
# Converting categorical variables into dummy variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

In [12]:
# Splitting the dataset
X = df.drop('Survived', axis=1)
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

In [22]:
# the training dataset setup for Statsmodels
train_data = df.iloc[X_train.index]  # Only use the rows from the training set
train_data.head()

Unnamed: 0,Survived,Travel_Class,Age,NumSiblings_Spouses,NumParents_Children,Sex_male,Embarked_Q,Embarked_S
329,1,1,16.0,0,1,False,False,False
749,0,3,31.0,0,0,True,True,False
203,0,3,45.5,0,0,True,False,False
421,0,3,21.0,0,0,True,True,False
97,1,1,23.0,0,1,True,False,False


In [24]:
train_data.columns

Index(['Survived', 'Travel_Class', 'Age', 'NumSiblings_Spouses',
       'NumParents_Children', 'Sex_male', 'Embarked_Q', 'Embarked_S'],
      dtype='object')

# What machine learning algorithms were used? 
The machine learning algorithm used was logistic regression, a popular method for binary classification tasks

# Which is better?

Given thatheur target variable nd binary (either 0 or 1), logistic regression is indeed a suitabl
and often highly effective choice for such classification problemse 

In [25]:
# creating multiple model with `statsmodels`
f1 = 'Survived ~ Travel_Class + Sex_male + Age + NumSiblings_Spouses + NumParents_Children + Embarked_Q + Embarked_S'
f2 = 'Survived ~ Travel_Class + Embarked_Q + Embarked_S'
f3 = 'Survived ~ Travel_Class + Sex_male + Age + Embarked_Q + Embarked_S'
f4 = 'Survived ~  NumSiblings_Spouses + NumParents_Children'


In [76]:
# Creating and fitting the logistic regression models using the specified formulas
model_1 = smf.logit(f1, data=train_data).fit()
model_2 = smf.logit(f2, data=train_data).fit()
model_3 = smf.logit(f3, data=train_data).fit()
model_4 = smf.logit(f4, data=train_data).fit()

Optimization terminated successfully.
         Current function value: 0.448756
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.608686
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.455715
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.659220
         Iterations 5


In [77]:
# Display the summaries of the models
model_1.summary(),model_2.summary(), model_3.summary(), model_4.summary()

(<class 'statsmodels.iolib.summary.Summary'>
 """
                            Logit Regression Results                           
 Dep. Variable:               Survived   No. Observations:                  712
 Model:                          Logit   Df Residuals:                      704
 Method:                           MLE   Df Model:                            7
 Date:                Sun, 12 May 2024   Pseudo R-squ.:                  0.3285
 Time:                        15:54:11   Log-Likelihood:                -319.51
 converged:                       True   LL-Null:                       -475.84
 Covariance Type:            nonrobust   LLR p-value:                 1.196e-63
                           coef    std err          z      P>|z|      [0.025      0.975]
 ---------------------------------------------------------------------------------------
 Intercept               5.3596      0.559      9.586      0.000       4.264       6.456
 Sex_male[T.True]       -2.7187      0.223 

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# What evaluation metric do you prefer?
## Accuracy
Measures the overall correctness of the model (i.e., the ratio of correct predictions to the total number of samples).
## Precision (Positive Predictive Value
Measures the ratio of correct positive predictions to the total predicted positives. It answers the question: Of all samples labeled as positive, how many actually belong to the positive class?
## Recall (Sensitivity, True Positive Rate)
Measures the ratio of correct positive predictions made in relation to all actual positives
## F1-Score 
The harmonic mean of precision and recall. It is used to balance the trade-off between precision and recall, particularly when you have class imbalance.

In [None]:
# Calculate metrics for the training set
accuracy_train = accuracy_score(y_train, y_pred_train)
precision_train = precision_score(y_train, y_pred_train)
recall_train = recall_score(y_train, y_pred_train)
f1_train = f1_score(y_train, y_pred_train)

# Calculate metrics for the testing set
accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
recall_test = recall_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test)

In [68]:
# Redefine the evaluation function to adapt to Statsmodels and apply it to available test data

def evaluate_model_performance(model, X, y_true):
    # Predict using the logistic regression model from statsmodels, adjusting intercept for predict
    X_with_intercept = sm.add_constant(X, has_constant='add')
    predictions_prob = model.predict(X_with_intercept)
    predictions = (predictions_prob > 0.5).astype(int)  # Binarize predictions based on threshold of 0.5
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, predictions)
    precision = precision_score(y_true, predictions)
    recall = recall_score(y_true, predictions)
    f1 = f1_score(y_true, predictions)
    
    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    }

In [69]:
# Evaluate each model
metrics_model_1 = evaluate_model_performance(model_1, X_test, y_test)
metrics_model_2 = evaluate_model_performance(model_2, X_test, y_test)
metrics_model_3 = evaluate_model_performance(model_3, X_test, y_test)
metrics_model_4 = evaluate_model_performance(model_4, X_test, y_test)

In [38]:
metrics_model_1

{'Accuracy': 0.7932960893854749,
 'Precision': 0.7,
 'Recall': 0.7538461538461538,
 'F1-Score': 0.725925925925926}

In [34]:
metrics_model_2

{'Accuracy': 0.7206703910614525,
 'Precision': 0.6415094339622641,
 'Recall': 0.5230769230769231,
 'F1-Score': 0.5762711864406779}

In [35]:
metrics_model_3

{'Accuracy': 0.7821229050279329,
 'Precision': 0.6911764705882353,
 'Recall': 0.7230769230769231,
 'F1-Score': 0.7067669172932332}

In [36]:
metrics_model_4

{'Accuracy': 0.6424581005586593,
 'Precision': 0.5333333333333333,
 'Recall': 0.12307692307692308,
 'F1-Score': 0.2}

# How did you evaluate models performance?
Model 1 outperforms the other **models** across all the metrics used.

**Highest Accuracy:** Model 1 has the highest accuracy (0.793), indicating that it correctly predicts a larger percentage of the total outcomes.ly.

**Highest Precision:** Model 1 also leads in precision (0.7), which suggests that it is better at predicting positive instances among the instances it predicts as positive.



**Highest Recall:** Model 1 shows the highest recall (0.753), which means it is better at identifying all relevant instances.

**Highest F1-Score:** Model 1 again has the highest F1-score (0.726), balancing precision and recall effectively.

# How did you diagnose the model? 

In [70]:
# Adding a constant for the intercept since Statsmodels does not add it by default for predictions
X_train_with_const = sm.add_constant(X_train, has_constant='add')
train_predictions_prob = model_1.predict(X_train_with_const)
train_predictions = (train_predictions_prob > 0.5).astype(int)


In [71]:
# Adding a constant for the intercept
X_test_with_const = sm.add_constant(X_test, has_constant='add')
test_predictions_prob = model_1.predict(X_test_with_const)
test_predictions = (test_predictions_prob > 0.5).astype(int)


In [73]:
# Metrics for training data
train_accuracy = accuracy_score(y_train, train_predictions)
train_precision = precision_score(y_train, train_predictions)
train_recall = recall_score(y_train, train_predictions)
train_f1 = f1_score(y_train, train_predictions)

# Metrics for testing data
test_accuracy = accuracy_score(y_test, test_predictions)
test_precision = precision_score(y_test, test_predictions)
test_recall = recall_score(y_test, test_predictions)
test_f1 = f1_score(y_test, test_predictions)


# Is it overfitting, under fitting, or good fitting?

In [75]:
# Print metrics
print("Training Metrics:")
print(f"Accuracy: {train_accuracy}, Precision: {train_precision}, Recall: {train_recall}, F1-Score: {train_f1}")

print("\nTesting Metrics:")
print(f"Accuracy: {test_accuracy}, Precision: {test_precision}, Recall: {test_recall}, F1-Score: {test_f1}")

print('The model appears to be well-fitted. It shows no signs of overfitting or underfitting, as evidenced by the very similar performance metrics across training and testing datasets.')

Training Metrics:
Accuracy: 0.7935393258426966, Precision: 0.75, Recall: 0.703971119133574, F1-Score: 0.7262569832402234

Testing Metrics:
Accuracy: 0.7932960893854749, Precision: 0.7, Recall: 0.7538461538461538, F1-Score: 0.725925925925926
The model appears to be well-fitted. It shows no signs of overfitting or underfitting, as evidenced by the very similar performance metrics across training and testing datasets.


# What is your model's results? Is it good? Do you have any concerns?
Model appears to be performing well with no major signs of overfitting or underfitting.


The model's performance is evaluated using several metrics which demonstrate how accurately it predicts outcomes, both during training and testing phases. In training, it achieved an accuracy o**f 79.3**5%, a precision o**f 7**5%, a recall o**f 70.3**9%, and an F1-score o**f 72.6**6%. Similarly, in testing, it maintained consistent performance with an accuracy o**f 79.3**3%, precision o**f 7**0%, recall o**f 75.3**8%, and an F1-score o**f 72.5**9%. These metrics indicate that the model is reliable and generalizes well to new, unseen data, suggesting a balanced approach between how often it is correct (precision) and how complete its predictions are (recall).

# Do you have any concerns?
The slight difference in precision between the training (0.75) and testing (0.70) metrics. 
While the precision difference is not large, it's good practice to monitor such changes as they might indicate how the model will perform as more data is introduced or in different operational settings. It might also be beneficial to look into model tuning or regularization techniques to minimize overfitting, ensuring that the model remains generalizable and robust.