# Puneeth Nunna
## Project Title: Titanic Classification Project  

**Class:** Summer Machine Learning Internship 2025 (STEMPEERS)  
**Instructor:** Bhishan Poudel  
**Programmer:** Puneeth Nunna  
**Date:** August 2025  

### Introduction

This final project for the Horizon Quest Internship Program explores a real-world classification problem using machine learning in Python. The goal is to predict whether a passenger survived the Titanic disaster based on features such as age, sex, class, fare, and more.

To solve this problem, I implemented and compared three different machine learning models:  
1. Logistic Regression  
2. Random Forest Classifier
3. Decision Tree Classifier   

Each model was trained on a preprocessed dataset and evaluated for accuracy, precision, recall, and other performance metrics. This project not only tested my understanding of data preprocessing and model training but also gave me practical experience in comparing different algorithms for a classic binary classification task.


# Importing and Reading the CSV File for the Titanic Classification from Seaborn

In [26]:
import seaborn as sns

df = sns.load_dataset('titanic') # Loading the dataset from Seaborn Library
df.head(10) # Printing first 5 rows of the dataset to review

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


#1. Logistic Regression Model

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# --- 1. Preprocessing the Data ---

# Handle missing 'age' values by filling with the median
df['age'].fillna(df['age'].median(), inplace=True)

# Handle missing 'embarked' and 'embark_town' values by filling with the mode
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)

# Drop the 'deck' column because it has too many missing values
df.drop('deck', axis=1, inplace=True)

# Convert categorical variables into dummy/indicator variables
df = pd.get_dummies(df, columns=['sex', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive', 'alone'], drop_first=True)

# --- 2. Splitting Data into Training and Testing Sets ---

# Define features (X) and target (y)
X = df.drop(['survived', 'alive_yes'], axis=1) # alive_yes is a proxy for survived
y = df['survived']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Feature Scaling ---

# Scale numerical features for better performance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- 4. Training the Logistic Regression Model ---

# Create and train the model
log_reg = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
log_reg.fit(X_train, y_train)

# --- 5. Making Predictions and Evaluating the Model ---

# Make predictions on the test set
y_pred = log_reg.predict(X_test)


# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display the confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Display the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


# Get user input for a new passenger
pclass = int(input("Enter passenger class (1, 2, or 3): "))
sex = input("Enter sex (male or female): ")
age = float(input("Enter age: "))
sibsp = int(input("Enter number of siblings/spouses aboard: "))
parch = int(input("Enter number of parents/children aboard: "))
fare = float(input("Enter fare: "))
embarked = input("Enter port of embarkation (C, Q, or S): ")
class_input = input("Enter class (First, Second, or Third): ")
who = input("Enter who (man, woman, or child): ")
adult_male = input("Is the person an adult male (True or False): ").lower() == 'true'
embark_town = input("Enter embark town: ")
alone = input("Is the person alone (True or False): ").lower() == 'true'


# Create a new DataFrame from the user input
new_passenger = pd.DataFrame({
    'pclass': [pclass],
    'sex': [sex],
    'age': [age],
    'sibsp': [sibsp],
    'parch': [parch],
    'fare': [fare],
    'embarked': [embarked],
    'class': [class_input],
    'who': [who],
    'adult_male': [adult_male],
    'embark_town': [embark_town],
    'alone': [alone]
})

# Preprocess the new passenger data in the same way as the training data
new_passenger = pd.get_dummies(new_passenger, columns=['sex', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alone'], drop_first=True)

# Align the columns of the new passenger data with the training data
# This ensures that the new data has the same dummy variables as the training data
new_passenger = new_passenger.reindex(columns=X.columns, fill_value=0)

# Scale the new passenger data using the same scaler
new_passenger_scaled = scaler.transform(new_passenger)

# Make a prediction
prediction = log_reg.predict(new_passenger_scaled)

# Print the prediction
if prediction[0] == 1:
    print("\nPrediction: The passenger would have survived.")
else:
    print("\nPrediction: The passenger would not have survived.")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we

Accuracy: 0.81

Confusion Matrix:
[[91 14]
 [20 54]]

Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       105
           1       0.79      0.73      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179

Enter passenger class (1, 2, or 3): 3
Enter sex (male or female): male
Enter age: 2.0
Enter number of siblings/spouses aboard: 3
Enter number of parents/children aboard: 1
Enter fare: 21.0750
Enter port of embarkation (C, Q, or S): S
Enter class (First, Second, or Third): Third
Enter who (man, woman, or child): child
Is the person an adult male (True or False): False
Enter embark town: Southampton
Is the person alone (True or False): False

Prediction: The passenger would have survived.


# 2. Random Forest Classifier

In [28]:
from sklearn.ensemble import RandomForestClassifier

# --- 1. Training the Random Forest Model ---
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# --- 2. Making Predictions and Evaluating the Model ---
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model's accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

# Display the confusion matrix
print("\nRandom Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

# Display the classification report
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

# --- 3. Get user input for a new passenger ---
pclass = int(input("Enter passenger class (1, 2, or 3): "))
sex = input("Enter sex (male or female): ")
age = float(input("Enter age: "))
sibsp = int(input("Enter number of siblings/spouses aboard: "))
parch = int(input("Enter number of parents/children aboard: "))
fare = float(input("Enter fare: "))
embarked = input("Enter port of embarkation (C, Q, or S): ")
class_input = input("Enter class (First, Second, or Third): ")
who = input("Enter who (man, woman, or child): ")
adult_male = input("Is the person an adult male (True or False): ").lower() == 'true'
embark_town = input("Enter embark town: ")
alone = input("Is the person alone (True or False): ").lower() == 'true'

# Create a new DataFrame from the user input
new_passenger_rf = pd.DataFrame({
    'pclass': [pclass],
    'sex': [sex],
    'age': [age],
    'sibsp': [sibsp],
    'parch': [parch],
    'fare': [fare],
    'embarked': [embarked],
    'class': [class_input],
    'who': [who],
    'adult_male': [adult_male],
    'embark_town': [embark_town],
    'alone': [alone]
})

# Preprocess the new passenger data in the same way as the training data
new_passenger_rf = pd.get_dummies(new_passenger_rf, columns=['sex', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alone'], drop_first=True)

# Align the columns of the new passenger data with the training data
new_passenger_rf = new_passenger_rf.reindex(columns=X.columns, fill_value=0)

# Scale the new passenger data using the same scaler
new_passenger_rf_scaled = scaler.transform(new_passenger_rf)

# Make a prediction
prediction_rf = rf_model.predict(new_passenger_rf_scaled)

# Print the prediction
if prediction_rf[0] == 1:
    print("\nPrediction: The passenger would have survived.")
else:
    print("\nPrediction: The passenger would not have survived.")

Random Forest Accuracy: 0.82

Random Forest Confusion Matrix:
[[90 15]
 [17 57]]

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       105
           1       0.79      0.77      0.78        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179

Enter passenger class (1, 2, or 3): 3
Enter sex (male or female): male
Enter age: 2.0
Enter number of siblings/spouses aboard: 3
Enter number of parents/children aboard: 1
Enter fare: 21.0750
Enter port of embarkation (C, Q, or S): S
Enter class (First, Second, or Third): Third
Enter who (man, woman, or child): child
Is the person an adult male (True or False): False
Enter embark town: Southampton
Is the person alone (True or False): False

Prediction: The passenger would have survived.


# 3. Decision Tree Classifier

In [29]:
from sklearn.tree import DecisionTreeClassifier

# --- 1. Training the Decision Tree Model ---
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# --- 2. Making Predictions and Evaluating the Model ---
y_pred_dt = dt_model.predict(X_test)

# Evaluate the model's accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy_dt:.2f}")

# Display the confusion matrix
print("\nDecision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

# Display the classification report
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt))

# --- 3. Get user input for a new passenger ---
pclass = int(input("Enter passenger class (1, 2, or 3): "))
sex = input("Enter sex (male or female): ")
age = float(input("Enter age: "))
sibsp = int(input("Enter number of siblings/spouses aboard: "))
parch = int(input("Enter number of parents/children aboard: "))
fare = float(input("Enter fare: "))
embarked = input("Enter port of embarkation (C, Q, or S): ")
class_input = input("Enter class (First, Second, or Third): ")
who = input("Enter who (man, woman, or child): ")
adult_male = input("Is the person an adult male (True or False): ").lower() == 'true'
embark_town = input("Enter embark town: ")
alone = input("Is the person alone (True or False): ").lower() == 'true'

# Create a new DataFrame from the user input
new_passenger_dt = pd.DataFrame({
    'pclass': [pclass],
    'sex': [sex],
    'age': [age],
    'sibsp': [sibsp],
    'parch': [parch],
    'fare': [fare],
    'embarked': [embarked],
    'class': [class_input],
    'who': [who],
    'adult_male': [adult_male],
    'embark_town': [embark_town],
    'alone': [alone]
})

# Preprocess the new passenger data in the same way as the training data
new_passenger_dt = pd.get_dummies(new_passenger_dt, columns=['sex', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alone'], drop_first=True)

# Align the columns of the new passenger data with the training data
new_passenger_dt = new_passenger_dt.reindex(columns=X.columns, fill_value=0)

# Scale the new passenger data using the same scaler
new_passenger_dt_scaled = scaler.transform(new_passenger_dt)

# Make a prediction
prediction_dt = dt_model.predict(new_passenger_dt_scaled)

# Print the prediction
if prediction_dt[0] == 1:
    print("\nPrediction: The passenger would have survived.")
else:
    print("\nPrediction: The passenger would not have survived.")

Decision Tree Accuracy: 0.75

Decision Tree Confusion Matrix:
[[79 26]
 [18 56]]

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.75      0.78       105
           1       0.68      0.76      0.72        74

    accuracy                           0.75       179
   macro avg       0.75      0.75      0.75       179
weighted avg       0.76      0.75      0.76       179

Enter passenger class (1, 2, or 3): 3
Enter sex (male or female): male
Enter age: 2.0
Enter number of siblings/spouses aboard: 3
Enter number of parents/children aboard: 1
Enter fare: 21.0750
Enter port of embarkation (C, Q, or S): S
Enter class (First, Second, or Third): Third
Enter who (man, woman, or child): child
Is the person an adult male (True or False): False
Enter embark town: Southampton
Is the person alone (True or False): False

Prediction: The passenger would not have survived.


# Project Findings

This project compared three different machine learning models to predict passenger survival on the Titanic. Each model was trained on the same preprocessed dataset and evaluated for accuracy. Here are the key findings:

### 1. Logistic Regression
- **Accuracy:** 0.81
- The logistic regression model provided a solid baseline performance, correctly predicting the outcome for 81% of the passengers in the test set.

### 2. Random Forest Classifier
- **Accuracy:** 0.82
- The Random Forest model slightly outperformed the logistic regression model, with an accuracy of 82%. This is likely due to its ability to capture more complex, non-linear relationships in the data.

### 3. Decision Tree Classifier
- **Accuracy:** 0.75
- The Decision Tree model had the lowest accuracy of the three models, at 75%. While it provided a different prediction for the user's specific input, its overall performance was not as strong as the other two models.

### Conclusion

Based on the evaluation metrics, the **Random Forest Classifier** was the most accurate model for this prediction task. While the logistic regression model also performed well, the Random Forest's slightly higher accuracy suggests it is the most reliable model for this dataset.