Project Overview

a.	Objective: Predict passenger survival on the Titanic using machine learning models, and gain insights into key factors 
    influencing survival.
b.	Data: Titanic passenger dataset, originally containing features like PassengerID, Name, Ticket, Cabin, etc.

# Workflow Steps

1.	Data Import & Setup
2.	Exploratory Data Analysis & Visualization
    -	Missing Values Check: Used a heatmap to visualize null entries and plan imputation.
    -	Pairwise Relationships: Created pair plots (sns.pairplot) colored by survival status for quick insight into 
        distributions and correlations.
    -	Categorical Analysis: Explored survival by passenger class and passenger age group (Child, Teen, Adult, 
        Middle-Aged, Senior).
    -	Violin & Pie Charts: Showed distribution of Fare within different age groups, examined the proportion of 
        passengers in each age category.
3.	Statistical Tests
    -	Chi-Square Tests: Investigated relationships between survival and categorical features (Sex, Embarked, Pclass).
    -	T-Tests: Compared means of numeric variables (Age, Fare) between survivors and non-survivors.
    -	Correlation Analysis: Calculated Pearson correlations among numeric columns (Age, Fare, Survived) to highlight 
        significant relationships.
    -	Data Preprocessing & Feature Engineering: Imputation, Scaling & Encoding: Applied StandardScaler to numeric 
        features and OneHotEncoder to categorical features in a combined ColumnTransformer pipeline.
    -	Train–Test Split: Divided data into training and testing sets (80/20 split) with a fixed random seed for        
        reproducibility.
4.	Model Development:	Logistic Regression, Decision Tree with Grid Search and Random Forest
5.	Model Evaluation
    -	Accuracy & Classification Report: Generated accuracy scores, precision, recall, and F1-scores to gauge 
        performance, especially important if the classes are imbalanced.
    -	Confusion Matrices, Cross-Validation,ROC & Precision–Recall Curves: Demonstrated performance at various thresholds 
        using AUC (Area Under the ROC curve) and average precision (useful for imbalanced classes).
    -	Best Model Findings: Compared performance metrics across Logistic Regression, Decision Tree, and Random Forest to 
        identify which model or set of parameters performed best.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
file_path = "titanic_survival.csv"  # Ensure the correct file path is used
df = pd.read_csv(file_path)

# Drop unnecessary columns
df.drop(['passengerId', 'name', 'ticket', 'cabin'], axis=1, inplace=True)

# Handling missing values
num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')

# Encoding categorical variables
encoder = OneHotEncoder(handle_unknown='ignore')

# Standardization
scaler = StandardScaler()

# Define preprocessing for numerical and categorical features
num_features = ['Age', 'Fare']
cat_features = ['Sex', 'Embarked', 'Pclass']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', num_imputer), ('scaler', scaler)]), num_features),
    ('cat', Pipeline([('imputer', cat_imputer), ('encoder', encoder)]), cat_features)
])

# Split data
X = df.drop('Survived', axis=1)
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression Model
log_reg = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

# Decision Tree Classifier with Hyperparameter Tuning
param_grid = {'classifier__max_depth': [3, 5, 10], 'classifier__min_samples_split': [2, 5, 10]}
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])
grid_search = GridSearchCV(dt_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

dt_best_model = grid_search.best_estimator_
y_pred_dt = dt_best_model.predict(X_test)

print("Decision Tree Best Parameters:", grid_search.best_params_)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

# Cross-validation for Logistic Regression
cv_scores = cross_val_score(log_reg, X, y, cv=5, scoring='accuracy')
print("Logistic Regression Cross-validation Accuracy:", np.mean(cv_scores))

# Confusion Matrices
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log), annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Blues')
plt.title('Decision Tree Confusion Matrix')
plt.show()


FileNotFoundError: [Errno 2] No such file or directory: 'Titanic.csv'

In [None]:
df

Unnamed: 0,sex,age,sibsp,parch,fare,embarked,class,who,alone,survived
0,male,22.0,1,0,7.2500,S,Third,man,False,0
1,female,38.0,1,0,71.2833,C,First,woman,False,1
2,female,26.0,0,0,7.9250,S,Third,woman,True,1
3,female,35.0,1,0,53.1000,S,First,woman,False,1
4,male,35.0,0,0,8.0500,S,Third,man,True,0
...,...,...,...,...,...,...,...,...,...,...
886,male,27.0,0,0,13.0000,S,Second,man,True,0
887,female,19.0,0,0,30.0000,S,First,woman,True,1
888,female,,1,2,23.4500,S,Third,woman,False,0
889,male,26.0,0,0,30.0000,C,First,man,True,1
