# Titanic Survival Prediction

### Introduction
This project uses the Titanic dataset to predict passenger survival using classification models. The goal is to uncover insights and accurately predict survival outcomes.

### Dataset Overview
- Source: [Kaggle Titanic Dataset](https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv)
- Key Features: `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, `Embarked`
- Target: `Survived` (1 = Survived, 0 = Did not survive)

In [None]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [None]:
# Load the Titanic dataset
data_path = '/mnt/data/Titanic-Dataset.csv'
data = pd.read_csv(data_path)
data.head()

### Data Preprocessing

In [None]:
# Drop irrelevant columns and handle missing values
data.drop(columns=['Cabin', 'Name', 'Ticket'], inplace=True)
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data = pd.get_dummies(data, columns=['Sex', 'Embarked'], drop_first=True)
data.head()

### Exploratory Data Analysis

In [None]:
# Visualize survival distribution
sns.countplot(x='Survived', data=data)
plt.title('Survival Distribution')
plt.show()

In [None]:
# Survival by gender
sns.countplot(x='Survived', hue='Sex_male', data=data)
plt.title('Survival by Gender')
plt.show()

In [None]:
# Survival by passenger class
sns.countplot(x='Survived', hue='Pclass', data=data)
plt.title('Survival by Passenger Class')
plt.show()

### Modeling and Evaluation

In [None]:
# Define features and target
X = data.drop(columns=['PassengerId', 'Survived'])
y = data['Survived']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train Logistic Regression
log_model = LogisticRegression(max_iter=1000, random_state=42)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

In [None]:
# Train Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

In [None]:
# Evaluate models
print('Logistic Regression Report')
print(classification_report(y_test, y_pred_log))
print('\nRandom Forest Report')
print(classification_report(y_test, y_pred_rf))

### Insights and Impact
1. **Key Factors Influencing Survival**:
   - Gender: Females had a higher survival rate.
   - Passenger Class: Higher classes had better survival rates.
2. **Model Performance**:
   - Random Forest slightly outperformed Logistic Regression.
3. **Impact**:
   - Social: Highlights historical inequalities in survival.
   - Ethical: Use data responsibly to avoid discrimination.
   - Business: Inform modern safety protocols.