# Titanic Survival Prediction - Team Challenge
# 🚢 Predict Who Survived the Titanic Disaster

## Challenge Overview

**Objective**: Predict which passengers survived the Titanic shipwreck based on passenger data like age, sex, ticket class, etc.

**Historical Context**: 
The RMS Titanic sank on April 15, 1912 during her maiden voyage after colliding with an iceberg. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it one of the deadliest peacetime maritime disasters in history. While there was some element of luck involved in surviving, some groups of people were more likely to survive than others.

**Your Mission**: 
Build a predictive model that answers the question: "What sorts of people were more likely to survive?" using passenger data (name, age, gender, socio-economic class, etc.).

**Dataset Features**: 
- **PassengerId** - Unique ID for each passenger
- **Pclass** - Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- **Name** - Passenger name
- **Sex** - Gender
- **Age** - Age in years
- **SibSp** - Number of siblings/spouses aboard
- **Parch** - Number of parents/children aboard
- **Ticket** - Ticket number
- **Fare** - Passenger fare
- **Cabin** - Cabin number
- **Embarked** - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- **Survived** - TARGET - Survival (0 = No, 1 = Yes)

## Challenge Rules & Guidelines

### Time Limit: 20 Minutes ⏰
Your goal is to improve upon this baseline implementation in 20 minutes.

### Evaluation Metric: Classification Accuracy
- Simple accuracy score (% correctly predicted)
- Higher is better (max = 1.0 or 100%)
- Binary classification: 0 = Died, 1 = Survived

### Submission Requirements:
1. Generate predictions for the test set (0 or 1 for each passenger)
2. Create a CSV file with columns: 'PassengerId', 'Survived'  
3. Submit to Kaggle for official scoring!

https://www.kaggle.com/competitions/titanic/submissions 

### Improvement Ideas (Pick 1-2 to focus on):
1. **Add More Features**: Include Sex, Age, Fare, family size (SibSp + Parch), Embarked port
2. **Feature Engineering**: Extract titles from names, create age groups, family features
3. **Model Selection**: Try Random Forest instead of Logistic Regression
4. **Better Data Handling**: Improve missing value handling, categorical encoding


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)


## 1. Load and Explore Data

In [None]:
print("\n📊 Loading Titanic data...")

# Load the Titanic dataset files
# Download from: https://www.kaggle.com/c/titanic/data
# Or use: kaggle competitions download -c titanic
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print(f"✅ Training set shape: {train.shape}")
print(f"✅ Test set shape: {test.shape}")

print(f"\n📋 Training data columns:")
print(train.columns.tolist())

print(f"\n📋 Sample of training data:")
print(train.head())

## 2. Data Exploration

In [None]:
print("\n🔍 Quick data overview...")

# Basic survival statistics only
print(f"\n⚰️  Survival Statistics:")
survived_counts = train['Survived'].value_counts()
print(f"Died: {survived_counts[0]}, Survived: {survived_counts[1]}")
print(f"Overall survival rate: {train['Survived'].mean():.1%}")

# Basic info about the data
print(f"\nDataset info:")
print(f"Training samples: {len(train)}")
print(f"Features: {train.columns.tolist()}")
print(f"Missing values: {train.isnull().sum().sum()} total")

In [None]:
# Basic visualizations to understand survival patterns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Overall survival distribution
train['Survived'].value_counts().plot(kind='bar', ax=axes[0,0], color=['red', 'green'])
axes[0,0].set_title('Overall Survival')
axes[0,0].set_xlabel('Survived (0=Died, 1=Survived)')
axes[0,0].set_ylabel('Count')
axes[0,0].tick_params(axis='x', rotation=0)

# 2. Survival by Passenger Class - the most important pattern!
pd.crosstab(train['Pclass'], train['Survived'], normalize='index').plot(kind='bar', ax=axes[0,1])
axes[0,1].set_title('Survival Rate by Passenger Class')
axes[0,1].set_xlabel('Passenger Class (1=First, 3=Third)')
axes[0,1].set_ylabel('Survival Rate')
axes[0,1].legend(['Died', 'Survived'])
axes[0,1].tick_params(axis='x', rotation=0)

# 3. Survival by Gender
pd.crosstab(train['Sex'], train['Survived'], normalize='index').plot(kind='bar', ax=axes[1,0])
axes[1,0].set_title('Survival Rate by Gender')
axes[1,0].set_xlabel('Gender')
axes[1,0].set_ylabel('Survival Rate')
axes[1,0].legend(['Died', 'Survived'])
axes[1,0].tick_params(axis='x', rotation=45)

# 4. Age distribution by survival
train[train['Survived']==0]['Age'].hist(alpha=0.5, bins=30, label='Died', color='red', ax=axes[1,1])
train[train['Survived']==1]['Age'].hist(alpha=0.5, bins=30, label='Survived', color='green', ax=axes[1,1])
axes[1,1].set_title('Age Distribution by Survival')
axes[1,1].set_xlabel('Age')
axes[1,1].set_ylabel('Count')
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("\n📊 Key Insights from Visualizations:")
print("   🎫 First-class passengers survived much more than third-class")
print("   👩 Women had higher survival rates than men")
print("   👶 Children and younger passengers had better survival chances")

In [None]:
# Key survival patterns - the "First Class First" rule
print("\n🚢 Understanding Survival Patterns:")

# Survival by class  
print("\nSurvival by Passenger Class:")
class_stats = train.groupby('Pclass')['Survived'].agg(['count', 'sum', 'mean'])
class_stats.columns = ['Total', 'Survived', 'Survival_Rate']
for pclass, row in class_stats.iterrows():
    print(f"   Class {pclass}: {row['Survived']}/{row['Total']} ({row['Survival_Rate']:.1%})")

# Survival by gender
print("\nSurvival by Gender:")
gender_stats = train.groupby('Sex')['Survived'].agg(['count', 'sum', 'mean'])
gender_stats.columns = ['Total', 'Survived', 'Survival_Rate']
for sex, row in gender_stats.iterrows():
    print(f"   {sex.capitalize()}: {row['Survived']}/{row['Total']} ({row['Survival_Rate']:.1%})")

# Age insights
print(f"\nAge Insights:")
print(f"   Average age of survivors: {train[train['Survived']==1]['Age'].mean():.1f} years")
print(f"   Average age of non-survivors: {train[train['Survived']==0]['Age'].mean():.1f} years")

print(f"\n💡 These patterns show why we need more features than just Pclass!")

## 3. Basic Feature Engineering (Intentionally Simple)

In [None]:
print("\n⚙️  Very simple feature preparation...")

def simple_preprocessing(df, is_train=True):
    """Very basic preprocessing - only one feature"""
    df_processed = df.copy()
    
    # Store PassengerId for test set submission
    if not is_train:
        passenger_ids = df_processed['PassengerId'].copy()
    
    # Only keep one feature
    features_to_keep = ['Pclass', 'Survived']
    if not is_train:
        features_to_keep = ['Pclass']
    
    # Select only these features
    df_processed = df_processed[features_to_keep]
        
    if not is_train:
        return df_processed, passenger_ids
    else:
        return df_processed

# Process training data
train_processed = simple_preprocessing(train, is_train=True)

# Separate features and target
X = train_processed.drop('Survived', axis=1)
y = train_processed['Survived']

# Show processed data sample
print(f"\n📋 Processed data sample:")
print(train_processed.head())

## 4. Naive Model Implementation

In [None]:
print("\n🤖 Training simple model...")

# Very simple train/validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Simple Logistic Regression
model = LogisticRegression(random_state=42, max_iter=1000)

print("Training model...")
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)

print(f"\n📈 Simple Model Results:")
print(f"   Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

baseline_accuracy = accuracy

## 5. Simple Model Analysis

In [None]:
print("\n🎯 What the model learned...")

# Simple feature importance
feature_names = X.columns
coefficients = model.coef_[0]

print("Feature importance (model coefficients):")
for feature, coef in zip(feature_names, coefficients):
    print(f"   {feature}: {coef:.3f}")

print("\n💡 Interpretation:")
print("   Positive = increases survival chance")
print("   Negative = decreases survival chance")

## 6. Generate Test Predictions and Submission

In [None]:
print("\n📝 Making predictions on test data...")

# Process test data using the same simple preprocessing
test_processed, test_passenger_ids = simple_preprocessing(test, is_train=False)

print(f"Test set shape: {test_processed.shape}")

# Generate predictions
test_predictions = model.predict(test_processed)

print(f"✅ Generated {len(test_predictions)} predictions")
print(f"   Predicted Survived: {test_predictions.sum()}")
print(f"   Predicted Died: {len(test_predictions) - test_predictions.sum()}")

In [None]:
# Create submission file
submission = pd.DataFrame({
    'PassengerId': test_passenger_ids,
    'Survived': test_predictions
})

# Save submission
submission.to_csv('titanic_submission.csv', index=False)

print("\n📋 Submission file created: 'titanic_submission.csv'")
print(f"Submission sample:")
print(submission.head())

print(f"\n🚀 Ready to submit!")
print(f"   Upload 'titanic_submission.csv' to Kaggle")

## 7. Ideas for Improvement

This simple model only uses 1 basic feature: **Pclass** (passenger class). There's huge room for improvement!

### 💡 IMPROVEMENTS TO TRY:

**🚀 Add More Features:**
- **Sex** - gender (we saw this matters a lot!)
- **Age** - children had better survival chances
- **Fare** - ticket price might indicate wealth/class
- **SibSp + Parch** - family size affects survival
- **Embarked** - port of departure
- **FamilySize** = SibSp + Parch + 1 (create new feature)

**⚙️ Better Feature Engineering:**
- Extract **titles** from names (Mr., Mrs., Miss., Master.)
- Create **age groups** (Child, Adult, Elderly)
- **IsAlone** feature (FamilySize == 1)
- Better handling of **missing Age values**

**🤖 Try Different Models:**
- **Random Forest** (often works better than Logistic Regression)
- **XGBoost** or other boosting methods

**📊 Better Data Handling:**
- One-hot encoding for categorical variables
- Feature scaling/normalization
- Handle **Cabin** information (extract deck letters)

### 🎯 Current Score vs. Potential:
- **Current minimal model**: ~65% accuracy (just passenger class!)  
- **Easy improvement**: >78% (add gender)
- **Good improvements**: 80-83% accuracy
- **Advanced techniques**: 83%+ accuracy

## 🎯 Current Model Summary

**CURRENT MINIMAL MODEL:** Uses only Pclass (passenger class) feature
- **Accuracy:** ~65% 
- **Approach:** Simple Logistic Regression with just 1 feature

🚀 **Huge room for improvement!**

### 💡 Easy wins to try first:
1. **Add Sex feature** (we saw gender matters a lot!)
2. **Add Age feature** (children had better survival chances)  
3. **Add Fare feature** (wealth/class indicator)
4. **Add family features** (SibSp, Parch, or FamilySize)
5. **Try Random Forest** instead of Logistic Regression