# Titanic Decision Tree (Gini Index)
This notebook builds a simple decision tree to predict Titanic survival.

**Goal:** Understand how decision trees split data using **Gini impurity**.

## 1. Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

## 2. Load the Titanic Dataset
Seaborn includes a built-in Titanic dataset.

In [None]:
df = sns.load_dataset('titanic')
df.head()

## 3. Select Simple Features
We keep a few easy-to-understand variables.
- **survived** → target variable
- **pclass** → passenger class
- **sex** → gender
- **age** → age
- **sibsp** → siblings/spouses aboard
- **parch** → parents/children aboard
- **fare** → ticket price
- **embarked** → port of embarkation

In [None]:
cols = ['survived','pclass','sex','age','sibsp','parch','fare','embarked']
data = df[cols].copy()
data.head()

## 4. Handle Missing Values
- Fill missing age with the median
- Fill missing embarked with the most common value

In [None]:
data['age'] = data['age'].fillna(data['age'].median())
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])
data.isna().sum()

## 5. Convert Categorical Variables
Decision trees require numeric inputs.
We convert text categories into numeric columns.

In [None]:
X = pd.get_dummies(data.drop(columns=['survived']), drop_first=True)
y = data['survived']
X.head()

## 6. Split into Training and Testing Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

## 7. Train a Decision Tree
We use:
- **criterion='gini'** → split using Gini impurity
- **max_depth=3** → keep the tree simple

In [None]:
tree = DecisionTreeClassifier(
    criterion='gini',
    max_depth=3,
    random_state=42
)

tree.fit(X_train, y_train)

## 8. Evaluate the Model

In [None]:
y_pred = tree.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred))

## 9. Visualize the Decision Tree
The tree splits data to reduce **Gini impurity**.

In [None]:
plt.figure(figsize=(14,8))
plot_tree(
    tree,
    feature_names=X.columns,
    class_names=['Died (0)','Survived (1)'],
    filled=True,
    rounded=True
)
plt.show()

## 10. Feature Importance
Shows which features influenced decisions the most.

In [None]:
importances = pd.Series(tree.feature_importances_, index=X.columns)
importances.sort_values(ascending=False)

### Key Takeaways
- Decision trees split data to create **pure groups**.
- Gini impurity measures how mixed a node is.
- Lower Gini = better split.
- Shallow trees are easier to interpret.