# Titanic - Machine Learning from Disaster

Competition: https://www.kaggle.com/c/titanic

**Notebook n√†y ƒë∆∞·ª£c thi·∫øt k·∫ø ƒë·ªÉ ch·∫°y tr√™n:**
- Local (VS Code v·ªõi conda env `kaggle-competitions`)
- Google Colab
- Kaggle Kernels

## 1. Bootstrap - Environment Setup

Cell n√†y t·ª± ƒë·ªông ph√°t hi·ªán v√† c·∫•u h√¨nh m√¥i tr∆∞·ªùng (local/colab/kaggle)

In [None]:
# === BOOTSTRAP CELL - UNIVERSAL SETUP ===
import sys
import os
from pathlib import Path

# GitHub configuration
GITHUB_USER = "n24q02m"
REPO_NAME = "n24q02m-kaggle-competitions"
BRANCH = "main"

# Detect environment
def detect_env():
    if 'google.colab' in sys.modules:
        return 'colab'
    elif 'kaggle_web_client' in sys.modules or os.path.exists('/kaggle'):
        return 'kaggle'
    else:
        return 'local'

ENV = detect_env()
print(f"üîç Detected: {ENV.upper()}")

# Setup theo m√¥i tr∆∞·ªùng
if ENV == 'local':
    # Local: Import tr·ª±c ti·∫øp t·ª´ repo
    # Gi·∫£ s·ª≠ ƒëang ·ªü competitions/titanic/notebooks/
    repo_root = Path.cwd().parent.parent.parent
    if str(repo_root) not in sys.path:
        sys.path.insert(0, str(repo_root))
    
    from core import setup_env
    env = setup_env.setup()
    
else:
    # Cloud: Download setup_env.py t·ª´ GitHub
    import requests
    import subprocess
    
    CORE_URL = f"https://raw.githubusercontent.com/{GITHUB_USER}/{REPO_NAME}/{BRANCH}/core"
    
    # Download setup_env.py
    print("üì• Downloading setup_env.py...")
    response = requests.get(f"{CORE_URL}/setup_env.py")
    with open("setup_env.py", "w") as f:
        f.write(response.text)
    
    # Import v√† setup
    import setup_env
    env = setup_env.setup(GITHUB_USER, REPO_NAME)

# Hi·ªÉn th·ªã th√¥ng tin m√¥i tr∆∞·ªùng
env.info()

## 2. Configuration

C·∫•u h√¨nh chung cho notebook

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Configuration class
class CFG:
    # Random seed cho reproducibility
    seed = 42
    
    # Cross-validation
    n_folds = 5
    
    # Target column
    target_col = 'Survived'
    
    # Data paths (t·ª± ƒë·ªông set theo m√¥i tr∆∞·ªùng)
    if ENV == 'kaggle':
        data_dir = Path('/kaggle/input/titanic')
    else:
        data_dir = Path.cwd().parent / 'data'
    
    train_path = data_dir / 'train.csv'
    test_path = data_dir / 'test.csv'
    submission_path = Path.cwd().parent / 'submissions' / 'submission.csv'

# Set random seeds
def seed_everything(seed):
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

seed_everything(CFG.seed)

# Display config
print("‚öôÔ∏è  Configuration:")
print(f"  - Seed: {CFG.seed}")
print(f"  - N Folds: {CFG.n_folds}")
print(f"  - Data Dir: {CFG.data_dir}")
print(f"  - Train: {CFG.train_path.exists()}")
print(f"  - Test: {CFG.test_path.exists()}")

## 3. Load Data

In [None]:
# Load datasets
train = pd.read_csv(CFG.train_path)
test = pd.read_csv(CFG.test_path)

print(f"üìä Train shape: {train.shape}")
print(f"üìä Test shape: {test.shape}")

# Display first rows
train.head()

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Basic info
print("=" * 50)
print("TRAIN DATA INFO")
print("=" * 50)
print(train.info())
print("\n" + "=" * 50)
print("STATISTICAL SUMMARY")
print("=" * 50)
print(train.describe())

In [None]:
# Missing values
missing = train.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

if len(missing) > 0:
    plt.figure(figsize=(10, 5))
    missing.plot(kind='barh')
    plt.title('Missing Values in Train Data')
    plt.xlabel('Count')
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Missing Values:")
    print(missing)
else:
    print("‚úÖ No missing values!")

In [None]:
# Target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
train[CFG.target_col].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Survived Count')
axes[0].set_xlabel('Survived (0=No, 1=Yes)')
axes[0].set_ylabel('Count')

# Pie chart
train[CFG.target_col].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])
axes[1].set_title('Survived Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

print(f"\nüìä Survival Rate: {train[CFG.target_col].mean():.2%}")

## 5. Feature Engineering

TODO: Th√™m feature engineering ·ªü ƒë√¢y

In [None]:
# Placeholder for feature engineering
# V√≠ d·ª•:
# - T·∫°o FamilySize t·ª´ SibSp + Parch
# - Extract Title t·ª´ Name
# - Group Age th√†nh bins
# - v.v.

print("‚ö†Ô∏è  TODO: Implement feature engineering")

## 6. Preprocessing

TODO: X·ª≠ l√Ω missing values, encoding, scaling

In [None]:
# Placeholder for preprocessing
print("‚ö†Ô∏è  TODO: Implement preprocessing")

## 7. Modeling

TODO: Train models v·ªõi Cross-Validation

In [None]:
# Placeholder for modeling
print("‚ö†Ô∏è  TODO: Implement modeling")

## 8. Evaluation

In [None]:
# Placeholder for evaluation
print("‚ö†Ô∏è  TODO: Implement evaluation")

## 9. Submission

In [None]:
# Placeholder for submission
# submission = pd.DataFrame({
#     'PassengerId': test['PassengerId'],
#     'Survived': predictions
# })
# submission.to_csv(CFG.submission_path, index=False)
# print(f"‚úÖ Submission saved to: {CFG.submission_path}")

print("‚ö†Ô∏è  TODO: Generate submission")