In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


This Python script is designed for a Kaggle competition focused on the Titanic dataset, with the goal of predicting passenger survival. 

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

# Load datasets
def load_data(file_paths):
    data = {name: pd.read_csv(path) for name, path in file_paths.items()}
    return data['train'], data['test']

# Analyze survival rates by gender
def survival_rate_by_gender(data):
    rates = {}
    for gender in ['female', 'male']:
        survived = data[data['Sex'] == gender]['Survived']
        rates[gender] = survived.mean()
        print(f"% of {gender}s who survived: {rates[gender]:.4f}")

# Feature engineering
def feature_engineering(df):
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    imputer = SimpleImputer(strategy='median')
    df['Age'] = imputer.fit_transform(df[['Age']])
    return df

# Prepare data for modeling
def prepare_data(train, test, features):
    train_fe = feature_engineering(train.copy())
    test_fe = feature_engineering(test.copy())
    
    # Additional features to include after engineering
    engineered_features = ['FamilySize', 'IsAlone', 'Age']
    
    X = pd.get_dummies(train_fe[features + engineered_features], drop_first=True)
    y = train_fe['Survived']
    X_test = pd.get_dummies(test_fe[features + engineered_features], drop_first=True)
    
    # Align X_test columns to match X
    X_test = X_test.reindex(columns=X.columns, fill_value=0)
    
    return X, y, X_test

# Train and evaluate the model using cross-validation
def train_and_evaluate(X, y):
    model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"Cross-validated accuracy: {scores.mean():.4f}")
    # Train the model on the entire dataset for prediction
    model.fit(X, y)
    return model

# Predict on test data
def predict_test_data(model, X_test, test_data):
    predictions = model.predict(X_test)
    return pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})

# Main function orchestrates the workflow
def main():
    train_data, test_data = load_data({
        'train': "/kaggle/input/titanic/train.csv", 
        'test': "/kaggle/input/titanic/test.csv"
    })
    
    survival_rate_by_gender(train_data)
    
    features = ["Pclass", "Sex", "SibSp", "Parch"]
    X, y, X_test = prepare_data(train_data, test_data, features)
    
    model = train_and_evaluate(X, y)
    
    output = predict_test_data(model, X_test, test_data)
    output.to_csv('submission.csv', index=False)
    print("Finished!")

if __name__ == "__main__":
    main()


% of females who survived: 0.7420
% of males who survived: 0.1889
Cross-validated accuracy: 0.8238
Finished!


- **Survival Rates by Gender**: The script first prints the survival rates by gender, showing that 74.20% of females survived while only 18.89% of males did. This significant difference underscores the importance of gender as a feature in predicting survival, reflecting historical accounts where women and children were given priority for lifeboat spots.
- **Cross-validated Accuracy**: The model achieves a cross-validated accuracy of 82.38%, which is a robust estimate of how well the model is expected to perform on unseen data. This high accuracy indicates that the RandomForestClassifier, combined with the chosen features

# Conclusion

1. **Predictive Power of Features**: The analysis of survival rates by gender demonstrates the significant impact of certain features on survival outcomes. This highlights the importance of feature selection and engineering in building effective predictive models. In this case, gender plays a critical role, with females having a substantially higher survival rate compared to males, reflecting historical accounts of the "women and children first" protocol during the disaster.

2. **Efficiency of Machine Learning Models**: The use of the RandomForestClassifier model showcases how machine learning can efficiently handle complex patterns in data that may not be immediately apparent through simple statistical analysis. RandomForest, a powerful ensemble method, is particularly suited for this task due to its ability to handle categorical data and its robustness against overfitting.

3. **Validation and Model Evaluation**: By partitioning the data into training and validation sets, the script demonstrates the necessity of model evaluation using unseen data. The accuracy score obtained on the validation set serves as a proxy for the model's expected performance on real-world, unseen data, ensuring the model's generalizability.

In conclusion, the notebook follows a structured approach to tackling predictive modeling challenges, emphasizing the importance of understanding the data, selecting appropriate features, evaluating model performance, and preparing for deployment. 
