# Titanic

https://www.kaggle.com/c/titanic/data

## Overview
The dataset is divided into two groups:

1. **Training Set (`train.csv`)**  
   - Use this data to build your machine learning models.
   - Contains features about each passenger along with the outcome (or “ground truth”) indicating survival.

2. **Test Set (`test.csv`)**  
   - Use this data to evaluate model performance on unseen data.
   - The ground truth (survival outcome) is not provided. Predict survival for each passenger using the trained model.

3. **Example Submission (`gender_submission.csv`)**  
   - A sample file showing predictions based on the assumption that only female passengers survive.

## Data Dictionary

| Variable    | Definition                        | Key                                              |
|-------------|-----------------------------------|--------------------------------------------------|
| `survival`  | Survival                          | 0 = No, 1 = Yes                                  |
| `pclass`    | Ticket class                      | 1 = 1st, 2 = 2nd, 3 = 3rd                        |
| `sex`       | Sex                               |                                                  |
| `age`       | Age in years                      |                                                  |
| `sibsp`     | # of siblings / spouses aboard    |                                                  |
| `parch`     | # of parents / children aboard    |                                                  |
| `ticket`    | Ticket number                     |                                                  |
| `fare`      | Passenger fare                    |                                                  |
| `cabin`     | Cabin number                      |                                                  |
| `embarked`  | Port of Embarkation               | C = Cherbourg, Q = Queenstown, S = Southampton   |

## Variable Notes

- **`pclass`**: Socio-economic status (SES) proxy  
  - 1st = Upper  
  - 2nd = Middle  
  - 3rd = Lower

- **`age`**: If age is fractional, it is below 1. Estimated ages are in the form `xx.5`.

- **`sibsp`**:  
  - Defines family relations on board.  
  - *Sibling*: Brother, sister, stepbrother, stepsister.  
  - *Spouse*: Husband, wife (excluding mistresses and fiancés).

- **`parch`**:  
  - Defines family relations on board.  
  - *Parent*: Mother, father.  
  - *Child*: Daughter, son, stepdaughter, stepson.  
  - Note: Some children traveled with a nanny, so `parch = 0` for them.

## 1. Define the Problem and Project Objectives

Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.


## 2. Data Collection and Understanding
Examine the dataset provided to understand its structure and contents.

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 

# Load the dataset
train = pd.read_csv('020 Titanic/data/train.csv')
test = pd.read_csv('020 Titanic//data/test.csv')
# Display the first few rows

print('Length of the training set: {}'.format(len(train)))
print('Length of the test set: {}'.format(len(test)))
train.head()

In [None]:
test.head()

In [None]:
test['Survived']=np.nan

In [None]:
# Lets join train and test so we process all rows at once.
df = pd.concat([train, test], ignore_index=True)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.Survived.value_counts()

In [None]:
df.Pclass.value_counts()

In [None]:
df.Sex.value_counts()

In [None]:
df.Age.value_counts()

In [None]:
df.SibSp.value_counts()

In [None]:
df.Parch.value_counts()

In [None]:
df.Ticket.value_counts()

In [None]:
df.Fare.value_counts()

In [None]:
df.Cabin.value_counts()

In [None]:
df.Embarked.value_counts()

In [None]:
df.isna().sum()

In [None]:
df.PassengerId.duplicated().sum()

After a first inspection: 
- There are 263 missing ages
- There are 1014 missing cabins
- There is 1 missing Fare
- There are 2 missing Embarked
- Pclass, Sex, Fare and Embarked can be converted into categories
- There are no duplicates
- The Name could be used to extract more information

In [None]:
df.nunique()

In [None]:
cat_cols = ['Survived', 'Pclass', 'Sex', 'Embarked']

for col in cat_cols:
    df[col] = df[col].astype('category')    

## 3. Data Cleaning
Clean the data by handling missing values, removing duplicates, and correcting errors.

### Age missing

In [None]:
df.Age.isna().sum()

In [None]:
sns.histplot(data=df, x='Age')

In [None]:
df['estimated_age'] = (df['Age']%1) == 0.5

In [None]:
df[df['Fare'].isna()==True]

In [None]:
df['Fare'].hist()

In [None]:
df['Fare'] = df['Fare'].fillna({'Fare': df['Fare'].mean()})

In [None]:
df.Embarked.value_counts()

In [None]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

In [None]:
df['Cabin'].fillna('None', inplace=True)

In [None]:
df.info()

The only missing values are in the age column. We will try to impute values in a smart way after performing some EDA

## 4. Exploratory Data Analysis (EDA)
Analyze the data visually and statistically to uncover patterns and insights.

In [None]:
df.Pclass.value_counts()

In [None]:
sns.countplot(data=df, x='Pclass', hue='Survived')

In [None]:
sns.pairplot(data=df, hue='Survived')
plt.show()

In [None]:
df[df.Name.str.contains('(', regex=False)]['Name']

In [None]:
df['Name_no_parenthesis'] = df['Name'].str.replace(r"\(.*?\)", "", regex=True)

In [None]:
df['Title'] = df['Name_no_parenthesis'].str.extract(r'\b(Mr|Mrs|Miss|Master|Don|Major|Col|Dr|Rev|Sir|Lady|Mme|Mlle|Ms|Dona|Capt|Countess|Jonkheer)\b', expand=False)

In [None]:
df.head()

In [None]:
df['Title'].isna().sum()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df, y='Age', x='Title', hue='Sex')

In [None]:
df['Cabin'].value_counts()

In [None]:
df['Cabin_Category'] = df['Cabin'].str[0].fillna('None')

In [None]:
df['Age'] = df['Age'].fillna(df.groupby(['Sex', 'Title'])['Age'].transform('mean'))

In [None]:
df.columns

In [None]:
sns.histplot(data=df.Fare)

In [None]:
df['fare_log'] = np.log1p(df['Fare'])

In [None]:
sns.histplot(data=df.fare_log)

In [None]:
df.columns

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler


df_encoded = pd.get_dummies(df, columns=['Sex', 'Embarked', 'Title', 'Cabin_Category', 'Pclass'], drop_first=True)

In [None]:
columns_to_scale = ['Age', 'Parch', 'SibSp', 'fare_log']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the specified columns and transform them
df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])

df_encoded.columns

In [None]:
feature_columns = ['Age', 'SibSp', 'Parch',
       'estimated_age', 'fare_log',
       'Sex_male', 'Embarked_Q', 'Embarked_S', 'Title_Col', 'Title_Countess',
       'Title_Don', 'Title_Dona', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady',
       'Title_Major', 'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme',
       'Title_Mr', 'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir',
       'Cabin_Category_B', 'Cabin_Category_C', 'Cabin_Category_D',
       'Cabin_Category_E', 'Cabin_Category_F', 'Cabin_Category_G',
       'Cabin_Category_N', 'Cabin_Category_T', 'Pclass_2', 'Pclass_3']

X=df_encoded[~df_encoded['Survived'].isna()][feature_columns]
y=df_encoded[~df_encoded['Survived'].isna()]['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Assuming X_train, X_test, y_train, y_test are already defined

# Define the models and their parameter grids, including class weights
models = {
    'RandomForest': (RandomForestClassifier(), {
        'n_estimators': [50, 100, 150],
        'max_depth': [5, 7, 10, 13],
        'min_samples_split': [2, 5],
        'class_weight': ['balanced', None]  # Adding class weights
    }),
    'LogisticRegression': (LogisticRegression(max_iter=200), {
        'C': [0.01, 0.1, 1, 5, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'lbfgs'],  # Use 'liblinear' for L1 penalty
        'class_weight': ['balanced', None]  # Adding class weights
    })
}

# Iterate over the models
for model_name, (model, param_grid) in models.items():
    print(f"Training {model_name}...")

    # Set up GridSearchCV
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    # Best model after grid search
    best_model = grid_search.best_estimator_

    # Predictions
    y_train_pred = best_model.predict(X_train)
    y_test_pred = best_model.predict(X_test)

    # Performance metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    class_report = classification_report(y_test, y_test_pred)
    conf_matrix = confusion_matrix(y_test, y_test_pred)

    # Print results
    print(f"Best Parameters for {model_name}: {grid_search.best_params_}")
    print(f"Training Accuracy for {model_name}: {train_accuracy:.2f}")
    print(f"Test Accuracy for {model_name}: {test_accuracy:.2f}")
    print("Classification Report:")
    print(class_report)
    print("Confusion Matrix:")
    print(conf_matrix)
    print("-" * 50)

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Assuming you have your X_train, y_train data already defined
# and the previous best parameters
best_params = {
    'n_estimators': 100,     # Previous best
    'max_depth': 7,          # Previous best
    'min_samples_split': 5,  # Previous best
    'class_weight': None      # Previous best
}

# Define the new parameter grid with values close to the best parameters
param_grid = {
    'n_estimators': [90, 100, 110],            # Slightly varied n_estimators
    'max_depth': [5, 6, 7, 8],                 # Exploring depth around the best
    'min_samples_split': [4, 5, 6],             # Exploring min_samples_split
    'class_weight': ['balanced', None]          # Including balanced class weight option
}

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Set up GridSearchCV with the parameter grid
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best model after grid search
best_rf_model = grid_search.best_estimator_

# Print best parameters
print(f"Best Parameters after fine-tuning: {grid_search.best_params_}")

# Optionally, you can evaluate the model on the test set
y_test_pred = best_rf_model.predict(X_test)

# Import metrics for evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Performance metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
class_report = classification_report(y_test, y_test_pred)
conf_matrix = confusion_matrix(y_test, y_test_pred)

# Print results
print(f"Test Accuracy: {test_accuracy:.2f}")
print("Classification Report:")
print(class_report)
print("Confusion Matrix:")
print(conf_matrix)

In [None]:
best_rf_model.feature_importances_

In [None]:
importances = best_rf_model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,  # Assuming X_train is a DataFrame
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print feature importances
print(feature_importance_df)

# Plotting feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importances from Best Random Forest Classifier')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()