In [None]:
import pandas as pd

train_data_path = '/kaggle/input/titanic/train.csv'
train_data = pd.read_csv(train_data_path)

train_data.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetics for the plots
sns.set(style="whitegrid")

# Exploratory Data Analysis: Understanding the relationship of various features with Survival

# Survival rate by Gender
plt.figure(figsize=(10, 6))
sns.barplot(x="Sex", y="Survived", data=train_data)
plt.title("Survival Rate by Gender")
plt.show()

# Survival rate by Passenger Class
plt.figure(figsize=(10, 6))
sns.barplot(x="Pclass", y="Survived", data=train_data)
plt.title("Survival Rate by Passenger Class")
plt.show()

# Survival rate by Embarkation Port
plt.figure(figsize=(10, 6))
sns.barplot(x="Embarked", y="Survived", data=train_data)
plt.title("Survival Rate by Embarkation Port")
plt.show()

# Distribution of Age and its impact on Survival
plt.figure(figsize=(10, 6))
sns.histplot(data=train_data, x="Age", hue="Survived", kde=True, element="step", stat="density", common_norm=False)
plt.title("Age Distribution and Survival")
plt.show()


Exploratory data analysis (EDA) of the Titanic dataset:


1. **Survival Rate by Gender**: The first plot shows a higher survival rate for females compared to males. This suggests that gender could be a significant predictor of survival.

2. **Survival Rate by Passenger Class (Pclass)**: The second plot indicates that passengers in the first class had a higher survival rate compared to those in the second and third classes. This suggests that socio-economic status, as indicated by the ticket class, may have played a role in survival chances.

3. **Survival Rate by Embarkation Port**: The third plot illustrates variations in survival rates based on the port of embarkation. Passengers who embarked from Cherbourg (C) seem to have a higher survival rate compared to those from Queenstown (Q) and Southampton (S).

4. **Age Distribution and Survival**: The last plot shows the age distribution of passengers and how it relates to survival. The plot indicates that younger passengers had a higher chance of survival compared to older passengers, with a notable peak in survival for children.


Based on these insights, we can infer that gender, socio-economic status (Pclass), age, and possibly the port of embarkation could be important features for predicting survival. 

In [None]:
# Checking for missing values in the training data
missing_values = train_data.isnull().sum()
missing_values_percentage = (missing_values / len(train_data)) * 100

# Displaying the count and percentage of missing values for each column
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_values_percentage})
missing_data[missing_data['Missing Values'] > 0]


The analysis of missing values in the training data reveals the following:

- **Age**: 177 missing values, constituting about 19.87% of the dataset.
- **Cabin**: 687 missing values, which is a significant 77.10% of the dataset.
- **Embarked**: 2 missing values, making up about 0.22% of the dataset.

Given this information, I handle missing values in the following way:

1. **Age**: Since the percentage of missing values is significant but not overwhelming, we could impute these missing values. I choose to use the median age.

2. **Cabin**: With 77.10% of the data missing, it might be challenging to impute this accurately, so I will drop it.

3. **Embarked**: Only 2 values are missing. I impute these missing values with the mode of the 'Embarked' column.


In [None]:
# Handling missing values

# Imputing missing values for Age with the median age
median_age = train_data['Age'].median()
train_data['Age'].fillna(median_age, inplace=True)

# Dropping the Cabin column as it has a lot of missing values
train_data.drop('Cabin', axis=1, inplace=True)

# Imputing missing values for Embarked with the mode
mode_embarked = train_data['Embarked'].mode()[0]
train_data['Embarked'].fillna(mode_embarked, inplace=True)

# Checking if all missing values have been addressed
train_data.isnull().sum()


All missing values in the dataset have now been addressed:

- The missing values in the **Age** column have been filled with the median age.
- The **Cabin** column, which had a large proportion of missing values, has been dropped.
- The 2 missing values in the **Embarked** column have been filled with its mode (most common value).

In [None]:
import pandas as pd

# Re-importing the training data as the code execution state was reset
train_data_path = '/kaggle/input/titanic/train.csv'
train_data = pd.read_csv(train_data_path)

# Imputing missing values for Age with the median age
median_age = train_data['Age'].median()
train_data['Age'].fillna(median_age, inplace=True)

# Dropping the Cabin column as it has a lot of missing values
train_data.drop('Cabin', axis=1, inplace=True)

# Imputing missing values for Embarked with the mode
mode_embarked = train_data['Embarked'].mode()[0]
train_data['Embarked'].fillna(mode_embarked, inplace=True)

# Encoding the 'Sex' column (male, female)
train_data = pd.get_dummies(train_data, columns=['Sex'], drop_first=True)  # drop_first to avoid multicollinearity

# Encoding the 'Embarked' column (C, Q, S)
train_data = pd.get_dummies(train_data, columns=['Embarked'], drop_first=True)

# Displaying the first few rows of the updated dataset to verify the changes
train_data.head()


The categorical variables in the dataset have now been encoded:

- The **Sex** column has been transformed into **Sex_male**, where 1 indicates male and 0 indicates female.
- The **Embarked** column has been converted into two columns: **Embarked_Q** and `Embarked_S`, representing Queenstown and Southampton, respectively. The reference category (dropped to avoid multicollinearity) is Cherbourg.


In [None]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# Instantiation of the basic models
base_rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
base_gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)
base_xgb_model = XGBClassifier(n_estimators=50, max_depth=3, use_label_encoder=False, eval_metric='logloss', random_state=42)

# Creation of the stacking ensemble model
stacked_model = StackingClassifier(
    estimators=[
        ('random_forest', base_rf_model),
        ('gradient_boosting', base_gb_model),
        ('xgboost', base_xgb_model)
    ],
    final_estimator=LogisticRegression(),
    cv=5
)

# Model evaluation with cross-validation
stacked_cv_scores = cross_val_score(stacked_model, X_train, y_train, cv=5)

# Calculation of the mean and standard deviation of the scores
stacked_cv_mean = stacked_cv_scores.mean()
stacked_cv_std = stacked_cv_scores.std()

print(f"Average CV scores: {stacked_cv_mean}, Standard deviation: {stacked_cv_std}")


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Data
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Handling missing values
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
train_data.drop('Cabin', axis=1, inplace=True)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# Coding of categorical variables
train_data = pd.get_dummies(train_data, columns=['Sex', 'Embarked'], drop_first=True)

# Definition of X and Y
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S']
X = train_data[features]
y = train_data['Survived']

# Subdivision into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
stacked_model.fit(X_train, y_train)

# Using the trained model to make predictions on the test dataset
test_data['Survived'] = stacked_model.predict(X_test)

submission = test_data[['PassengerId', 'Survived']]

submission_file_path = '/kaggle/working/submission.csv'
submission.to_csv(submission_file_path, index=False)

submission.head()


In [None]:
submission.info()

In [None]:
# Caricamento del dataset di test
test_data_path = '/kaggle/input/titanic/test.csv'
test_data = pd.read_csv(test_data_path)

# Preprocessing del dataset di test (simile a quello del dataset di training)
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)  # Imputazione per 'Fare'
test_data = pd.get_dummies(test_data, columns=['Sex', 'Embarked'], drop_first=True)

# Seleziona le stesse feature usate per l'allenamento del modello
X_test = test_data[features]

# Utilizza il modello allenato per fare previsioni sul dataset di test
test_data['Survived'] = stacked_model.predict(X_test)

# Crea il DataFrame di submission
submission = test_data[['PassengerId', 'Survived']]

# Salva il DataFrame come file CSV
submission_file_path = '/kaggle/working/submission.csv'
submission.to_csv(submission_file_path, index=False)

# Controlla le prime righe del file di submission
submission.head()


In [None]:
# I try to insert new engineering features for next ottimizing attemps

import pandas as pd
import numpy as np

# Re-importing the training data for feature engineering
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Feature Engineering
# 1. Creating new features

# Family Size - combination of SibSp and Parch
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1  # Adding 1 for the passenger themselves

# Age Categories - Binning Age into categories
bins = [0, 12, 18, 60, np.inf]
labels = ['Child', 'Teenager', 'Adult', 'Senior']
train_data['AgeCategory'] = pd.cut(train_data['Age'], bins, labels=labels, right=False)

# Cabin Availability - whether the cabin information is available
train_data['HasCabin'] = train_data['Cabin'].apply(lambda x: 0 if pd.isna(x) else 1)

# 2. Preprocessing - handling missing values and encoding categorical variables
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
train_data = pd.get_dummies(train_data, columns=['Sex', 'Embarked', 'AgeCategory'], drop_first=True)

# Re-defining features and target variable with new features
new_features = features + ['FamilySize', 'HasCabin'] + [col for col in train_data.columns if 'AgeCategory' in col]
X_new = train_data[new_features]
y_new = train_data['Survived']

# Splitting the data into training and validation sets for model training
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X_new, y_new, test_size=0.2, random_state=42)

# Displaying the first few rows of the updated dataset with new features
X_train_new.head()



The dataset now includes the following features:

- `Pclass`: Ticket class.
- `Age`: Age of the passenger.
- `SibSp`: Number of brothers/sisters or spouses on board.
- `Park`: Number of parents or children on board.
- `Do`: Ticket price.
- `Sex_male`: Binary variable indicating male sex.
- `Embarked_Q` and `Embarked_S`: Binary variables for ports of embarkation.
- `FamilySize`: Family size (calculated as `SibSp` + `Parch` + 1).
- `HasCabin`: Binary variable indicating whether the passenger has a cabin.
- `AgeCategory_Teenager`, `AgeCategory_Adult`, `AgeCategory_Senior`: Binary variables for age categories.
