## Import Libraries

**Common Libraries**

1. **`pandas` (`pd`)**: Used for data manipulation and analysis, mainly in a DataFrame format.
2. **`numpy` (`np`)**: For numerical computing, multi-dimensional arrays and mathematical operations.
3. **`maplotlib.pyplot` (`plt`) & `seaborn` (`sns`)**: Statistical data visualization library
4. **`sklearn` (`scikit-learn`)**: ML library for data preprocessing, model training, and evaluation.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

## EDA

In [None]:
# Load the dataset
df = pd.read_csv('/titanic/train.csv')

# Display the first 5 rows of the dataset
df.head()

In [None]:
# Get basic information about the dataset
df.info()

In [None]:
# Summary statistics for numerical columns
df.describe().T

In [None]:
# Check for missing values
df.isna().sum()

### Data Visualization

**Summary of Plots**

| **Plot Type**   | **Use Case**                                                                 |
|------------------|------------------------------------------------------------------------------|
| Histogram        | Distribution of a numerical variable.                                       |
| Bar Plot         | Compare values across categories.                                           |
| Count Plot       | Count occurrences of categorical data.                                      |
| Box Plot         | Distribution and outliers in numerical data.                                |
| Violin Plot      | Distribution of numerical data (combination of box plot and density plot).   |
| Scatter Plot     | Relationship between two numerical variables.                               |
| Pair Plot        | Pairwise relationships between numerical variables.                         |
| Heatmap          | Visualize correlations between numerical features.                          |
| Pie Chart        | Show proportions of categorical data.                                       |
| Line Plot        | Show trends over a continuous interval (e.g., time-series data).             |

Other types of plots: https://seaborn.pydata.org/examples/index.html

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], bins=20, kde=True, color='blue')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=df, ci=None, palette='viridis')
plt.title('Average Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='Survived', data=df, palette='Set2')
plt.title('Count of Survivors')
plt.xlabel('Survived (1 = Yes, 0 = No)')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Survived', y='Age', data=df, palette='coolwarm')
plt.title('Age Distribution by Survival')
plt.xlabel('Survived (1 = Yes, 0 = No)')
plt.ylabel('Age')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df, palette='Set1')
plt.title('Age vs Fare (Colored by Survival)')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

In [None]:
sns.pairplot(df[['Age', 'Fare', 'Survived', 'Pclass']], hue='Survived', palette='husl')
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()

In [None]:
plt.figure(figsize=(6, 6))
df['Survived'].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightcoral', 'lightgreen'])
plt.title('Survival Proportion')
plt.ylabel('')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.lineplot(x='Age', y='Fare', data=df, ci=None, color='purple')
plt.title('Line Plot of Age vs Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

## Data preprocessing

In [None]:
def clean_data(df):
    df = df.copy()
    
    # Fill missing 'Age' values with the median
    df['Age'] = df['Age'].fillna(df['Age'].median())
    
    # Fill missing 'Embarked' values with the mode
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    
    # Drop the 'Cabin' column as it has too many missing values
    df = df.drop('Cabin', axis=1)

    return df

def feature_engineer(df):
    df = df.copy()

    # Create a new feature 'FamilySize' by combining 'SibSp' and 'Parch'
    df['FamilySize'] = df['SibSp'] + df['Parch']

    # Create a new feature 'IsAlone' to indicate if a passenger is alone
    df['IsAlone'] = df['FamilySize'].apply(lambda x: 1 if x == 0 else 0)

    return df

train = clean_data(df)
train = feature_engineer(train)

# Verify if missing values are handled
df.isnull().sum()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
num_cols = df.select_dtypes(include=['number'])  

# Compute correlation matrix
corr_matrix = num_cols.corr()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", linewidths=0.5) 
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
# Define features (X) and target (y)
X = df.drop(['Survived'], axis=1)
y = df["Survived"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)

In [None]:
# Convert the dataframe to csv file
df.to_csv('train_data.csv', index=False)