<a href="https://www.aus.edu/"><img src="https://i.imgur.com/pdZvnSD.png" width=200> </a>

<h1 align=center><font size = 5>Data Exploration, Cleaning, and Preparation - Titanic Case Study</font>
<h1 align=center><font size = 5>Prepared by Alex Aklson, Ph.D.</font>
<h1 align=center><font size = 5>September 19, 2024</font>

## Import Libraries <a id="import-libraries"></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, Binarizer, OneHotEncoder, OrdinalEncoder
from scipy.stats import pearsonr, spearmanr, chi2_contingency, pointbiserialr, f_oneway

from sklearn.impute import SimpleImputer

pd.set_option("display.max_columns", None) # to display all columns in a dataframe

Load Titanic dataset

In [None]:
titanic = sns.load_dataset('titanic')

View the first five objects / instances /samples

In [None]:
titanic.head()

Get the number of objects and attributes

In [None]:
titanic.shape

## Data Exploration <a id="data-exploration"></a>

Identifying Data Types (categorical, nominal, ordinal, numerical)

In [None]:
print("Data Types in Titanic Dataset:\n", titanic.dtypes)

In [None]:
titanic.describe()

Check for missing values in the dataset

In [None]:
missing_data = titanic.isnull().sum()

print("Missing Data in the Titanic Dataset:\n", missing_data)

We see that we have 177 missing values in **age**, 688 missing values in **deck**, and 2 missing values in **embarked** and **embark_town**.

Let's start visualizing the dataset

### 1. Visualizing Attributes - Numerical Attributes

In [None]:
sns.histplot(titanic['age'], kde=True)
plt.title("Age Distribution")
plt.show()

We observe a slight skewness to the right. We confirm this by noting a difference between the mean and the median values.

In [None]:
print(titanic['age'].mean(), titanic['age'].median())

Howeve, the difference is *NOT* that significant.

Let's visualize the distribution for the fare attribute.

In [None]:
sns.histplot(titanic['fare'], kde=True)
plt.title("Fare Distribution")
plt.show()

Wow! This is a distribution with a strong skewness to the right.

In [None]:
print(titanic['fare'].mean(), titanic['fare'].median())

And the difference between the mean and median is significant. The mean is more than twice as much as the median. So we will most likely need to transform the fare attribute into a normal distribution before we feed it into a machine learning algorithm.

Let's proceed with the rest of the numerical attributes, but instead of repeating the same code again and again, let's define a function that takes in the numerical attribute or feature as input and generates the distribution plot along with printing the mean, the median, and the range.

In [None]:
def visualize_numerical_feature(feature):
    sns.histplot(titanic[feature], kde=True)
    plt.title("{} Distribution".format(feature))
    plt.show()

    print(
        'mean: ', titanic[feature].mean(), 
        ', median: ', titanic[feature].median(), 
        ', range: ', titanic[feature].max() - titanic[feature].min()
    )

In [None]:
visualize_numerical_feature('age')

In [None]:
visualize_numerical_feature('fare')

In [None]:
visualize_numerical_feature('sibsp')

In [None]:
visualize_numerical_feature('parch')

We also note right skewness in the distrubtions of **nsibsp** and **parch** attributes.

### 2. Visualization Attributes - Categorical Attributes

There will be five categorical attributes that we will explore. So instead of repeating the same code again and again, let's define a function that takes in the feature name as input, and displays a frequency count for each value of the feature and create a bar chart.

In [None]:
def visualize_categorical_feature(feature):
    print(titanic[feature].value_counts())

    plt.figure(figsize=(6, 4))
    sns.countplot(data=titanic, x=feature)
    plt.title(f"Frequency of {feature}")
    plt.show()

Let's start with sex.

In [None]:
visualize_categorical_feature('sex')

There was many more male passengers compared to female passengers. 

In [None]:
visualize_categorical_feature('pclass')

The majority of the passengers were 3rd class passengers.

In [None]:
visualize_categorical_feature('embarked')

In [None]:
visualize_categorical_feature('survived')

More people died than survived. Almost twice as many people died than survived.

In [None]:
visualize_categorical_feature('who')

## Data Preparation

### 1. Imputing Missing Values

Imputing missing values for 'Age' and 'Embarked'

We will use the median to impute the missing **age** values, and the mode to impute the missing **embarked** and **embark_town** missing values.

In [None]:
imputer_median = SimpleImputer(strategy='median')
titanic['age'] = imputer_median.fit_transform(titanic[['age']])

In [None]:
imputer_mode = SimpleImputer(strategy='most_frequent')
titanic['embarked'] = imputer_mode.fit_transform(titanic[['embarked']]).ravel()
titanic['embark_town'] = imputer_mode.fit_transform(titanic[['embark_town']]).ravel()

### 2. Calculating Correlations Between Attributes 

#### Create the pairplot

In [None]:
sns.pairplot(titanic)

#### Correlation Between Fare and Age

Pearson correlation between Age and Fare

In [None]:
pearson_corr, pearson_p = pearsonr(titanic['age'], titanic['fare'])
print(
    "Pearson correlation between Age and Fare: {}, p-value: {}".format(pearson_corr, pearson_p)
)

Spearman's Rank correlation between Age and Fare

In [None]:
spearman_corr, spearman_p = spearmanr(titanic['age'], titanic['fare'])
print(
    "Spearman's rank correlation between Age and Fare: {}, p-value: {}".format(
        spearman_corr, spearman_p
    )
)

Spearman's Rank correlation between Age and sibsp

In [None]:
spearman_corr, spearman_p = spearmanr(titanic['age'], titanic['sibsp'])
print(
    "Spearman's rank correlation between Age and Number of Siblings / Spouse: {}, p-value: {}".format(
        spearman_corr, spearman_p
    )
)

#### Correlations Between Survived and Attributes

**Point Biserial correlation between survived and fare**

In [None]:
point_biserial_corr, point_biserial_p = pointbiserialr(titanic['survived'], titanic['fare'])
print(
    "Point Biserial correlation between survived and fare: {}, p-value: {}".format(
        point_biserial_corr, point_biserial_p
    )
)

There is a weak positive correlation between the fare and survival. This suggests that as the fare increases, the likelihood of survival tends to increase.

**Point Biserial correlation between survived and age**

In [None]:
point_biserial_corr, point_biserial_p = pointbiserialr(titanic['survived'], titanic['age'])
print(
    "Point Biserial correlation between survived and age: {}, p-value: {}".format(
        point_biserial_corr, point_biserial_p
    )
)

There is a very weak negative correlation between age and survival, meaning that as age increases, the likelihood of survival slightly decreases. 

**Point Biserial correlation between survived and sibsp**

In [None]:
point_biserial_corr, point_biserial_p = pointbiserialr(titanic['survived'], titanic['sibsp'])
print(
    "Point Biserial correlation between survived and sibsp: {}, p-value: {}".format(
        point_biserial_corr, point_biserial_p
    )
)

There is a very weak negative correlation between the number of siblings/spouses aboard and survival. This suggests that having more siblings or spouses aboard slightly decreases the likelihood of survival.

**Point Biserial correlation between survived and parch**

In [None]:
point_biserial_corr, point_biserial_p = pointbiserialr(titanic['survived'], titanic['parch'])
print(
    "Point Biserial correlation between survived and parch: {}, p-value: {}".format(
        point_biserial_corr, point_biserial_p
    )
)

There is a weak positive correlation between the number of parents/children aboard and survival. This suggests that having more parents or children aboard is slightly associated with a higher likelihood of survival. 

**Chi2 test between pclass and survived**

In [None]:
contingency_table = pd.crosstab(titanic['pclass'], titanic['survived'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Test Statistic: {}".format(chi2))
print("p-value: {}".format(p))
print("Degrees of Freedom: {}".format(dof))

The Chi-squared test statistic is quite large, which suggests a strong association between the variables pclass and survived.

**Chi2 test between embarked and survived**

In [None]:
contingency_table = pd.crosstab(titanic['embarked'], titanic['survived'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Test Statistic: {}".format(chi2))
print("p-value: {}".format(p))
print("Degrees of Freedom: {}".format(dof))

The high chi-squared statistic suggests that passengers from different embarkation ports had different survival rates.

**Chi2 test between sex and survived**

In [None]:
contingency_table = pd.crosstab(titanic['sex'], titanic['survived'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Test Statistic: {}".format(chi2))
print("p-value: {}".format(p))
print("Degrees of Freedom: {}".format(dof))

The high chi-squared statistic suggests that whether a passenger was male or female had a very strong influence on their likelihood of survival.

**chi2 test between who and survived**

In [None]:
contingency_table = pd.crosstab(titanic['who'], titanic['survived'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Test Statistic: {}".format(chi2))
print("p-value: {}".format(p))
print("Degrees of Freedom: {}".format(dof))

There is a highly significant relationship between the who variable (man, woman, child) and survival. This aligns with historical knowledge that women and children were given priority in lifeboats, which greatly increased their chances of survival compared to men.

After imputing the missing values and exploring the correlations between the different attributes, let's confirm that we don't have missing values anymore.

In [None]:
titanic.head()

In [None]:
missing_data = titanic.isnull().sum()

print("Missing Data in the Titanic Dataset:\n", missing_data)

### 3. Dropping Attributes

We will drop: <br>
    - **deck**: because of the high number of missing values <br>
    - **class**: because it is a duplicate of pclass <br>
    - **embark_town**: because it is a duplicate of embarked <br>
    - **alive**: because it is a duplicate of suvived <br>
    - **adult_male**: because it is a duplicate of sex <br>

In [None]:
titanic_cleaned = titanic.drop(['deck', 'class', 'embark_town', 'alive', 'adult_male'], axis=1)

In [None]:
titanic_cleaned.head()

### 3. Handling Skewness in Numerical Features

Apply log(1 + x) transformation to skewed numerical features: fare, sibsp, parch

In [None]:
titanic_cleaned['fare_log'] = np.log1p(titanic_cleaned['fare'])
titanic_cleaned['sibsp_log'] = np.log1p(titanic_cleaned['sibsp'])
titanic_cleaned['parch_log'] = np.log1p(titanic_cleaned['parch'])

In [None]:
titanic_cleaned.head()

### 4. Scaling Numerical Features

In [None]:
numerical_features = ['age', 'fare_log', 'sibsp_log', 'parch_log']

In [None]:
scaler = StandardScaler()
titanic_cleaned[numerical_features] = scaler.fit_transform(titanic_cleaned[numerical_features])

In [None]:
titanic_cleaned.head()

### 5. Encoding Categorical Features

In [None]:
nominal_vars = ['sex', 'who', 'embarked', 'alone']  # nominal variables to be one-hot encoded
ordinal_vars = ['pclass']  # ordinal variable to be ordinal encoded

One-hot encoding nominal variables

In [None]:
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_nominal = onehot_encoder.fit_transform(titanic_cleaned[nominal_vars])
encoded_nominal_df = pd.DataFrame(encoded_nominal, columns=onehot_encoder.get_feature_names_out(nominal_vars))

In [None]:
encoded_nominal_df

Ordinal encoding ordinal variables

In [None]:
ordinal_encoder = OrdinalEncoder()
encoded_ordinal = ordinal_encoder.fit_transform(titanic_cleaned[ordinal_vars])
encoded_ordinal_df = pd.DataFrame(encoded_ordinal, columns=ordinal_vars)
encoded_ordinal_df

Drop the original categorical columns

In [None]:
titanic_cleaned = titanic_cleaned.drop(nominal_vars + ordinal_vars + ['fare', 'sibsp', 'parch'], axis=1)

In [None]:
titanic_cleaned.head()

## Prepared Data

Combine the encoded nominal, ordinal, and scaled numerical features into one dataframe

In [None]:
titanic_final = pd.concat(
    [titanic_cleaned, encoded_nominal_df, encoded_ordinal_df], axis=1
)

In [None]:
print(titanic_final.head())