# Importing Libraries and Data

This section focuses on setting up our environment. We'll import the required Python packages for our analysis and load the datasets we'll be working with.

### Importing Data

In [1]:
import pandas as pd

# Using Pandas, load the data into 2 csv files, 'train_df' and 'test_df'


'train_df': This dataset will be used for Exploratory Data Analysis (EDA), data wrangling, and building our model.

test_df': This is a separate dataset containing the same predictor variables as the training data but excludes the target variable, 'Survival'. After training our model we'll use this dataset to make predictions on passenger survival.

| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| Pclass | Ticket Class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| Sex | Sex | - |
| Age | Age in years | - |
| Sibsp | # of siblings / spouses aboard the Titanic | - |
| Parch | # of parents / children aboard the Titanic | - |
| Ticket | Ticket Number | - |
| Fare | Passenger Fare | - |
| Cabin | Cabin Number | - |
| Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

# Session 1 - Exploratory Data Analysis (EDA)

In this section, we'll dive into Exploratory Data Analysis (EDA).

### 1.1) Initial Data Inspection

In [2]:
# Display top 5 rows of train_df


In [3]:
# Display some general information about our data


In [4]:
# Display summary statistics on variables with type 'object'


In [5]:
# Display summary statistics on variables with type 'number'


In [6]:
# Display number of null values in each column


### 1.2) Univariate Analysis

Univariate analysis examines one variable at a time to uncover its individual characteristics. This involves analysing its distribution, including central tendency, spread, and shape, and is often visualised through histograms and box plots. Summary statistics like mean and standard deviation provide a numerical overview of the variable's distribution, laying the groundwork for understanding individual data components before exploring relationships between multiple variables.

Analysis to understand individual characteristics of attributes:

Gender

In [7]:
# Use the groupby() function to calculate the survival counts based on sex and load into a new dataframe

# Use pandas' loc[] functionality on this dataframe to extract the number of males who survived

# Use pandas' loc[] functionality on this dataframe to extract the number of females who survived

# Print the results


Use a searborn countplot to display survival counts by sex

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Use sns.set_style() to set a whitegrid style

# Create a countplot using sns.countplot

# Label the x and y axis, and add a litle

# Add a legend

# Display the plot



P Class

In [9]:
# Use the groupby function to calculate the survival counts based on passenger class and load into a new dataframe

# Use pandas' loc[] functionality to extract the number of passengers who survived for each class

# Print the results


In [10]:
# Create a seaborn plot showing survival counts by class

# Create a countplot using sns.countplot()

# label the x and y axes, add a title

# Add legend for clarity

# Display the plot


Age

In [11]:
# Create a seaborn histogram to display the distribution of age amongst passengers

# Create a histogram with Kernal Density Estimation (kde) of passenger ages using sns.histplot()

# label x and y axes, add a title

# Display the plot


In [12]:
# Create subplots using plt.subplots(), in a 1 by 2 configuration

# In the first subplot, create a histogram for those who did not survive

# Label the x and y axes, add a title and limit the y axis to be between 0 and 120

# In the second subplot, create a histogram for those who survived

# Label the x and y axes, add a title and limit the y axis to be between 0 and 120

#Improve layout using tight_layout()


### 1.3) Multivariate Analysis (additional task)

Multivariate analysis explores relationships between multiple variables simultaneously to uncover complex patterns and interactions. Techniques like correlation analysis help understand variable dependencies and predict outcomes based on multiple factors. By going beyond single-variable analysis, it provides a more comprehensive understanding of datasets and supports better decision-making.

Correlation Matrix

In [13]:
import numpy as np

# Set the Seaborn style using sns.set_style, read the documentation to pick a style that you like

# Set figure size for better readability - try 10 by 8

# Calculate correlation matrix (using numeric_only for clarity) using .corr()

# Create a mask for the upper triangle using np.triu()

# Choose a diverging color palette

# Plot the correlartion matrix using sns.heatmap()

# Add title for context

# Improve layout using plt.tight_layout() and the show


Comments on how to interpret results

# Session 2 - Data Wrangling

Data wrangling encompasses the crucial process of cleaning, transforming, and preparing raw data to make it suitable for effective data analysis and modeling. It's the essential bridge between messy, real-world data and meaningful insights.

To preserve the original data, create a clone of the dataset before performing any data manipulation.

In [14]:
# Create a clone of 'train_df' called 'train_df_preprocessed'


### 2.1) Handling Missing Values

Deletion: You can delete rows with missing values. This is a simple approach, but it can reduce the size of your dataset and introduce bias if the missing values are not randomly distributed.

Imputation: You can replace missing values with estimated values. There are many different imputation methods, including:

Mean/median/mode imputation: Replacing missing values with the mean, median, or mode of the non-missing values in the same column.

Regression imputation: Using a regression model to predict the missing values based on the other variables in the dataset.

K-nearest neighbors (KNN) imputation: Finding the k most similar rows to the row with the missing value and using the average of their values to impute the missing value.

Using algorithms that support missing values: Some machine learning algorithms can handle missing values without any preprocessing. For example, decision trees can be used to handle missing values by creating separate branches for rows with and without the missing value.

The best approach for handling missing values will depend on the specific dataset and the goals of the analysis.

In [15]:
# Use the .isnull() and .sum() methods to display null values 


Age

Skewness is a measure of the asymmetry of a distribution. A symmetrical distribution has a skewness of 0. A distribution that is skewed to the right (i.e., has a longer tail on the right side) has a positive skewness, and a distribution that is skewed to the left (i.e., has a longer tail on the left side) has a negative skewness.

Mean imputation: Suitable for datasets with a relatively symmetrical distribution, typically when the skewness falls within the range of -0.5 to +0.5. Mean imputation is sensitive to outliers, meaning that extreme values can significantly influence the mean.

Median imputation: More appropriate for datasets with a skewed distribution (either positive or negative), generally when the skewness is outside the range of -0.5 to +0.5. The median is less affected by outliers compared to the mean.

For age we will plot the skewness, this will help us determine if we will be using the mean or median for imputation.

In [16]:
# Set a visually appealing style using sns.set_style

# Set the matplotlib figure using plt.figure()

# Create the histogram with KDE using seaborn

# Set plot title and labels

# Calculate skewness

# Display skewness value on the plot u)sing plt.text(

# Display the plot


Since there is only a moderate skew, we will use mean imputation for the age column

In [17]:
# Calculate the mean of the age column

# Using the mean age, impute missing age values


Cabin

Since the majority of the cabin data is missing, we will drop this column for training

In [18]:
# Drop cabin column


Embarked

Since only 2 embarked values are missing, we will drop these rows

In [19]:
# Drop rows with null values in embarked column


### 2.2) Encoding Categorical Variables

Encoding categorical variables is a crucial step in preparing data for machine learning algorithms, as these algorithms typically work with numerical data. Here's a breakdown of common encoding methods:

1. Nominal Data (Unordered Categories):

One-Hot Encoding:

Creates binary (0 or 1) columns for each unique category.

Suitable for variables where there's no inherent order.

Example: For "Color" (Red, Blue, Green), you'd have three columns: "Color_Red," "Color_Blue," "Color_Green."

Hashing Encoding:

Uses a hash function to convert categories into numerical values.

Can be more memory-efficient for high-cardinality variables (many unique categories).

2. Ordinal Data (Ordered Categories):

Ordinal Encoding (Label Encoding):

Assigns a unique integer to each category based on its order.

Preserves the order information, which can be beneficial for some algorithms.

Example: For "Size" (Small, Medium, Large), you might assign 1, 2, and 3 respectively.

Target Encoding (Mean Encoding):

Replaces each category with the mean target value (e.g., average sales) for that category.

Can be powerful but prone to overfitting, especially with small datasets.

In [20]:
# Print column names with categorical values 


In [21]:
# Print number of unique values in each of these columns


For the purpose of this training, we will drop the ticket and name column

In [22]:
# Drop Ticket and Name columns


In [23]:
# Print all unique values in the Sex and Embarked columns


For the purpose of this training, we will use label encoding 

In [24]:
# Using the .map() function, map the values in Sex and Embarked columns to integer values


In [25]:
# View values in 'Sex' and 'Embarked' columns to ensure changes have been made


### Additional Task - Feature Engineering 

Age Buckets 

The task below focuses on feature engineering for exploratory data analysis (EDA). We will transform the numerical 'Age' feature into categorical age groups to potentially uncover hidden patterns and gain a deeper understanding of the data.

In [26]:
# Define age bins and corresponding labels

# Create the 'Age_Group' column using pd.cut()

# Create a seaborn count plot to visualise this new feature

# Create the count plot

# Add labels and title

# Rotate x-axis labels for better readability (if needed)

# Show the plot


# Session 3 - Logistic Regression: Model Training and Testing

In [27]:
# Load the first five rows of train_df_preprocessed


### 3.1) Train Test Split

A train-test split is a common technique used in machine learning to evaluate the performance of a model. It involves dividing the dataset into two separate subsets (generally a 70/30 split):   

Training set: This subset is used to train the model. The algorithm learns patterns and relationships from this data.

Testing set: This subset is held back and used to evaluate the model's performance after training. It helps assess how well the model generalizes to unseen data.

In [28]:
# View data to ensure all necessary wrangling is complete

# Gender should be assigned a value of 1 or 0, and there should be no null values


In [29]:
from sklearn.model_selection import train_test_split

# Split data into X (Predictor variables) and y (Target Variable) - this is the 'Survived' column.


In [30]:
# Split data into train and test data with a 70/30 split, set the random_state to be 42


### 3.2) Training Model

In [31]:
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model

# Train the model on the training data using .fit()

# Make predictions on the test data using .predict()

# Calculate the accuracy of the model on the training data using .score()

# Display the accuracy score


In [32]:
from sklearn.metrics import confusion_matrix

# Initisalise confusion matrix

# Using seaborn a visually appealing heatmap to display this information, try using sns.heatmap()


### 3.3) Testing Model on test_df

Apply the identical data preparation steps used on the train_df to the test_df.

Kaggle submission requirements: CSV file with 2 columns: PassengerId and Survived with 418 rows of data

Since the initial dataset already contains 418 rows, we cannot remove any rows in our preprocessing

Create a clone of the test data

In [33]:
# Create a clone of 'test_df' called 'test_df_preprocessed'


1) Null Values

In [34]:
# Check for null values


Note: We'll apply the same preprocessing steps outlined in Section 2 to the test data, ensuring consistency in data preparation between the training and test sets.

In [35]:
# Using the mean age, impute missing age values

# Drop cabin column

# Replace null value in fare column with mean


2) Encoding Categorical Variables

In [36]:
# Drop Ticket and Name columns

# Map values in Sex and Embarked columns to integer values


### 3.4) Make predictions on test_df_preprocessed and export output.

In [37]:
# Make predictions on test_df_preprocessed using our trained logistic regression model

# Create Dataframe 'output' with the Passenger ID and the coresponding 'Survived' value

# Export this DataFrame as a csv


Once csv file is created please flag a facilitator to submit for you