Chapter 3 - Handling Missing Values

In [1]:
# Import Dataset

import pandas as pd
df = pd.read_csv("Titanic-Dataset.csv")

In [2]:
# Checking Missing Values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [3]:
# We have missing values in Age, Cabin and Embarked columns

In [None]:
# Handling Missing Values

# Missing values are very common in real-world datasets
# Missing values can be represented as NaN, ?, or just a blank cell

# Key Notes on Handling Missing Values
# 1. Dropping rows with missing values can lead to loss of valuable data
# 2. Dropping columns with missing values can lead to loss of valuable features
# 3. Imputing missing values can lead to incorrect data analysis results
# 4. Always understand the data before handling missing values

# Handling Numerical Data
# Numerical data is data that represents numbers
# Numerical data can be continuous or discrete

# Example of Continuous Data: Age, Height, Weight, Temperature
# Continuous data can take any value within a range
# Age: 25.3 years, 45.75 years
# Height: 167.5 cm, 172.3 cm
# Weight: 68.4 kg, 75.9 kg
# Temperature: 36.6°C, 98.2°F

# Example of Discrete Data: Number of Siblings, Number of Children
# Discrete data involves counts or distinct values. It cannot take fractions or decimals (only whole numbers).

# Number of Siblings: 0, 1, 2, 3, 4
# Number of Children: 0, 1, 2, 3
# Days Absent from Work: 0, 2, 4, 10

# Fill missing values using mean, median
# If outlier in data then use median otherwise use mean

# Handling Categorical Data
# Categorical data is non-numeric data that represents categories
# Categorical data can be ordinal or nominal
# Ordinal data has a natural order
# Nominal data does not have a natural order
# fill using mode

In [4]:

# 3. Impute Missing Values using Mean, Median, Mode
# Here we will impute missing values in the Age column using the median value
df['Age'].fillna(df['Age'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


In [5]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# As we can see know there is no any missing value in Age column

In [7]:
# Impute Missing Values of Embarked Column using Mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [8]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [9]:
# As we can see know there is no any missing value in Embarked column

In [10]:
# Solve Missing value issues by Deleting the Column
# We can also delete the column if it is not important
# Drop the Cabin column

df.drop('Cabin', axis=1, inplace=True)

In [11]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [12]:
# As we can see know cabin column is deleted
# In this way we can handle missing values in dataset