<h1>Lab 21: Handling Missing Data with Imputation Techniques</h1>
<h2>Objective</h2>
<p>The objective of this lab is to teach students various techniques for imputing missing data. Students will learn how to apply statistical and model-based imputation techniques and understand their impact on datasets.</p>
<h2>Expected Outcomes</h2>
<p>By the end of this lab, students will be able to:</p>
<ul>
<li>Identify missing data in a dataset.</li>
<li>Apply different imputation techniques, such as mean, median, mode, and KNN imputation.</li>
<li>Understand when to use each imputation method.</li>
</ul>

<h3>Step 1: Import Required Libraries</h3>
<h4>Concept</h4>
<p>We need to import the necessary libraries for data manipulation, visualization, and imputation.</p>

In [None]:
# Import the necessary libraries for data manipulation and visualization
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
import seaborn as sns
import matplotlib.pyplot as plt

# Enable inline plotting for Jupyter notebooks
%matplotlib inline


<h3>Step 2: Load Dataset with Missing Data</h3>
<h4>Concept</h4>
<p>We will use the <strong>Titanic</strong> dataset, which contains missing values in both numerical and categorical columns. This dataset is available in seaborn's built-in datasets.</p>

![image.png](attachment:image.png)

In [None]:
# Load the Titanic dataset from seaborn's built-in datasets


# Display the first few rows of the dataset



<h3>Step 3: Identifying Missing Data</h3>
<h4>Concept</h4>
<p>Before imputing missing data, it&rsquo;s important to identify which columns contain missing values and how much data is missing. This helps in choosing the appropriate imputation method.</p>
<p><strong>Step 3.1: Check for Missing Values in Each Column</strong></p>

![image.png](attachment:image.png)

In [None]:
# Check for missing values in each column



<p><strong>Step 3.2: Visualize Missing Data</strong></p>
<p>A heatmap can be useful to visualize where missing data is present in the dataset.</p>

![image.png](attachment:image.png)

In [None]:
# Visualize missing data with a heatmap



<h3>Step 4: Imputing Missing Data Using Simple Techniques</h3>
<h4>Concept</h4>
<p>Simple imputation methods include replacing missing values with the <strong>mean</strong>, <strong>median</strong>, or <strong>mode</strong> of a column. These methods are straightforward and are effective for numerical data that has a relatively normal distribution.</p>
<p><strong>Step 4.1: Mean Imputation</strong></p>

![image.png](attachment:image.png)

In [None]:
# Impute missing 'age' values with the mean of the column



<p>Step 4.2: Median Imputation</p>

![image.png](attachment:image.png)

In [None]:
# Impute missing 'age' values with the median of the column



<p><strong>Step 4.3: Mode Imputation</strong></p>
<p>Mode imputation replaces missing values with the most frequently occurring value in the column. It is often used for categorical data but can also be applied to numerical data.</p>

![image.png](attachment:image.png)

In [None]:
# Impute missing 'age' values with the mode (most frequent value) of the column



<p>Display the Results</p>

![image.png](attachment:image.png)

In [None]:
# Display the original and imputed 'age' columns



<h3>Step 5: Imputing Categorical Data</h3>
<h4>Concept</h4>
<p>For categorical columns, missing values can often be imputed with the <strong>mode</strong> (most frequent value), as it represents the most common category.</p>
<p><strong>Step 5.1: Mode Imputation for Categorical Data</strong></p>

![image.png](attachment:image.png)

In [None]:
# Impute missing 'embarked' values with the mode of the column



<p>Display the Result</p>

![image.png](attachment:image.png)

In [None]:
# Display the original and imputed 'embarked' columns



<h3>Step 6: K-Nearest Neighbors (KNN) Imputation</h3>
<h4>Concept</h4>
<p><strong>KNN Imputation</strong> replaces missing values by considering the values of the <strong>k-nearest neighbors</strong> in the dataset. This method can capture more complex patterns in data, especially when features are correlated.</p>
<p><strong>Note:</strong> KNN Imputer requires numerical data. If your dataset contains categorical variables, you need to encode them or select only numerical columns.</p>
<p><strong>Step 6.1: Applying KNN Imputer</strong></p>

![image.png](attachment:image.png)

In [None]:
# Select numerical columns for KNN Imputation


# Apply KNN Imputation to the selected columns


# Add KNN-imputed 'age' and 'fare' columns back to the DataFrame



<p>Display the Results</p>

![image.png](attachment:image.png)

In [None]:
# Display the original and KNN-imputed 'age' and 'fare' columns



<h3>Step 7: Visualizing the Effect of Imputation</h3>
<h4>Concept</h4>
<p>Visualizing the distribution of original and imputed values can help assess the impact of imputation techniques on the data's distribution.</p>
<p><strong>Step 7.1: Plot Distributions of Original and Imputed 'Age' Columns</strong></p>

![image.png](attachment:image.png)

In [None]:
# Plot the distributions of the original and imputed 'age' columns



<h3>Step 8: Discussion Questions</h3>
<ol>
<li>
<p><strong>What are the potential risks of using mean or median imputation for numerical data?</strong></p>
</li>
<li>
<p><strong>How does KNN imputation differ from simple imputation techniques, and when would it be more appropriate?</strong></p>
</li>
<li>
<p><strong>Why might it be necessary to consider the distribution of the data when choosing an imputation method?</strong></p>
</li>
</ol>

<h3>Step 9: Practice Task</h3>
<h4>Practice</h4>
<ul>
<li><strong>Apply KNN Imputation to Additional Columns:</strong>
<ul>
<li>Try applying KNN imputation on additional numerical columns within the Titanic dataset, such as <code>'sibsp'</code> and <code>'parch'</code>.</li>
</ul>
</li>
<li><strong>Adjust the Number of Neighbors (k):</strong>
<ul>
<li>Experiment with different values of <code>n_neighbors</code> in the <code>KNNImputer</code> and observe the effect on the imputed values.</li>
</ul>
</li>
<li><strong>Reflect on Imputation Methods:</strong>
<ul>
<li>Consider which imputation method might work best for each type of column in the dataset and why.</li>
</ul>
</li>
</ul>

<h2>Lab Explanation</h2>
<p><strong>Introduction:</strong></p>
<p>This lab introduces various imputation techniques for handling missing data and explains why proper imputation is essential for maintaining data quality.</p>
<p><strong>Dataset:</strong></p>
<p>The Titanic dataset is used, as it contains missing values in both numerical (<code>'age'</code>, <code>'fare'</code>) and categorical (<code>'embarked'</code>) columns.</p>
<p><strong>Simple Imputation Techniques:</strong></p>
<p>Students learn how to replace missing values with statistical measures such as <strong>mean</strong>, <strong>median</strong>, and <strong>mode</strong>, which are appropriate for specific types of data distributions.</p>
<p><strong>Categorical Data Imputation:</strong></p>
<p>Mode imputation is applied to a categorical column (<code>'embarked'</code>) to demonstrate handling missing values in categorical data.</p>
<p><strong>KNN Imputation:</strong></p>
<p>KNN-based imputation is introduced as an advanced technique, ideal for numerical columns where missing values may be related to other features.</p>
<p><strong>Visual Comparison:</strong></p>
<p>Kernel Density Estimate (KDE) plots visually show the effects of each imputation method, allowing students to observe the impact on data distribution.</p>
<p><strong>Discussion Questions:</strong></p>
<p>The questions guide students to reflect on the pros and cons of each imputation method and consider the importance of data distribution.</p>
<p><strong>Practice Task:</strong></p>
<p>An optional task allows students to try different KNN parameters and consider which imputation techniques best suit various columns.</p>
<hr />
<h3>Additional Notes</h3>
<ul>
<li>
<p><strong>Considerations When Choosing Imputation Methods:</strong></p>
<ul>
<li><strong>Mean/Median Imputation:</strong>
<ul>
<li>Best for numerical data without outliers.</li>
<li>Can distort data distribution if the data is skewed.</li>
</ul>
</li>
<li><strong>Mode Imputation:</strong>
<ul>
<li>Suitable for categorical data.</li>
<li>May not be appropriate if the mode is not representative.</li>
</ul>
</li>
<li><strong>KNN Imputation:</strong>
<ul>
<li>Accounts for the similarity between instances.</li>
<li>Computationally more intensive.</li>
<li>Can handle complex patterns.</li>
</ul>
</li>
</ul>
</li>
<li>
<p><strong>Data Preprocessing for KNN Imputation:</strong></p>
<ul>
<li>Ensure all features used are numerical.</li>
<li>Consider scaling the data, as KNN uses distance metrics that are sensitive to the scale of the data.</li>
</ul>
</li>
</ul>

---

# Submission
Submit all files to myConnexion.