<a href="https://colab.research.google.com/github/Srikara2005/Data-Analytics-Lab/blob/main/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Lab 2**
Imputation

Imputation is the process of replacing missing data with substituted values.

In data science and statistics, datasets often have gapsâ€”cells where information was not recorded or was lost. Because most machine learning algorithms cannot process blank values (NaN/Null), you must handle them. Instead of deleting the entire row or column (which loses valuable information), imputation fills these gaps with estimated values.

The goal of imputation is not just to "fill the hole," but to do so in a way that preserves the overall statistical relationships (like the mean, variance, and correlations) of the dataset.

In [None]:
import numpy as np
import pandas as pd

df=pd.read_excel("Employee_data.xlsx")

In [None]:
df.head()

Unnamed: 0,Employee_ID,Name,Department,Age,Salary,Join_Date
0,101,John Doe,Engineering,28.0,75000.0,2021-01-15
1,102,Jane Smith,Marketing,34.0,82000.0,2019-03-12
2,103,Mike Johnson,Engineering,,90000.0,2020-06-01
3,104,Sarah Williams,HR,29.0,62000.0,2022-02-20
4,105,Robert Brown,Sales,45.0,55000.0,2018-11-10


In [None]:
print(df)

    Employee_ID              Name   Department    Age    Salary  Join_Date
0           101          John Doe  Engineering   28.0   75000.0 2021-01-15
1           102        Jane Smith    Marketing   34.0   82000.0 2019-03-12
2           103      Mike Johnson  Engineering    NaN   90000.0 2020-06-01
3           104    Sarah Williams           HR   29.0   62000.0 2022-02-20
4           105      Robert Brown        Sales   45.0   55000.0 2018-11-10
5           101          John Doe  Engineering   28.0   75000.0 2021-01-15
6           106       Emily Davis    Marketing   31.0       NaN 2021-07-22
7           107      Chris Miller        Sales   22.0   48000.0 2023-01-05
8           108       Anna Taylor           HR  205.0   72000.0 2020-10-15
9           109      David Wilson  Engineering   40.0  120000.0 2015-05-19
10          110       Linda Moore        Sales   37.0   64000.0 2017-08-30
11          111    James Anderson          ???   29.0   59000.0 2022-04-12
12          112    Barbar

1. Simple Imputation (Univariate)

This is the most basic approach. You replace missing values with a summary statistic of that column (mean, median, mode) or a constant value. It treats every feature independently.

Mean: Good for normally distributed numerical data.

Median: Better if the data has outliers.

Most Frequent (Mode): Used for categorical data.

In [None]:
from sklearn.impute import SimpleImputer
import pandas as pd

# Create a copy of the dataframe to keep the original safe
df_simple = df.copy()

# Identify numerical columns for imputation
numerical_cols = df_simple.select_dtypes(include=['number']).columns

# Initialize the imputer (strategy can be 'mean', 'median', 'most_frequent', 'constant')
imputer = SimpleImputer(strategy='mean')

# Fit and transform only the numerical data
df_simple[numerical_cols] = imputer.fit_transform(df_simple[numerical_cols])

print("\n--- Simple Imputation (Mean) ---")
print(df_simple)



--- Simple Imputation (Mean) ---
    Employee_ID              Name   Department       Age         Salary  \
0         101.0          John Doe  Engineering   28.0000   75000.000000   
1         102.0        Jane Smith    Marketing   34.0000   82000.000000   
2         103.0      Mike Johnson  Engineering   35.9375   90000.000000   
3         104.0    Sarah Williams           HR   29.0000   62000.000000   
4         105.0      Robert Brown        Sales   45.0000   55000.000000   
5         101.0          John Doe  Engineering   28.0000   75000.000000   
6         106.0       Emily Davis    Marketing   31.0000  108624.979167   
7         107.0      Chris Miller        Sales   22.0000   48000.000000   
8         108.0       Anna Taylor           HR  205.0000   72000.000000   
9         109.0      David Wilson  Engineering   40.0000  120000.000000   
10        110.0       Linda Moore        Sales   37.0000   64000.000000   
11        111.0    James Anderson          ???   29.0000   59000.0

2. K-Nearest Neighbors (KNN) ImputationConcept:

This method finds the k rows (neighbors) that are most similar to the row with the missing value. It then averages the values of those neighbors to fill the gap.

This is often more accurate than simple imputation because it accounts for the correlation between rows.

Key Parameter: n_neighbors (the number of neighbors to use).

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd

df_knn = df.copy()

# Separate numerical and non-numerical columns
numerical_cols = df_knn.select_dtypes(include=['number']).columns
non_numerical_cols = df_knn.select_dtypes(exclude=['number']).columns

# Initialize KNN Imputer
# n_neighbors=3 means it looks at the 3 most similar rows
knn_imputer = KNNImputer(n_neighbors=3)

# Apply KNN Imputer only to numerical data
df_knn_imputed_numerical = pd.DataFrame(knn_imputer.fit_transform(df_knn[numerical_cols]), columns=numerical_cols)

# Combine the imputed numerical columns with the original non-numerical columns
df_knn_imputed = pd.concat([df_knn_imputed_numerical, df_knn[non_numerical_cols]], axis=1)

print("\n--- KNN Imputation ---")
print(df_knn_imputed)



--- KNN Imputation ---
    Employee_ID         Age    Salary              Name   Department  \
0         101.0   28.000000   75000.0          John Doe  Engineering   
1         102.0   34.000000   82000.0        Jane Smith    Marketing   
2         103.0   33.333333   90000.0      Mike Johnson  Engineering   
3         104.0   29.000000   62000.0    Sarah Williams           HR   
4         105.0   45.000000   55000.0      Robert Brown        Sales   
5         101.0   28.000000   75000.0          John Doe  Engineering   
6         106.0   31.000000   78000.0       Emily Davis    Marketing   
7         107.0   22.000000   48000.0      Chris Miller        Sales   
8         108.0  205.000000   72000.0       Anna Taylor           HR   
9         109.0   40.000000  120000.0      David Wilson  Engineering   
10        110.0   37.000000   64000.0       Linda Moore        Sales   
11        111.0   29.000000   59000.0    James Anderson          ???   
12        112.0   33.000000   78000.0   

3. Multivariate Imputation by Chained Equations (MICE)

Concept: Also known as Iterative Imputation. This is a sophisticated method that models each feature with missing values as a function of other features.

It fills missing values with a placeholder (e.g., mean).

It treats the column with missing values as the "target" and runs a regression model (like BayesianRidge) using other columns as features to predict the true value.

It repeats this process multiple times until the values converge.

In [None]:
from sklearn.experimental import enable_iterative_imputer  # Explicitly enable
from sklearn.impute import IterativeImputer
import pandas as pd

df_mice = df.copy()

# Separate numerical and non-numerical columns
numerical_cols = df_mice.select_dtypes(include=['number']).columns
non_numerical_cols = df_mice.select_dtypes(exclude=['number']).columns

# Initialize MICE Imputer
# random_state ensures reproducibility
mice_imputer = IterativeImputer(max_iter=10, random_state=0)

# Apply MICE Imputer only to numerical data
df_mice_imputed_numerical = pd.DataFrame(mice_imputer.fit_transform(df_mice[numerical_cols]), columns=numerical_cols)

# Combine the imputed numerical columns with the original non-numerical columns
df_mice_imputed = pd.concat([df_mice_imputed_numerical, df_mice[non_numerical_cols]], axis=1)

print("\n--- MICE (Iterative) Imputation ---")
print(df_mice_imputed)



--- MICE (Iterative) Imputation ---
    Employee_ID         Age         Salary              Name   Department  \
0         101.0   28.000000   75000.000000          John Doe  Engineering   
1         102.0   34.000000   82000.000000        Jane Smith    Marketing   
2         103.0   35.927008   90000.000000      Mike Johnson  Engineering   
3         104.0   29.000000   62000.000000    Sarah Williams           HR   
4         105.0   45.000000   55000.000000      Robert Brown        Sales   
5         101.0   28.000000   75000.000000          John Doe  Engineering   
6         106.0   31.000000  108624.976187       Emily Davis    Marketing   
7         107.0   22.000000   48000.000000      Chris Miller        Sales   
8         108.0  205.000000   72000.000000       Anna Taylor           HR   
9         109.0   40.000000  120000.000000      David Wilson  Engineering   
10        110.0   37.000000   64000.000000       Linda Moore        Sales   
11        111.0   29.000000   59000.000

4. Time-Series Imputation (Forward/Backward Fill)

Concept: If your data is time-series data (ordered by time), using the mean is dangerous because it ignores trends. instead, we propagate the last known value forward or the next valid value backward.

FFill (Forward Fill): Takes the previous valid value and fills it forward.

BFill (Backward Fill): Takes the next valid value and fills it backward.

In [None]:
df_time = df.copy()

# Forward Fill
df_ffill = df_time.ffill()

# Backward Fill
df_bfill = df_time.bfill()

# Linear Interpolation (Connecting the dots)
# This assumes a linear relationship between time steps
df_interp = df_time.interpolate(method='linear')

print("\n--- Time Series: Linear Interpolation ---")
print(df_interp)



--- Time Series: Linear Interpolation ---
    Employee_ID              Name   Department    Age    Salary  Join_Date
0           101          John Doe  Engineering   28.0   75000.0 2021-01-15
1           102        Jane Smith    Marketing   34.0   82000.0 2019-03-12
2           103      Mike Johnson  Engineering   31.5   90000.0 2020-06-01
3           104    Sarah Williams           HR   29.0   62000.0 2022-02-20
4           105      Robert Brown        Sales   45.0   55000.0 2018-11-10
5           101          John Doe  Engineering   28.0   75000.0 2021-01-15
6           106       Emily Davis    Marketing   31.0   61500.0 2021-07-22
7           107      Chris Miller        Sales   22.0   48000.0 2023-01-05
8           108       Anna Taylor           HR  205.0   72000.0 2020-10-15
9           109      David Wilson  Engineering   40.0  120000.0 2015-05-19
10          110       Linda Moore        Sales   37.0   64000.0 2017-08-30
11          111    James Anderson          ???   29.0   5

  df_interp = df_time.interpolate(method='linear')
