# Data Imputation and Outlier Removal Tutorial

In this notebook, we will learn how to handle missing data through imputation and how to remove outliers from our dataset.

## Libraries

- **Pandas**: A software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
- **Numpy**: A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- **Scikit-learn (sklearn)**: A machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.


In [None]:
!pip install scikit-learn

In [5]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from scipy import stats

## Data Imputation

Data imputation is the process of replacing missing data with substituted values. There are many strategies for this, such as using the mean, median, or mode of the column, using a constant, or using the most frequent value.

In this tutorial, we will use the `SimpleImputer` class from `sklearn.impute` to replace missing values with the mean of the column. Let's first create a dummy DataFrame with some missing values.


In [7]:
# Creating a dataframe with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, 4, np.nan]
})
df


Unnamed: 0,A,B,C
0,1.0,,1.0
1,2.0,2.0,2.0
2,,3.0,3.0
3,4.0,4.0,4.0
4,5.0,5.0,


Now, let's use the `SimpleImputer` class to replace the missing values with the mean of the respective column.


In [8]:
# Creating the SimpleImputer object
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Performing the imputation
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df_imputed


Unnamed: 0,A,B,C
0,1.0,3.5,1.0
1,2.0,2.0,2.0
2,3.0,3.0,3.0
3,4.0,4.0,4.0
4,5.0,5.0,2.5


## Outlier Removal

Outliers are data points that are significantly different from other observations. They can occur due to variability in the data or errors. Outliers can significantly affect the results of your data analysis and statistical modeling.

There are many ways to identify and remove outliers, such as the Z-score method, the IQR method, etc. In this tutorial, we will use the Z-score method to remove outliers. The Z-score is a measure of how many standard deviations an element is from the mean. A Z-score greater than 3 or less than -3 is generally considered to be an outlier.

Let's first create a dummy DataFrame with some outliers.


In [9]:
# Creating a dataframe with some outliers
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 100],
    'B': [1, 2, 3, 4, 5, -100],
    'C': [1, 2, 3, 4, 5, 1000]
})
df


Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
2,3,3,3
3,4,4,4
4,5,5,5
5,100,-100,1000


Now, let's use the Z-score method to identify and remove the outliers.


In [10]:
# Calculating Z-scores
z_scores = np.abs(stats.zscore(df))

# Defining a threshold to identify an outlier
threshold = 3

# Identifying outliers
outliers = np.where(z_scores > threshold)

# Removing outliers
df_no_outliers = df[(z_scores < threshold).all(axis=1)]
df_no_outliers


Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
2,3,3,3
3,4,4,4
4,5,5,5
5,100,-100,1000
