# Data Preprocessing Pipeline

A data preprocessing pipeline is a systematic and automated approach that integrates multiple preprocessing steps into a cohesive workflow. It serves as a guide for data professionals, outlining the necessary transformations and calculations required to clean and prepare data for analysis. Such pipelines are invaluable for data engineers, data analysts, data scientists, and machine learning engineers, as they automate repetitive preprocessing tasks, enabling professionals to focus on higher-value activities and improving overall workflow efficiency.

The pipeline is composed of interconnected steps, each responsible for a specific task, such as:

- **Imputing Missing Values**: Filling in gaps in the data to maintain consistency.
- **Scaling Numeric Features**: Standardizing numerical variables to ensure uniformity across features.
- **Encoding Categorical Variables**: Transforming categorical data into a format suitable for analysis or machine learning models.
- **Detecting and Handling Outliers**: Identifying and addressing anomalies that could skew results.

By adhering to a predefined sequence of operations, the pipeline ensures consistency, reproducibility, and efficiency throughout the preprocessing process. The steps mentioned above represent fundamental functions that every pipeline should perform when preparing any dataset for analysis or modeling.

Here's how to create a Data Preprocessing pipeline using Python based on the fundamental functions that every pipeline should perform while preprocessing any dataset.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [8]:
def data_preprocessing_pipeline(data):
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    # Handle missing values numerical features (Mean values)
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    # Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])
        
    # Normalize numerical features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    # Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])
    
    return data

Now that our basic data preprocessing pipeline is ready, we can go ahead and test it on a sample data and see how it works.

In [9]:
# Load sample data
df = pd.read_csv('data.csv')
print('Original Data')
print(df)

Original Data
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


In [10]:
# Perform data preprocessing with pipeline
clean_data = data_preprocessing_pipeline(df)

print("Processed Data:")
print(clean_data)

Processed Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C


Great! We can see there's a difference between the two datasets. The second output has undergone the outlined preprocessing steps within our pipeline. This process simplifies the process of cleaning and preparing data in addition to eliminating repetitions and integrating everything into one code. 