<a href="https://colab.research.google.com/github/krutikaParab/Data-Science-Projects/blob/main/Data_Preprocessing_Pipeline/Dpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Preprocessing pipeline should be able to handle missing values, standardize numerical features, remove outliers, and ensure easy replication of preprocessing steps on new datasets.

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
  # Identify numeric and categorical features
  numeric_features = data.select_dtypes(include=['float', 'int']).columns
  categorical_features = data.select_dtypes(include=['object']).columns

  # Handle missing values in numeric features
  data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

  # Detect and handle outliers in numeric features using IQR
  for feature in numeric_features:
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)

    IQR = Q3 - Q1
    lower_bound = Q1 - (1.5 * IQR)
    upper_bound = Q3 + (1.5 * IQR)

    data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                             data[feature].mean(), data[feature])

    # Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    # Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data


In [16]:
data = pd.read_csv("data.csv")

print("Original_data: ")
print(data)

Original_data: 
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


In [17]:
#Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data:")
print(cleaned_data)

Preprocessed Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -0.576053                  A
1        -0.944999        -0.510839                  B
2         0.000000        -0.445626                  A
3         0.236250        -0.380412                  A
4         0.826874        -0.315199                  B
5         1.417499         2.228129                  C
