## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [5]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Step 1: Load sample dataset with missing values
data = {
    'Age': [25, 30, np.nan, 40, 22],
    'Salary': [50000, np.nan, 60000, 65000, np.nan]
}
df = pd.DataFrame(data)
print("🔹 Original DataFrame with Missing Values:")
print(df)

# Step 2: Define pipeline with Imputation + Scaling
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),       # Handle missing values
    ('scaler', StandardScaler())                       # Scale the data
])

# Step 3: Fit and transform the data using the pipeline
processed_data = pipeline.fit_transform(df)

# Step 4: Convert result back to DataFrame
processed_df = pd.DataFrame(processed_data, columns=df.columns)
print("\n✅ Processed DataFrame (Imputed and Scaled):")
print(processed_df)

🔹 Original DataFrame with Missing Values:
    Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  40.0  65000.0
4  22.0      NaN

✅ Processed DataFrame (Imputed and Scaled):
        Age    Salary
0 -0.695414 -1.725164
1  0.122720  0.000000
2  0.000000  0.345033
3  1.758989  1.380131
4 -1.186295  0.000000


In [None]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# Step 1: Load a sample dataset
# Using the Boston housing dataset (you can replace this with your own DataFrame)
boston = load_boston() 
df = pd.DataFrame(boston.data, columns=boston.feature_names)

# We'll just use the first 4 columns as numerical features for demonstration
X = df.iloc[:, :4]  # Select numerical features

# Step 2: Define a pipeline with scaling
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Fit and transform the data using the pipeline
scaled_data = pipeline.fit_transform(X)

# Convert the scaled data back to a DataFrame for easier inspection
scaled_df = pd.DataFrame(scaled_data, columns=X.columns)

# Display the first few rows
print("🔍 Scaled Features:")
print(scaled_df.head())

NameError: name 'load_boston' is not defined

    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [4]:
# Write your code from here

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Step 1: Load a dataset with missing values
data = {
    'Age': [25, 30, np.nan, 40, 22],
    'Salary': [50000, np.nan, 60000, 65000, np.nan]
}
df = pd.DataFrame(data)
print("🔹 Original DataFrame with Missing Values:")
print(df)

# Step 2: Define a pipeline with SimpleImputer to fill missing values
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))  # You can also try 'median', 'most_frequent', etc.
])

# Step 3: Fit and transform the data using the pipeline
cleaned_data = pipeline.fit_transform(df)

# Step 4: Convert back to DataFrame for easier inspection
cleaned_df = pd.DataFrame(cleaned_data, columns=df.columns)

print("\n✅ Cleaned DataFrame after Imputation:")
print(cleaned_df)

🔹 Original DataFrame with Missing Values:
    Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  40.0  65000.0
4  22.0      NaN

✅ Cleaned DataFrame after Imputation:
     Age        Salary
0  25.00  50000.000000
1  30.00  58333.333333
2  29.25  60000.000000
3  40.00  65000.000000
4  22.00  58333.333333
