## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [None]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Step 1: Load a sample dataset (Iris dataset used here)
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Step 2: Define a pipeline with StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Step 3: Apply the pipeline to scale the features
scaled_data = pipeline.fit_transform(df)

# Optional: Convert scaled data back to a DataFrame for inspection
scaled_df = pd.DataFrame(scaled_data, columns=iris.feature_names)
print(scaled_df.head())


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [None]:
# Write your code from here
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Step 1: Create or load a dataset with missing values
data = {
    'Age': [25, 30, np.nan, 45, 50],
    'Income': [50000, np.nan, 60000, 65000, np.nan]
}
df = pd.DataFrame(data)

# Step 2: Define a pipeline with SimpleImputer (using mean strategy)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Step 3: Apply the pipeline to fill missing values
imputed_data = pipeline.fit_transform(df)

# Optional: Convert back to DataFrame for inspection
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)
