## **TASK-1 : DATA PIPELINEDEVELOPMENT**

In this program we will see how to develop a Data Pipeline using Python. The dataset used her is Iris dataset.

Reference materials:
https://www.askpython.com/python/examples/pipelining-in-python
https://www.analyticsvidhya.com/blog/2020/01/build-your-first-machine-learning-pipeline-using-scikit-learn/


In [1]:
# Step 1: Import necessary libraries
#Essential libraries and the dataset is being imported in this step, which includes:Pandas for data manipulation and Scikit-Learn for machine learning.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

In [4]:
# Step 2: Load the Iris dataset
# Loading the dataset using Pandas for better access and Data manipulation
iris = load_iris()
X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.Series(data=iris.target)

In [3]:
# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Step 4: Define preprocessing steps
# For this dataset, we will use a SimpleImputer for any missing values and a StandardScaler for feature scaling
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
            ('scaler', StandardScaler())  # Scale features
        ]), X.columns)
    ]
)

In [5]:
# Step 5: Create a pipeline that combines preprocessing with a Random Forest Classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('classifier', RandomForestClassifier(random_state=42))  # Model training step
])

In [6]:
# Step 6: Fit the model on the training data
pipeline.fit(X_train, y_train)

In [7]:
# Step 7: Evaluate the model on the test data
accuracy = pipeline.score(X_test, y_test)
print(f'Model accuracy: {accuracy:.2f}')

Model accuracy: 1.00


ETA using Functions


In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os

def extract_data():
    """
    Load the Iris dataset and convert it into a pandas DataFrame.
    """
    iris = load_iris(as_frame=True)
    data = iris.frame  # includes features and target
    print("Data extracted successfully.")
    return data

def transform_data(data):
    """
    Preprocess the data:
    - Encode target labels
    - Standardize feature columns
    """
    # Label Encoding
    label_encoder = LabelEncoder()
    data['target'] = label_encoder.fit_transform(data['target'])

    # Feature Scaling
    feature_cols = data.columns[:-1]  # all except target
    scaler = StandardScaler()
    data[feature_cols] = scaler.fit_transform(data[feature_cols])

    print("Data transformed successfully.")
    return data

def load_data(data, output_path='processed_iris.csv'):
    """
    Save the transformed data to a CSV file.
    """
    data.to_csv(output_path, index=False)
    print(f"Data loaded to {output_path}")

def run_pipeline():
    """
    Full ETL pipeline execution.
    """
    print("Starting ETL pipeline...")
    data = extract_data()
    data_transformed = transform_data(data)
    load_data(data_transformed)
    print("ETL pipeline completed successfully.")

if __name__ == "__main__":
    run_pipeline()


Starting ETL pipeline...
Data extracted successfully.
Data transformed successfully.
Data loaded to processed_iris.csv
ETL pipeline completed successfully.
