# Iris Dataset

The Iris dataset is a classic dataset in the field of machine learning and statistics, often used as a benchmark for classification algorithms. It consists of 150 iris flower samples from three different species: setosa, versicolor, and virginica.

The task is to classify iris flowers into their correct species based on four measured features:
Sepal length: The length of the flower's sepal in centimeters.
Sepal width: The width of the flower's sepal in centimeters.
Petal length: The length of the flower's petal in centimeters.
Petal width: The width of the flower's petal in centimeters.

Build a model that can accurately predict the species of an iris flower given its four measured features. A good model would have a high accuracy rate, meaning it correctly classifies a large percentage of the iris flowers in the dataset.

### Imports

In [40]:
import numpy as np
import pandas as pd
pd.options.mode.use_inf_as_na = True

import tensorflow as tf
from logzero import logger
# import tensorflow_datasets as tfds

  pd.options.mode.use_inf_as_na = True


### Data Loading

In [43]:
# config
data_config = {
    "data_path": r"data\iris\iris.data",
    "preprocess": False,
    "validate": True,
    "validation_split":0.2,
    "test_split":0.3
}

In [58]:
class DataPipeline:
    def __init__(self):
        logger.info('Data Pipeline set up')
        self.class_mapping = {"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2}

    def load(self, data_path:str):
        """Loading Iris dataset """

        # dataset = tfds.load('iris', split='train', shuffle_files=True)
        dataset_df = pd.read_csv(data_path, names=[ "sepal_length", "sepal_width", "petal_length", "petal_width", "species"]).sample(frac=1)
        dataset_df["species"] = dataset_df["species"].map(self.class_mapping)
        logger.debug( f"Dataset length: {len(dataset_df)}" )
        logger.info( f"Loaded {len(dataset_df)} records!" )

        return dataset_df

    def preprocess(self, dataset):
        """Pre-process dataset

        Args:
            dataset (_type_): _description_
        """
        pass

    def validate(self, dataset):
        """Validates dataset

        Args:
            dataset (tf.data.Dataset/np.ndarray): .

        Returns:
            isvalid (bool): Is loaded dataset valid
            error (str): Error message in case dataset is not valid
        """
        if len(dataset.columns) != 5:
            return False, "Dataset should have the following columns - [ sepal_length, sepal_width, petal_length, petal_width, species]. One or more columns missing."
        
        null_mask = dataset.isnull()
        if null_mask.values.any():
            return False, f"Null values found at indices - {zip(np.where(null_mask))}"
        
        return True, ""

    def run(self, config:dict):
        """Runner for the complete pipeline

        Args:
            config (dict): .

        Raises:
            Exception: InValidDatasetException when the data schema is not valid 
        """
        dataset = self.load(config["data_path"])
        
        if config.get("validate", False):
            isValid, error = self.validate(dataset)
            if not isValid:
                raise Exception(error)
            else:
                logger.info("Loaded dataset has a valid schema")
        
        split_index = int((1-config["test_split"])*len(dataset))
        train_dataset = dataset.iloc[:split_index]
        test_dataset = dataset.iloc[split_index:]
        logger.debug( f"Records in: Train split - {len(train_dataset)}, Test split - {len(test_dataset)}" )

        if config.get("preprocess", False):
            self.preprocess(train_dataset)
        
        return train_dataset, test_dataset


In [60]:
data_pipeline = DataPipeline()
train_dataset, test_dataset = data_pipeline.run(data_config)

[I 241017 22:26:45 2945258857:3] Data Pipeline set up
[D 241017 22:26:45 2945258857:12] Dataset length: 150
[I 241017 22:26:45 2945258857:13] Loaded 150 records!
[I 241017 22:26:45 2945258857:60] Loaded dataset has a valid schema
[D 241017 22:26:45 2945258857:65] Records in: Train split - 105, Test split - 45


### Training

In [None]:
class TrainingPipeline:
    def __init__(self):
        pass

### Evaluation

In [None]:
class InferPipeline:
    def __init__(self):
        pass