# Iris Flower Classification

## Project Overview 
This project is a supervised classification problem focused on Iris flower species identification. Its primary aim is to develop a machine learning model that can accurately classify Iris flowers into one of three distinct species, namely 'setosa,' 'versicolor,' and 'virginica.' The classification is based on the flowers' sepal and petal measurements, making use of a dataset containing these numerical features. The target variable, 'species,' is a categorical attribute with three exclusive classes. The project's objective is to create a robust model capable of accurately categorizing Iris flowers by leveraging their morphological characteristics

## Business Understanding

Problem Statement: The goal of this project is to develop a machine learning model that can accurately classify Iris flowers into their respective species (setosa, versicolor, and virginica) based on their sepal and petal measurements.
Relevance: Accurate classification of Iris flowers is valuable for botanists, horticulturists, and researchers studying plant species. It can also be used for educational purposes in introductory machine learning and classification tasks.

### Project Objectives:   

Primary Objective:
* Train a machine learning model to classify Iris flowers into three species based on sepal and petal measurements.

Secondary Objectives:

* Explore and analyze the Iris dataset to gain insights into the characteristics of the data.
* Preprocess the data to make it suitable for training a machine learning model.
* Select an appropriate machine learning algorithm for classification.
* Train and evaluate the model's performance using relevant metrics.
* Create a user-friendly interface for users to input sepal and petal measurements and receive species predictions.


## Data Understanding 

Data Source: The dataset used for this project is the Iris dataset obtained from Kaggle :[Iris Flower Dataset](https://www.kaggle.com/datasets/arshid/iris-flower-dataset)

Data Description: The dataset consists of the following columns:  

``sepal_length``: Sepal length in centimeters (numerical)  
``sepal_width``: Sepal width in centimeters (numerical)  
``petal_length``: Petal length in centimeters (numerical)    
``petal_width``: Petal width in centimeters (numerical)  
``species``: The target variable, indicating the Iris species (categorical - 'setosa', 'versicolor', 'virginica')  

In [15]:
import pandas as pd

In [16]:
class DataProcessor:
    def __init__(self, df):
        # Initialize with a DataFrame.
        self.df = df

    def get_info(self):
        # Get basic DataFrame info 
        return self.df.info()
    
    def get_summary_statistics(self):
        # Get summary statistics for numerical columns.
        return self.df.describe()
    
    def get_dtypes(self):
        # Get data types of columns.
        return self.df.dtypes
    
    def get_mising_values(self):
        # Get the count of missing (null) values in each column.
        return self.df.isnull().sum()
    def get_value_counts(self):
        # Get value counts for all categorical columns.
        categorical_columns = self.df.select_dtypes(include=['object']).columns
        value_counts = {}
        for col in categorical_columns:
            value_counts[col] = self.df[col].value_counts()
        return value_counts
    
    def check_duplicates_in_column(self, column_name):
        # Check for duplicate values in a specific column.
        if column_name in self.df.columns:
            duplicates = self.df[column_name][self.df[column_name].duplicated(keep=False)]
            if not duplicates.empty:
                return duplicates
            else:
                return "No duplicates found in the specified column."
        else:
            return "Column not found in the DataFrame."
        

In [17]:
#load the dataset
df = pd.read_csv('IRIS.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [18]:
# Initialize the DataProcessor class
dp = DataProcessor(df)

In [19]:
# get summary of the dataframe
dp.get_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [20]:
# get summary statistics
dp.get_summary_statistics()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [21]:
# check the datatypes
dp.get_dtypes()

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [22]:
# value count
dp.get_value_counts()

{'species': Iris-setosa        50
 Iris-versicolor    50
 Iris-virginica     50
 Name: species, dtype: int64}

In [23]:
df.shape

(150, 5)

## Data Cleaning