***
# <font> Machine Learning Demo on classifying Iris species.</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> OU OCI Delivery </font></p>

***

### The following code demonstrates a machine learning lifecycle using the Iris dataset. The stages involved in this demo are as following
* Loading data, 
* Preprocessing,
* Training a model, 
* Evaluating the model
* Making predictions.

The Iris dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:
* Id
* SepalLengthCm
* SepalWidthCm
* PetalLengthCm
* PetalWidthCm
* Species

Source:
* https://archive.ics.uci.edu/dataset/53/iris
* https://www.kaggle.com/datasets/uciml/iris

Here, we are importing various libraries such as sklearn, numpy and pandas.
* numpy and pandas are used for data manipulation and analysis.
* train_test_split is used to split our dataset into training and testing sets.
* StandardScaler is used to standardize (normalize) the features of the dataset.
* LogisticRegression is a machine learning model we'll be using.
* accuracy_score is used to evaluate the accuracy of our model.

## Import Libraries and Load Data

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

We're loading the Iris dataset from a CSV file named 'iris.csv' using the pd.read_csv() function. The dataset contains information about different species of iris flowers.

In [2]:
# Load the Iris dataset
df = pd.read_csv('Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


We now split the data into X and y as features and target respectively. We will further split the data into training and test sets.
* X contains the features of the dataset, excluding the 'species' column, which we drop using drop() method.
* y contains the target (output) variable, which is the 'species' column.
* We split the data into training and testing sets using train_test_split().
* test_size=0.2 means 20% of the data is used for testing, and random_state ensures reproducibility.

## Pre_processing the Data

In [3]:
# Split the data into features (X) and target (y)
X = df.drop(['Id', 'Species'], axis=1)
y = df['Species']

In [4]:
X.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We now create a StandardScaler instance to standardize (normalize) the features.
* fit_transform() computes the mean and standard deviation from the training data and scales it.
* transform() applies the same scaling to the testing data.

In [7]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We create a LogisticRegression model instance.
* fit() trains the model on the scaled training data (X_train_scaled) and target (y_train).

## Training the Model

In [8]:
# Train a machine learning model (Logistic Regression)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

## Evaluating the Model

We use the trained model to predict the species on the scaled testing data.
* accuracy_score() compares the predicted values (y_pred) with the actual values (y_test) and calculates accuracy.
* then we will print the accuracy of the model's predictions.

In [9]:
# Evaluate the model on the testing set
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


We create a new_data array with new samples for prediction.
* We scale (transform()) the new data using the same scaler as before.
* The model predicts the species for the new data, and we print the predictions.

## Making Predictions

In [10]:
# Sample new data for prediction
new_data = np.array([[5.1, 3.5, 1.4, 0.2],
                     [6.3, 2.9, 5.6, 1.8],
                     [4.9, 3.0, 1.4, 0.2]])

In [11]:
# Standardize the new data
new_data_scaled = scaler.transform(new_data)



In [12]:
# Make predictions
predictions = model.predict(new_data_scaled)

In [13]:
# Display the predicted classes
print("Predictions:", predictions)

Predictions: ['Iris-setosa' 'Iris-virginica' 'Iris-setosa']


## References:

* https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html
* https://scikit-learn.org/stable/
* https://www.oracle.com/artificial-intelligence/
* https://numpy.org
* https://pandas.pydata.org