# Project Machine Learning Classification on IRIS data set
    **This project focuses on classifying the species of the Iris flower (Setosa, Versicolor, and Virginica) using supervised machine learning techniques. The dataset includes sepal and petal length and width as features.** 

    Key objectives include exploratory data analysis, building classification models (e.g., Logistic Regression, Decision Trees, etc.), and evaluating model performance using metrics like accuracy and precision. The project demonstrates end-to-end implementation of a machine learning pipeline, showcasing feature engineering, model training, and validation.

**1. Import Required Libraries:**
Identify and import libraries such as pandas, numpy, matplotlib, seaborn, and scikit-learn for data manipulation, visualization, and modeling.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
plt.style.use('classic')
import os

import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics

**2.Load the Dataset :**

Load the Dataset: Load the Iris dataset, either from a built-in library (e.g., scikit-learn) or an external file (e.g., CSV).

In [29]:
df_0 = pd.read_csv('iris.csv') # Initial Import
df_1 = df_0.drop('Id',axis=1).copy() #copy for future use & droped ID as this may impact model

**3.Understand the Dataset:**

Display the first few rows of the dataset.
Check for null values and basic statistics (mean, median, standard deviation, etc.).

In [None]:
df_1.head()

In [None]:
df_1.isnull().sum() #Chek nulls

In [None]:
df_1.describe().T

In [None]:
df_1.info()

In [None]:
species = df_1.Species.unique()
species

**4.Exploratory Data Analysis (EDA)**

Analyze the distribution of each feature (sepal length, sepal width, petal length, petal width).
Visualize relationships between features using scatter plots, pair plots, and box plots.
Check the class distribution for balance among the species.

In [None]:
# Encode the 'Species' column into numeric values
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_1['SpeciesEncoded'] = label_encoder.fit_transform(df_1['Species'])

# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    df_1['SepalLengthCm'], 
    df_1['SepalWidthCm'], 
    c=df_1['SpeciesEncoded'],  # Use encoded values for coloring
    cmap='viridis',           # Color map for differentiation
    s=100,                    # Size of the points
    alpha=0.8                 # Transparency
)
plt.colorbar(scatter, label='Species (Encoded)')
plt.title('Scatter Plot of Sepal Dimensions by Species', fontsize=14)
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Sepal Width (cm)', fontsize=12)
plt.grid(alpha=0.5)
plt.show()

In [None]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    df_1['PetalLengthCm'], 
    df_1['PetalWidthCm'], 
    c=df_1['SpeciesEncoded'],  # Use encoded values for coloring
    cmap='viridis',           # Color map for differentiation
    s=100,                    # Size of the points
    alpha=0.8                 # Transparency
)
plt.colorbar(scatter, label='Species (Encoded)')
plt.title('Scatter Plot of Petal Dimensions by Species', fontsize=14)
plt.xlabel('Petal Length (cm)', fontsize=12)
plt.ylabel('Petal Width (cm)', fontsize=12)
plt.grid(alpha=0.5)
plt.show()

**Pair Plot**

In [None]:
sns.pairplot(df_1, hue= 'Species', palette='viridis', diag_kind='kde', markers=["o", "s", "D"])
plt.suptitle("Pair Plot of Iris Data", y=1.02)  
plt.show()

In [None]:
#Outlier Finding
plt.figure(figsize=(10, 8))
df_1.boxplot()  # No 'ver' argument; simply call boxplot on the DataFrame
plt.title("Boxplot of Features", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(10, 10))
# Compute correlation matrix for the selected columns
correlation_matrix = df_1[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'SpeciesEncoded']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', square=True)
plt.title("Correlation Heatmap", fontsize=14)
plt.show()

This heatmap provides a visual representation of the correlation coefficients between the numerical features in your dataset. Here's how these insights can help for a machine learning model:

**Insights from the Heatmap:**

Strong Correlations:

PetalLengthCm and PetalWidthCm have a very high positive correlation (~0.96). This means these two features are highly redundant, and including both might not add much additional information to the model.
SpeciesEncoded has strong correlations with PetalLengthCm (~0.95) and PetalWidthCm (~0.96), indicating these features are highly relevant for predicting the species.
Weak Correlations:

SepalWidthCm has weak or negative correlations with other features (e.g., ~-0.42 with PetalLengthCm). This suggests it might not be as informative for predicting the species compared to other features.
Feature Selection:

For models like linear regression, removing highly correlated features can help reduce multicollinearity. For instance, you might choose either PetalLengthCm or PetalWidthCm, but not both.
Features like SepalWidthCm may require further evaluation to determine their usefulness in the model.

In [40]:
# Create a new DataFrame by dropping 'SepalWidthCm' and 'PetalLengthCm'
df_2 = df_1.drop(['SepalWidthCm', 'PetalLengthCm','Species'], axis=1).copy()

In [None]:
df_2.info()

**5.Split the Dataset**

Divide the dataset into training and testing sets, typically in an 80:20 or 70:30 ratio.

In [42]:
X = df_2.drop('SpeciesEncoded',axis = 1)  # we are removing target variable

#copy target into the Y
y = df_2['SpeciesEncoded'] # Created 1 variable only for Target variable

In [43]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1)

**6.Select Machine Learning Algorithms**

Choose classification algorithms such as Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, or K-Nearest Neighbors.

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression()
model_LR.fit(X_train, y_train)
model_LR.score(X_train,y_train)

**Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_DT = DecisionTreeClassifier()
model_DT.fit(X_train, y_train)
model_DT.score(X_train,y_train)


**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_RF = RandomForestClassifier()
model_RF.fit(X_train, y_train)
model_RF.score(X_train,y_train)

**Support Vector Machine (SVM)**

In [None]:
from sklearn.svm import SVC
model_SVM = SVC()
model_SVM.fit(X_train, y_train)
model_SVM.score(X_train,y_train)

**K-Nearest Neighbors (KNN)**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model_KNN = KNeighborsClassifier(n_neighbors=5)  # You can tune `n_neighbors`
model_KNN.fit(X_train, y_train)
model_KNN.score(X_train,y_train)

**7.Evaluate the Models**

Test the trained models using the testing dataset.
Use metrics such as accuracy, precision, recall, F1-score, and confusion matrix to evaluate performance.

In [None]:
from sklearn.metrics import classification_report, accuracy_score

#LogisticRegression
# Predict on the test set
y_pred = model_LR.predict(X_test)

# Evaluate the model
print("LogisticRegression")
print(classification_report(y_test, y_pred))
print("LogisticRegression Accuracy:", accuracy_score(y_test, y_pred))

#Random Forest
# Predict on the test set
y_pred = model_RF.predict(X_test)

# Evaluate the model
print("Random Forest")
print(classification_report(y_test, y_pred))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

#Support Vector Machine (SVM)
# Predict on the test set
y_pred = model_RF.predict(X_test)

# Evaluate the model
print("Support Vector Machine (SVM)")
print(classification_report(y_test, y_pred))
print("Support Vector Machine (SVM) Accuracy:", accuracy_score(y_test, y_pred))

#K-Nearest Neighbors (KNN)
# Predict on the test set
y_pred = model_KNN.predict(X_test)

# Evaluate the model
print("K-Nearest Neighbors (KNN)")
print(classification_report(y_test, y_pred))
print("K-Nearest Neighbors (KNN) Accuracy:", accuracy_score(y_test, y_pred))




**10.Compare Model Performance**

Compare the performance of different models to identify the best one for the task.

In [None]:
# Example: Trying multiple models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name} Accuracy: {accuracy_score(y_test, y_pred)}")


**11.Visualize Results**

Create plots to show model performance (e.g., accuracy scores, ROC curves).
Visualize the confusion matrix to understand classification performance.

1. Accuracy Scores Visualization

In [None]:
model_names = ['Logistic Regression', 'Random Forest', 'SVM']
accuracy_scores = [0.93, 0.95, 0.95]

# Plot the accuracy scores
plt.figure(figsize=(8, 5))
plt.bar(model_names, accuracy_scores, color=['blue', 'green', 'orange'])
plt.title('Model Accuracy Comparison', fontsize=14)
plt.xlabel('Models', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.ylim(0.8, 1.0)
plt.show()

**2. Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Generate predictions
y_pred = model_RF.predict(X_test)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap='Blues', values_format='d', xticks_rotation=45)
plt.title('Confusion Matrix', fontsize=14)
plt.show()



**12.Hyperparameter Tuning**

Optimize the best-performing model using techniques like Grid Search or Random Search for improved accuracy.

**13.Make Predictions**

Use the final model to make predictions on new, unseen data.

In [None]:
new_data = pd.DataFrame({
    'SepalLengthCm': [5.9, 6.0],
    'PetalWidthCm': [4.2, 5.1]
})

# Make predictions using the final KNN model
predictions = model_KNN.predict(new_data)

# If you want to see the probabilities for each class (if applicable)
probabilities = model_KNN.predict_proba(new_data)

# Display the results
print("Predicted Classes:", predictions)
print("Prediction Probabilities:", probabilities)

**14.Save the Model**

Save the trained model for future using joblib.

In [None]:
import joblib

# Save the trained model to a file
joblib.dump(model_KNN, 'iris_knn_model.pkl')
print("Model saved as 'iris_knn_model.pkl'")