<p align="center">
  <img src="iris_kubeflow.png" />
</p>


# Iris Dataset Example
The [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) a dataset consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) and their features(sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)). The goal of our model will be to use these features to predict  iris species.  We will be training an [XGboost](https://xgboost.readthedocs.io/en/stable/) model. Run the below cells in order to train your XGboost model and view predictions.

In [2]:
#Install XGboost
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.1/297.1 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.0.3


In [3]:
# Import our packages
import xgboost as xgb
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import random 


In [4]:
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target


In [5]:
# Convert to DataFrame
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add the Target Variable
df_iris['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Print the DataFrame
print(df_iris.head())



   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  


In [6]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% testing


In [110]:
# Train XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
model.fit(X_train, y_train)


In [111]:
# Predict the response for test dataset
y_pred = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


## Unpacking our Predictions 
Let's go ahead and take a look at a sample request and response to and from our model.

In [112]:
# Example: A new flower with these measurements
test_samples = [
    [5.1, 3.5, 1.4, 0.2],  # Setosa
    [6.0, 2.2, 4.0, 1.0],  # Versicolor
    [6.3, 3.3, 6.0, 2.5], # Virginica
    [1.0,1.2,5.0,3.1], #random flower 
]

for sample in test_samples:
    sample_prediction = model.predict([sample])
    predicted_species = iris.target_names[sample_prediction[0]]
    print(f"Test sample: {sample} - Predicted species: {predicted_species}")

Test sample: [5.1, 3.5, 1.4, 0.2] - Predicted species: setosa
Test sample: [6.0, 2.2, 4.0, 1.0] - Predicted species: versicolor
Test sample: [6.3, 3.3, 6.0, 2.5] - Predicted species: virginica
Test sample: [1.0, 1.2, 5.0, 3.1] - Predicted species: virginica


We could also adjust our sample_input and see what species we get! 

In [116]:
random_flower = [
    random.uniform(4.3, 7.9),  # Sepal length
    random.uniform(2.0, 4.4),  # Sepal width
    random.uniform(1.0, 6.9),  # Petal length
    random.uniform(0.1, 2.5)   # Petal width
]

# Predict the species of the random flower
sample_prediction = model.predict([random_flower])
predicted_species = iris.target_names[sample_prediction[0]]
print(f"Random flower features: {random_flower} - Predicted species: {predicted_species}")

Random flower features: [4.4132081184715, 3.260956323282673, 6.739219351767208, 0.16823570237241894] - Predicted species: virginica


## A Few Notes on Accuracy
### Accuracy isn't Everything
While accuracy is a useful metric for many classification problems, relying solely on it might not be sufficient for all scenarios, especially for imbalanced datasets where the number of instances across classes is not evenly distributed. In such cases, other metrics like precision, recall, F1-score, or the confusion matrix provide a more nuanced view of the model's performance. These are all metrics we can visualize in our Experiments tab! 
### Dataset Characteristics
The Iris dataset is relatively small and simple, with clear boundaries between classes for the most part. Such datasets can often lead to high-performing models, which might not be the case with more complex or noisy datasets.
### Overfitting
Overfitting is kind of like when you cram all night for a test, memorizing every single answer by heart. Sure, you ace the test the next day because you've got all those specific answers down pat. But then, when you're thrown into the real world, trying to apply what you "learned"? Suddenly, you find yourself a bit lost. That's because you were super focused on those exact questions and answers, not really grasping the broader concepts or thinking about how to tackle problems you haven't seen before.  In machine learning, overfitting happens when your model becomes gantastic at predicting the data it was trained on, but stumbles when it encounters new, unseen data. The real aim isn't just to get your model to nail the test data (though it feels great when it does). Instead, it's about prepping your model to perform well out in the wild, on real-world data it hasn't seen before.
