<a href="https://colab.research.google.com/github/mOnchhh/github-porfolio/blob/main/CSCI_111_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Instructions**
---
This notebook is in completion of the Final Project for CSCI 111 - ST2. It contains the source code for a Machine Learning Output wherein the models attempts to predict **Banana Quality** through Logistic Regression and Random Forest Classification.

In order to predict the banana sample's quality, the attribute **"quality_category"** has been used as the target attribute, or independent variable.

Additionally, this notebook also includes a Random Forest Regressor model further below, which attempts to predict "quality_score" rather than "quality_category."


```
To adjust which columns are included in the dataframe, look to the "Data Preparation"
section and utilize the data.drop() function.

To change the test sizes to be used in model testing, see the train_test_split()
function at line 30, and change the argument "test_size" from 0.2 to whichever test
size you prefer.
```

The sources used in the fulfillment of this requirement can be found at the bottom of this notebook.

# **ML Models: Logistic Regression & Random Forest Classification**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

data = pd.read_csv("banana_quality_dataset.csv")

# ------------------------ Data Preparation ------------------------
data.drop(["region", "harvest_date","quality_score"], axis=1, inplace=True) # Upon including "soil_nitrogen_ppm" in the model training,
                                                                            # there remains no difference in the classification scores
                                                                            # during model evaluation.
data["altitude_m"] = pd.to_numeric(data["altitude_m"], errors='coerce')
data["rainfall_mm"] = pd.to_numeric(data["rainfall_mm"], errors='coerce')
data.dropna(inplace=True)

data["quality_category"] = data["quality_category"].map({"Unripe": 0, "Processing": 1, "Good": 2, "Premium": 3})
data["ripeness_category"] = data["ripeness_category"].map({"Green": 0, "Turning": 1, "Ripe": 2, "Overripe": 3})

# Label Encode region
le = LabelEncoder()
data["variety"] = le.fit_transform(data["variety"])

# ------------------------ Model Training ------------------------
x = data.drop("quality_category", axis=1)
qc = data["quality_category"]

# Split and Standardize
x_train, x_test, qc_train, qc_test = train_test_split(x, qc, test_size=0.2, random_state=42)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Logistic Regression Model
lr_model = LogisticRegression()
lr_model.fit(x_train, qc_train)

# LR Model Evaluation
qc_prediction_lr = lr_model.predict(x_test)
print("[1] Logistic Regression")
print("confusion matrix:\n", confusion_matrix(qc_test, qc_prediction_lr),"\n")
print("classification report\n", classification_report(qc_test, qc_prediction_lr))

# Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, qc_train)

# RF Model Evaluation
qc_prediction_rf = rf_model.predict(x_test)
print("\n[2] Random Forest")
print("confusion Matrix:\n", confusion_matrix(qc_test, qc_prediction_rf))
print("classification Report:\n", classification_report(qc_test, qc_prediction_rf))

[1] Logistic Regression
confusion matrix:
 [[  4   4   0   0]
 [  0 100   0   0]
 [  0   8  81   0]
 [  0   0   1   2]] 

classification report
               precision    recall  f1-score   support

           0       1.00      0.50      0.67         8
           1       0.89      1.00      0.94       100
           2       0.99      0.91      0.95        89
           3       1.00      0.67      0.80         3

    accuracy                           0.94       200
   macro avg       0.97      0.77      0.84       200
weighted avg       0.94      0.94      0.93       200


[2] Random Forest
confusion Matrix:
 [[ 2  6  0  0]
 [ 0 99  1  0]
 [ 0 16 73  0]
 [ 0  0  1  2]]
classification Report:
               precision    recall  f1-score   support

           0       1.00      0.25      0.40         8
           1       0.82      0.99      0.90       100
           2       0.97      0.82      0.89        89
           3       1.00      0.67      0.80         3

    accuracy             

# **Random Forest Regression**

Utilises Random Forest Regression in order to predict **"quality_score"** rather than **"quality_category."**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = pd.read_csv("banana_quality_dataset.csv")

# Data Preparation
data.drop(["region", "harvest_date", "quality_category"], axis=1, inplace=True)
data["altitude_m"] = pd.to_numeric(data["altitude_m"], errors='coerce')
data["rainfall_mm"] = pd.to_numeric(data["rainfall_mm"], errors='coerce')
data.dropna(inplace=True)

data["ripeness_category"] = data["ripeness_category"].map({"Green": 0, "Turning": 1, "Ripe": 2, "Overripe": 3})

# Encode variety using LabelEncoder
le = LabelEncoder()
data["variety"] = le.fit_transform(data["variety"])

# Target attribute
X = data.drop(columns=["quality_score"])
y = data["quality_score"]

# Split and Standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("[3] Random Forest Regressor")
print("MSE :", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R² Score:", r2_score(y_test, y_pred))

[3] Random Forest Regressor
MSE : 0.013102816650000015
RMSE: 0.11446753535391603
R² Score: 0.9564760693424342


# **References**

Machine Learning Logistic Regression Model. Retrieved from https://github.com/FullStackWithLawrence/006-scikit-learn-logistic-regression/blob/main/jupyter-notebook/Logistic_Regression_Notebook.ipynb

Banana Quality Dataset. Retrieved from https://www.kaggle.com/datasets/mrmars1010/banana-quality-dataset

Scikit Learn: train_test_split. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

