In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report, f1_score

In [2]:
data = pd.read_csv("Raisin_Dataset.csv")
y = data.pop("Class")

# Split the data into train and validation, stratifying on the target feature.
X_train, X_val, y_train, y_val = train_test_split(data, y, stratify=y, random_state=42)

In [3]:
# Get a high level overview of the data. This will be useful for slicing.
X_train.describe()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
count,675.0,675.0,675.0,675.0,675.0,675.0,675.0
mean,88107.967407,433.058918,254.496053,0.78305,91554.672593,0.698341,1169.725927
std,38979.928482,117.734938,48.918447,0.090386,40916.098518,0.053885,275.94255
min,25387.0,225.629541,143.710872,0.34873,26139.0,0.379856,619.074
25%,59691.5,346.19502,220.469287,0.743461,61982.0,0.671966,971.3505
50%,78883.0,407.940329,248.500271,0.798953,81613.0,0.706142,1120.963
75%,105055.5,494.211068,279.586959,0.845791,108543.0,0.733834,1305.7995
max,235047.0,997.291941,492.275279,0.962124,278217.0,0.824319,2697.753


In [4]:
lr = LogisticRegression(max_iter=1000, random_state=23)
lb = LabelBinarizer()

# Binarize the target feature.
y_train = lb.fit_transform(y_train)
y_val = lb.transform(y_val)

# Train Logistic Regression.
lr.fit(X_train, y_train.ravel())

LogisticRegression(max_iter=1000, random_state=23)

# Solution Code

In [5]:
# Use sklearn's classification report to get an overall view of our classifier.
print(classification_report(y_val, lr.predict(X_val)))

              precision    recall  f1-score   support

           0       0.90      0.86      0.88       112
           1       0.86      0.90      0.88       113

    accuracy                           0.88       225
   macro avg       0.88      0.88      0.88       225
weighted avg       0.88      0.88      0.88       225



In [6]:
print("F1 score on MajorAxisLength slices:")
row_slice = X_val["MajorAxisLength"] >= 427.7
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

row_slice = X_val["MajorAxisLength"] < 427.7
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

print("\nF1 score on MinorAxisLength slices:")
row_slice = X_val["MinorAxisLength"] >= 254.4
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

row_slice = X_val["MinorAxisLength"] < 254.4
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

print("\nF1 score on ConvexArea slices:")
row_slice = X_val["ConvexArea"] >= 90407.3
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

row_slice = X_val["ConvexArea"] < 90407.3
print(f1_score(y_val[row_slice], lr.predict(X_val[row_slice])))

F1 score on MajorAxisLength slices:
0.0
0.9230769230769231

F1 score on MinorAxisLength slices:
0.65
0.9319371727748692

F1 score on ConvexArea slices:
0.30769230769230765
0.9174311926605505


From the above slices, we see that our model seems to perform consistently better on raisins that are smaller than average versus ones that are larger than average.

Looking at the summary statistics, we see that for nearly every measure the median is smaller than the mean. So more than 50% of our raisins are below the average which is also where our model is strongest. This indicates that we might want more data on larger raisins.

## Model Card

### Model Details

Logistic Regresion model using default scikit-learn hyperparameters. Trained with sklearn version 0.24.1.

### Intended Use

For classifying two types of raisins from Turkey.

### Metrics

F1 classification with a macro average of 0.85, 0.84 for the minority class, and 0.85 for the majority class.

When analyzing across data slices, model performance is higher for raisins below the average size and much lower for raisins above the average.

### Data

Raisin dataset acquired from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Raisin+Dataset

Originally from: Cinar I., Koklu M. and Tasdemir S., Classification of Raisin Grains Using Machine Vision and Artificial Intelligence Methods. Gazi Journal of Engineering Sciences, vol. 6, no. 3, pp. 200-209, December, 2020.

### Bias

The majority of raisins are below the average size. This could be a potential source of bias but more subject matter expertise may be necessary. Note to students: this is a useful call out, and in a real-world scenario should prompt you to engage in collaboration with subject matter experts so you can flesh this out.