In [1]:
import pandas as pd
from warnings import filterwarnings

filterwarnings(action='ignore', category=FutureWarning)

RAISINS = '/kaggle/input/raisin-binary-classification/Raisin_Dataset.csv'
df = pd.read_csv(filepath_or_buffer=RAISINS)
df['class'] = df['Class'] == 'Kecimen'

df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class,class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen,True
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen,True
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen,True
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen,True
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen,True


In [2]:
from plotly.express import histogram
for column in df.columns[:7]:
    histogram(data_frame=df, x=column, color='Class', height=500).show()

A first look with histograms suggests we will have some raisins that are obviously besni and others that are obviously kecimen, but there will be some in between that may be difficult to identify. Let's take another look using dimension reduction.

In [3]:
from sklearn.manifold import TSNE
from plotly.express import scatter
tsne = TSNE(verbose=1, random_state=2024, n_iter=1000)
tsne_df = pd.DataFrame(data=tsne.fit_transform(X=df[df.columns[:7]]), columns=['tx', 'ty'])
tsne_df['Class'] = df['Class'].values
scatter(data_frame=tsne_df, x='tx', y='ty', color='Class')

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 900 samples in 0.001s...
[t-SNE] Computed neighbors for 900 samples in 0.010s...
[t-SNE] Computed conditional probabilities for sample 900 / 900
[t-SNE] Mean sigma: 1739.858557
[t-SNE] KL divergence after 250 iterations with early exaggeration: 47.252121
[t-SNE] KL divergence after 1000 iterations: 0.201181


TSNE tells us pretty much the same thing: a lot of raisins are going to be hard to classify given the available data.

In [4]:
df.columns

Index(['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity',
       'ConvexArea', 'Extent', 'Perimeter', 'Class', 'class'],
      dtype='object')

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

columns = df.columns[:7]
target = 'class'
X_train, X_test, y_train, y_test = train_test_split(df[columns], df[target], test_size=0.25, random_state=2024)
model = LogisticRegression(max_iter=100000)
model.fit(X_train, y_train)

print('accuracy: {:5.2f} pct'.format(100 * accuracy_score(y_test, model.predict(X_test))))

accuracy: 84.44 pct


In [6]:
from sklearn.metrics import classification_report
print(classification_report(y_true = y_test, y_pred = model.predict(X_test)))

              precision    recall  f1-score   support

       False       0.87      0.81      0.84       111
        True       0.83      0.88      0.85       114

    accuracy                           0.84       225
   macro avg       0.85      0.84      0.84       225
weighted avg       0.85      0.84      0.84       225



Our model performance is balanced across the two classes, more or less. What features are contributing the most? 

In [7]:
histogram(x=columns, y=model.coef_[0])

The axes and perimeter measurements are telling us the most; the other features make negligible contributions. Let's drop the other columns and train a second model.

In [8]:
model_1 = LogisticRegression(max_iter=100000)
model_1.fit(X_train[['MajorAxisLength', 'MinorAxisLength', 'Perimeter']], y_train)
print('accuracy: {:5.2f} pct'.format(100 * accuracy_score(y_test, model_1.predict(X_test[['MajorAxisLength', 'MinorAxisLength', 'Perimeter']]))))
print(classification_report(y_true = y_test, y_pred = model_1.predict(X_test[['MajorAxisLength', 'MinorAxisLength', 'Perimeter']])))

accuracy: 85.78 pct
              precision    recall  f1-score   support

       False       0.87      0.84      0.85       111
        True       0.85      0.88      0.86       114

    accuracy                           0.86       225
   macro avg       0.86      0.86      0.86       225
weighted avg       0.86      0.86      0.86       225



In [9]:
histogram(x=['MajorAxisLength', 'MinorAxisLength', 'Perimeter'], y=model_1.coef_[0])

Our results have gotten better by ignoring features that do not contain additional helpful information.