<a href="https://www.kaggle.com/code/mikedelong/gaussian-process-acc-0-9150?scriptVersionId=165427693" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
from warnings import filterwarnings
filterwarnings(action='ignore', category=FutureWarning)

In [2]:
import pandas as pd

APPLES = '/kaggle/input/apple-quality-analysis-dataset/apple_quality.csv'

df = pd.read_csv(filepath_or_buffer=APPLES, index_col=['A_id']).dropna(subset=['Quality'])
# for some reason our acidity data doesn't load as floats, so we need to fix that
df['Acidity'] = df['Acidity'].astype(float)
df.head()

Unnamed: 0_level_0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
A_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.49159,good
1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809,good
2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636,bad
3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723,good
4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984,good


We have a bunch of independent variables that are all floats; let's look at them broken out by the target variable to see if their distributions conditioned on the target variable look any different.

In [3]:
from plotly import express
xs = ['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity',]
for x in xs:
    express.histogram(data_frame=df, x=x, facet_col='Quality', marginal='box').show()

We see some small difference but nothing really jumps out; let's use dimension reduction to see if we think there's any signal in this data at all. We will use UMAP to project our data into two dimensions and visualize it as a scatter plot, colored by the target variable. We want see if once we do this the data clusters and separates according to the target variable.

In [4]:
from arrow import now
from umap import UMAP

time_start = now()
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=100,)
columns = ['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity',]

df[['x', 'y']] = umap.fit_transform(X=df[columns])
express.scatter(data_frame=df, x='x', y='y', color='Quality', marginal_y='box').show()
print('done with UMAP in {}'.format(now() - time_start))

2024-03-04 16:22:35.251912: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-04 16:22:35.252075: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-04 16:22:35.417573: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=100, n_jobs=1, random_state=2024, verbose=True)
Mon Mar  4 16:22:50 2024 Construct fuzzy simplicial set
Mon Mar  4 16:23:06 2024 Finding Nearest Neighbors
Mon Mar  4 16:23:12 2024 Finished Nearest Neighbor Search
Mon Mar  4 16:23:16 2024 Construct embedding


Epochs completed:   0%|            0/100 [00:00]

	completed  0  /  100 epochs
	completed  10  /  100 epochs
	completed  20  /  100 epochs
	completed  30  /  100 epochs
	completed  40  /  100 epochs
	completed  50  /  100 epochs
	completed  60  /  100 epochs
	completed  70  /  100 epochs
	completed  80  /  100 epochs
	completed  90  /  100 epochs
Mon Mar  4 16:23:19 2024 Finished embedding


done with UMAP in 0:00:28.831338


What do we see here? We see some clustering and some mixing; there's definitely a signal in our data, so let's build a model.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df[columns], df['Quality'], test_size=0.2, random_state=2024)

regression = LogisticRegression(max_iter=100000)
regression.fit(X=X_train, y=y_train)

express.histogram(y=regression.coef_.tolist()[0], x=columns).show(validate=True)
print('accuracy: {:5.4f} '.format(regression.score(X=X_test, y=y_test)))

accuracy: 0.7425 


Our regression coefficients tell us that overwripe apples are bad and acidic apples are bad, while big, heavy, sweet, juicy apples are good. Now that we can see the regression coefficients it is easy to go back to the histograms above and see how the distributions of the variables differ according to the fruit quality.

In [6]:
from sklearn.metrics import classification_report
print(classification_report(y_true = y_test, y_pred=regression.predict(X=X_test)))

              precision    recall  f1-score   support

         bad       0.76      0.72      0.74       404
        good       0.73      0.77      0.75       396

    accuracy                           0.74       800
   macro avg       0.74      0.74      0.74       800
weighted avg       0.74      0.74      0.74       800



Our classes are balanced and our simple regression model captures both classes reasonably well. Let's try a more complicated model with a scaler and see if we can do any better.

In [7]:
from arrow import now
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

# training this model takes a couple of minutes, so sit tight

time_start = now()
classifier = GaussianProcessClassifier(kernel=1.0 * RBF(1.0), random_state=2024)
classifier.fit(X_train, y_train)
print('score: {:5.4f}'.format(classifier.score(X_test, y_test)))
print(classification_report(y_true = y_test, y_pred=classifier.predict(X=X_test)))
print('done in {}'.format(now() - time_start))

score: 0.9150
              precision    recall  f1-score   support

         bad       0.92      0.91      0.92       404
        good       0.91      0.92      0.91       396

    accuracy                           0.92       800
   macro avg       0.92      0.92      0.91       800
weighted avg       0.92      0.92      0.92       800

done in 0:03:06.197292
