In [1]:
import pandas as pd

PLANTS = '/kaggle/input/plant-growth-data-classification/plant_growth_data.csv'
TARGET = 'Growth_Milestone'

df = pd.read_csv(filepath_or_buffer=PLANTS)
df[TARGET] = df[TARGET] == 1
df.head()

Unnamed: 0,Soil_Type,Sunlight_Hours,Water_Frequency,Fertilizer_Type,Temperature,Humidity,Growth_Milestone
0,loam,5.192294,bi-weekly,chemical,31.719602,61.591861,False
1,sandy,4.033133,weekly,organic,28.919484,52.422276,True
2,loam,8.892769,bi-weekly,none,23.179059,44.660539,False
3,loam,8.241144,bi-weekly,none,18.465886,46.433227,False
4,sandy,8.374043,bi-weekly,organic,18.128741,63.625923,False


How much data do we have?

In [2]:
df.shape

(193, 7)

Not a lot. Is our target class balanced?

In [3]:
df[TARGET].value_counts(normalize=True).to_dict()

{False: 0.5025906735751295, True: 0.49740932642487046}

The target class is balanced, which is good news.

In [4]:
if len(df.columns) == 7:
    df = pd.get_dummies(data=df, columns=['Soil_Type', 'Water_Frequency', 'Fertilizer_Type'])
COLUMNS = df.drop(columns=[TARGET]).columns.tolist()

In [5]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=500)
df[['x', 'y']] = umap.fit_transform(X=df[COLUMNS])
print('done with UMAP in {}'.format(arrow.now() - time_start))

2024-07-30 16:16:18.541788: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-30 16:16:18.541942: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-30 16:16:18.713866: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=500, n_jobs=1, random_state=2024, verbose=True)
Tue Jul 30 16:16:31 2024 Construct fuzzy simplicial set
Tue Jul 30 16:16:31 2024 Finding Nearest Neighbors
Tue Jul 30 16:16:35 2024 Finished Nearest Neighbor Search
Tue Jul 30 16:16:39 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Jul 30 16:16:41 2024 Finished embedding
done with UMAP in 0:00:09.794921


In [6]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)
express.scatter(data_frame=df, x='x', y='y', color=TARGET, )

Dimension reduction with UMAP causes our data to cluster, but not according to the target variable.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df[COLUMNS], df[TARGET], test_size=0.2, random_state=2024, stratify=df[TARGET])
model = LogisticRegression(max_iter=200, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(model.n_iter_[0]))

print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=model.predict(X=X_test))))

model fit in 156 iterations
accuracy: 0.4359


In [8]:
from sklearn.metrics import classification_report

print(classification_report(zero_division=0 , y_true=y_test, y_pred=model.predict(X=X_test)))

              precision    recall  f1-score   support

       False       0.45      0.45      0.45        20
        True       0.42      0.42      0.42        19

    accuracy                           0.44        39
   macro avg       0.44      0.44      0.44        39
weighted avg       0.44      0.44      0.44        39



Our model does poorly on both classes. Can we do markedly better with a different model?