# Plant Recommender Project

## Cluster Modeling

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, DBSCAN

#### DBSCAN

Clustering plays an important part in this project - it will provide the basis for the suggestion engine later. To create the best possible clustering model, I will need to have a rough estimate of the number of datapoint clusters. I don't intuitively know the number of clusters necesary since the data has a high degree of dimensionality, so to estimate it I will utilize `DBSCAN` clustering:

In [10]:
# Load in the data and scale it
df = pd.read_csv('../datasets/cleaned-data.csv')
df.dropna(inplace=True)

In [11]:
X = df.drop(columns=['id', 'Scientific_Name_x'])
species = df[['id', 'Scientific_Name_x']]

sc = StandardScaler()

X_sc = sc.fit_transform(X)

In [12]:
X.isnull().sum().any()

False

In [21]:
# Fit a DBSCAN model
db = DBSCAN(eps=10, min_samples=2)
db.fit(X_sc)

DBSCAN(eps=10, min_samples=2)

In [22]:
# Find the number of clusters and look at the silouette score
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_

68

In [23]:
silhouette_score(X_sc, db.labels_)

0.1460248480136029

This score is not great, but that doesn't particularly matter, as this model was instead meant to give a best estimate on the number of clusters needed to represent the data well. The `n_clusters_` of about 70 found here will be indispensible for testing other clustering models, such as KMeans.

#### KMeans

To see if this silouette score can be improved upon, let's try to use a KMeans clustering model:

In [24]:
km = KMeans(n_clusters=70, random_state=42)
km.fit(X_sc)

KMeans(n_clusters=70, random_state=42)

In [25]:
silhouette_score(X_sc, km.labels_)

0.049270928938044745