# Plant Recommender Project

## Cluster Modeling

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, DBSCAN, SpectralClustering

#### DBSCAN

Clustering plays an important part in this project - it will provide the basis for the suggestion engine later. To create the best possible clustering model, I will need to have a rough estimate of the number of datapoint clusters. I don't intuitively know the number of clusters necesary since the data has a high degree of dimensionality, so to estimate it I will utilize `DBSCAN` clustering:

In [23]:
# Load in the data and scale it
df = pd.read_csv('../datasets/cleaned-data.csv')
df.dropna(inplace=True)

In [24]:
X = df.drop(columns=['id', 'Scientific_Name_x'])
species = df[['id', 'Scientific_Name_x']]

sc = StandardScaler()

X_sc = sc.fit_transform(X)

In [25]:
X.isnull().sum().any()

False

In [21]:
# Fit a DBSCAN model
db = DBSCAN(eps=10, min_samples=2)
db.fit(X_sc)

DBSCAN(eps=10, min_samples=2)

In [22]:
# Find the number of clusters and look at the silouette score
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_

68

In [23]:
silhouette_score(X_sc, db.labels_)

0.1460248480136029

This score is not great, but that doesn't particularly matter, as this model was instead meant to give a best estimate on the number of clusters needed to represent the data well. The `n_clusters_` of about 70 found here will be indispensible for testing other clustering models, such as KMeans.

#### KMeans

To see if this silouette score can be improved upon, let's try to use a KMeans clustering model:

In [24]:
km = KMeans(n_clusters=70, random_state=42)
km.fit(X_sc)

KMeans(n_clusters=70, random_state=42)

In [25]:
silhouette_score(X_sc, km.labels_)

0.049270928938044745

This actually performed worse than the `DBSCAN` model, so I will scrap this model in favor of a better one for now.

#### Spectral Clustering

In persuit of a better silhoutte score, I will now try using `SpectralClustering`, another clustering model from `scikitlearn`:

In [27]:
spc = SpectralClustering(n_clusters=70)
spc.fit(X_sc)

  est = KMeans(


SpectralClustering(n_clusters=70)

In [30]:
silhouette_score(X_sc, spc.labels_)
spc.get_params

0.31690229242790374

This is by far the best performance I've gotten out of a clustering model thus far. Perhaps this can even be improved upon by tuning the model's hyperparameters:

In [44]:
# Iterate over different combinations of hyperparameters 
def spec_clustering_tuner(data, n_clusters=[6, 10], eigen_solvers=None,
                          gammas=[1],
                          assign_labels=None, n_inits=[10], cores=-1, plot=True):
    """Takes in lists of hyperparameters to tune over a spectral clustering model as well as scaled data.
       Outputs a plot of silhoutte scores against a chosen metric and the best parameters found.
    """
    # Initialize empty params dict and scores list for plotting
    params = {}
    scores = []
    
    # Looping, fitting and testing a new model each iteration
#     for label in assign_labels:
    for gamma in gammas:
        for cluster in n_clusters:
            for num in n_inits:
                model = SpectralClustering(n_clusters=cluster, gamma=gamma, n_init=num,
                                           n_jobs=cores)
                model.fit(data)
                        
                # Save the parameters used alongside the associated score
                score = silhouette_score(data, model.labels_)
                scores.append(score)
                        
                        # Make sure the loop can continue just in case two scores are identical
                if score in params.keys():
                    continue
                        
                params[score] = model.get_params
                        
     # Finding the best score
    best_model_params = sorted(params.items(), reverse=True)[0]
    
    # Plotting
    if plot:
        plt.figure(figsize=(12,10))
        plt.title('Silhouette Score Against Clustering Metrics', fontsize=18)
        sns.scatterplot(n_clusters, scores, alpha=0.75, ci=None)
        plt.show()
        print(best_model_params)
        
    else:
        return best_model_params

In [47]:
best = spec_clustering_tuner(data=X_sc, n_clusters=[5, 20, 70, 90], 
                      gammas=[0.5, 1], n_inits=[5, 20], plot=False)

  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(


In [48]:
best

(0.5184634551759015,
 <bound method BaseEstimator.get_params of SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)>)

Tuning this clustering model has led to an even greater improvement in silhoutte score. This will be the model that I'll use for the recommendation engine.

## Building the Recommendation System

To build this engine, I will use [this](https://towardsdatascience.com/build-your-own-clustering-based-recommendation-engine-in-15-minutes-bdddd591d394) blog on building recommender systems from clustering models. 

In [4]:
# Build out the final model based on the best parameters found
spec = SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)
spec.fit(X_sc)

  est = KMeans(


SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)

In [5]:
silhouette_score(X_sc, spec.labels_)

0.5184634551759015

In the interest of creating a more granular clustering model, I will also try creating a model with more clusters than the `SpectralClustering` model recommends for this data so that recommendations will be more precise:

In [6]:
# Trying out more clusters for a more granular model
spec = SpectralClustering(gamma=0.5, n_clusters=30, n_init=5, n_jobs=-1)
spec.fit(X_sc)

  est = KMeans(


SpectralClustering(gamma=0.5, n_clusters=30, n_init=5, n_jobs=-1)

In [7]:
silhouette_score(X_sc, spec.labels_)

0.505810616398999

In [8]:
spec = SpectralClustering(gamma=0.5, n_clusters=70, n_init=5, n_jobs=-1)
spec.fit(X_sc)
silhouette_score(X_sc, spec.labels_)

  est = KMeans(


0.505810616398999

The number of clusters doesn't seem to have as much of an effect on the silhouette score as previously thought. This is great news, as this will allow for a more granular model with more precise ans specific recommendations.

To create these recommendations, I will need to predict the clusters for both the X data and the user inputted entry. For `X`:

In [26]:
# Create predicted cluster column and recombine the data with the id column 
X = pd.concat([X, species], axis=1)
X['cluster'] = spec.labels_

In [27]:
X.head()

Unnamed: 0,Growth_Rate,Lifespan,Toxicity,Drought_Tolerance,Hedge_Tolerance,Moisture_Use,pH_Minimum,pH_Maximum,Salinity_Tolerance,Shade_Tolerance,...,Bloom_Period_Mid Summer,Bloom_Period_Spring,Bloom_Period_Summer,Bloom_Period_Winter,Bloom_Period_nan,Fire_Resistance_Yes,Fire_Resistance_nan,id,Scientific_Name_x,cluster
0,1,1,0,1,1,2,4.0,6.0,0,2,...,1,0,0,0,0,0,0,21,Abies balsamea (L.) Mill.,5
1,1,3,0,2,1,2,5.5,7.8,0,1,...,0,0,0,0,0,0,0,40,Abies concolor (Gord. & Glend.) Lindl. ex Hild...,5
2,2,2,0,1,1,2,3.5,5.5,1,2,...,0,0,0,0,0,0,0,55,Abies fraseri (Pursh) Poir.,5
3,2,3,0,2,1,2,4.5,7.5,0,2,...,0,0,0,0,0,1,0,62,Abies grandis (Douglas ex D. Don) Lindl.,5
4,3,2,0,2,3,2,5.7,7.0,1,2,...,0,0,0,0,0,0,0,65,Abelia ×grandiflora (Rovelli ex André) Rehder,5


In [None]:
# Now for a sample user input
