# Plant Recommender Project

## Cluster Modeling

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, DBSCAN, SpectralClustering

#### DBSCAN

Clustering plays an important part in this project - it will provide the basis for the suggestion engine later. To create the best possible clustering model, I will need to have a rough estimate of the number of datapoint clusters. I don't intuitively know the number of clusters necessary since the data has a high degree of dimensionality, so to estimate it I will utilize `DBSCAN` clustering:

In [33]:
# Load in the data and scale it
df = pd.read_csv('../datasets/cleaned-data.csv')
df.dropna(inplace=True)

In [34]:
X = df.drop(columns=['id', 'Scientific_Name_x'])
species = df[['id', 'Scientific_Name_x']]

sc = StandardScaler()

X_sc = sc.fit_transform(X)

In [25]:
X.isnull().sum().any()

False

In [21]:
# Fit a DBSCAN model
db = DBSCAN(eps=10, min_samples=2)
db.fit(X_sc)

DBSCAN(eps=10, min_samples=2)

In [22]:
# Find the number of clusters and look at the silouette score
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_

68

In [23]:
silhouette_score(X_sc, db.labels_)

0.1460248480136029

This score is not great, but that doesn't particularly matter, as this model was instead meant to give a best estimate on the number of clusters needed to represent the data well. The `n_clusters_` of about 70 found here will be indispensable for testing other clustering models, such as KMeans.

#### KMeans

To see if this silhouette score can be improved upon, let's try to use a KMeans clustering model:

In [24]:
km = KMeans(n_clusters=70, random_state=42)
km.fit(X_sc)

KMeans(n_clusters=70, random_state=42)

In [25]:
silhouette_score(X_sc, km.labels_)

0.049270928938044745

This actually performed worse than the `DBSCAN` model, so I will scrap this model in favor of a better one for now.

#### Spectral Clustering

In pursuit of a better silhouette score, I will now try using `SpectralClustering`, another clustering model from `scikitlearn`:

In [27]:
spc = SpectralClustering(n_clusters=70)
spc.fit(X_sc)

  est = KMeans(


SpectralClustering(n_clusters=70)

In [30]:
silhouette_score(X_sc, spc.labels_)
spc.get_params

0.31690229242790374

This is by far the best performance I've gotten out of a clustering model thus far. Perhaps this can even be improved upon by tuning the model's hyperparameters:

In [97]:
# Iterate over different combinations of hyperparameters 
def spec_clustering_tuner(data, n_clusters=[6, 10], eigen_solvers=None,
                          gammas=[1],
                          assign_labels=None, n_inits=[10], cores=-1):
    """Takes in lists of hyperparameters to tune over a spectral clustering model as well as scaled data.
       Outputs a plot of silhoutte scores against a chosen metric and the best parameters found.
    """
    # Initialize empty params dict and scores list for plotting
    params = {}
    scores = []
    
    # Looping, fitting and testing a new model each iteration
#     for label in assign_labels:
    for gamma in gammas:
        for cluster in n_clusters:
            for num in n_inits:
                model = SpectralClustering(n_clusters=cluster, gamma=gamma, n_init=num,
                                           n_jobs=cores)
                model.fit(data)
                        
                # Save the parameters used alongside the associated score
                score = silhouette_score(data, model.labels_)
                scores.append(score)
                        
                        # Make sure the loop can continue just in case two scores are identical
                if score in params.keys():
                    continue
                        
                params[score] = model.get_params
                        
     # Finding the best score
    best_model_params = sorted(params.items(), reverse=True)[0]
        
    return best_model_params

In [47]:
best = spec_clustering_tuner(data=X_sc, n_clusters=[5, 20, 70, 90], 
                      gammas=[0.5, 1], n_inits=[5, 20], plot=False)

  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(
  est = KMeans(


In [48]:
best

(0.5184634551759015,
 <bound method BaseEstimator.get_params of SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)>)

Tuning this clustering model has led to an even greater improvement in silhouette score. This will be the model that I'll use for the recommendation engine.

## Building the Recommendation System

To build this engine, I will use [this](https://towardsdatascience.com/build-your-own-clustering-based-recommendation-engine-in-15-minutes-bdddd591d394) blog on building recommender systems from clustering models. 

In [4]:
# Build out the final model based on the best parameters found
spec = SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)
spec.fit(X_sc)

  est = KMeans(


SpectralClustering(gamma=0.5, n_clusters=5, n_init=5, n_jobs=-1)

In [5]:
silhouette_score(X_sc, spec.labels_)

0.5184634551759015

In [32]:
# Bring in the lists of features from the previous notebook to help sort through the X dataframe
categorical_features = ['Category', 'Family', 'Growth_Habit', 'Native_Status',
           'Active_Growth_Period', 'Fall_Conspicuous', 'Flower_Color',
           'Flower_Conspicuous', 'Fruit_Conspicuous', 'Bloom_Period', 'Fire_Resistance']

ordinal_features = ['Toxicity', 'Drought_Tolerance', 'Hedge_Tolerance',
                   'Moisture_Use', 'Salinity_Tolerance', 'Shade_Tolerance', 'Growth_Rate', 'Lifespan']

other_features = ['id', 'Scientific_Name_x', 'pH_Minimum', 'pH_Maximum',
                      'Temperature_Minimum_F']

In [95]:
# Now for a small sample user input - create a function to replicate the streamlit app
def plant_input(df, neighbors):
    # Create dummy entry to feed into the clustering model with the same columns as the cleaned dataset
    dummy = {}
    dummy['id'] = 42
    dummy['Scientific_Name_x'] = 'sample'
    
    # Inputs for simple user-chosen features
    dummy['Lifespan'] = input('Enter a lifespan')
    dummy['Toxicity'] = input('Enter toxicity value')
    dummy['Drought_Tolerance'] = input('Input drought tolerance')
    dummy['Hedge_Tolerance'] = input('Enter desired hedge tolerance')
    dummy['Moisture_Use'] = input('Enter desired moisture use')
    
    # Fill in the other columns with dummy values if they are not specified
    for col in df.columns:
        if col not in dummy.keys():
            dummy[col] = np.nan

    # Scale the dummy data and concat it to the whole dataset
    df_d = pd.DataFrame(dummy, index=[0])
    df_d.fillna(0, inplace=True)
    data = df.append(df_d)
    labels = data[['id', 'Scientific_Name_x']]
    data.drop(columns=['id', 'Scientific_Name_x'], inplace=True)
    data_sc = sc.transform(data)
    
    # Predict the labels of all of the data, including the dummy entry
    spec.fit_predict(data_sc)
    data['cluster'] = spec.labels_
    out_cluster = spec.labels_[-1]
    
    # Recombine the data with the label features
    data = pd.concat([data, labels], axis=1)
    
    # Filter down to the dummy entry and its nearest neighbors
    output = data.loc[data['cluster'] == out_cluster]
    
    # Sample from the filtered dataset
    return output[['id', 'Scientific_Name_x']].sample(neighbors)

In [96]:
plant_input(df, 10)

Enter a lifespan 3
Enter toxicity value 0
Input drought tolerance 1
Enter desired hedge tolerance 1
Enter desired moisture use 3


  est = KMeans(


Unnamed: 0,id,Scientific_Name_x
673,54390,Melica californica Scribn.
745,62519,Panicum virgatum L.
35,1641,Agrostis capillaris L.
44,1948,Agastache parvifolia Eastw.
969,79947,Setaria italica (L.) P. Beauv.
575,45788,Juniperus osteosperma (Torr.) Little
623,50121,Ligustrum ovalifolium Hassk.
596,46452,Lavatera assurgentiflora Kellogg
357,22299,Corylus cornuta Marshall
868,72365,Quercus laevis Walter


The function above is just a simple proof of concept, but it is an effective proof of concept. Since the query was not very detailed, the suggestions the model made here don't make a lot of sense. In the final version of this model, however, users will be able to make much more precise queries to narrow down their searches. To see this in action, make sure to look at [this script](./plant_recommender.py). This script, `plant_recommender.py`, runs using `streamlit` [(*source*)](https://streamlit.io/).