# Capstone Check-In #2 

## Capstone Topic

I have decided to go with the plant recommender system as my capstone project. This recommender system will utilize clustering of plant species from North America to make recommendations based on user inputs. These inputs will include:
- Drought tolerance
- Mature height
- Minimum and maximum soil pH
- Commerical availibility

This recommender system will be able to make accurate and useful recommendations based on user input. From here, users will be able to interact with the recommender system through an application. This will likely take the form of a Heroku/Flask application so that the system can be hosted on the cloud. 

## Data Collection

The main source for my data will be the [USDA PLANTS Database](https://plants.sc.egov.usda.gov/home). This website has a wealth of information about many different plant species and is open to public use, unlike many other plant databases. The data comes in the form of a CSV that can be easily imported using Pandas or a similar library:

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('../datasets/sample_small.csv')
df.head()

Unnamed: 0,Accepted Symbol,Synonym Symbol,Scientific Name,PLANTS Floristic Area,Active Growth Period,After Harvest Regrowth Rate,Bloat,C:N Ratio,Coppice Potential,Fall Conspicuous,...,Lumber Product,Naval Store Product,Nursery Stock Product,Palatable Browse Animal,Palatable Graze Animal,Palatable Human,Post Product,Protein Potential,Pulpwood Product,Veneer Product
0,ABELI,,Abelia,NA (L48),,,,,,,...,,,,,,,,,,
1,ABGR4,,Abelia ×grandiflora,NA (L48),Spring and Summer,,,High,No,Yes,...,No,No,Yes,Low,,No,No,Low,No,No
2,ABELM,,Abelmoschus,"NA (L48), HI, PR, VI",,,,,,,...,,,,,,,,,,
3,ABES,,Abelmoschus esculentus,"NA (L48), PR, VI",,,,,,,...,,,,,,,,,,
4,ABMA9,,Abelmoschus manihot,,,,,,,,...,,,,,,,,,,


In [3]:
df.isnull().sum()

Accepted Symbol             0
Synonym Symbol           9238
Scientific Name             1
PLANTS Floristic Area    2500
Active Growth Period     8778
                         ... 
Palatable Human          8765
Post Product             8765
Protein Potential        8915
Pulpwood Product         8765
Veneer Product           8765
Length: 87, dtype: int64

As you can see, there are several issues with this data:
1. There are tons of null values - these occur when a plant species does not have additional information, which is usually because the plant is a niche cultivar or a rare species. These will have to be dropped.
2. This dataset is only a small subset of the overall data - it is a sample query I found on the [USDA's API GitHub page](https://github.com/USDA/USDA-APIs)
3. This data is largely categorical with a few exceptions, so there is a lot of one-hot encoding to be done here.


Another issue is the USDA API itself. The API is not well documented so it will take me a while to learn how to use it. This is dependent on the fact that the API is even functioning, which it might not be since the USDA updated the PLANTS database in May 2021. If this is the case, then I will need to scrape the PLANTS database instead, which would be a significant undertaking.

## Goals and Timeline

For this project to be successful, I will need to accomplish the following tasks:

1. Fully understand the PLANTS API or scrape the database if the API is not available - needs to be done ASAP 
2. Clean the data by removing the entries with no characteristic data - this will leave **2163** entries in the overall dataset. This should be done by the end of next week, 7/31/2021.
3. Cluster the data and create a recommender system based on the clustering. I will be attempting several different clustering methods, but will probably start with **DBSCAN** because the number of clusters is not easy to determine. Most of the time in this part of the project will be spent researching how to create a recommender system from a clustering model. This should be done before 8/6/2021 (the end of the week two weeks from now)
4. The rest of the available time for this project will be spent researching and testing an input system for the recommender. This is the most open-ended part of the project and will take the most research, so I will leave the most time for it.
5. The final days before the project is due will also be my time to make my reports and presentation.