# Create Datasets
## Plant Lists
- I scraped a list of the best performing perennials in the midwest from midwestgardentips.com
- I scraped a list of common weeds in IL from preen.com, which provides a list of common weeds state-by-state

## Photo Collection
- I scraped the photos of weeds available on preen.com
- I attempted to scrape photos of both the perennials and weeds from garden.org, but due to excessive scraping, garden.org blocked me from scraping all of the photos
    - As a back-up, I scraped the photos of the perennials from the Missouri Botanical Gardens, and the weeds from UMass (the code for which is provided in separate notebooks)


In [68]:
# standard imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# webscrape
import requests
from requests import get
from bs4 import BeautifulSoup
import urllib
from time import sleep

In [2]:
# Paths to store data
perennial_path = os.path.join(os.pardir, os.pardir, 'data', 'perennials')
weed_path = os.path.join(os.pardir, os.pardir, 'data', 'weeds')

## midwestgardentips.com

In [3]:
# Scrape names of best performing perennials from Midwest Gardening site
perennial_url = 'https://www.midwestgardentips.com/best-performing-perennials-1'
response = get(perennial_url)
html = response.text
soup = BeautifulSoup(html, 'lxml')

# Names of plants appear to be bolded (i.e., 'strong') and italicized (i.e., 'em')
p_list = [a.text for a in (strong.find('em') for strong in soup.find_all('strong')) if a]

perennials = []
for i in range(len(p_list)):
    text = p_list[i].split(':')[0]
    perennials.append(text)

# Remove mislabeled text from list of perennials
perennials.remove('y.')
perennials.remove('Full to part sun\xa0 Hardy in zones ')

## preen.com

In [5]:
# Create list of weeds and
# Scrape weed photos from preen site

weed_url = 'https://www.preen.com/weeds/il'
response = get(weed_url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
div = soup.find(id = 'WeedList')
w_list = div.find_all('a')
weeds = []
for i in range(len(w_list)):
    # Create list of weed names
    text = w_list[i].find('img').attrs['alt']  
    weeds.append(text)
    
    # Scrape photos from site
    photo_url = 'https://www.preen.com' + w_list[i].attrs['href']
    photo_response = get(photo_url)
    photo_html = photo_response.text
    photo_soup = BeautifulSoup(photo_html, 'lxml')
    photo_div = photo_soup.find(id = 'imagePicker')
    photo_list = photo_div.find_all('a')
    for j in range(len(photo_list)):
        photo_url = 'https:' + photo_list[j].attrs['href'].replace(' ', '%20')
        
        # To account for photos that were removed from the site
        if get(photo_url).status_code != 404:
            
            # "pr" suffix to indicate photos were scraped from preen site
            path = os.path.join(weed_path, text.lower().replace(' ', '_') + '_pr')
            urllib.request.urlretrieve(photo_url, path + '_' + str(j) + '.jpg')

## garden.org
I created a function (enter_url) that provides the name of each perennial/weed one at a time.  The first prompt allows the user to skip the perennial/weed if no photos are provided for the plant on garden.org.  If photos exist, the user can enter the url for the plant.  The function scrapes all photos where the plant name is identified in the header or as a common name for the plant.  All plants are stored with the plant name as part of the name of the file.

In [64]:
# Function to get soup from garden.org site
def get_soup(url):
    # Need to "fake a browser visit" by providing a user-agent header for garden.org
    response = requests.get(url, headers = {'User-Agent' : 'test'}, proxies = {'http' : proxy, 'https' : proxy})
    html = response.text
    soup = BeautifulSoup(html, 'lxml')
    return soup

# garden.org provides a "results page" when searching for a plant
# Each "result" includes a link that provides photos for the plant
# Function goes to the URL for each result on page, and calls the "add_plant" function
def get_results(result_soup, plant_name, count, weed):
    find_plants_results = result_soup.find('table')
    plants_results = find_plants_results.find_all('tr')
    # Create list of URLs for each result
    for k in range(len(plants_results)): # For each result
        plant_url = 'https://garden.org' + plants_results[k].find('a').attrs['href']
        # Count keeps track of the number of photos for each plant
        count = add_plant(plant_url, plant_name, count, weed)
    sleep(1)
    return (count)

# Function adds all photos from each "result"
# "Results" include plants that contain the search term
# Only plants that match the name of the search term 
# as a "common name" for the plant or in the header of the page are included
def add_plant(plant_url, plant_name, count, weed):
    soup = get_soup(plant_url)
    if weed:
        path = os.path.join(weed_path, plant_name)
    else:
        path = os.path.join(perennial_path, plant_name)
    
    # Create list of common names
    tables = soup.find_all('table')
    common_names_table = None
    for j in range(len(tables)):
        if tables[j].find('caption'):
            if 'common' in tables[j].find('caption').text.lower():
                common_names_table = tables[j]
    common_names_list = []
    if common_names_table:
        common_names = common_names_table.find_all('tr')
        for k in range(len(common_names)):
            common_names_list.append(common_names[k].find('td').findNextSibling().text.strip().lower())
                
    # Add names in header to list of common names
    header_names = soup.find('h1', {'class' : 'page-header'}).text.lower()
    header_names = header_names.replace('(', '→').replace(')', '').split('→')
    common_names_list += header_names
    
    # If search term is in header or list of common names, add photos
    if plant_name.replace('_', ' ') in common_names_list:
        photo_gallery = soup.find_all('div', {'class' : 'plant_thumbbox'})
        for i in range(len(photo_gallery)):
            photo_url = 'https://garden.org' + photo_gallery[i].find('a').find('img').attrs['src']
            if get(photo_url).status_code != 404:
                urllib.request.urlretrieve(photo_url, path + '_' + str(count) + '.jpg')
                count += 1
    return (count)


In [None]:
# Find URLs for results page and pull all plants for each result
# "data" is the list of perennials or weeds
# "weed" indicates whether the plant is a weed (weed=True) or not
def enter_url(data, weed):
    for l in range(len(data)):
        print('Plant:  ', data[l])
        add_photos = input('Add Photos? (Y/N):  ')
        if (add_photos == 'Y') or (add_photos == 'y'):
            plants_url = input('Enter garden.org url:  ')
            plant_name = plants_url.split('=')[-1].replace('+', '_')
            count = 0 # Track number of results to name plant
            
            # Go to URL for each result on page, and add plants from each
            plant_soup = get_soup(plants_url)
            count = get_results(plants_soup, plant_name, count, weed)
            
            # Check if there are additional results pages
            # Will return actual page if one exists. Otherwise, will return nothing.
            query = plant_soup.find('span', {'class' : 'PageActive'})
            if query:
                next_page = query.findNextSibling()
                while next_page:
                    next_url = 'https://garden.org' + next_page.attrs['href'] # Go to next page of results
                    plant_soup = get_soup(next_url)
                    count = get_results(plant_soup, plant_name, count, weed)
                    query = plant_soup.find('span', {'class' : 'PageActive'})
                    if query:
                        next_page = query.findNextSibling()
    return

In [None]:
enter_url(perennials, False)

In [24]:
enter_url(weeds, True)