<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-libraries" data-toc-modified-id="Load-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load libraries</a></span></li><li><span><a href="#Load-webpage" data-toc-modified-id="Load-webpage-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load webpage</a></span></li><li><span><a href="#Scraping" data-toc-modified-id="Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scraping</a></span><ul class="toc-item"><li><span><a href="#How-many-hybrid-oaks-registered-in-their-system?" data-toc-modified-id="How-many-hybrid-oaks-registered-in-their-system?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>How many hybrid oaks registered in their system?</a></span></li><li><span><a href="#Extract-species-name-(hybrid-and-parents)-into-a-dataframe" data-toc-modified-id="Extract-species-name-(hybrid-and-parents)-into-a-dataframe-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extract species name (hybrid and parents) into a dataframe</a></span></li></ul></li></ul></div>

# Load libraries

In [1]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup as bs
import networkx as nx
from pyvis.network import Network
import matplotlib.pyplot as plt

# Load webpage

The webpage is the [USDA Plants Database](https://plants.sc.egov.usda.gov/java/). This website is quite useful because it contains standardized information about many plants species in the US. 

There are more than 500 species of oaks in the [world](http://plantsoftheworldonline.org/taxon/urn:lsid:ipni.org:names:325819-2), and the US has 90 oak [species](https://en.wikipedia.org/wiki/Oak)

I searched for *Quercus*, which is the genus name of oak, and it yielded 377 records. My **goal** is to extract the hybrid oak species to obtain their parents.

[How to import an html file in bs](http://zetcode.com/python/beautifulsoup/)

In [2]:
## htm file path
webpage_file = "usda_plants_quercus_search.htm"

## open
with open(webpage_file, 'r') as f:
    contents = f.read()
    soup = bs(contents, 'lxml')
    
## print
# print(soup.prettify())

# Scraping

In [3]:
## inspect the page, find the main table
## find good tags to extract information
result_table = soup.find_all(['tr','td','em'], attrs={'class':'rowon'})

# result_table

A lot of information in `result_table`. 

The hybrid species have `hybrid oak` as their common name,   
and the parent species are written like so: *Q. xhybrid \[parent_species1 x parent_species2\]*

## How many hybrid oaks registered in their system?

In [4]:
## the output of find_all is a list, so we can use len()
## using re.IGNORECASE to account for potential capitalization

result_hybrids = soup.find_all('td', string = re.compile("hybrid oak", re.IGNORECASE))

print(f'There are {len(result_hybrids)} hybrid oak species in the USDA Plants Database.')

There are 81 hybrid oak species in the USDA Plants Database.


## Extract species name (hybrid and parents) into a dataframe

Use `re` to search each list and get the hybrid species parents.

There are `em` and `i` tags. The `em` tags have the hybrid species name, and the `i` tags have the parent species. There **two** exceptions to this rule, though:

1. There is one hybrid species that doesn't have a name, so the species have is the name of both parents `== alba × virginiana`.  
2. There is another hybrid species that has a name, but there are two potential crossings that can result in that hybrid `== bicolor x (muehlenbergii, prinoides)`. At least, I think that's what this means...

In [5]:
list_hybrids_parents = []

for res in result_table:
    
    temp = res.find(string = re.compile("hybrid oak", re.IGNORECASE))
    
    if temp != None:
        # print(res)
        
        # get the exception:
        hybrid = res.find_all(['em', 'i'])  # this is ResultSet
        list_hybrids_parents.append(["Q. "+x.string.strip() for x in hybrid][1:])

        
for index, l in enumerate(list_hybrids_parents):
    
    if len(l) > 3:
        exception2 = list_hybrids_parents.pop(index)
        for index,e in enumerate(exception2):
            exception2[index] = e.replace('(', '').replace(')','')
        for el in [2,3]:
            list_hybrids_parents.append([x for index,x in enumerate(exception2) if index != el])
            
    elif len(l) < 3:  # alba × virginiana
        a_list = [x for index,x in enumerate(l[0].split()) if index !=2 ]
        a_list[0] = 'Q. unknown'
        a_list[1] = 'Q. alba'
        a_list[2] = 'Q. virginiana'
        list_hybrids_parents.append(a_list)
        list_hybrids_parents.pop(index)
        
    else:
        continue
        
            
# list_hybrids_parents
df = pd.DataFrame(list_hybrids_parents, columns = ['hybrid', 'parent1', 'parent2'])
df.head()

Unnamed: 0,hybrid,parent1,parent2
0,Q. acutidens,Q. cornelius-mulleri,Q. engelmannii
1,Q. ashei,Q. incana,Q. laevis
2,Q. atlantica,Q. incana,Q. laurifolia
3,Q. bebbiana,Q. alba,Q. macrocarpa
4,Q. beckyae,Q. macrocarpa,Q. prinoides


In [6]:
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'parent1', "parent2", ["hybrid"])
# nx.draw(G)
nx.write_gml(G, "oak_hybridization.graphml")

nt = Network("1000px", "1000px")
nt.from_nx(G)
nt.show("nx.html")