This notebook describes how I scraped [ocpdb.pythonanywhere.com](http://ocpdb.pythonanywhere.com) using request/s, extracted the data using BeautifulSoup and saved the data in a csv file.

The Open Cannabis Project (OCP) closed as of May 31, 2019 and ocpdb.pythonanywhere.com is where their data is stored. This data is to be used as evidence of prior art and defensive documentation; one of OCP's main goals.

From [opencannabisproject.org](https://web.archive.org/web/20190529203529/https://opencannabisproject.org/):

>One very important reason to create an open data set is to create evidence of prior art, which helps to ensure that
patents are not issued on plants that already exist. This need became apparent in 2015, when [the first utility patent on a whole category of cannabis plants](http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=3&f=G&l=50&co1=AND&d=PTXT&s1=%22Biotech+Institute,+LLC%22&OS=%22Biotech+Institute,+LLC%22&RS=%22Biotech+Institute,+LLC%22) was issued by the USPTO. The plant described in the  claims resembles plants that people had grown before; evidence that could counter the patent has reportedly been rejected because it had not been published publicly.
>
> **We never want to see this happen again.**
>
> This is why creating prior art for cannabis is so important.
>
> ![path of a patent](path-of-a-patent-opencannabisproject-insert-prior-art-2.png)
>
>The USPTO and other patenting bodies internationally can’t legally grant patents that would cover existing varieties –  but there has to be proof that these varieties DO exist. We can block this process by providing this proof: documenting genetic and chemical data for all of the cannabis varieties in existence today.

The first step is to understand the structure of the web page(s) to be scraped. There are several "views" of the data presented at ocpdb.pythonanywhere.com. The "bigtable" view presents the "chemical" data, generated by ![Cascadia Labs](https://www.cascadia-labs.com/), in tabular format. The "ocpdb/filter" view presents the genetic data, generated by ![Phylos Bioscience](https://phylos.bio/), in a list format. Digging deeper, I found the "ocpdb/" view to be the best view for scraping, as it holds the most detail per record. The URL format is http://ocpdb.pythonanywhere.com/ocpdb/<OCPID>/, where OCPID is used as a key for each record.

There are 1,099 data records. I categorized the records by OCPID as:

**Chemical Only**
- 420 - 426
- 428 - 429
- 431 - 443
- 445 - 447
- 449 - 455
- 457 - 476
- 478 - 493
- 495 - 674

**Genetic Only**
- 675 - 1518

**Combined**
- 427
- 430
- 444
- 448
- 456
- 477
- 494

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import time
import os

# Captain Obvious reminds us to enable Internet (on Kaggle)
# let's download the Combined pages to see how we can isolate the data with BeautifulSoup
combined_ids = ['427','430','444','448','456','477','494']
base_url = 'http://ocpdb.pythonanywhere.com/ocpdb/'

# making headers to identify myself to the sysadmins
headers = {
    'User-Agent': 'Bill Ostaski, https://www.kaggle.com/ostaski/scraping-the-ocp-data',
    'From': 'ostaski@gmail.com'
}

# putting these in a "Pages" directory
dir = "Pages"
if not os.path.exists(dir):
    os.mkdir(dir)
    for id in combined_ids:
        url = base_url + id
        req = urllib.request.Request(url, headers=headers)
        resp = urllib.request.urlopen(req)
        with open("Pages/" + id, "a") as p:
            p.writelines(str(resp.read()))
        time.sleep(10) # showing some respect to the server

In [2]:
# confirm pages and content are present
#!ls Pages # yes, they are there
f = open("Pages/427", "r")
if f.mode == 'r':
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')
    print(soup.prettify()) # wish I could truncate this output (for Kaggle)

b'\n
<!DOCTYPE doctype html>
\n
<html lang="en">
 \n
 <head>
  \n
  <meta charset="utf-8"/>
  \n
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  \n
  <meta content="" name="description"/>
  \n
  <meta content="" name="author"/>
  \n
  <link href="/static/ocpdb/img/favicon.ico" rel="icon"/>
  \n\n
  <title>
   OCP Database
  </title>
  \n\n
  <!-- Bootstrap core CSS -->
  \n
  <!--<link href="../../dist/css/bootstrap.min.css" rel="stylesheet">-->
  \n\t
  <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO" rel="stylesheet"/>
  \n\n
  <!-- Custom styles for this template -->
  \n
  <link href="/static/ocpdb/css/starter-template.css" rel="stylesheet"/>
  \n
 </head>
 \n\n
 <body>
  \n\n
  <nav class="navbar navbar-expand-md navbar-dark bg-dark fixed-top">
   \n\t
   <img src="/static/ocpdb/img/ocp-circle-2

Now we have our "Combined" category pages that we can parse for the data we want. Let's fiddle with the soup object to focus on the data in the HTML structure. I've identified 58 data points per record to store in a csv file. Of course, only the Combined category records will contain all data points. The "Chemical" and "Genetic" records will be empty for the other category's columns. It's important to note that you should use the output above to figure out how to manipulate the soup object. There are often hidden characters (e.g., newline \n, tab \t, etc.) you will need to address that are not seen in the page's source.

In [3]:
# we have some links in our data points, let's have a look
for link in soup.find_all('a'):
    print(link.get('href'))

/ocpdb/
/ocpdb/filter/?f=chemical
/ocpdb/filter/?f=genetic
https://example.com
https://opencannabisproject.org/about/
https://opencannabisproject.org/get-involved/
https://opencannabisproject.org/contact/
https://store.maps.org/np/clients/maps/donation.jsp?campaign=99
https://www.ncbi.nlm.nih.gov/sra?term=SRS3289200
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3483
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA470994
https://trace.ncbi.nlm.nih.gov/Traces/sra?study=SRP145424
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR7151893


The last 5 links in each group are our targets.

In [4]:
soup.body.b.text

'Sour Tangie'

This is our strain.

In [5]:
soup.find_all("td", class_="numcell")

[<td class="numcell">254.00</td>,
 <td class="numcell">2.26</td>,
 <td class="numcell">1.02</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.17</td>,
 <td class="numcell">0.11</td>,
 <td class="numcell">0.00</td>,
 <td class="numcell">5.68</td>,
 <td class="numcell">0.16</td>,
 <td class="numcell">263.00</td>,
 <td class="numcell">0.62</td>,
 <td class="numcell">0.63</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.03</td>,
 <td class="numcell">0.29</td>,
 <td class="numcell">0.03</td>,
 <td class="numcell">0.14</td>,
 <td class="numcell">0.00</td>,
 <td class="numcell">1.30</td>,
 <td class="numcell">0.10</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.00</td>,
 <td class="numcell">0.08</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.07</td>,
 <td class="numcell">0.07</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.00</td>,
 <td class="numcell">0.12</td>,
 <td class="numcell"></td>,
 <td class="numcell">0.02</td>,
 <td class="numcell"></td>,


LOTS of data here!

I'll spare you all the details of how I extracted the data using BeautifulSoup. Look at the soup statements in the functions below for those calls.

This was my first time using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It is very intuitive and has a tiny learning curve if you are familiar with HTML/CSS structures. I just printed out the entire page for each report type and played with the soup object to extract the data in a local notebook.

The "Combined" and "Chemical" pages are similar in structure, so only minor modifications needed to be made to the "Chemical" reports after completing the "Combined" reports. The "Genetic" reports had a slightly different structure and required different soup calls in some cases.

OK, here's the code I used to get the data.

In [6]:
# defining some functions below
def getReportType(tocpid):
    if tocpid in combined_ids:
        return 'B' # B for Both
    elif int(tocpid) < 675:
        return 'C' # C for Chemical
    else:
        return 'G' # G for Genetic

In [7]:
# function for "Both" report type files
def populateBoth(soup):
    ocpid = soup.find(class_="text-right").find('b').find(text=True, recursive=False).string[7:]
    strain = soup.body.b.text
    sampleID = soup.find(class_="text-muted").find(text=True, recursive=False)
    dateRecorded = soup.find(class_="text-right").find(text=True, recursive=False).string[1:]
    # get truncated ocpid
    tocpid = ocpid.lstrip("0")
    reportType = getReportType(tocpid)
    chemicalLab = soup.find('td', colspan="8").find(text=True, recursive=False)[2:]
    h20 = soup.find('td', colspan="4").text[6:]
    totalTHC = soup.find('td', colspan="3").text[11:]
    thc = soup.find_all("td", class_="numcell")[0].text
    # can't use hyphens in identifiers
    Δ8_thc = soup.find_all("td", class_="numcell")[3].text
    Δ9_thc = soup.find_all("td", class_="numcell")[6].text
    thca = soup.find_all("td", class_="numcell")[9].text
    thcv = soup.find_all("td", class_="numcell")[12].text
    totalCBD = soup.find("td", colspan="2").text[11:]
    cbda = soup.find_all("td", class_="numcell")[17].text
    cbdv = soup.find_all("td", class_="numcell")[20].text
    cbdva = soup.find_all("td", class_="numcell")[23].text
    cbc = soup.find_all("td", class_="numcell")[26].text
    cbg = soup.find_all("td", class_="numcell")[29].text
    cbn = soup.find_all("td", class_="numcell")[31].text
    α_pinene = soup.find_all("td", class_="numcell")[1].text
    camphene = soup.find_all("td", class_="numcell")[4].text
    myrcene = soup.find_all("td", class_="numcell")[7].text
    β_pinene = soup.find_all("td", class_="numcell")[10].text
    three_carene = soup.find_all("td", class_="numcell")[13].text
    α_terpinene = soup.find_all("td", class_="numcell")[15].text
    d_limonene = soup.find_all("td", class_="numcell")[18].text
    p_cymene = soup.find_all("td", class_="numcell")[21].text
    ocimene = soup.find_all("td", class_="numcell")[24].text
    eucalyptol = soup.find_all("td", class_="numcell")[27].text
    y_terpinene = soup.find_all("td", class_="numcell")[30].text
    terpinolene = soup.find_all("td", class_="numcell")[32].text
    linalool = soup.find_all("td", class_="numcell")[2].text
    isopulegol = soup.find_all("td", class_="numcell")[5].text
    geraniol = soup.find_all("td", class_="numcell")[8].text
    β_caryophyllene = soup.find_all("td", class_="numcell")[11].text
    α_humelene = soup.find_all("td", class_="numcell")[14].text
    nerolidol_1 = soup.find_all("td", class_="numcell")[16].text
    nerolidol_2 = soup.find_all("td", class_="numcell")[19].text
    guaiol = soup.find_all("td", class_="numcell")[22].text
    caryophylleneOxide = soup.find_all("td", class_="numcell")[25].text
    α_bisabolol = soup.find_all("td", class_="numcell")[28].text
    geneticLab = soup.find_all('td', colspan="5")[1].text[6:23]
    sample = soup.find_all('a')[8].text
    sampleURL = soup.find_all('a')[8]
    organism = soup.find_all('a')[9].text
    organismURL = soup.find_all('a')[9]
    project = soup.find_all('a')[10].text
    projectURL = soup.find_all('a')[10]
    study = soup.find_all('a')[11].text
    studyURL = soup.find_all('a')[11]
    run = soup.find_all('a')[12].text
    runURL = soup.find_all('a')[12]
    datePublished = soup.find_all("td", class_="numcell")[33].text
    spots = soup.find_all("td", class_="numcell")[34].text
    bases = soup.find_all("td", class_="numcell")[35].text
    size = soup.find_all("td", class_="numcell")[36].text.replace('\\xc2\\xa0', ' ')
    notes = soup.select('div.col')[3].text[23:27]
   
    return [ocpid,strain,sampleID,dateRecorded,reportType,chemicalLab,h20,totalTHC,thc,Δ8_thc,
            Δ9_thc,thca,thcv,totalCBD,cbda,cbdv,cbdva,cbc,cbg,cbn,α_pinene,camphene,myrcene,
            β_pinene,three_carene,α_terpinene,d_limonene,p_cymene,ocimene,eucalyptol,y_terpinene,
            terpinolene,linalool,isopulegol,geraniol,β_caryophyllene,α_humelene,nerolidol_1,
            nerolidol_2,guaiol,caryophylleneOxide,α_bisabolol,geneticLab,sample,sampleURL,organism,
            organismURL,project,projectURL,study,studyURL,run,runURL,datePublished,spots,bases,
            size,notes]

In [8]:
# function for "Chemical" report type files
def populateChemical(soup):
    ocpid = soup.find(class_="text-right").find('b').find(text=True, recursive=False).string[7:]
    strain = soup.body.b.text
    sampleID = soup.find(class_="text-muted").find(text=True, recursive=False)
    dateRecorded = soup.find(class_="text-right").find(text=True, recursive=False).string[1:]
    # get truncated ocpid
    tocpid = ocpid.lstrip("0")
    reportType = getReportType(tocpid)
    chemicalLab = soup.find('td', colspan="8").find(text=True, recursive=False)[2:]
    h20 = soup.find('td', colspan="4").text[6:]
    totalTHC = soup.find('td', colspan="3").text[11:]
    thc = soup.find_all("td", class_="numcell")[0].text
    # can't use hyphens in identifiers
    Δ8_thc = soup.find_all("td", class_="numcell")[3].text
    Δ9_thc = soup.find_all("td", class_="numcell")[6].text
    thca = soup.find_all("td", class_="numcell")[9].text
    thcv = soup.find_all("td", class_="numcell")[12].text
    totalCBD = soup.find("td", colspan="2").text[11:]
    cbda = soup.find_all("td", class_="numcell")[17].text
    cbdv = soup.find_all("td", class_="numcell")[20].text
    cbdva = soup.find_all("td", class_="numcell")[23].text
    cbc = soup.find_all("td", class_="numcell")[26].text
    cbg = soup.find_all("td", class_="numcell")[29].text
    cbn = soup.find_all("td", class_="numcell")[31].text
    α_pinene = soup.find_all("td", class_="numcell")[1].text
    camphene = soup.find_all("td", class_="numcell")[4].text
    myrcene = soup.find_all("td", class_="numcell")[7].text
    β_pinene = soup.find_all("td", class_="numcell")[10].text
    three_carene = soup.find_all("td", class_="numcell")[13].text
    α_terpinene = soup.find_all("td", class_="numcell")[15].text
    d_limonene = soup.find_all("td", class_="numcell")[18].text
    p_cymene = soup.find_all("td", class_="numcell")[21].text
    ocimene = soup.find_all("td", class_="numcell")[24].text
    eucalyptol = soup.find_all("td", class_="numcell")[27].text
    y_terpinene = soup.find_all("td", class_="numcell")[30].text
    terpinolene = soup.find_all("td", class_="numcell")[32].text
    linalool = soup.find_all("td", class_="numcell")[2].text
    isopulegol = soup.find_all("td", class_="numcell")[5].text
    geraniol = soup.find_all("td", class_="numcell")[8].text
    β_caryophyllene = soup.find_all("td", class_="numcell")[11].text
    α_humelene = soup.find_all("td", class_="numcell")[14].text
    nerolidol_1 = soup.find_all("td", class_="numcell")[16].text
    nerolidol_2 = soup.find_all("td", class_="numcell")[19].text
    guaiol = soup.find_all("td", class_="numcell")[22].text
    caryophylleneOxide = soup.find_all("td", class_="numcell")[25].text
    α_bisabolol = soup.find_all("td", class_="numcell")[28].text
    geneticLab = ''
    sample = ''
    sampleURL = ''
    organism = ''
    organismURL = ''
    project = ''
    projectURL = ''
    study = ''
    studyURL = ''
    run = ''
    runURL = ''
    datePublished = ''
    spots = ''
    bases = ''
    size = ''
    notes = soup.select('div.col')[3].text[23:27]
   
    return [ocpid,strain,sampleID,dateRecorded,reportType,chemicalLab,h20,totalTHC,thc,Δ8_thc,
            Δ9_thc,thca,thcv,totalCBD,cbda,cbdv,cbdva,cbc,cbg,cbn,α_pinene,camphene,myrcene,
            β_pinene,three_carene,α_terpinene,d_limonene,p_cymene,ocimene,eucalyptol,y_terpinene,
            terpinolene,linalool,isopulegol,geraniol,β_caryophyllene,α_humelene,nerolidol_1,
            nerolidol_2,guaiol,caryophylleneOxide,α_bisabolol,geneticLab,sample,sampleURL,organism,
            organismURL,project,projectURL,study,studyURL,run,runURL,datePublished,spots,bases,
            size,notes]

In [9]:
# function for "Genetic" report type files
def populateGenetic(soup):
    ocpid = soup.find(class_="text-right").find('b').find(text=True, recursive=False).string[7:]
    strain = soup.body.b.text.replace('\\n\\t\\t\\t ', '').replace('\\n\\t\\t\\t  ', '').strip()
    sampleID = soup.find(class_="text-muted").find(text=True, recursive=False)
    dateRecorded = soup.find(class_="text-right").find(text=True, recursive=False).string[1:]
    # get truncated ocpid
    tocpid = ocpid.lstrip("0")
    reportType = getReportType(tocpid)
    chemicalLab = ''
    h20 = ''
    totalTHC = ''
    thc = ''
    Δ8_thc = ''
    Δ9_thc = ''
    thca = ''
    thcv = ''
    totalCBD = ''
    cbda = ''
    cbdv = ''
    cbdva = ''
    cbc = ''
    cbg = ''
    cbn = ''
    α_pinene = ''
    camphene = ''
    myrcene = ''
    β_pinene = ''
    three_carene = ''
    α_terpinene = ''
    d_limonene = ''
    p_cymene = ''
    ocimene = ''
    eucalyptol = ''
    y_terpinene = ''
    terpinolene = ''
    linalool = ''
    isopulegol = ''
    geraniol = ''
    β_caryophyllene = ''
    α_humelene = ''
    nerolidol_1 = ''
    nerolidol_2 = ''
    guaiol = ''
    caryophylleneOxide = ''
    α_bisabolol = ''
    geneticLab = soup.find('td', colspan="8").find(text=True, recursive=False)[2:]
    sample = soup.find_all('a')[8].text
    sampleURL = soup.find_all('a')[8]
    organism = soup.find_all('a')[9].text
    organismURL = soup.find_all('a')[9]
    project = soup.find_all('a')[10].text
    projectURL = soup.find_all('a')[10]
    study = soup.find_all('a')[11].text
    studyURL = soup.find_all('a')[11]
    run = soup.find_all('a')[12].text
    runURL = soup.find_all('a')[12]
    datePublished = soup.find_all('td', colspan="5")[2].text[17:]
    spots = soup.find_all('td', colspan="5")[3].text[8:]
    bases = soup.find_all('td', colspan="5")[4].text[8:]
    size = soup.find_all('td', colspan="5")[5].text[7:].replace('\\xc2\\xa0', ' ')
    notes = ''
   
    return [ocpid,strain,sampleID,dateRecorded,reportType,chemicalLab,h20,totalTHC,thc,Δ8_thc,
            Δ9_thc,thca,thcv,totalCBD,cbda,cbdv,cbdva,cbc,cbg,cbn,α_pinene,camphene,myrcene,
            β_pinene,three_carene,α_terpinene,d_limonene,p_cymene,ocimene,eucalyptol,y_terpinene,
            terpinolene,linalool,isopulegol,geraniol,β_caryophyllene,α_humelene,nerolidol_1,
            nerolidol_2,guaiol,caryophylleneOxide,α_bisabolol,geneticLab,sample,sampleURL,organism,
            organismURL,project,projectURL,study,studyURL,run,runURL,datePublished,spots,bases,
            size,notes]

In [10]:
import requests
from bs4 import BeautifulSoup
import time
import datetime
import csv

combined_ids = ['427','430','444','448','456','477','494']

base_url = 'http://ocpdb.pythonanywhere.com/ocpdb/'

# making headers to identify myself to the sysadmin(s)
req_headers = {
    'User-Agent': 'Bill Ostaski, https://www.kaggle.com/ostaski/scraping-the-ocp-data',
    'From': 'ostaski@gmail.com'
}

# generally a good idea to note the date of this snapshot
filename = "OCPDB-" + datetime.datetime.now().strftime("%m_%d_%Y") + ".csv"

col_headers = ["OCPID","Strain","SampleID","DateRecorded","ReportType","ChemicalLab","H2O",
               "TotalTHC","THC","Δ8-THC","Δ9-THC","THCA","THCV","TotalCBD","CBDA","CBDV",
               "CBDVA","CBC","CBG","CBN","α-Pinene","Camphene","Myrcene","β-Pinene","3-Carene",
               "α-Terpinene","D-Limonene","p-Cymene","Ocimene","Eucalyptol","γ-Terpinene",
               "Terpinolene","Linalool","Isopulegol","Geraniol","β-Caryophyllene","α-Humelene",
               "Nerolidol-1","Nerolidol-2","Guaiol","CaryophylleneOxide","α-Bisabolol",
               "GeneticLab","Sample","SampleURL","Organism","OrganismURL","Project","ProjectURL",
               "Study","StudyURL","Run","RunURL","DatePublished","Spots","Bases","Size","Notes"]

# had to comment out the lines below in order to Commit (on Kaggle)
with open(filename, "w+") as f:
    writer = csv.writer(f)
    writer.writerow(col_headers)

#for id in range(420, 1519): # only gets chemical report for combined ids
for id in combined_ids: # grabs combined reports
    url = base_url + str(id)
    resp = requests.get(url, headers=req_headers)

    soup = BeautifulSoup(resp.text, 'html.parser')
    
    if id in combined_ids:
        parsed_data = populateBoth(soup)
    elif id < 675:
        parsed_data = populateChemical(soup)
    else:
        parsed_data = populateGenetic(soup)
    
    with open(filename, 'a') as f:
        writer = csv.writer(f)
        writer.writerow(parsed_data)

    time.sleep(10) # showing some respect to the server

I had an issue with the "Combined" report records not showing the genetic data when using 'in range(420,1519)', though it worked when using 'in combined_ids', so I just used the 'in combined_ids' records to overwrite those records 'in range(420,1519)'. There are only 7 records, so no biggie.

It takes around an hour and 40 minutes to extract the dataset when "sleeping" for 5 seconds between calls and a bit over 3 hours when "sleeping" for 10 seconds. There were occasional DNS issues when running with a 5 second sleep, but no issues when running with a 10 second sleep.

I ran this on my local notebook, but it seems to hang in this kernel. No worries, I published the dataset at [https://www.kaggle.com/ostaski/ocp-dataset](https://www.kaggle.com/ostaski/ocp-dataset) for folks to play with.

If you want to see an exceptional visualization of cannabis genetics, take a look at the [Phylos Galaxy](https://phylos.bio/galaxy).