# NIST Bioeconomy Lexicon Scraper

You can use this notebook to scrape the terms and definitions from the [NIST Bioeconomy Lexicon](https://www.nist.gov/bioscience/nist-bioeconomy-lexicon), display them, and save to TSV.

In [10]:
import csv

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

In [11]:
# URL of the NIST Bioeconomy Lexicon web page:
URL = "https://www.nist.gov/bioscience/nist-bioeconomy-lexicon"

table_headers = ("nist_bioeconomy_term", "definition")

output_tsv_file = "../local/nist_bioeconomy_terms.tsv"

Get the HTML code of the web page at that URL.

In [12]:
response: requests.Response = requests.get(URL)
html: str = response.text

Parse the HTML code into a [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.BeautifulSoup) object, so we can explore it using Beautiful Soup.

In [13]:
soup = BeautifulSoup(html, "html.parser")

Use Beautiful Soup to extract the terms and their definitions from the object.

In [14]:
definitions = dict()

glossaryEl = soup.find(id="glossary")
entryEls = glossaryEl.find_all(class_="usa-accordion")
for el in entryEls:
    term: str = el.find(class_="usa-accordion__heading").text.strip()
    definition: str = el.find(class_="usa-accordion__content").text.strip()
    definitions[term] = definition

(Optional) Display the terms and definitions as a text-based grid.

In [15]:
table_body = definitions.items()
table_str = tabulate(table_body, table_headers, tablefmt="grid", maxcolwidths=[30, 70])
print(table_str)

+--------------------------------+------------------------------------------------------------------------+
| nist_bioeconomy_term           | definition                                                             |
| alternative protein source     | food technology-based alternatives (including field-grown or facility- |
|                                | produced) to protein harvested or obtained directly from animals       |
|                                | Version date: December 8, 2022  Direct link                            |
+--------------------------------+------------------------------------------------------------------------+
| bio-related software           | all or part of software used to design, control, manufacture,          |
|                                | simulate, model, manage, access, analyze, store, or process biological |
|                                | materials and associated data, especially as it pertains to laboratory |
|                           

Display the terms and their definitions as a CSV string.

In [17]:
# writer = csv.writer(sys.stdout)
# writer.writerow(("Term", "Definition"))
# writer.writerows(definitions.items())

In [18]:
nist_bioeconomy_terms = [{table_headers[0]: key, table_headers[1]: definitions[key]} for key in definitions]

with open(output_tsv_file, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=table_headers, delimiter='\t')

    # Write the header row
    writer.writeheader()

    # Write the data rows
    writer.writerows(nist_bioeconomy_terms)

