# Scraping the NIST database

This notebook has the following functionality:
 - A download of the NIST isotherm database in JSON form
 - Preliminary homogenisation of isotherm parameters
 - Pickling of this data to disk

To run this notebook the root directory must be the main folder.

In [1]:
import os
import json
import requests
import pathlib
import pickle

import concurrent.futures
from tqdm import tqdm

basedir = pathlib.Path.cwd() / "data"
isodb_base = "https://adsorption.nist.gov/isodb/api/"

## List of isotherms

We first download lists of all isotherms, materials and adsorbents in the NIST database.

In [2]:
iso_list = requests.get(isodb_base + "isotherms.json")
with open(basedir / "isotherm_list.json", "w", encoding='utf-8') as f:
    f.write(iso_list.text)
isotherms_parsed = iso_list.json()
print(f"Downloaded {len(isotherms_parsed)} isotherms.")

Downloaded 32552 isotherms.


In [3]:
mat_list = requests.get(isodb_base + "materials.json")
with open(basedir / "material_list.json", "w", encoding='utf-8') as f:
    f.write(mat_list.text)
materials_parsed = mat_list.json()
print(f"Downloaded {len(materials_parsed)} materials.")

Downloaded 7008 materials.


In [5]:
ads_list = requests.get(isodb_base + "gases.json")
with open(basedir / "adsorbent_list.json", "w", encoding='utf-8') as f:
    f.write(ads_list.text)
adsorbents_parsed = ads_list.json()
print(f"Downloaded {len(adsorbents_parsed)} probes.")

Downloaded 356 probes.


In case these initial lists need to be reloaded from disk.

In [4]:
with open(basedir / "isotherm_list.json", "r", encoding='utf-8') as f:
    isotherms_parsed = json.load(f)
print(f"Loaded {len(isotherms_parsed)} isotherms.")

with open(basedir / "material_list.json", "r", encoding='utf-8') as f:
    materials_parsed = json.load(f)
print(f"Loaded {len(materials_parsed)} materials.")

with open(basedir / "adsorbent_list.json", "r", encoding='utf-8') as f:
    adsorbents_parsed = json.load(f)
print(f"Loaded {len(adsorbents_parsed)} adsorbents.")

Loaded 32552 isotherms.
Loaded 7008 materials.
Loaded 356 materials.


## Concurrent download of ISODB

The following cell uses a thread pool to download multiple isotherms in parralel from the list obtained previously. It is reasonably fast (a few minutes) but may encounter problems if the number of simultaneous connections is too high. Any errors are highlighted and saved in a separate list. This should be the first choice, but a sequential download can also be performed.

In [6]:
isos = []
errs = []
CONNECTIONS = 50
TIMEOUT = 10
isodb_session = requests.Session()

urls = [isodb_base + "isotherm/" + iso_raw['filename'] + '.json' for iso_raw in isotherms_parsed]

def load_url(url, timeout):
    ans = isodb_session.get(url, timeout=timeout)
    return ans.json()

print(f"Total number of iterations will be: {len(isotherms_parsed)}")
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
    future_to_url = {executor.submit(load_url, url, TIMEOUT): url for url in urls}
    for future in tqdm(concurrent.futures.as_completed(future_to_url)):
        url = future_to_url[future]
        try:
            iso = future.result()
        except Exception as exc:
            print(exc)
            errs.append(url)
        else:
            isos.append(iso)

if errs:
    print("Some errors occurred!")

Total number of iterations will be: 32552
32552it [04:24, 123.25it/s]


## Sequential download of ISODB

If the concurrent code does not work, a sequential version is available. *Warning, this will take some time (can be several hours) depending on the ping and throughput, try concurrent version first.*

In [35]:
isos = []
errs = []
TIMEOUT = 5
isodb_session = requests.Session()

for iso_raw in tqdm(isotherms_parsed):
    try:
        iso = isodb_session.get(isodb_base + "isotherm/" + iso_raw['filename'] + '.json', timeout=TIMEOUT)
        isos.append(iso.json())
    except Exception as e:
        errs.append(iso_raw)
        print(e)

100%|██████████| 32552/32552 [2:07:34<00:00,  4.25it/s]


This step is required due to several inconsistencies in the NIST ISODB. Due to the switch to a multicomponent JSON format, the *total_adsorption* field, which is supposed to show the total amount of all species adsorbed is sometimes left blank. Therefore, here we iterate and generate this field when it is absent.

In [9]:
import numpy as np

fixes = 0

for iso in tqdm(isos):
    if iso['isotherm_data'][0]['total_adsorption'] is None:
        fixes += 1
        for point in iso['isotherm_data']:
            point['total_adsorption'] = float(np.sum([dp['adsorption'] for dp in point['species_data']]))

print(f"Corrections performed in {fixes} isotherms.")

100%|██████████| 32552/32552 [00:02<00:00, 14112.17it/s]
Corrections performed in 4839 isotherms.


### Save (or load) pickled isotherms if needed

In [10]:
with open(basedir / "isotherms.pickle", 'wb') as f:
    pickle.dump(isos, f)

In [74]:
with open(basedir / "isotherms.pickle", 'rb') as f:
    isos = pickle.load(f)
print(f"Loaded {len(isos)} full isotherms.")

Loaded 32552 full isotherms.
