# NIAID DIR Laboratory Descriptions Data Set

This notebook creates a data set of the National Institute of Allergy or Infectious Diseases (NIAID) Division of Intramural Research (DIR) organizational structure. The data set was created for later consumption into a Neo4j graph representation and subsequently integrated into the PubMed Knowledge Graph (PKG) for further analyses. To that end, the script outputs a CSV. In order to obtain the data, this notebook utilizes web scraping directly from the information found in the NIAID DIR Laboratory Descriptions. It includes the following features of substantive interest.

_**Date:** Jan 2022_

_**Contact:** Nick Kunz, Deloitte Consulting LLP (nkunz@deloitte.com)_


### Data Features
1. **Name** (object): Name of the principal investigators, staff scientists, and staff clinicians. There are 180 unique names.
2.  **Education** (object): Research and training credentials for each name. There are 15 unique credentials.
3. **Branch** (object): The branch name nested within the DIR. There are 20 unique branches.
4. **Section** (object): The section/unit name nested within its respective branch. There are 162 unique sections.

### Depdendencies
1. **Python** (3.8.2): Language
2. **Pandas** (1.2.4): Dataframes for data manipulation
3. **Requests** (2.26.0): Web requests for naviation
3. **Selenium** (3.141.0): Framework for web automation
4. **TQDM** (4.61.2): Progress bar for job status and completion
5. **GeckoDriver** (0.30.0): WebDriver utilized in Selenium 

    _Note: Notebook only supports GeckoDriver auto download on Windows machines. If you are running Linux or Mac, you will need to download GeckoDriver (https://github.com/mozilla/geckodriver/releases) and specify its excutable path manually. Also, requires Python kernel to accept web requests._

In [25]:
## libraries
import io
import zipfile as zp
import requests as rq
import pandas as pd
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver import FirefoxProfile
from selenium.webdriver import FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### Dependency Downloader

In [26]:
## get geckodriver
def gecko_downloader(os = 'win64', path = './'):

    """
    Desc:
        Downloads and extracts GeckoDriver required for Selenium WebDriver.
    
    Args:
        os (str): Specify OS on local machine (takes "win64" or "win32").
        path (str): File path for saving GeckoDriver on local machine.

    Returns:
        None (saves file 'geckodriver.exe' to specified path in 'path').
    
    Raises:
        ValueError: Cannot download GeckoDriver
    """

    ## arg quality
    if type(os) is not str:
        raise TypeError('os arg requires str.')

    if os != 'win64' and os != 'win32':
        raise TypeError('os arg requires valid input: "win64", "win32".')

    else:
        pass

    ## windows support (to do: linux and mac support)
    if os == 'win64':
        url = 'https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-win64.zip'

    if os == 'win32':
        url = 'https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-win32.zip'

    ## download geckodriver
    request = rq.get(
        url = url
    )

    if request.status_code == 200:
        gecko = zp.ZipFile(
            file = io.BytesIO(request.content)
        )
    else:
        raise ValueError('Could not download GeckoDriver.')

    try:
        ## extract geckodriver
        gecko.extractall(
            path = path
        )
        
        ## complete
        gecko.close()

        print('Successfully downloaded and extracted GeckoDriver.')

    except:
        pass

    finally:
        print('GeckoDriver detected. Proceeding to web requests.')


### Driver

In [27]:
## make webdriver options
def driver_options(opt):

    """
    Desc:
        Specifies Selenium WebDriver options for Firefox.

    Args:
        opt (list): Option string value flags (e.g. "--headless").

    Returns:
        FirefoxOptions: Selenium WebDriver options object for Firefox.

    Raises:
        TypeError: Incorrect data type in argument.
    """

    ## arg quality
    if type(opt) is not list:
        raise TypeError('opt arg requires list of str.')
    
    else:
        pass

    ## firefox options
    options = FirefoxOptions()

    for i in opt:
        options.add_argument(
            argument = i
        )

    return options


## make webdriver profile
def driver_profile():

    """
    Desc:
        Creates Selenium WebDriver profile for Firefox.

    Args:
        None.

    Returns:
        FirefoxProfile: Selenium WebDriver profile object for Firefox.
    
    Raises:
        None.
    """

    ## firefox profile
    profile = FirefoxProfile()

    profile.set_preference(
        key = 'browser.startup.homepage',
        value = 'about.blank'
    )

    return profile

### Extractor

In [28]:
## navigation by link anchors
def link_clicker(driver, url, anch_x, n, t):

    """ 
    Desc:
        Navigates to link in website specified by anchor ID in 'anch_x'. Makes 
        'n' number of web request attempts before time out failure with 't' load 
        latency time. 

    Args:
        url (str): URL of website.
        driver (obj): Selenium WebDriver object.
        anch_x (int): Anchor ID of link.
        n (int): Number of web request attempts after first failure.
        t (int): Load latency of website (seconds).
    
    Returns:
        None
    
    Raises:
        RuntimeError: Max 'n' number of attempts reached.
    """

    ## get request url
    driver.get(
        url = url
    )

    ## load latency
    wait = WebDriverWait(
        driver = driver,
        timeout = t
    )

    ## try n times
    i = 1

    ## nav to link
    while i < n:
        try:
            branch = wait.until(
                method = EC.element_to_be_clickable(
                    locator = (By.ID, anch_x)
                )
            )
            branch.click()
            break

        ## try again on failure
        except:
            i += 1
            print('Unsuccessful request, trying again. Attempt: {x}'.format(
                    x = i
                )
            )

            ## get request url again
            driver.get(
                url = url
            )
            pass

        ## time out failure on too many attempts
        if n == i:
            raise RuntimeError(
                'Unsuccessful request, now stopping. Max number of attempts.'
            )


## name and section data
def name_section_extractor(driver, anch_a, anch_b, anch_c, anch_d, n, t):

    """
    Desc:
        Primary function for retrieving raw data for 'Name', 'Education', and 
        'Section' features. Makes 'n' number of web request attempts before 
        time out failure with 't' load latency time. Anchor ID's 'anch_a', 
        'anch_b', 'anch_c', 'anch_d' pre-specified from known information 
        based on prior website inspection.

    Args:
        driver (obj): Selenium WebDriver object.
        anch_a (str): Anchor ID of link.
        anch_b (str): Anchor ID of link.
        anch_c (str): Anchor ID of link.
        anch_d (str): Anchor ID of link.
        n (int): Number of web request attempts after first failure.
        t (int): Load latency of website (seconds).

    Returns:
        list: List of lists containing strings of names and lab desc.

    Raises:
        ValueError: Cannot split strings.
        RuntimeError: Max 'n' number of attempts reached. 
    """

    ## load latency
    wait = WebDriverWait(
        driver = driver,
        timeout = t
    )

    ## try n times
    i = 1

    ## nav to link
    while i < n:
        try:

            ## -- multiple researcher profiles -- ##
            ## contains branch and section/unit columns
            try:

                ## global web elements
                element_all = wait.until(
                    method = EC.presence_of_element_located(
                        locator = (By.XPATH, anch_a)
                    )
                )

                ## global list
                element_all_lst = element_all.find_elements_by_tag_name(
                    name = "li"
                )

                people_all = list()

                for i in element_all_lst:
                    people_all.append(i.text)

                ## global list to contain only names, edu, section/unit
                people_all = [i for i in people_all if "\n" in i]

                ## subset web elements
                element_sub = wait.until(
                    method = EC.presence_of_element_located(
                        locator = (By.XPATH, anch_b)
                    )
                )

                ## subset web element list
                element_sub_lst = element_sub.find_elements_by_tag_name(
                    name = "li"
                )

                people_sub = list()

                for i in element_sub_lst:
                    people_sub.append(i.text)

                ## subset web element list to contain only names, edu, section/unit
                people_sub = [i for i in people_sub if "\n" in i]

                ## section/unit list
                n = len(people_sub)
                people_sec = people_all[n:]

                people_sec_spt = list()
                n = len(people_sec)

                for i in range(0, n):
                    try:
                        people_sec_spt.append(
                            [people_sec[i].split('\n')[1], people_sec[i].split('\n')[0]]
                        )

                    except ValueError as err:
                        print(err.args, 'Cannot split strings in Sections and Units element.')

                ## branch list
                people_sub_spt = list()
                n = len(people_sub)

                for i in range(0, n):
                    try:
                        people_sub_spt.append(
                            people_sub[i].split('\n')
                        )
                    except ValueError as err:
                        print(err.args, 'Cannot split strings in People element.')

                ## combined branch and section/unit list
                people_all_spt = people_sec_spt + people_sub_spt

                ## remove duplicates
                for i in people_all_spt:
                    n = len(i)
                    if n > 2:
                        people_all_spt.remove(i)

            ## -- single researcher profile -- ##
            ## does not contain branch and section/unit columns
            except:

                ## name, edu, section/unit
                people_all_spt = list()
                string_all_spt = list()

                ## name, section/unit text
                anchors = [
                    anch_c,
                    anch_d
                ]

                n = len(anchors)

                for i in range(0, n):
                    element_all = wait.until(
                        method = EC.presence_of_element_located(
                            locator = (By.XPATH, anchors[i])
                        )
                    )
                    string_all_spt.insert(i, element_all.find_element_by_xpath(
                            xpath = anchors[i]
                        ).text
                    )

                ## store list in list
                people_all_spt.insert(0, string_all_spt)

            finally:
                return people_all_spt

        ## try again on failure
        except:
            i += 1
            print('Unsuccessful request to link, trying again. Attempt: {x}'.format(
                    x = i
                )
            )
            pass

        ## time out failure on too many attempts
        if n == i:
            raise RuntimeError(
                'Unsuccessful request, now stopping. Max number of attempts.'
            )


## name and section pre-processing
def name_section_processor(data, feat_a, feat_b):

    """
    Desc:
        Addresses 'Section' feature. Subordinate function for determining
        duplicate observations by values in 'feat_b' encoded with a special
        character (comma). Removes redundant observations.

    Args:
        data (df): A valid DataFrame.
        feat_a (str): Reference column, typically 'Name'.
        feat_b (str): Modification column, typically 'Section'.
    
    Returns:
        DataFrame: Contains modified 'feat_b' feature.
    
    Raises:
        None.
    """
    
    ## make researcher name and section feats
    feats = [
        feat_a,
        feat_b,
    ]

    data = pd.DataFrame(
        data = data,
        columns = feats
    )

    ## remove multiple section names
    return data[~data[feat_b].str.contains(',')]


## name and education pre-processing
def name_educat_processor(data, feat_a, feat_b):

    """
    Desc:
        Creates new 'Education' feature. Removes job title and family suffix 
        from values in 'feat_a'. Utilizes remaining substring values to move 
        specified education credentials to 'feat_b'. Duplicate values in 
        'feat_a' will occure where there are multiple education credentials.

    Args:
        data (df): A valid DataFrame.
        feat_a (str): Reference column, typically 'Name'.
        feat_b (str): Creation column, typically 'Education'.
        
    Returns:
        DataFrame: Contains new 'feat_b' feature.
    
    Raises:
        None.
    """
    
    ## job title suffix
    titles = [
        'Chief',
        'Director',
        'Diplomate',
        'Senior Investigator',
        'Facility Veterinarian',
        'FRCPA Staff Clinician',
        'FRCPA',
        'Diplomate ACLAM',
        'ACLAM',
        'FAAAAI',
        'Acting',
        'Associate',
        'Staff Clinician'
    ]

    ## punctuation
    puncs = [
        ';',
        ',',
        '.'
    ]

    ## remove job title suffix and punctuation
    remove = titles + puncs

    for i in remove:
        data[feat_a] = data[feat_a].str.replace(
            pat = i,
            repl = '',
            case = True,
            regex = False
        )

    ## remove leading and trailing whitespace
    for i in data.columns:
        data[i] = data[i].str.strip()

    ## edu suffix
    edu = [
        'MA',
        'MSc',
        'MS',
        'MHSc',
        'MHS',
        'MPVM',
        'MPH',
        'MD',
        'ScD',
        'DSc',
        'DVM',
        'DPhil',
        'PhD',
        'Dr rer nat'
    ]

    ## make edu feature
    data.insert(
        loc = 1,
        column = feat_b,
        value = None
    )

    ## move edu from suffix to edu feature
    n = len(data)

    for i in range(0, n):
        for j in edu:
            if j in data[feat_a].iloc[i]:
                data[feat_a].iloc[i] = data[feat_a].iloc[i].replace(j, '')
                data[feat_a].iloc[i] = data[feat_a].iloc[i].replace('  ', ' ')
                if data[feat_b].iloc[i] is None:
                    data[feat_b].iloc[i] = j
                else:
                    data = data.append(data.iloc[i])
                    data[feat_b].iloc[i] = None
                    data[feat_b].iloc[i] = j

    ## assume credentials not listed
    data[feat_b].fillna(
        value = 'Other',
        inplace = True
    )

    ## remove duplicate edu
    for i in edu:
        data[feat_a] = data[feat_a].str.replace(
            pat = i,
            repl = '',
            case = True,
            regex = False
        )

    ## remove parath
    data[feat_a] = data[feat_a].str.replace(
            pat = r"\(.*\)",
            repl = '',
            regex = True
        )

    ## remove leading and trailing whitespace
    for i in data.columns:
        data[i] = data[i].str.strip()

    ## remove duplicate observations
    data.drop_duplicates(
        inplace = True
    )

    return data


## branch data and pre-processing
def branch_extractor(driver, data, feat_a, feat_b, anch_c, t):

    """
    Desc:
        Creates new 'Branch' feature. Utilizies web requests and DataFrame 
        referencing. For multiple researchers per 'feat_b', 'feat_a' will 
        contain different values than 'feat_b'. For one researcher per 'feat_b',
        'feat_a' and 'feat_b' contain the same values.

    Args:
        driver (obj): Selenium WebDriver object.
        data (df): A valid DataFrame.
        feat_a (str): Reference column, typically 'Section'.
        feat_b (str): Target column, typically 'Branch'.
        anch_c (str): Anchor ID of link.
        t (int): Load latency of website (seconds).

    Returns:
        Dataframe: Contains new 'feat_b' feature.
    
    Raises:
        ValueError: Could not locate web resource.
        ValueError: 'Branch' feature could not be created.
    """

    ## load latency
    wait = WebDriverWait(
        driver = driver,
        timeout = t
    )

    ## add feat
    n = len(data)

    ## multiple researchers
    if n > 1:
        try:
            element = wait.until(
                method = EC.presence_of_element_located(
                    locator = (By.XPATH, anch_c)
                )
            )

            feat_branch = element.find_element_by_xpath(
                xpath = anch_c
            ).text

        except ValueError as err:
            print(err.args, 'Cannot find heading.')

    ## single researcher
    elif n == 1:
        feat_branch = data[feat_a].iloc[0]

    else:
        raise ValueError('Could not create branch name feature.')

    data.insert(
        loc = 2,
        column = feat_b,
        value = feat_branch
    )

    return data


## modify section names
def section_processor(data, feat_a, feat_b):

    """
    Desc:
        Addresses 'Section' feature. Modifies string values to observations in 
        'feat_a' by including the corresponding string values found in 'feat_b' 
        if there are duplicate values in 'feat_a', where the values found in 
        'feat_b' are different.

    Args:
        data (df): A valid DataFrame.
        feat_a (str): Target column, typically 'Section'.
        feat_b (str): Reference column, typically 'Branch'.

    Returns:
        Dataframe: Annotated observations in 'feat_a' column.
    
    Raises:
        None.
    """

    ## add parenth for matching sections with different branch
    mix_sec = list()

    for i in data[feat_a].unique():
        if len(data[data[feat_a] == i][feat_b].unique()) > 1:
            mix_sec.append(i)

    for i in mix_sec:
        data_sec = data[data[feat_a] == i].copy()
        n = len(data_sec)
        for j in range(0, n):
            name_sec = (
                data_sec[feat_a].iloc[j] + ' ' + '(' + data_sec[feat_b].iloc[j] + ')'
            )
            data_sec[feat_a].iloc[j] = name_sec

        data.update(
            other = data_sec
        )

    ## reindex data
    data.reset_index(
        drop = True,
        inplace = True
    )

    return data

### Namer

In [29]:
## modify first names
def first_namer(data, feat):

    """
    Desc:
        Addresses 'Name' feature. Performs ad hoc changes to abbreviated or 
        otherwise misspelled first names. This function should evolve when 
        errors in first names are recognized or improved info.
     
    Args:
        data (df): A valid DataFrame.
        feat (str): Target column, typically 'Name'.

    Returns:
        DataFrame: Cleaned first name substring values in 'feat' column.
    
    Raises:
        None.
    """

    ## individual name corrections
    feat_repl = {

        ## slight abuse of original purpose
        ## info: https://ned.nih.gov/search/
        feat: {
            'Beth Fischer': 'Elizabeth Fischer',
            'David Hackstadt': 'Ted Hackstadt'
        }
    }

    data.replace(
        to_replace = feat_repl,
        inplace = True
    )

    return data


## modify middle names
def middle_namer(data, feat):

    """
    Desc:
        Addresses 'Name' feature. Inserts middle initial or name into string
        values that have matching first and last names when compared to other
        observation, but where the middle initial or name is absent.
     
    Args:
        data (df): A valid DataFrame.
        feat (str): Target column, typically 'Name'.

    Returns:
        DataFrame: Cleaned middle name substring values in 'feat' column.
    
    Raises:
        None.
    """

    ## clean names
    n = len(data) - 1

    for i in range(0, n):

        ## use full name containing middle initial
        name_one = str(
            data[feat].iloc[i].split()[0] + ' ' + 
            data[feat].iloc[i].split()[-1]
        )

        name_two = str(
            data[feat].iloc[i + 1].split()[0] + ' ' + 
            data[feat].iloc[i + 1].split()[-1]
        )

        if name_one == name_two:
            n_name_one = len(data[feat].iloc[i].split())
            n_name_two = len(data[feat].iloc[i + 1].split())

            if n_name_one != n_name_two:
                if n_name_one > 2:
                    name_use = data[feat].iloc[i]
                if n_name_two > 2:
                    name_use = data[feat].iloc[i + 1]
                else:
                    pass

                data[feat].iloc[i] = name_use
                data[feat].iloc[i + 1] = name_use

            else:
                pass

        else:
            pass
    
    ## individual name corrections
    feat_repl = {

        ## slight abuse of original purpose
        ## info: https://ned.nih.gov/search/
        feat: {
            'Elizabeth Fischer': 'Elizabeth R Fischer', 
            'David Sacks': 'David L Sacks',
            'Daniella Schwartz': 'Daniella M Schwartz',
            'Richard Davey': 'Richard T Davey',
            'Louis Miller': 'Louis H Miller',
            'Catharine Bosio': 'Catharine M Bosio'
        }
    }

    data.replace(
        to_replace = feat_repl,
        inplace = True
    )

    ## remove duplicates
    data.drop_duplicates(
        inplace = True
    )

    return data


## modify last names
def last_namer(data, feat, reap = 3):

    """
    Desc:
        Addresses 'Name' feature. Modifies last name substring by correcting 
        errors when misspellings are assumed to be missing letters. Replaces 
        assumed misspelling with string value of greater length. Also removes 
        family name suffix.
     
    Args:
        data (df): A valid DataFrame.
        feat (str): Target column, typically 'Name'.
        reap (int): Number of substrings in last name to compare.

    Returns:
        DataFrame: Cleaned last name substring values in 'feat' column.
    
    Raises:
        None.
    """

    ## clean names
    n = len(data) - 1

    for i in range(0, n):

        ## remove suffix from last names
        name_sir = data[feat].iloc[i].split()[-1]
        n_name_sir = len(name_sir)

        ## remove suffix errors
        if n_name_sir == 1:
            data[feat].iloc[i] = data[feat].iloc[i][:-2]

        ## remove family suffix
        if n_name_sir <= 3:
            fam_suf = [
                'III',
                'II',
                'Jr',
                'Sr'
            ]

            for j in fam_suf:
                if name_sir == j:
                    data[feat].iloc[i] = ' '.join(data[feat].iloc[i].split()[:-1])

        else:
            pass

        ## replace missing letters in last names 
        ## assumes same last name for first 'reap' repeated letters
        name_giv_one = data[feat].iloc[i].split()[0]
        name_giv_two = data[feat].iloc[i + 1].split()[0]

        name_sir_one = data[feat].iloc[i].split()[-1]
        name_sir_two = data[feat].iloc[i + 1].split()[-1]

        if name_giv_one == name_giv_two and name_sir_one[0:reap] == name_sir_two[0:reap]:
            n_name_sir_one = len(name_sir_one)
            n_name_sir_two = len(name_sir_two)

            if n_name_sir_one != n_name_sir_two:
                if n_name_sir_one > n_name_sir_two:
                    data[feat].iloc[i + 1] = data[feat].iloc[i]

                if n_name_sir_one < n_name_sir_two:
                    data[feat].iloc[i] = data[feat].iloc[i + 1]

            else:
                pass

        else:
            pass


    ## individual name corrections
    feat_repl = {

        ## slight abuse of original purpose
        ## info: https://ned.nih.gov/search/
        feat: {
            'Jennifer M Cuellar-Rodriguez': 'Jennifer M Cuellar-Rodríguez',
            'Sumati Ragagopalan': 'Sumati Rajagopalan'
        }
    }

    data.replace(
        to_replace = feat_repl,
        inplace = True
    )

    return data


## naming wrapper
def name_processor(data, feat, reap):

    """
    Desc:
        Addresses 'Name' feature. Unifies subordinate naming functions for 
        string and substring values. Utilized for later implementation in
        web scraping procedure.
     
    Args:
        data (df): A valid DataFrame.
        feat (str): Target column, typically 'Name'.
        reap (int): Number of characters to compare in last name substring.

    Returns:
        DataFrame: Cleaned name string values in 'feat' column.
    
    Raises:
        None.
    """

    ## remove whitespace
    data[feat] = data[feat].str.strip()

    data.replace(
        to_replace = {' +':' '},
        regex = True,
        inplace = True
    )

    ## sort data by names
    data.sort_values(
        by = feat,
        ascending = False,
        inplace = True
    )
    
    ## first
    data = first_namer(
        data = data,
        feat = feat
    )

    ## middle
    data = middle_namer(
        data = data,
        feat = feat
    )

    ## last
    data = last_namer(
        data = data,
        feat = feat,
        reap = reap  ## num char to compare
    )

    return data

### Scraper

In [30]:
## web scaper
def data_scraper(feats, driver, url, anch_a = 354, anch_b = 374, n = 3, t = 30):

    """
    Desc:
        Primary web scraping function. Unifies all subordinate functions to 
        traverses website through anchor ID's and pre-process the data returned 
        in a DataFrame.

    Args:
        feats (tuple): Columns as string values.
        driver (obj): Selenium WebDriver object.
        url (str): URL of website.
        anch_a (int): Anchor ID at start of website traversal.
        anch_b (int): Anchor ID at end of website traversal.
        n (int): Number of web request attempts after first failure.
        t (int): Load latency of website (seconds).

    Returns:
        DataFrame: 'Name', 'Education', 'Branch', 'Section' features.
    
    Raises:
        TypeError: Incorrect data type in an argument.
    """

    ## arg quality
    if type(feats) is not tuple:
        raise TypeError('feats arg requires tuple of str.')
    
    if driver is None:
        raise TypeError('driver arg requires Selenium WebDriver object.')
    
    if type(url) is not str:
        raise TypeError('url arg requires a valid str.')
    
    if type(anch_a) is not int:
        raise TypeError('url arg requires a pos int within anchor ID range.')

    if type(anch_b) is not int:
        raise TypeError('url arg requires a pos int within anchor ID range.')
    
    if type(n) is not int or t < 1:
        raise TypeError('n arg requires a pos int.')
    
    if type(t) is not int or t < 1:
        raise TypeError('t arg requires a pos int.')
    
    else:
        pass

    ## make dataframe
    feats = list(feats)
    data = pd.DataFrame(
        columns = feats
    )

    ## traverse lab desc
    for i in tqdm(range(anch_a, anch_b), 
        ascii = True, 
        desc = "Scraping Data from NIAID DIR Laboratory Descriptions"
        ):

        ## link nav
        link_clicker(
            url = url,
            driver = driver,
            anch_x =  'anch_{x}'.format(x = i),
            n = n,
            t = t
        )

        ## -- name, edu, section -- ##
        ## name and section raw extraction
        data_list = name_section_extractor(
            driver = driver,
            anch_a = '//*[@class="block block-layout-builder block-field-blocknodedivisionfield-subtopic-division"]',
            anch_b = '//*[@class="clearfix text-formatted field field--name-field-body field--type-text-long field--label-hidden field__item"]',
            anch_c = '//h1',
            anch_d = '//*[@id="anch_346"]',
            n = n,
            t = t
        )

        ## name and section feat processing
        data_loop = name_section_processor(
            data = data_list,
            feat_a = feats[0],
            feat_b = feats[3]
        )

        ## education feat processing
        data_loop = name_educat_processor(
            data = data_loop,
            feat_a = feats[0],
            feat_b = feats[1]
        )

        ## -- branch and section -- ##
        # branch raw extraction
        data_loop = branch_extractor(
            driver = driver,
            data = data_loop,
            feat_a = feats[3],
            feat_b = feats[2],
            anch_c = '//h1',
            t = t
        )

        ## -- global -- ##
        ## create data
        data = data.append(
            other = data_loop,
            ignore_index = True
        )

    ## name feat processing
    data = name_processor(
        data = data,
        feat = feats[0],
        reap = 2
    )

    ## branch and section processing
    data = section_processor(
        data = data,
        feat_a = feats[3],
        feat_b = feats[2]
    )

    return data


## data wrangling
def data_cleaner(data, feats):

    """
    Desc:
        Utilized for global data cleaning results from web scraping and data
        manipulation. Removes whitespace, sorts, and resets index in DataFrame.
    
    Args:
        data (df): A valid DataFrame.
        feats (list): List of column names as string values.

    Returns:
        DataFrame: Cleaned and sorted assumed for export to disk.
    
    Raises:
        None.
    """

    ## remove whitespace
    for i in data.columns:
        data[i] = data[i].str.strip()

    data.replace(
        to_replace = {' +':' '},
        regex = True,
        inplace = True
    )

    ## arbitrary sorting
    data.sort_values(
        by = [feats[2], feats[0]],
        inplace = True
    )

    ## reset index
    data.reset_index(
        drop = True,
        inplace = True
    )

    return data

### Execution

In [31]:
## -- download geckodriver -- ##
gecko_downloader(os = 'win64')

## -- local machine input -- ##
exe_path = './geckodriver.exe'
csv_path = '../data/niaid-dir-org.csv'

## webdriver settings
option_flags = [
    '--headless',
    '--no-sandbox',
    '--start-maximized',
    '--ignore-certificate-errors',
    '--disable-extensions'
]

options = driver_options(
    opt = option_flags
)

profile = driver_profile()

## webdriver initalization
driver = webdriver.Firefox(
    executable_path = exe_path,
    firefox_profile = profile,
    options = options
)

## data features *strictly* named and ordered
feats = tuple((
    'Name',
    'Education',
    'Branch',
    'Section'
    )
)

## web scraping and processing
data = data_scraper(
    feats = feats,
    driver = driver,
    url = 'https://www.niaid.nih.gov/research/division-intramural-research-labs',
    t = 12
)

## data processing
data = data_cleaner(
    data = data,
    feats = feats
)

## data preview
data.info()

GeckoDriver detected. Proceeding to web requests.


Scraping Data from NIAID DIR Laboratory Descriptions:  35%|###5      | 7/20 [00:42<01:00,  4.69s/it]

### Exporter

In [None]:
## data to disk
data.to_csv(
    path_or_buf = csv_path,
    index = False
)

### End