| <div> <img src="https://storage.googleapis.com/open-ff-common/openFF_logo.png" width="100"/></div>|      |<h1>Build and curate external data sets</h1>|
|---|---|---|

The primary mission of the Open-FF project is to facilitate easy access and big-picture overviews of the chemical data in the FracFocus.org disclosure instrument.  That mission relies both on the FracFocus data and on external data sets that clarify and provide perspective to the FracFocus data.

This notebook outlines these external data sets and how they are acquired, reformated (if necessary) and made available in Open-FF. 

In [None]:
!git clone https://github.com/gwallison/intg_support.git &>/dev/null;
!pip install itables  &>/dev/null;
!pip install geopandas  &>/dev/null;

In [None]:
# %run intg_support/local_frack_steps.py

## This notebook is used to curate that data sets<br> that are used in conjuction with Open-FF

**Data sets**
- Census state and county shapefiles
- EPA lists harvested from CompTox, meta
- TSCA UVCB list
- California's Prop 65 list
- USGS PADUS 3.0
- State provided location data
- TEDX EDC data
- EPA's diesel list
- Elsner et al chemical summary
- Reportable quantities list
- TSCA list
- NPDWR list
- DHS list of schools,day cares and nursing homes
- SciFinder and Comptox name and synonym data
- SkyTruth scrape data
- FFV1 scrape data
- NM state-held disclosures
- OH drilling data
- summary of FF archived meta data: for silent changes and publication delay 

**Possibles**
- EPA's EJscreen
- Well-scale production numbers
- EDF's database
- shapefile for geologic "plays"
- PA violations
- PA waste data
- [Historical production](https://www.sciencebase.gov/catalog/item/632b67a5d34e900e86c509ce)





# PADUS - by USGS
This comprehensive data set allows us to find wells that are on Fed/state/native lands, and can give details about those lands.  The data set
is particularly large so we create a compiled version and 'pickle' it here for use in Open-FF generation tasks.  This requires processing 11 separate files. This can take a long time.


In [8]:
def process_PADUS(sources=r"C:\MyDocs\OpenFF\data\external_refs",
                         outdir='./tmp/'):
    import pandas as pd
    import geopandas
    import os
    final_crs = 4326 # EPSG value for bgLat/bgLon; 4326 for WGS84: Google maps
    pkl_name = os.path.join(outdir,'padus.pkl')
    print(pkl_name)
    print('  -- fetch PADUS from zip files')
    allshp = []
    # shp_fn = r"C:\MyDocs\OpenFF\data\external_refs\shape_files\PADUS3_0_Region_7_SHP.zip!PADUS3_0Combined_Region7.shp"
    for i in range(1,12):
        print(f'     PADUS {i} file processed')
        shp_fn = os.path.join(sources,'shape_files',
                              f'PADUS3_0_Region_{i}_SHP.zip!PADUS3_0Combined_Region{i}.shp')
        shpdf = geopandas.read_file(shp_fn).to_crs(final_crs)
        allshp.append(shpdf)

    shdf = geopandas.GeoDataFrame(pd.concat(allshp,
                                            ignore_index=True), 
                                  crs=allshp[0].crs)
    shdf.to_pickle(pkl_name)
    
process_PADUS()

./tmp/padus.pkl
  -- fetch PADUS from zip files
     PADUS 1 file processed
     PADUS 2 file processed
     PADUS 3 file processed
     PADUS 4 file processed
     PADUS 5 file processed
     PADUS 6 file processed
     PADUS 7 file processed
     PADUS 8 file processed
     PADUS 9 file processed
     PADUS 10 file processed
     PADUS 11 file processed


# Location data

## State and county boundaries
We use these to check on the consistency of the lat/lon recorded with the embedded county/state identification of the API number

Source

# Chemistry data

## California's Proposition 65 list of chemicals

Source: https://oehha.ca.gov/proposition-65/proposition-65-list
Most recent version downloaded: 