| <div> <img src="https://storage.googleapis.com/open-ff-common/openFF_logo.png" width="100"/></div>|      |<h1>Adding New Disclosures to the Open-FF Data:<br><br>Download, Curate, Assemble, Test, and Archive<br></h1>|
|---|---|---|

In [None]:
# !git clone https://github.com/gwallison/intg_support.git &>/dev/null;
# !pip install itables  &>/dev/null;
# !pip install geopandas  &>/dev/null;

In [1]:
%run intg_support/create_new_set_steps.py

<IPython.core.display.Javascript object>

# Set up
Construct a workspace and collect the resources needed.

## Create directories and fetch previous repository as a reference
Before we start downloading new FracFocus data, we set up a working directory structure and collect the resources we need.

**Directories constructed**
| directory name | description |
| ---: | :--- |
|**orig_dir**| expanded zip files, downloaded external files, etc: files used as a model for the next round, but not to be directly saved|
|**work_dir**| This is the working directory where new curation files created by these routines are kept. These 'generated' files are saved at the end of the process into either the repository or other archives.|
|**ext**| non-FracFocus data files used in constructing the Open-FF data set |
|**final**| the place for final files, archives and repositories. |

In [2]:
# Control download: typically set to True
#    set to False if you can skip the downloading part of the repo and the external data, for example, during testing.

download_repo_flag = True
download_ext_flag = False
download_FF_flag = True
# unpack_to_orig_flag = False
create_raw_flag = True


In [3]:
create_and_fill_folders(download_repo_flag)

Directory exists: orig_dir
Creating directory: orig_dir\pickles
Creating directory: orig_dir\curation_files
Creating directory: orig_dir\CAS_ref_files
Creating directory: orig_dir\CompTox_ref_files
Directory exists: work_dir
Directory exists: work_dir\new_CAS_REF
Directory exists: work_dir\new_COMPTOX_REF
Directory exists: final
Creating directory: final\pickles
Creating directory: final\curation_files
Creating directory: final\CAS_ref_files
Creating directory: final\CompTox_ref_files
Creating directory: final\intg_support
Directory exists: ext

Fetching repository files:
  -- CAS_ref_files
  -- CompTox_ref_files
  -- curation_files
  -- pickles

Date of downloaded (last) repo: 2023-06-24


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:43:15</div>

## Download external files used to assemble final data set
To do:
- make list of external databases and when they were compiled
- reorg so that you don't have to download everything.

In [4]:

get_external_files(download_ext_flag)

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> Completed without new external download </h3> 07/26/2023 07:43:20</div>

## Download raw files from FracFocus
To do:
- refactor comparison with previous download


In [5]:

download_raw_FF(download_FF_flag)

Downloading FracFocus data from http://fracfocusdata.org/digitaldownload/fracfocuscsv.zip
No archived data to compare against; skipping comparison.
in add_to_repo_info: bulk_download_date,2023-07-26
in add_to_repo_info: FF_archive_filename,ff_archive_2023-07-26.zip


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:44:28</div>

## Create master raw FracFocus set as file

In [6]:

create_master_raw_df(create_raw_flag)

 -- processing FracFocusRegistry_1.csv
 -- processing FracFocusRegistry_2.csv
 -- processing FracFocusRegistry_3.csv
 -- processing FracFocusRegistry_4.csv
 -- processing FracFocusRegistry_5.csv
 -- processing FracFocusRegistry_6.csv
 -- processing FracFocusRegistry_7.csv
 -- processing FracFocusRegistry_8.csv
 -- processing FracFocusRegistry_9.csv
 -- processing FracFocusRegistry_10.csv
 -- processing FracFocusRegistry_11.csv
 -- processing FracFocusRegistry_12.csv
 -- processing FracFocusRegistry_13.csv
 -- processing FracFocusRegistry_14.csv
 -- processing FracFocusRegistry_15.csv
 -- processing FracFocusRegistry_16.csv
 -- processing FracFocusRegistry_17.csv
 -- processing FracFocusRegistry_18.csv
 -- processing FracFocusRegistry_19.csv
 -- processing FracFocusRegistry_20.csv
 -- processing FracFocusRegistry_21.csv
 -- processing FracFocusRegistry_22.csv
 -- processing FracFocusRegistry_23.csv
 -- processing FracFocusRegistry_24.csv
 -- processing FracFocusRegistry_25.csv


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:47:12</div>

#### Add new disclosures to UploadKey file

In [7]:

update_upload_date_file()

Number of new disclosures added to list: 1175


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:47:25</div>

# Curation steps
These steps are a mix of automated and hand-performed curation tasks. The hand performed tasks require the user to examine database values in spreadsheets and then to make and record decisions on those values about individual records.



## `CASNumber` and `IngredientName` curation tasks

Open-FF uses both raw input fields `CASNumber` and `IngredientName` to clarify chemical identity in each record.  These two fields **should** agree on the identity, but often only one field provides unambiguous identification (usually `CASNumber`) and sometimes the two are conflicting.  Our target is an accurate `bgCAS`, which is our "best guess" at a CAS Registration Number for the material reported in the disclosure.
That is,
> Unique `CASNumber` | `IngredientName` pair  $\rightarrow$ `bgCAS`

There are currently over 28,000 unique pairs.

The curation process outlined below gets our identification as close as possible to our target.  It requires using several sources of information and part of the process includes collecting that information.  Some steps are partially automated whereas other steps require our judgement and are therefore manual.  

This process is also incremental - we only need to curate the *new* chemical identifiers in the most recent download.  However, this process can also be used to examine the whole curated set to refine identification performed earlier.  

Resources needed to create the CAS-Ing list:
- CAS_curated: a list of `CASNumber` values and the tentative `bgCAS` number they imply.
- `IngredientName` synonym list: list of synonyms (and associated CAS number) to weigh against `IngredientName`. This is created from a collection of CAS and CompTox references.
- `TradeName` values associated with CAS-Ing pairs -- this aspect is still in development, though curators may manually examine TradeNames to make decisions.

### step A - use previous list to find any new `CASNumber` values
1. compare list of `CASNumber` values in rawdf to list of `CASNumber` in *olddir/curation_files/cas_curated.csv*
1. make and display list of those new ones.

#### Next steps for YOU:
>Next steps for **new** `CASNumber` (see note below): 
>- if the implied chemical is not in the CAS references, go to SciFinder and make new entry (manual!)
>- otherwise, can skip the SciFinder steps, but go to the CAS_curate step. 
>
>If there are no **new** `CASNumber`, 
>- skip all the way to the moving the current CAS_curate.csv to *newdir/curation_files/CAS_curate.csv*

Note: these "new" `CASNumber` values can be completely new chemicals or just a new version of an already used material (for example, we might find '00000050-00-0' for the authoritative CASRN '50-00-0' that is already documented in Open-FF.).   They may also be something that will not resolve into a valid CASRN, for example: 'proprietary by operator'. You will assign appropriate 'bgCAS' values in the curation step.

In [8]:

cas_curate_step1()

Unnamed: 0,CASNumber,clean_wo_work,tent_CAS,valid_after_cleaning,auto_status,tent_is_in_ref,deprecated_replacement
Loading... (need help?),,,,,,,


## Go to STEP B: Use SciFinder for `CASNumber`s not in reference already

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:47:38</div>

### Step B - Add chem info from SciFinder of new tentative CAS numbers into CAS_ref files
If a chemical on this list hasn't been seen before in FF, we need to add some information into the CAS_ref files.
To do that, we currently use SciFinder, which is a product of the Chemical Abstract Service, the naming authority for materials.

Save the file in `new_CAS_ref` directory of **work_dir** 

Once you have saved the file, run the following cell to **verify** that the new `CASNumber`s that you've found to be valid and created a reference for, actually made it into the SciFinder reference.



In [9]:
cas_curate_step2()

     -- including added files: ['CAS_ref_by_hand_2023_07-25.txt']
 - New CompTox current search file not available, using repo version
 - Using backup version of comptox_batch_results.xlsx for synonyms
 - Using repo version of comptox_batch_results_broad.xlsx for synonyms

Number of new CAS lines to curate: 2



### Step C - Curate the CAS_to_curate file
In this step you will manually edit the *work_dir/CAS_curated_TO_EDIT.csv* file to curate the new `CASNumber` values. There are typically only a handful of lines in this file that you need to curate, just those newly discovered in the latest FF download. 

Your task is to assign a `curatedCAS` value for each new line, using the clues there.

Once you have completed the editing, save the file back to *work_dir/CAS_curated_modified.csv*.

#### Step C.1 - Make sure all `CASNumber` values have been curated

In [10]:

cas_curate_step3()

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:53:46</div>

---
### Step D - Update CompTox data
The metadata from EPA's CompTox system informs OpenFF in a number of ways. Now that we have a full CAS list, we need to update the CompTox data. 
**If no new CAS numbers were added to CAS_curated.csv, you can skip this step and jump to the CAS | Ing processing**

1. Open [CompTox batch search](https://comptox.epa.gov/dashboard/batch-search)
1. Under "Select Input Type(s)", check "CASRN"
1. Open the  *work_dir/comptox_search_list.csv* file in something like Excel or OpenOffice. (If you can't find this file, you may not need to run this step - no new CAS numbers?).
1. Copy and paste all CAS numbers in the `curatedCAS` column into the CompTox webpage "Enter Identifiers to Search" box.  You can skip the non_CAS numbers like 'proprietary'. They mean nothing to CompTox.
1. On the Comptox page, click the "Choose Export Options" button.
1. Under "Choose Export Format," select "Excel."
1. In the "Chemical Identifiers" section, make sure that the following are checked (but no more than these):
- Chemical Name
- DTXSID
- IUPAC Name
8. Under the "Presence in Lists" table, click the check box in the header that selects ALL lists.  This will be used to map what lists each chemical is part of.
1. Under "Enhanced Data Sheets," select "Synonyms and Identifiers" (**This is currently broken on the EPA site, see below**)
1. Finally, click "Download Export File". This can take several minutes, or even stall if EPA's servers have heavy use.
1. Once the file has been downloaded to your machine, RENAME it "comptox_batch_results.xlsx" (don't open it, just rename it!) 
    - **NOTE that recently this "Download Export File" process does not complete but hangs indefinitely.**  The current work-around is to 
        - Deselect the "Synonyms and Identifiers" checkbox and click "Download Export File" again. You won't be able to get the synonym data, but will still be able to update the name data.
1. Move that file to *work_dir*

### Next, we fetch a fresh version of the CompTox Lists metadata:

- Go to the [EPA CompTox **list** page](https://comptox.epa.gov/dashboard/chemical-lists)
- Click on the "Export" button in the upper right, and select the "Excel" option.  This will download a file to your computer.
- Rename that file (without opening it) as "comptox-chemical-lists-meta.xlsx" and move it to work_dir.

1. Run the following cell.


In [11]:
update_CompTox_lists()

     -- including added files: ['CAS_ref_by_hand_2023_07-25.txt']
 - Using work_dir version of comptox_batch_results.xlsx for name list.
 - Using fresh version of comptox_batch_results.xlsx for synonyms
 - Using repo version of comptox_batch_results_broad.xlsx for synonyms
 - Using repo version for comptox-chemical-lists-meta.xlsx
 - Using work_dir version of comptox_batch_results.xlsx for lists table.


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:55:27</div>

---
## Start CAS|Ing processing

In [12]:


casing_step1()

Number of new CAS|ING pairs: 45
0: processing:< 00-00-0 > < no ingredients as hazardous according to osha 29 19:10 12:00 >
1: processing:< 000107-21-1 > < friction reducer >
2: processing:< 007647-14-5 > < kcl >
3: processing:< 100-41-4 > < silica substrate >
4: processing:< 112-62-9 > < methyl oelate >
5: processing:< 154518-36-2 > < alcohols, c9-11-iso-, c10-rich, ethoxylated propoxylated >
6: processing:< 22042-96-2 > < phosphonic acid >
7: processing:< 227310-69-2 > < poly(oxy-1,2-ethanediyl), .alpha.-(carboxymethyl)-.omega.-hydroxy-,c16-18 and c18-unsatd. alkyl ethers >
8: processing:< 26062-79-3 > < copolymer >
9: processing:< 26172-55-4 > < 5-chloro-2-methyl-3(2h)- isothaiazolone >
10: processing:< 3844-45-9 > < acid blue 9, disodium salt >
11: processing:< 39202-17-0 > < methyl 9-dedecenoate >
12: processing:< 57635-48-0 > < poly(oxy-1 ,2-wthanediyl), alpha-(carboxymethyl)-omega-[(9z)-9-octadecen-1-yloxy]- >
13: processing:< 57635-48-0 > < poly(oxy-1,2-ethanediyl), a- (carboxym

CASNumber,curatedCAS,IngredientName,recog_syn,synCAS,match_ratio,n_close_match,source,bgCAS,rrank,picked
Loading... (need help?),,,,,,,,,,


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 07:56:11</div>

### step E - curate the new CAS|Ing pairs 

- make any desired changes to casing_TO_CURATE.csv.
- save file as *casing_modified.csv* in **work_dir**
- then run the following code.  This step will keep only those lines where 'picked'=='xxx' and it will add today's date as the first seen date.

It will then add these lines to the master casing_curated file that will be used in subsequent steps.


In [13]:

casing_step2()

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 08:00:30</div>

Unnamed: 0,CASNumber,IngredientName,curatedCAS,recog_syn,synCAS,n_close_match,bgCAS,source,first_date,change_date,change_comment
Loading... (need help?),,,,,,,,,,,


### step F - Verify that all CAS/Ing pairs are curated

In [14]:

casing_step3()

Number of new CAS|ING pairs: 0


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 08:00:45</div>

## Company Name curation tasks

The company names used in FracFocus are not standardized; searching for all records of a company using the raw FracFocus data can be a tedious and frustrating task.   Open-FF uses a translation table to take raw company names (`OperatorName` and `Supplier`) and cluster them into categories that refer to the same company.  

The cells below first finds new company names that need curation attention and stores them in a file called *company_xlateNEW.csv*.  Typically, for about 1000 new disclosures, there are about 50 new names to curate, with many being slight variations on already curated names or brand new companies.  The users job is to do that curation (it usually takes just a few minutes).  The user saves that curated file and that will be used to build a new data set.

In [15]:


companies_step1()


Number of new company lines to curate: 35


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 08:00:58</div>

rawName,cleanName,xlateName,is_new,OpCount,OperatorYears,SupCount,SupplierYears,status,first_date,change_date,change_comment
Loading... (need help?),,,,,,,,,,,


### Now curate the new company names
Edit the *company_xlateNEW.csv* file so that `xlateName` is acceptable, the `first_date` is filled out, and the `status` is set to **curated**. 

Save those changes as *work_dir/company_xlate_modified.csv*. 

Run the following cell and verify that you have no company names to curate.

In [16]:

companies_step2()

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/26/2023 08:07:48</div>

## Location curation tasks

Like the other text fields in FracFocus, state and county names are not required to be standardized.  We try to create curated, standardized versions where we can to help with location errors detection.  Typically, very few new locations are added and so curation is often not even required with this data set.



In [17]:

location_step1()

Starting Location cleanup
  -- importing state-derived location data
Make list of disclosures whose lat/lons are not specific enough
Number of new locations: 1
  -- re-projecting location points
  -- fetching shapefiles
  -- checking against shapefiles
   No match: alaska: beechey point
   No match: alaska: flaxman island
   No match: alaska: harrison bay
   No match: alaska: harrison bay offshore
   No match: alaska: kenai
   No match: alaska: kenai offshore
   No match: alaska: sagavanirktok
   No match: alaska: seldovia
   No match: alaska: tyonek
   No match: arkansas: missing
   No match: california: anacapa island offshore
   No match: california: contra costa offshore
   No match: california: los angeles offshore
   No match: california: marin offshore
   No match: california: san clemente island offshore
   No match: california: san nicolas island offshore
   No match: california: santa catalina island offshore
   No match: california: santa cruz island offshore
   No match: ca

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 07:49:51</div>

### Curate results
**If there are new locations**, curate the *work_dir/location_curatedNEW.csv* file and save to *work_dir/location_curated_modified.csv*

Then run the location check again to make sure you curated all the new locations: 

In [18]:

location_step2()

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 07:52:25</div>

## Water carrier detection
To perform accurate calculation of mass, it is critical that the water carrier records in disclosures are identified.

In this current version of Open-FF, all water carrier determinations are performed with code.  No hand-curation is used. We came to the conclusion that, in the irregular disclosures that would be a target for hand curation, there are too many moving parts to make consistent decisions over the whole set especially with new disclosures being added all the time.  By using only coded algorithms to detect the carriers, we can apply consistant rules over the entire set.   

The current set of algorithms rejects about 54,000 disclosures as being clearly ineligible for carrier detection (43,000 simply because they lack ingredient data).  Of the remaining 150,000, about 1% are not caught by the detection algorithms.  Data on those are available in a saved file here for user examination. While calculated masses will not be performed on that small set, `MassIngredient` may still be explored. 

In [19]:

carrier_step()

Summary of disclosures with water carrier identification problems:
  43,978 : No ingredients
  30,970 : Total water volume missing or 0
  44,189 : Ingredients have only 0 or missing PercentHFJob
   3,319 : Sum of all valid ingredients is greater than upper tolerance
   4,599 : Sum of ingredients with sys app meta removed sum is greater than upper tolerance
   2,703 : Proppant percentage sum is greater than water percentage
   2,929 : Sum of all ingredients is less than 90%
     150 : Not used: Gasses are dominant (>50%)
     149 : Chlorine dioxide percentage is 100
Total problem disclosures: 54,652; Number remaining: 155,144

Auto set 1:    130,801, with     24,343 remaining
Auto set 2:      7,370, with     16,973 remaining
Auto set 3:      4,765, with     12,208 remaining
Auto set 4:      4,688, with      7,520 remaining
Auto set 5:      4,515, with      3,005 remaining
Auto set 6:      1,052, with      1,953 remaining
Auto set 7:        421, with      1,532 remaining
Auto set 8:     

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 07:53:04</div>

# Build and save Open-FF data set

**Start these steps only after all curation steps have been completed successfully!**

This step takes all of the files created in the curation steps and applies them together to the raw data.  Additionally, hooks to external data sources are used to create fields that better identify chemicals, locations etc.  The result of this step is a set of tables that can be used to further build a flat data set (such as a CSV file) or even a relational database.



In [20]:

builder_step1()

In [21]:

builder_step2()


**************************************************
                   Table_manager                   
**************************************************
 -- assembling CAS/IngredientName table
 -- assembling companies table
 -- assembling bgCAS table
    -- add external references such as TEDX and PFAS
     -- processing CWA
     -- processing DWSHA
     -- processing AQ_CWA
     -- processing HH_CWA
     -- processing IRIS
     -- processing PFAS_list
     -- processing volatile_list
     -- processing NPDWR list
     -- processing epa diesel list
     -- processing TSCA UVCB list
     -- processing Reportable Quantity list
 -- assembling disclosure table
    -- create bgOperatorName
       -- Number uncurated Operators: 0 
    -- constructing dates
 -- searching for wells on Fed and Native lands
 -- assembling chemical records table
    -- adding bgCAS
       -- Number uncurated CAS/Ingred pairs: 0, n: 0
    -- create bgSupplier
       -- Number uncurated Suppliers: 0 
    -- flagg

<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 07:57:35</div>

## Create flat data set and test it
This step uses the set of tables created earlier to build a single 'flat' data file as well as to run some basic tests on the new data set.  Note that because the full data set is very large (too big for excel) and CSV files are cumbersome at this size, we are using the **parquet** format which is much faster and takes up far less space.  To create an equivalent CSV file, see this XXXXXXX

**Note that running this notebook in the free version of Colab can, at this step, cause the memory to overflow and reset the kernel.** This seems to depend on a number of factors and while unfortunate, is not fatal.  If you've made it this far, all the materials needed to complete the process are stored on disk.  So to finish up:
- rerun the `%run intg_support/create_new_set_steps.py` line at the top of the notebook
- run the following steps.  It should complete normally.

In [22]:


builder_step3()

Performing tests
  --  Test <reckey> consistency
  --  Test <bgCAS> consistency
  --  Test <bgOperatorName> consistency
  --  Test <bgSupplier> consistency
  --  Confirm removal of duplicate records from filtered data
  --  Confirm APINumber integrity
    -- Disclosures with malformed APINumber: ['5ce7e732-e28d-4451-a679-2e9485f9539a']
  --  Confirm calcMass consistency


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 07:59:12</div>

## Make repository
One the data set has been created, saved and tested, we construct a "repository."  Once created, this repository is intended to be **read only**, that is, no changes should be made to it.  The idea is that when using a given repository, analysts can depend on it being frozen in time.

Still to be done:
- include README that describes history of repo.
- include code used to make this repo 
    - BUT exclude git and other hidden folders. Not useful!
- make html file of the output of this notebook and include in repo
- include list of the versions of the external files used
- include archive of zip of bulk download


In [23]:

make_repository(create_zip=False)

Skipping zip zrchive creation


<div style="background-color: #669999; padding: 10px; border: 1px solid green;"><h3> This step completed normally. </h3> 07/27/2023 08:03:50</div>