| <div> <img src="https://storage.googleapis.com/open-ff-common/openFF_logo.png" width="100"/></div>|<h1>Adding New Disclosures to the Open-FF Data:<br><br>Download, Curate, Assemble, Test, and Archive<br></h1>|
|---|---|

<!--
<center><a href="https://www.fractracker.org/" title="FracTracker Alliance"><img src="https://storage.googleapis.com/open-ff-common/2021_FT_logo_icon.png" alt="FracTracker logo" width="100" height="100"><br>Sponsored by<br> FracTracker Alliance</a></center>| -->


In [None]:
# fetch the Open-FF code repository and master data file from remote storage
# For use in COLAB, the following lines should be uncommented; 
#   comment all lines if running locally

# !git clone https://github.com/gwallison/openFF.git &>/dev/null;
# !pip install itables  &>/dev/null;
# !pip install geopandas  &>/dev/null;

In [None]:
# %run openFF/build/builder_nb_support.py

In [None]:
# Local - The lines below should stay commented unless running locally, they would replace the cell above
# assumes that the code is already available in the local environment
import sys
sys.path.insert(0,'c:/MyDocs/integrated/') # adjust to your setup
%run builder_nb_support.py

# Set up
First we start by constructing a workspace and collecting the resources needed.

**Directories constructed:**  These are all within a directory called **sandbox** in the openFF/build directory/.  As you go through this process, you will interact with files in most of these directories.
| directory name | description |
| ---: | :--- |
|**orig_dir**| expanded zip files, downloaded external files, etc: files used as a model for the next round, but not to be directly saved|
|**work_dir**| This is the working directory where new curation files created by these routines are kept. These 'generated' files are saved at the end of the process into either the repository or other archives.|
|**ext**| non-FracFocus data files used in constructing the Open-FF data set |
|**final**| the place for final files, archives and repositories. |

In [None]:
# Control download: typically set to True
#    set to False if you can skip the downloading part of the repo and the external data, for example, during testing.

download_repo_flag = True
download_ext_flag = True
download_FF_flag = True
create_raw_flag = True


In [None]:
create_and_fill_folders(download_repo_flag)

## Download external files used to assemble final data set
To do:
- make list of external databases and when they were compiled
- reorg so that you don't have to download everything.

In [None]:
get_external_files(download_ext_flag)

## Download raw files from FracFocus
To do:
- refactor comparison with previous download
- make trap to detect addition of new fields

In [None]:

download_raw_FF(download_FF_flag)

## Create master raw FracFocus set as file

In [None]:

create_master_raw_df(create_raw_flag)

#### Add new disclosures to Disclosure ID file

## One Time Change:
Due to the transition to FracFocus version 4 that happened on Dec 4, 2023, a transition "upload_dates.parquet"  was used as a replacement to the original.  The primary difference is that it includes the appropriate `DisclosureId` values that are replacing the `UploadKey` values.  that file can now serve as a translation file between old and new keys to the disclosures.

In [None]:

update_upload_date_file()

# Curation steps
These steps are a mix of automated and hand-performed curation tasks. The hand performed tasks require the user to examine database values in spreadsheets and then to make and record decisions on those values about individual records.



## `CASNumber` and `IngredientName` curation tasks

Open-FF uses both raw input fields `CASNumber` and `IngredientName` to clarify chemical identity in each record.  These two fields **should** agree on the identity, but often only one field provides unambiguous identification (usually `CASNumber`) and sometimes the two are conflicting.  Our target is an accurate `bgCAS`, which is our "best guess" at a CAS Registration Number for the material reported in the disclosure.
That is,
> Unique `CASNumber` | `IngredientName` pair  $\rightarrow$ `bgCAS`

There are currently over 28,000 unique pairs.

The curation process outlined below gets our identification as close as possible to our target.  It requires using several sources of information and part of the process includes collecting that information.  Some steps are partially automated whereas other steps require our judgement and are therefore manual.  

This process is also incremental - we only need to curate the *new* chemical identifiers in the most recent download.  However, this process can also be used to examine the whole curated set to refine identification performed earlier.  

Resources needed to create the CAS-Ing list:
- CAS_curated: a list of `CASNumber` values and the tentative `bgCAS` number they imply.
- `IngredientName` synonym list: list of synonyms (and associated CAS number) to weigh against `IngredientName`. This is created from a collection of CAS and CompTox references.
- `TradeName` values associated with CAS-Ing pairs -- this aspect is still in development, though curators may manually examine TradeNames to make decisions.

### step A - use previous list to find any new `CASNumber` values
1. compare list of `CASNumber` values in rawdf to list of `CASNumber` in *olddir/curation_files/cas_curated.csv*
1. make and display list of those new ones.

#### Next steps for YOU:
>Next steps for **new** `CASNumber` (see note below): 
>- if the implied chemical is not in the CAS references, go to SciFinder and make new entry (manual!)
>- otherwise, can skip the SciFinder steps, but go to the CAS_curate step. 
>
>If there are no **new** `CASNumber`, 
>- skip all the way to the moving the current CAS_curate.csv to *newdir/curation_files/CAS_curate.csv*

Note: these "new" `CASNumber` values can be completely new chemicals or just a new version of an already used material (for example, we might find '00000050-00-0' for the authoritative CASRN '50-00-0' that is already documented in Open-FF.).   They may also be something that will not resolve into a valid CASRN, for example: 'proprietary by operator'. You will assign appropriate 'bgCAS' values in the curation step.

In [None]:
cas_curate_step1()

### Step B - Add chem info from SciFinder of new tentative CAS numbers into CAS_ref files
If a chemical on this list hasn't been seen before in FF, we need to add some information into the CAS_ref files.
To do that, we currently use SciFinder, which is a product of the Chemical Abstract Service, the naming authority for materials.

Save the file in `new_CAS_ref` directory of **work_dir** 

Once you have saved the file, run the following cell to **verify** that the new `CASNumber`s that you've found to be valid and created a reference for, actually made it into the SciFinder reference.



In [None]:
cas_curate_step2()

### Step C - Curate the CAS_to_curate file
In this step you will manually edit the *work_dir/CAS_curated_TO_EDIT.csv* file to curate the new `CASNumber` values. There are typically only a handful of lines in this file that you need to curate, just those newly discovered in the latest FF download. 

Your task is to assign a `curatedCAS` value for each new line, using the clues there.

Once you have completed the editing, save the file back to *work_dir/CAS_curated_modified.csv*.

#### Step C.1 - Make sure all `CASNumber` values have been curated

In [None]:

cas_curate_step3()

---
### Step D - Update CompTox data
The metadata from EPA's CompTox system informs OpenFF in a number of ways. Now that we have a full CAS list, we need to update the CompTox data. 
**If no new CAS numbers were added to CAS_curated.csv, you can skip this step and jump to the CAS | Ing processing**

1. Open [CompTox batch search](https://comptox.epa.gov/dashboard/batch-search)
1. Under "Select Input Type(s)", check "CASRN"
1. Open the  *work_dir/comptox_search_list.csv* file in something like Excel or OpenOffice. (If you can't find this file, you may not need to run this step - no new CAS numbers?).
1. Copy and paste all CAS numbers in the `curatedCAS` column into the CompTox webpage "Enter Identifiers to Search" box.  You can skip the non_CAS numbers like 'proprietary'. They mean nothing to CompTox.
1. On the Comptox page, click the "Choose Export Options" button.
1. Under "Choose Export Format," select "Excel."
1. In the "Chemical Identifiers" section, make sure that the following are checked (but no more than these):
- Chemical Name
- DTXSID
- IUPAC Name
8. Under the "Presence in Lists" table, click the check box in the header that selects ALL lists.  This will be used to map what lists each chemical is part of.
1. Under "Enhanced Data Sheets," select "Synonyms and Identifiers" (This is has been broken on the EPA site in the past, see below)
1. Finally, click "Download Export File". This can take several minutes, or even stall if EPA's servers have heavy use.
1. Once the file has been downloaded to your machine, RENAME it "comptox_batch_results.xlsx" (don't open it, just rename it!) 
    - (**NOTE this feature - the "Download Export File" process - has occasionally not worked.  It does not complete but hangs indefinitely.**  The current work-around is to 
        - Deselect the "Synonyms and Identifiers" checkbox and click "Download Export File" again. You won't be able to get the synonym data, but will still be able to update the name data.)
1. Move that file to *work_dir*

### Next, we fetch a fresh version of the CompTox Lists metadata:

- Go to the [EPA CompTox **list** page](https://comptox.epa.gov/dashboard/chemical-lists)
- Click on the "Export" button in the upper right, and select the "Excel" option.  This will download a file to your computer.
- Rename that file (without opening it) as "comptox-chemical-lists-meta.xlsx" and move it to work_dir.

1. Run the following cell.


In [None]:
update_CompTox_lists()

### Finally, capture the ChemInformatics

**Note that this is currently working slowly and often hanging. If this doesn't work, this step can be skipped, and the previous repo's information will be used instead.**


- open the [ChemInformatics Modules](https://www.epa.gov/comptox-tools/cheminformatics) that is part of the CompTox system. 
- click on the "Hazard" module, then click on the magnifying glass.
- using the *work_dir/comptox_search_list.csv* list again, paste all CASRN into the search box ("search by identifiers" tab), and click "Search".
- When the module returns with a list of found compounds, save them to the cart by clicking the "Cart +" icon.
- Back at the Hazard Module screen, press the "Cart" icon to generate a report.  For this long list, this process can take a few minutes.  
- Now, save the results into TWO files to use in Open-FF:
    - click the "export to XLSX" icon in the upper right, and wait again.  Move that file directly into the *work_dir/new_CHEMINFO_ref* folder
    - click the "export to SDF" icon in the upper right, and wait again.  Move that file directly into the *work_dir/new_CHEMINFO_ref* folder.
- Next, we will grab the safety data for these chemicals.  Click on the "Safety" module (the safety glasses icon).
- Click on the "Cart" icon to generate a report for the 1400+ chemicals still in the cart.  This will take a while.
- Now, save the results into one file to use in Open-FF:
    - click the "export to XLSX" icon in the upper right, and wait again.  Move that file directly into the *work_dir/new_CHEMINFO_ref* folder


- Finish by running the following code:

In [None]:
update_ChemInformatics()

---
## Start CAS|Ing processing

In [None]:
casing_step1()

### step E - curate the new CAS|Ing pairs 

- make any desired changes to casing_TO_CURATE.csv.  For any given pair, **there should be only ONE line with 'xxx' in 'picked'**.  
- save file as *casing_modified.csv* in **work_dir**
- then run the following code.  This step will keep only those lines where 'picked'=='xxx' and it will add today's date as the first seen date.

It will then add these lines to the master casing_curated file that will be used in subsequent steps.


In [None]:

casing_step2()

### step F - Verify that all CAS/Ing pairs are curated

In [None]:

casing_step3()

## Company Name curation tasks

The company names used in FracFocus are not standardized; searching for all records of a company using the raw FracFocus data can be a tedious and frustrating task.   Open-FF uses a translation table to take raw company names (`OperatorName` and `Supplier`) and cluster them into categories that refer to the same company.  

The cells below first finds new company names that need curation attention and stores them in a file called *company_xlateNEW.csv*.  Typically, for about 1000 new disclosures, there are about 50 new names to curate, with many being slight variations on already curated names or brand new companies.  The users job is to do that curation (it usually takes just a few minutes).  The user saves that curated file and that will be used to build a new data set.

In [None]:
companies_step1()

### Now curate the new company names
Edit the *company_xlateNEW.csv* file so that `xlateName` is acceptable, the `first_date` is filled out, and the `status` is set to **curated**. 

Save those changes as *work_dir/company_xlate_modified.csv*. 

Run the following cell and verify that you have no company names to curate.

In [None]:
companies_step2()

## Location curation tasks

Like the other text fields in FracFocus, state and county names are not required to be standardized.  We try to create curated, standardized versions where we can to help with location errors detection.  Typically, very few new locations are added and so curation is often not even required with this data set.



In [None]:
location_step1()

### Curate results
**If there are new locations**, curate the *work_dir/location_curatedNEW.csv* file and save to *work_dir/location_curated_modified.csv*

Then run the location check again to make sure you curated all the new locations (or to move the last repo's curation into workdir): 

In [None]:

location_step2()

## Water carrier detection
To perform accurate calculation of mass, it is critical that the water carrier records in disclosures are identified.

In this current version of Open-FF, all water carrier determinations are performed with code.  No hand-curation is used. We came to the conclusion that, in the irregular disclosures that would be a target for hand curation, there are too many moving parts to make consistent decisions over the whole set especially with new disclosures being added all the time.  By using only coded algorithms to detect the carriers, we can apply consistant rules over the entire set.   

The current set of algorithms rejects about 54,000 disclosures as being clearly ineligible for carrier detection (43,000 simply because they lack ingredient data).  Of the remaining 150,000, about 1% are not caught by the detection algorithms.  Data on those are available in a saved file here for user examination. While calculated masses will not be performed on that small set, `MassIngredient` may still be explored. 

In [None]:
carrier_step()

# Build and save Open-FF data set

**Start these steps only after all curation steps have been completed successfully!**

This step takes all of the files created in the curation steps and applies them together to the raw data.  Additionally, hooks to external data sources are used to create fields that better identify chemicals, locations etc.  The result of this step is a set of tables that can be used to further build a flat data set (such as a CSV file) or even a relational database.

**Also, before you take these next steps, look through 

In [None]:
builder_step1()

In [None]:
builder_step2()

## Create flat data set and test it
This step uses the set of tables created earlier to build a single 'flat' data file as well as to run some basic tests on the new data set.  Note that because the full data set is very large (too big for excel) and CSV files are cumbersome at this size, we are using the **parquet** format which is much faster and takes up far less space.  To create an equivalent CSV file, see this XXXXXXX

**Note that running this notebook in the free version of Colab can, at this step, cause the memory to overflow and reset the kernel.** This seems to depend on a number of factors and while unfortunate, is not fatal.  If you've made it this far, all the materials needed to complete the process are stored on disk.  So to finish up:
- rerun the `%run [...]build_nb_support.py` cell at the top of the notebook
- run the following steps.  It should complete normally.

In [None]:
builder_step3()

## Make repository
One the data set has been created, saved and tested, we construct a "repository."  Once created, this repository is intended to be **read only**, that is, no changes should be made to it.  The idea is that when using a given repository, analysts can depend on it being frozen in time.

Still to be done:
- include README that describes history of repo.
- include code used to make this repo 
    - BUT exclude git and other hidden folders. Not useful!
- make html file of the output of this notebook and include in repo
- include list of the versions of the external files used
- include archive of zip of bulk download


In [None]:

make_repository(create_zip=False)

## Copy this notebook page into the new repository
As a record of the process that created this new repository, use jupyter -> `Save and Export Notebook as...` to create a .HTML file of this notebook's output.  Then move that file to the new repository folder.