In [None]:
import sys
sys.path.insert(0,'c:/MyDocs/integrated/') # adjust to your setup

%run "catalog_support.py" 
showHeader('Open-FF Data Dictionary')

#### Description of the contents of the final data files generated by Open-FF from the FracFocus data.

In [None]:
def addfield(dict,fn, table):
    dict.setdefault(fn, []).append(table)

tables = fh.get_repo_tables()
all_fn = {}
for t in tables.keys():
    for fn in tables[t]:
        addfield(all_fn,fn,t)

In [None]:
kys = list(all_fn.keys())
tbls = []
for k in kys:
    tbls.append(all_fn[k])

In [None]:
    
full = fh.get_df(os.path.join(hndl.curr_repo_dir,'full_df.parquet'))
#print(f'Number of fields in Full: {len(full.columns)}')

In [None]:
# find fields created outside of the tables
for col in full.columns:
    if not col in kys:
        #print(col)
        kys.append(col)
        tbls.append('filter flag')

In [None]:
# calculate a few stats for each field        
uniq = []
num_val = []
dt = []
for k in kys:
    try:
        uniq.append(full[k].nunique())
    except:
        uniq.append(-1)
    try:
        #num_val.append(len(full[full[k].notna()]))
        num_val.append(full[k].notna().sum())
    except:
        num_val.append(-1)
    try:
        dt.append(full[k].dtype)
    except:
        dt.append('?')

In [None]:
all_fn_df = pd.DataFrame({'fieldName':kys,'Database_Tables':tbls,'Data_type':dt,'Num':num_val,
                          'Unique':uniq})
#all_fn_df.Num = all_fn_df.Num.map(lambda x: round_sig(x))
#all_fn_df.Unique = all_fn_df.Unique.map(lambda x: round_sig(x))

### Acceptable use of FracFocus data
One requirement 
for using the FracFocus data is stipulated on the FracFocus website:

**"Downloaded data may be aggregated or combined with other datasets,** 
**but the FracFocus data may not be altered in any way."**

Please read the entire "Terms of use" at http://fracfocus.org/data-download.

The work in this project maintains the original FracFocus data as is reported
in the bulk download.  The field names used in the original are kept: All of 
these original names begin with an upper-case letter and can be identified
in that way.  Fields generated by this project or from external data sources will begin with a lower case
letter (for example, `CASNumber` is the original field, `bgCAS` is the 
generated field. Note there are two exceptions: `DTXSID` and `MI_inconsistent` are NOT 
original with FracFocus.)

In the zipped bulk download from FracFocus, a data dictionary is provided in 
the **'readme.txt'** file. (This zipped download is in the /sources or /data directory
and we rename it as 'currentData.zip')  This file gives some information about many of 
the fields found; however, it is written for the SQL database version of the bulk 
download, not the CSV version which we use in this project.  Further, some important fields are not mentioned in that readme.txt file; they are
described below.  In the descriptions of all fields below, we cite the FracFocus text 
from a June 2021 bulk download.

## Descriptions of fields in the output data sets

|Explanation of columns in the table below|
| --- |

|column| what it is|
| --- | :--- |
|**fieldName:**|The name of the field or column in the data set. All field names that are capitalized are from the original FracFocus downloaded data.  Lower-case names are generated by Open-FF.  <br>**tables**: Which Open-FF internal tables that are used to construct output data sets have this field|
|**FracFocus description:**| Description of the (original) field given by FracFocus in the bulk download file, *readme.txt*.|
|**Open-FF description:**|Our description of the field|
|**source:**|is this field a direct copy of the original FracFocus data or is it generated by Open-FF, or pulled from an external data set?|
|**Num:**| the number of non-empty values in the field|
|**Unique:**| the number of unique types (including NaN) in the field|
|**Data_type:**| the python/pandas data type for the field|


In [None]:
desc = {'bgCAS':"""this is our best guess for the CAS registry number based on: 1) CASNumber and IngredientName, 
2) an automated cleaning and validation process, 3) a manual curation process and 
4) a check against an authoritative catalog of CAS numbers.""",
'bgIngredientName':"""the primary ingredient name used for bgCAS in the online 
reference, SciFinder.  While other synonyms may be valid, a single name used
across the whole data set allows for easier searches.""",
'calcMass':"""The calculated mass (in pounds) of the chemical in the record. 
This mass is calculated using the total mass of the fracking job and the 
percentage of the chemical in the whold job (PercentHFJob).  <br>(This field was
previously called 'bgMass')""",
'bgOperatorName':"""our curatated version of the raw field 'OperatorName.'  This is 
a simple translation of the OperatorName field using the file: xlate_companies.csv.
See that file in the /data/transformed directory.""",
'bgStateName':"""Our corrected and normalized version of the raw field 'StateName.' """,
'bgSupplier':"""our curatated version of the raw field 'Supplier.'  This is 
a simple translation of the Supplier field using the file: xlate_companies.csv.
See that file in the /data/transformed directory.  This is an attempt to clean
up the field to make it more searchable. While the raw field has over 80
versions of the company name Halliburton, bgSupplier has only one.""",
'data_source':"""indicates whether record was sourced from 'bulk' or some other source. Open-FF focuses primarily on the bulk download that is published
regularly; however some archived or special purpose data sets may be used in this data framework.""",
'date':"""a cleaned version of JobEndDate.  This version has removed time stamps and 
has corrected a few erroneous years.  Note that early disclosures only specified a "fracture date" which seems to be a starting date. When there is
only a single date given, it is used for this field.""",
'primarySupplier' :"""(added in version 9.0) This field is simply the most frequent bgSupplier in a disclosure (ignoring all non-company values such as
'n/a', 'Listed Above', etc.)  This field is an attempt to overcome the disconnect
between the chemical records and the descriptor fields in the System Approach.
It should be noted that any specific chemical cannot be *directly* connected
to the company in this field; use bgSupplier or Supplier for that.""",
'reckey':"""a simple incrementing index of records across all raw input files.""",
# 'alt_CAS':"""indicates CAS numbers implied when comparing IngredientName to a long list of chemical synonyms. This is used to determine if the IngredientName is specific
# enough to authenticate the final CAS number.""",
'bgSource':"""in the evaluation of a CASNumber|IngredientName pair, this indicates the source of the final bgCAS.""",
'auto_carrier_type':"""for disclosures where carrier record(s) are auto-detected, this field indicates which
auto-detection method was triggered.""",
'cleanMI' :"""version of the MassIngredient value for this record that is filtered for inconsistancy within the
disclosure.  If cleanMI has a value, the disclosure passed the consistency test.""",
'MI_inconsistent':"""True/False where True indicates that this disclosure did NOT pass the consistency test.""",
'IndianWell':"""(text field) Presumably, True when the well is within Native American controlled land. Note that
Early disclosures were not categorized for this characteristic.  Furthermore, in other disclosures that are in the bulk download, if this
value is not included in the disclosure, FracFocus gives a value of "**False**" when it should be unknown or unreported.
 See **bgNativeAmericanLand**.""",
'FederalWell':""" (text field) Presumably, True when the well is on Federal lands. Note that
Early disclosures were not categorized for this characteristic.Furthermore, in other disclosures that are in the bulk download, if this
value is not included in the disclosure, FracFocus gives a value of "**False**" when it should be unknown or unreported.
 See **bgFederalLand**.""",
'bgCountyName':""" Our corrected and normalized version of the raw field 'CountyName.'""",
'bgLatitude':""" Our best guess of the latitude of the well. In a small proportion of cases, FracFocus Latitude is wrong
but can be corrected. Although FracFocus lat/lon values may be reported in any of a number of 'Projections', all Open-FF latitudes and
longitudes are converted to a single projection: WGS84.  See bgLocationSource.""",
'bgLongitude':""" Our best guess of the longitude of the well. In a small proportion of cases, FracFocus Longitude is wrong
but can be corrected. Although FracFocus lat/lon values may be reported in any of a number of 'Projections', all Open-FF latitudes and
longitudes are converted to a single projection: WGS84. See bgLocationSource.""",
'latlon_too_coarse':"""This flag is set to True when the decimal digits of Latitude and Longitude (together) are
fewer than 5 digits. (For example 31.1, -89.34 is only 3 decimal digits.) This is an indication that the characterization
of the location is <a href="https://gisjames.wordpress.com/2016/04/27/deciding-how-many-decimal-places-to-include-when-reporting-latitude-and-longitude/">coarse enough to obscure </a>
the well's actual place on the map.""",
'no_chem_recs':"""This flag indicates that the disclosure has no chemical records """,
'is_duplicate':"""This flag indicates that this disclosure has a duplicate in FracFocus; that is, has the same APINumber and
JobEndDate.  Currently, there is no way to detect which is the correct disclosure, so all duplicates are removed in
the standard filter.""",
'has_TBWV':"""This flag indicates that a disclosure has a non-zero TotalBaseWaterVolume, and is therefore a
candidate for mass calculations""",
'within_total_tolerance':"""This flag indicates whether the totalPercent is within 5% tolerance of 100%. """,
'carrier_mass_MI':""" This is the mass of the carrier calculated from the MassIngredient field.  Used only for 
comparative purposes.""",
'carrier_density_from_comment':"""The density value reported in the IngredientComment field for those water records
with a PercentHFJob value greater than 50%""",
'carrier_percent':"""The PercentHFJob value for the carrier record(s) in the disclosure""",
'carrier_density_MI':"""This is the density of the carrier calculated from the MassIngredient field. Used only for
comparative purposes""",
'bgDensity':"""The density of the carrier used in calculating mass of the carrier.""",
'bgDensity_source':"""Currently, either the carrier density as reported in the IngredientComment for the carrier or
the default.""",
'carrier_mass':"""The calculated mass of the carrier for this disclosure.""",
'job_mass':"""The mass of the entire hydraulic fracking fluid as calculated from the carrier mass and carrier volume.""",
'MassIngredient':"""A FracFocus supplied value of the mass of the chemical in the record (in pounds).  This field
wsa only documented in the FracFocus README.txt file starting in 2023. These mass values are absent for many records
and we have found that some values are very inconsistent with other information in a disclosure.  Where we can verify its internal consistency,
we use this field and calcMass to produce the composite field 'mass'.""",
'ingKeyPresent':"""does the record have an IngredientsId value?""",
'density_from_comment':""" the density of the chemical in the record as reported in the IngredientComment """,
'DTXSID':"""EPA's code for this CAS number""",
'is_on_CWA':"""indicates if the chemical is on the Clean Water Act list as compiled in EPA's CompTox""",
'is_on_DWSHA':"""indicates if the chemical is on the EPA's Drinking Water Safety and Health Advisory list""",        
'dup_rec':"""This flag indicates that a record is a redundant duplicate of another in the same disclosure. These
are removed in the standard filter.""",
'massComp':"""indicates the degree of agreement between calcMass and MassIngredient for a given record.  The smaller the number, the better the agreement. 
""",
'massCompFlag':"""indicates if calcMass and MassIngredient are out of tolerance.""", # changed 'in' to 'out of' Dec. 20,2021
'syn_code': """indicates the match of the raw IngredientName with the synonym reference.""",
'is_water_carrier':"""This flag indicates the carrier record of the disclosure. """,
'job_mass_MI':"""The mass of the entire job (in pounds) as calculated using MassIngredient.""",
'is_on_prop65':"""The chemical in this record is on the California Prop 65  list.""",
'is_on_TEDX':""" The chemical in this record is on the Endocrine Disruptor Exchange list. """,
'rawName':""" (internal use) """,
'cleanName':""" (internal use) """,
'xlateName':""" (internal use)  """,
'status':""" (internal use)  """,
'FFVersion': """Most disclosures of FFVersion 1 (2011 - May 2013) do not have chemical records, only metadata.  In earlier versions,
Open-FF used the SkyTruth archive to include the chemical records from these fracking events but
that has been discontinued.""",
'APINumber':"""This field is treated as a TEXT field, not a number, to maintain the integrity of any leading zeros.  Note that there
are not dashes in the APINumbers reported in the bulk download, despite what FracFocus indicates.""",
'UploadKey':"""The values in this field are unique identifiers of individual disclosures.""",
'WellName':"""While the WellName should be a unique identifier for a given well, it is often not used consistently
when a well has multiple disclosures.""",
'TVD':"""Measurement in feet.""",
'pub_delay_days':"""Number of days between JobEndDate and day of first detection in Open-FF. Note that these values
are only valid for JobEndDates after Sept. 2018 (when Open-FF started keeping track).""",
'in_std_filtered':"""This True/False flag indicates the records that are included in the standard filtered set. 
This is generated in the Analysis_set.py module""",
'date_added':"""Indicates the date that the given disclosure is first detected in Open-FF. Note that these values
are only valid for JobEndDates after Sept. 2018 (when Open-FF started keeping track).""",
'carrier_problem_flags': """For those disclosures that detect problems when Open-FF is determining the carrier record, this 
field indicates the type(s) of issues found.  See table below for codes.""",
'carrier_status': """indicates the method by which the carrier record was identified (if possible). See table below for description of detection criteria""",
'IngredientKey':"""FracFocus-created key that is a unique number identifying a specific record in the whole data base. """,
'is_on_PFAS_list':"""True/False flag indicating if the bgCAS of the given record is on EPA's master list of PFAS chemicals
and precursors""",
'is_on_diesel':"""True/False flag indicating if the bgCAS of the given record is on EPA's list that 'represent the most appropriate interpretation of the statutory term 'diesel fuels' 
to use for permitting diesel fuels hydraulic fracturing under the UIC Program nationwide.'""",
'is_on_AQ_CWA':"""True/False flag indicating if the bgCAS of the given record is on EPA CompTox WATERQUALCRIT list""",
'is_on_HH_CWA':"""True/False flag indicating if the bgCAS of the given record is on EPA CompTox NWATRQHHC list""",
'is_on_NPDWR':"""True/False flag indicating if the bgCAS of the given record is on National Primary Drinking Water Regulation list""",
'is_on_IRIS':"""True/False flag indicating if the bgCAS of the given record is on EPA CompTox IRIS list""",
'is_on_UVCB':"""True/False flag indicating if the bgCAS of the given record is on the TSCA 'Unknown, Variable composition, Complex
reaction products and Biological' materials list""",
'epa_pref_name':"""EPA's "preferred name" for this chemical.""",     
'has_water_carrier':"""True/False flag indicating if this disclosure has an identified water carrier record that is used
to calculate mass of the chemicals in the disclosure.""",
'is_valid_cas':"""True/False flag to distinguish bgCAS values that are valid from others 'ambiguousID' and 'sysAppMeta'. Note that 'conflictingID' and 'proprietary' are valid.""",
'total_percent_all_records':"""a disclosure-level sum of PercentHFJob for all records in a disclosure.""",
'total_percent_of_valid':"""a disclosure-level sum of PercentHFJob for just the valid records in a disclosure. For this measure, 'proprietary' and 
'conflicting' records are considered valid.""",
'IngredientComment':"""Lots of different types of info show up in this field: density of the chemical in the record, contact info for a proprietary
claim, etc.""",
'loc_name_mismatch':"""Indicates that the StateName or CountyName provided in the original disclosure do not match the expected names
given the codes embedded in the APINumber. Sometimes these are simply alternative spellings of the expected name, other times they are 
completely wrong.""",
'loc_within_county':"""True/False flag value of whether the lat/log values reported in the original disclosure are actually in the reported county.""",
'loc_within_state':"""True/False flag value of whether the lat/log values reported in the original disclosure are actually in the reported state.""",
'Projection': """While the reported 'Latitude' and 'Longitude' are in this projection, all 'bgLatitude' / 'bgLongitude' pairs are converted to 
'WGS84'.""",
'rq_lbs':"""The 'reportable quantity' for this chemical (in pounds), see 
<a href="https://www.law.cornell.edu/cfr/text/40/302.4">this website </a>""",
'api10':"""A 10 character version of APINumber. For almost all uses, this shortened number is sufficient to identify any particular well.
In many cases, states do not even use the extra 4 digits of a 'full' API number.""",
'bgLocationSource': """Indicates the source of the bgLatitude and bgLongitude values. Typically, they are taken
from the FracFocus disclosures, but in cases where those values appear error-ridden, Open-FF uses values from
a state-derived source, such as Texas RR Commission or Ohio Department of Natural Resources.""",
'iupac_name':"""The official IUPAC name for the given substance.""",
'bgFederalLand':"""A True/False indication of whether the given well is on Federal lands.  Unlike the FracFocus field, FederalWell,
this one is generated in OpenFF by overlaying the lat/lon of the well with an authoratative USGS geographic resource,
PADUS-3, that characterizes all federal lands (the field `Own_Name` == 'FED') (currently in beta)""",
'bgStateLand':"""A True/False indication of whether the given well is on State lands.  This is generated in OpenFF by overlaying the lat/lon of the well with an authoratative USGS geographic resource,
PADUS-3, that characterizes all state lands (the field `Own_Name` == 'STAT') (currently in beta)""",
'bgNativeAmericanLand':"""A True/False indication of whether the given well is on land designated as tribal.  Unlike the FracFocus field, IndianWell,
this one is generated in OpenFF by overlaying the lat/lon of the well with an authoratative USGS geographic resource,
PADUS-3, that characterizes identifies native lands by using the Census shape files for American Indian, Alaska Native Areas 
and Hawaiian Home Lands. (currently in beta)""",
'stLatitude':"""a latitude value that comes from a state regulatory agency for the given well.  When FracFocus location values are
flagged as erroneous, this state-derived value is used instead for bgLatitude.""",
'stLongitude':"""a longitude value that comes from a state regulatory agency for the given well. When FracFocus location values are
flagged as erroneous, this state-derived value is used instead for bgLongitude.""",
'mass':"""mass (in pounds) of the material. This version of the chemical's quantity is a composite of both MassIngredient (when it is internally 
consistent) and calcMass and provides the value most reflective of the data reported in the disclosures.""",
'massSource':"""which of the two mass sources (MassIngredient and calcMass) are used in the field 'mass'""",
'calcMass_unfiltered':"""this is the value of calcMass before being compared to MassIngredient. It is useful for investigating discrepancies 
between calcMass and MassIngredient.""",
'ingredCommonName':"""Open-FF's version of the most common name used in FracFocus for the specific bgCAS value.""",
'perc_pw':"""Reported percentage of the water sources that is produced water. Optional report starting in Dec 2023.""",
'perc_sw_low_TDS':"""Reported percentage of the water sources that is surface water and low TDS. Optional report starting in Dec 2023.""",
'perc_sw_high_TDS':"""Reported percentage of the water sources that is surface water and high TDS. Optional report starting in Dec 2023.""",
'perc_gw_low_TDS':"""Reported percentage of the water sources that is ground water and low TDS. Optional report starting in Dec 2023.""",
'perc_gw_high_TDS':"""Reported percentage of the water sources that is ground water and high TDS. Optional report starting in Dec 2023.""",
'perc_other_low_TDS':"""Reported percentage of the water sources that is "other" and low TDS. Optional report starting in Dec 2023.""",
'perc_other_high_TDS':"""Reported percentage of the water sources that is "other" and high TDS. Optional report starting in Dec 2023.""",
'ws_perc_total':"""Sum of all percent water sources. Should total 100%. Optional report starting in Dec 2023.""",
        

        
       }

ffdesc = {'JobStartDate':"""The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.""",
'JobEndDate':"""The date on which the hydraulic fracturing job was completed.  Does not include site teardown.""",
'APINumber':"""The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.""",
#'StateNumber':"""The first two digits of the API number.  Range is from 01-50.""",
#'CountyNumber':"""The 3 digit county code.""",
'OperatorName':"""The name of the operator.""",
'IngredientsId':"""Key index for the Ingredients data""",
'IngredientCommonName': """Common Name of the chemical based on submission frequency in FracFocus""",
'WellName':"""The name of the well.""",
'MassIngredient':"""Mass in pounds of the ingredient used on Job""",
'WaterSourceId':"""Key index for the WaterSource data.""",
'Description':"""Definition of Water Source used on Job.""",
'Percent':"""Percent of water defined in Description used on Job.""",
'Latitude':"""The lines that circle the earth horizontally, running side to side at equal distances apart on the earth.   Latitude is typically 
expressed in degrees North/ South.  In the FracFocus system these lines are shown in decimal degrees and must be between 15 and 75.""",
'Longitude':"""The lines that circle the earth vertically, running top to bottom that are equal distances apart at the equator 
and merge at the geographic top and bottom of the earth.  Longitude is typically expressed in degrees East/ West.  In the FracFocus 
system the number representing these  lines are shown in decimal degrees and must be between -180 and -163 Note: Longitude number must 
be preceded by a negative sign.""",
'Projection':"""The geographic coordinate system to which the latitude and longitude are related. In the FracFocus 
system the projection systems allowed are NAD (North American Datum) 27 or 83 and UTM (Universal Transverse Mercator).""",
'TVD':"""The vertical distance from a point in the well (usually the current or final depth) to a point at the surface, usually the 
elevation of the rotary kelly bushing.""",
'TotalBaseWaterVolume':"""The total volume of water used as a carrier fluid for the hydraulic fracturing job (in gallons).""",
'TotalBaseNonWaterVolume':"""The total volume of non water components used as a carrier fluid for the hydraulic fracturing job (in gallons)""",
'StateName':"""The name of the state where the surface location of the well resides.  Calculated from the API number.""",
'CountyName':"""The name of the county were the surface location of the well resides.  Calculated from the API number.""",
'FFVersion':"""A key which designates which version of FracFocus was used when the disclosure was submitted.""",
'FederalWell':"""True = Yes<br>False = No.""",
'DisclosureId':"""Key linking to the RegistryUpload table.""",
'TradeName':"""The name of the product as defined by the supplier.""",
'Supplier':"""The name of the company that supplied the product for the hydraulic fracturing job (Usually the service company).""",
'Purpose':"""The reason the product was used (e.g. Surfactant, Biocide, Proppant).""",
'IngredientName':"""Name of the chemical or for Trade Secret chemicals the chemical family name.""",
'CASNumber':"""The Chemical Abstract Service identification number.""",
'PercentHighAdditive':"""The percent of the ingredient in the Trade Name product in % (Top of the range from MSDS).""",
'PercentHFJob':"""The amount of the ingredient in the total hydraulic fracturing volume in % by Mass.""",
'IngredientComment':"""Any comments related to the specific ingredient.""",
'IndianWell': """(definition not provided)""",
#'IngredientKey':"""(definition not provided)""",
'IngredientMSDS':"""True = Yes, False = No. - Is ingredient listed on MSDS sheet.""",
'ClaimantCompany':"""Name of company claiming trade secret on ingredient.""",
       }
descs = []
kys = list(desc.keys())
for k in kys:
    descs.append(desc[k].replace('\n',' '))
desc_df = pd.DataFrame({'fieldName':kys,'Open-FF description':descs})
desc_df.head()

ffdescs = []
kys = list(ffdesc.keys())
for k in kys:
    ffdescs.append(ffdesc[k].replace('\n',' '))
ffdesc_df = pd.DataFrame({'fieldName':kys,'FracFocus description':ffdescs})
ffdesc_df['source'] = 'original FF'

table = pd.merge(all_fn_df,desc_df,on='fieldName',how='outer')

# drop rows that should have been cut elsewhere!
table = table[~table.fieldName.isin(['OpCount','OperatorYears','SupCount','SupplierYears','chemInfo_available','cleanName','is_new','rawName',
                    'synCAS','source','syn_code','status','xlateName',])]

table.Database_Tables.fillna('none',inplace=True)
table = pd.merge(table,ffdesc_df,on='fieldName',how='left').reset_index(drop=True)
# no_desc = table[(table['Open-FF description'].isna())&(table['FracFocus description'].isna())].fieldName.tolist()

# # make a csv file:
# table.to_csv('./out/data_dictionary.csv')

table.fieldName = '<b><h3>'+table.fieldName+'</h3></b>'
table['FracFocus description'].fillna(' ',inplace=True) 
table['Open-FF description'].fillna(' ',inplace=True) 
table['source'].fillna('generated',inplace=True) 

table['field Name, [tables]'] = table.fieldName + ' ' + table.Database_Tables.astype('str') 
iShow(table[['field Name, [tables]','FracFocus description','Open-FF description','source',
            'Num','Unique','Data_type']].sort_values('field Name, [tables]'),
      classes="display compact cell-border", 
      columnDefs=[{"className":"dt-left",  "targets": "_all"}],
     paging=False,footer=True)

## Carrier detection sets:
Among the filters below, **s1** finds the majority of water carriers.  However, there is no single set of criteria that can be used to identify the water carrier record(s) for all FracFocus disclosures.  Therefore the other filters are employed to catch many other disclosure patterns without needing to curate each by hand.

| Set name | description | Criteria to be detected|
| :--: | --- | :-- |
| **s1** | Primary filter; most recent disclosures are detected with this|- Only one record whose `Purpose` is "carrier" (or related)<br>- `bgCAS` is '7732-18-5'<br>- at least 50% `PercentHFJob`<br>- total % of disclosure is 95% > x > 105%|
| **s2** |More than one record as the carrier;<br>covers situations, for example, where there are two water records<br> (fresh and produced) and where other chemicals are also labeled as part of the carrier. <br>It is important to include all water carrier records <br>to avoid underestimating carrier mass  |- More than one record whose `Purpose` is "carrier" (or related)<br>- at least one `bgCAS` is '7732-18-5'<br>- total of water records is at least 50% `PercentHFJob`<br>- total % of disclosure is 95% > x > 105%|
| **s3** |No carrier records labeled; but clear water record with typical percentage |- `bgCAS` is '7732-18-5'<br>- at least 40% `PercentHFJob`<br>- `IngredientName` contains phrase "including mix water"<br>- total % of disclosure is 95% > x > 105%|
| **s4** | Like s3, but CAS number missing; still obvious water record |- `CASNumber` is empty <br>- at least 60% `PercentHFJob`<br>- `IngredientName` contains phrase "including mix water"<br>- total % of disclosure is 95% > x > 105%|
| **s5** | Like s1 but no carrier records are labeled; |<br>- `bgCAS` is '7732-18-5'<br>- at least 50% `PercentHFJob`<br>- total % of disclosure is 95% > x > 105%|
| **s6** |`CASNumber` missing but clear carrier label |- `bgCAS` is ambiguousID<br>- single record with a carrier `Purpose`<br>- `IngredientName` is either "carrier" (or related) or has "water" in it<br>- `TradeName` has "water" in it<br>- 50% < %HFJob < 100%<br>- total % of disclosure is 95% > x > 105%|
| **s7** |Like s1, but for "salted" water<br>Note that even though the record is labeled with the salt CAS number,<br> the predominant mass is water |- Only one record whose `Purpose` is "carrier" (or related)<br>- `bgCAS` is either '7747-40-7' (kcl) or '7647-14-5' (nacl)<br>- at least 50% `PercentHFJob`<br>- total % of disclosure is 95% > x > 105%|
| **s8** |Common pattern in the older disclosures (incl. SkyTruth archive)|- `bgCAS` is ambiguousID or 7732-18-5<br>- `IngredientName` is MISSING<br>- `Purpose` is "unrecorded purpose"<br>- `TradeName` has either "water" or "brine"<br>- can be one or two records in each disclosure<br>- 50% < sum of `PercentHFJob` of these records < 100%<br>- total % of disclosure is 95% > x > 105%|  
| **s9** |Common pattern in the older disclosures (incl. SkyTruth archive)|- `bgCAS` is ambiguousID or 7732-18-5<br>- `IngredientName` is MISSING<br>- `Purpose` is one of the standard carrier words or phrases<br>- `TradeName` has either "water" or "brine"<br>- can be one or two records in each disclosure<br>- 50% < sum of `PercentHFJob` of these records < 100%<br>- total % of disclosure is 95% > x > 105%|  
| **s10** |A pattern seen in later disclosures: <br>the carrier is only reported in the top part of the <br>systems approach section under the "Listed Below"  `CASNumber`.<br>The actual `PercentHFJob` value isn't even reported in the PDF <br>version, but is in the bulk download.|- `CASNumber` is "Listed Below"<br>- record has a carrier `Purpose`<br>- `PercentHFJob`>50 %<br>- `TradeName` has "water" in it<br>- total % of disclosure is 95% > x > 105%|  


## Disclosures with detected problems for determination of water carrier ID

| code | description |
| :---: | --- |
| **0** | Disclosure has no valid chemical records. |
| **1** | `TotalBaseWaterVolume` is empty or 0 gallons.|
| **2** | None of the chemical records have non-zero `PercentHFJob`.|
| **3** | The sum of `PercentHFJob` values for valid CAS records is larger than limit (105%)|
| **4** | The sum of `PercentHFJob` values for all records excluding SystemApproach is larger than limit (105%)|
| **5** | `PercentHFJob` of all "proppant" records is greater than 50% (not used after v16) |
| **6** | The sum of `PercentHFJob` values for all records is less than 90% - a partial disclosure|
| **7** | `PercentHFJob` of Nitrogen or Carbon Dioxide records is greater than 50% (so carrier will be smaller) (not used after v16) |
| **8** | `PercentHFJob` of Chlorine dioxide records is 100% (it is typically an additive to the water; not a replacement) (added 3/2023, after v16). |
| **9** | `PercentHFJob` of Nonwater carrier record too large (>50%) (added 3/2023, after v16). |
