In [None]:
import sys
sys.path.insert(0,'c:/MyDocs/integrated/') # adjust to your setup

%run "catalog_support.py" 
showHeader('Ohio Drilling Chemicals')

## Description
Chemicals are used in many phases of an oil or gas well's life span.  FracFocus reports on just one phase: chemicals used during the hydraulic fracturing phase. 

Another phase, however, is the drilling phase: during which the hole is initially created.  This phase might be of interest because any chemicals added are not separated from the surrounding rocks and, potentially, aquafers.  This is unlike the fracking phase because a cement casing added before fracking commences.  

However, also unlike the situation for fracking chemicals, drilling chemicals are not often reported.  Ohio is (currently) the only state to require disclosure of drilling chemicals. 

This page reports the beginning of a data set of disclosures of Ohio Drilling Chemicals.  

## Current status of this data set
We have found direct web links to over 3000 disclosure forms (OH Form 8A).  We have not scraped this data set; but it is not a giant task, at least for those disclosures that can be automatically scraped.  There is a lot of interesting info (and missing data!) here.  If you are interested in diving deeper into these data, contact us - we may have tools to help.

### Data source
The data are maintained on the ODNR database system.  Typically, the way a user would locate the drilling chemical report would be to:
- go to the ODNR's online database search, and find the ["Completions" search page](https://apps.ohiodnr.gov/oilgas/rbdmsreports/Reports_Completions.aspx).
- Enter the APINumber of the well of interest into the appropriate cell.
- When the result is returned, look for the link towards the bottom labeled "CHEMICAL" or something similar.
- Clicking on that link will cause a PDF file to be available to your computer.


#### Easier access
The table below connects you directly to the chemical disclosure for each well that has a disclosure.
This set of PDF files cover a variety of formats from computer generated and scrape-able to handwritten forms.  It is based on a webcrawl in **June 2022**.  Newer disclosures won't be here.

In [None]:
def make_clickable(val,text='ODNR link'):
    try:
        if val[:4]=='http':
            return '<a href="{}" target="_blank">{}</a>'.format(val,text)
    except:
        return val
    return val


In [None]:
pdf_lst = pd.read_csv(r"C:\MyDocs\OpenFF\src\testing\oh_scrape\chemical_pdf_list_with_ana.csv",
                     dtype={'APINumber':'str'})
pdf_lst['pdflink'] = pdf_lst.link.map(lambda x: make_clickable(x))
apis = pdf_lst.APINumber.unique().tolist()
# iShow(pdf_lst[['APINumber','ntables','pdflink']],maxBytes=0)

|Explanation of columns below|
| :---: |

| Column      | Description |
| :----: | :-------- |
|APINumber | The standard well identification number|
|map | Link to a Google satellite view of the geocoordinates given in FracFocus for this well|
|date| FracFocus date for End of Job; 'NaT' means disclosure is not in FracFocus|
|bgCountyName | County name for this well as reported in Open-FF; 'nan' means disclosure is not in FracFocus |
|bgOperatorName |Operator company for this well as reported in Open-FF; 'nan' means disclosure is not in FracFocus |
|ntables | Number of tables found by scraping software (camelot.py) in the PDF; 0 tables usually means the PDF is not scrape-able.
|pdflink| Direct link to the ODNR copy of this Form.|

In [None]:
# fetch data set
df = fh.get_df(os.path.join(hndl.sandbox_dir,'workdf.parquet'))

In [None]:
gb = df[df.APINumber.isin(apis)].groupby('APINumber',as_index=False)[['bgCountyName','date',
                                                                      'bgOperatorName','bgLatitude',
                                                                      'bgLongitude']].first()
gb['map'] = gb.apply(lambda x: th.getMapLink(x,'map'),axis=1)

mg = pd.merge(pdf_lst,gb,on='APINumber',how='left')

iShow(mg[['APINumber','map','date','bgCountyName','bgOperatorName','ntables','pdflink']],maxBytes=0)

## Interesting and/or typical disclosures found in this collection

|APINumber with link | Description |
| :--- |  :--- |
| [34013206610000](https://gis.ohiodnr.gov/MapViewer/download.ashx?AB68B541-C473-4D51-9C7A-3E2B5B7B53A1WSC) | handwritten; detailed, but no CAS Numbers |
| [34111243050100](https://gis.ohiodnr.gov/MapViewer/download.ashx?3D93459B-FA2E-46B9-9C2A-34BECA7CF560WSC) | lots of chemicals including BTEX; disclosure not scrape-able. |
|[34111245720000](https://gis.ohiodnr.gov/MapViewer/download.ashx?F2329050-9966-41E7-A04E-4F84B0723A02WSC) | long list of relatively large masses of chemical usage .|
|[34029219110000](https://gis.ohiodnr.gov/MapViewer/download.ashx?39945EF7-F351-4E31-B867-95E3C94585C7WSC) | simple list; PDF not scrape-able.|
|[34081207820000](https://gis.ohiodnr.gov/MapViewer/download.ashx?F40687D1-9BD5-42BD-A609-ED1DFE6FEED8WSC) | simple list; common |
|[34169256880000](https://gis.ohiodnr.gov/MapViewer/download.ashx?497FEC2E-E986-4568-9E6E-846A621225BEWSC) | no chemicals reported; common |


---
# CAS Numbers reported
Preliminary scraping of these PDFs gives us an idea of the set of chemicals used in drilling.  
(This was not an exhaustive search of all PDFs; many PDFs require more work to scrape than I had time for!  Nevertheless, there are more than 1200 distinct wells with identifiable chemicals.)

In [None]:
import re
import string

def is_valid_CAS_code(cas):
    """Returns boolean.
    
    Checks if number follows strictest format of CAS registry numbers:
        
    - three sections separated by '-', 
    - section 1 is 2-7 digits with no leading zeros, 
    - section 2 is two digits (no dropping leading zero),
    - section 3 (check digit) is just one digit that satisfies validation algorithm.
    - No extraneous characters."""
    try:
        for c in cas:
            err = False
            if c not in '0123456789-': 
                err = True
                break
        if err: return False
        lst = cas.split('-')
        if len(lst)!=3 : return False
        if len(lst[2])!=1 : return False # check digit must be a single digit
        if lst[0][0] == '0': return False # leading zeros not allowed
        s1int = int(lst[0])
        if s1int > 9999999: return False
        if s1int < 10: return False
        s2int = int(lst[1])
        if s2int > 99: return False
        if len(lst[1])!=2: return False # must be two digits, even if <10

        # validate test digit
        teststr = lst[0]+lst[1]
        teststr = teststr[::-1] # reverse for easy calculation
        accum = 0
        for i,digit in enumerate(teststr):
            accum += (i+1)*int(digit)
        if accum%10 != int(lst[2]):
            return False
        return True
    except:
        # some other problem
        return False


def cleanup_cas(cas):
    """Returns string.
    
    Removes extraneous characters and adjusts zeros where needed:
        
    - need two digits in middle segment and no leading zeros in first.
    Note that we DON'T check CAS validity, here. Just cleanup. 
    """
    #print(cas)
    cas = re.sub(r'[^0-9-]','',cas)
    lst = cas.split('-') # try to break into three segments
    if len(lst) != 3: return cas # not enough pieces - return filtered cas
    if len(lst[2])!= 1: return cas # can't do anything here with malformed checkdigit
    if len(lst[1])!=2:
        if len(lst[1])==1:
            lst[1] = '0'+lst[1]
        else:
            return cas # wrong number of digits in chunk2 to fix here
    lst[0] = lst[0].lstrip('0')
    if (len(lst[0])<2 or len(lst[0])>7): return cas # too many or two few digits in first segment
    
    return f'{lst[0]}-{lst[1]}-{lst[2]}'


In [None]:
casdf = pd.read_csv(r"C:\MyDocs\OpenFF\src\testing\oh_scrape\prospective_cas_list.csv")
casdf = casdf[casdf.cas_num.notna()]
gb = casdf.groupby('cas_num',as_index=False)['APINumber'].count()
#print(gb.head())


In [None]:
gb['cleancas'] = gb.cas_num.map(lambda x: cleanup_cas(x))
gb['is_valid'] = gb.cleancas.map(lambda x: is_valid_CAS_code(x))

#print(gb.head(50))

In [None]:
# Now compare to FF chemicals
df = ana_set.Full_set(repo = repo_name, outdir='../common/').get_set(verbose=False)
ffcas = df.bgCAS.unique().tolist()

gb['is_in_FF'] = gb.cleancas.isin(ffcas)
gb.to_csv('drilling_cas_list.csv')
gb
