In [None]:
%run "catalog_common.py"
ID_header('Open-FF:  CASNumber and IngredientName pairs', incl_links=True)
set_page_param()

Starting with the Open-FF version 10, we have used a curation method to better clean the FracFocus data set.  In this method, all unique combinations of CASNumber and IngredientName pairs are evaluated manually to determine the best chemical label to assign to records.  While FracFoucs records about 1,300 unique chemical materials, because of the numerous ways companies record these chemicals, this list of CASNumber/IngredientName pairs is over 25,000 entries long.

This is essentially a tranlation table.  The input is the CASNumber and IngredientName and the output is bgCAS, our best guess for the proper identity of the chemical in the record.  

See the bottom of this page for notes about issues with these pairs.

In [None]:
import pandas as pd
import numpy as np
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
from itables import show as iShow
import itables.options as opt

In [None]:
#import core.Analysis_set as ana_set
#import core.Construct_set as const_set
import matplotlib.pyplot as plt
%matplotlib inline

df_cas = ana_set.Catalog_set(repo = repo_name, outdir='../common/').get_set(verbose=False)

In [None]:
gb = df_cas.groupby(['CASNumber','IngredientName'],as_index=True)['UploadKey'].count().reset_index()
gb = gb.rename({'UploadKey':'record_count'},axis=1)

In [None]:
casing = grd.get_curated_df(repo_name,'casing_curated.parquet')
print(casing.columns)
CAScurated = grd.get_curated_df(repo_name,'CAS_curated.parquet').rename({'comment':'CAS_comment'},axis=1)
CAScurated.CAS_comment.fillna(' ',inplace=True)
casing = pd.merge(casing,CAScurated[['CASNumber','CAS_comment']],on='CASNumber',how='left')
#casing.columns

In [None]:
def sort_id(st):
    l = list(st)
    l.sort()
    return l

gb1 = df_cas.groupby(['CASNumber','IngredientName'],as_index=True)['bgIngredientName'].first().reset_index()

gb2 = df_cas.groupby(['CASNumber','IngredientName'])['APINumber'].apply(set).reset_index()
gb2.APINumber = gb2.APINumber.map(lambda x: sort_id(x))
gb2.APINumber = gb2.APINumber.map(lambda x: xlate_to_str(x,' ',totallen=30))
casing = pd.merge(casing,gb,on=['CASNumber','IngredientName'],how='left')
casing = pd.merge(casing,gb1,on=['CASNumber','IngredientName'],how='left')
casing = pd.merge(casing,gb2,on=['CASNumber','IngredientName'],how='left')
casing['<CAS'] = '<h1>||</h1>'
casing['<Ing'] = '<h1>||</h1>'
casing.CASNumber = '<b>'+casing.CASNumber+'</b>'
casing.IngredientName = '<b>'+casing.IngredientName+'</b>'
casing['curCAS'] = casing.curatedCAS #+'<br>'+casing.categoryCAS
try:
    casing['curING'] = casing.synCAS #+'<br>'+casing.recog_syn
except:
    casing['curING'] = casing.prospect_CAS_fromIng+'<br>'+casing.syn_code
try:
    casing['curfinal'] = '<h3>'+casing.bgCAS+'</h3><br>'+casing.source
except:
    casing['curfinal'] = '<h3>'+casing.bgCAS+'</h3><br>'+casing.bgSource
#print(casing.columns)

|Explanation of columns in the index|
| :---: |

| Column      | Description |
| :----: | :-------- |
| | **Curation of CASNumber**|
|*raw CASNumber*| is the text in the CASNumber field of the original FracFocus data set, as found.|
|*CAS comment:*| any comments made the the Open-FF person evaluating the raw CASNumber.|  
|*curated CASNumber*| is the CAS number after curation steps|
| | **Curation of IngredientName**|
|*IngredientName*| is the raw text in the IngredientName field of the original FracFocus data set, as found.|
|*implied CAS from IngName (code)*| is the implied CAS number from the IngredientName as given; below is the curation code|
| | **Curation of the CASNumber/IngredientName Pair**|
|*final bgCAS and source*| shows the 'best guess' CAS Number when considering the curated version of CASNumber and IngredientName. Below shows which of the pair were used for this conclusion.|
| | **Pair characteristics** |
|*record_count*| is the number of times this CASNumber/IngredientName pair occurs in the original FracFocus data set.|
|*APINumber*| example well APINumbers that have reported this CASNumber/IngredientName pair |


In [None]:
casing = casing[casing.record_count.notna()][['CASNumber','CAS_comment','curCAS','<CAS',
                                             'IngredientName','curING','<Ing',
                                             'curfinal','record_count','APINumber']]
casing = casing.rename({'curCAS':'curated CASNumber','CASNumber':'raw CASNumber',
                 'curING':'implied CAS from IngName (code)','curfinal':'final bgCAS and source',
                 'bgCAS':'output: bgCAS','record_count':'record count'},axis=1)
iShow(casing.sort_values('record count',ascending=False).reset_index(drop=True),
      maxBytes=0,classes="display compact cell-border")
# iShow(casing,maxBytes=0)
