# Analysis on CWA chemicals
Just a framework to get things started.
Version 1 - July 14, 2022

In [None]:
# This cell downloads some support code that is used to pull together the data set.  
!git clone https://github.com/gwallison/colab-support.git &>/dev/null;

# now run the code that defines the routine
%run colab-support/get_dataframe.py

In [None]:
# get_dataset pulls together a set of CSV files from a google storage site, then merges them
#  result: df is a dataframe with all records (though not ALL fields)
df = get_dataset()

# if you want to see what fields are in df, uncomment the following line
# df.columns

To get an understanding of what the various fields are in the data set, go the the [Data Dictionary](https://storage.googleapis.com/open-ff-browser/Open-FF_Data_Dictionary.html)

In [None]:
# just for reference, let's see how many unique bgCAS values there are:
print(f'Number of unique bgCAS values in df: {len(df.bgCAS.unique())}')

Since we are going to focus on CWA chemicals we have THREE builtin lists we can use with these flags:
> is_on_CWA

> is_on_AQ_CWA

> is_on_HH_CWA

We'll start with all three as a filter; and call our resulting dataframe: "cwadf"
The '|' character is the OR operator.  ('&' is the AND operator)
So, in the following code, if the bgCAS of the record is in any one of those lists (the OR), it will be in our output dataframe. 

In [None]:
cwadf = df[df.is_on_CWA | df.is_on_AQ_CWA | df.is_on_HH_CWA ].copy() 
print(f'Number of unique bgCAS values in cwadf: {len(cwadf.bgCAS.unique())}')

Let's summarize this new dataframe; we can do this with 'groupby'

In [None]:
gb1 = cwadf.groupby('bgCAS',as_index=False)[['bgIngredientName','epa_pref_name']].first()
gb2 = cwadf.groupby('bgCAS',as_index=False)['UploadKey'].count().rename({'UploadKey':'tot_record_cnt'},axis=1)

# how many records have a calculable mass
gb3 = cwadf[cwadf.calcMass>0].groupby('bgCAS',as_index=False)['UploadKey'].count().rename({'UploadKey':'calc_mass_cnt'},axis=1)

# total mass
gb4 = cwadf[cwadf.calcMass>0].groupby('bgCAS',as_index=False)['calcMass'].sum().rename({'calcMass':'total_mass'},axis=1)

# biggest recorded mass
gb5 = cwadf[cwadf.calcMass>0].groupby('bgCAS',as_index=False)['calcMass'].max().rename({'calcMass':'max_mass'},axis=1)


In [None]:
# now we will pull them all together with merge

gb = pd.merge(gb1,gb2,on='bgCAS',how ='left',validate='1:1')
gb = pd.merge(gb,gb3,on='bgCAS',how ='left',validate='1:1')
gb = pd.merge(gb,gb4,on='bgCAS',how ='left',validate='1:1')
gb = pd.merge(gb,gb5,on='bgCAS',how ='left',validate='1:1')

# That should do it; click on the magic wand icon to make the table sortable and searchable
gb