# Exploring data

## Step 1: Upload data to Colab
Click on the folder icon in the menu bar on the left.  Click on the Upload button and select the file to upload.  Once done, colab can use the data. 

In [None]:
# preamble to load necessary modules
import pandas as pd
import numpy as np
import altair as alt

In [None]:
file_name = 'full_proprietary.csv'  ###  <<  change this name to your file name
df = pd.read_csv(file_name,dtype={'APINumber':'str'}, # we want pandas to treat APINumber as text not a number
                 low_memory=False)  # this may prevent colab from choking on big files
df.date = pd.to_datetime(df.date)
print(f'Number of records: {len(df)}')
df.columns  # see what column names you have to work with

## Step 2: Filter to a smaller set
Use this to filter 
- by PercentHFJob or calcMass size
- by company, 
etc

In [None]:
# first filter
data = df[df.PercentHFJob >= 1].copy()
# filter more
#data = data[data.bgStateName=='oklahoma']
# filter even more
#data = data[data.date.dt.year==2019]
print(f'Number of records: {len(data)}')


## Other Steps

## Summarizing data
It is often helpful to summarized data.  For example, we are looking at the proprietary data which has a line for every single record in which a proprietary claim is made.  But maybe we want to know the total mass of proprietary chemicals per state.  

To do this we use the pandas code "groupby".

Code below is **optional** to perform various summaries.  Just uncomment whichever line you want to execute. 

In [None]:
# make a set with just states and total calculated mass of proprietary chemicals
out_df = data.groupby('bgStateName',as_index=False)['calcMass'].sum()

# Above the "as_index=False" is used to include the groupby variable(s) in the data set as a field,
#   usually that is what we will want.
#   The square bracket contains the variable that you want summarized.  The resulting data frame 
#     will include the summary column with that name.

# make a set with states and companies and total calculated mass of proprietary chemicals
#out_df = data.groupby(['bgStateName','bgOperator'],as_index=False)['calcMass'].sum()

# make a set with mass for each disclosure - 'UploadKey' is the field that's unique for each disclosure
# ! UploadKey is not currently in the input data set.  Let me know if you want it...
#out_df = data.groupby('UploadKey',as_index=False)['calcMass'].sum() 

# make a set with the count of proprietary records by bgSupplier.   With 'count', the variable
#   in the square bracket isn't critical - it just counts the records (but names it whatever variable
#   you choose in the square bracket.
#out_df = data.groupby('bgSupplier',as_index=False)['bgCAS'].count() 

# you can also do some basic stats with groupby
#out_df = data.groupby('bgOperatorName',as_index=False)['calcMass'].median() 
# or
#out_df = data.groupby('bgOperatorName',as_index=False)['calcMass'].max() 


out_df.head()

## Save your data to an external file
...  to use in, for example, Excel, a stats package, a graphing program -- whatever is useful.  Below, data are saved into CSV format, but [other formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=save%20formats) are possible.  

After you execute this cell, your file should show up in the Files panel of colab.  From there you can download it, send it to your Google drive, etc.

In [None]:
data.to_csv('my_export.csv') 
data.head() # just show the first few records of the data as well

## Show major categories

In [None]:
# now let's look at what TradeNames are used most (top 20)
data.TradeName.value_counts()[:20]

In [None]:
# now let's look at what Ingredients are used most (top 20)
data.IngredientName.value_counts()[:20]

## Graph with interactive plots
Note that if you hover over a point, you will see more info

logmass and logperc might help spread things out

In [None]:

#data = data[data.date.dt.year==2021]
data['logmass'] = np.log10(data.calcMass)
data['logperc'] = np.log10(data.PercentHFJob)
print(len(data))
alt.Chart(data=data).mark_point().encode(
    x="date",
    y="logperc",
    color='bgStateName',
    #size='calcMass',
    tooltip = ['APINumber','bgStateName','IngredientName','calcMass','PercentHFJob',
              'TradeName','Purpose','bgOperatorName','bgSupplier'],
).properties(
    width=800,
    height=300
)
