# Midterm Project 
*Matthew Ueckermann*

## Problem to explore: Toxic Chemical Releases

Many manufacturing plants or resource extraction units release toxic chemicals to the air, water, or land around them. These chemicals impact the life of people, plants, and animals surronding these locations. In some cases, communities have been inadvertantly poisoned slowly by surronding industries. Recently, the use of PFOAs by DuPont in West Virgina and the resulting court cases have inspired documentaries like the [Devil We Know](https://en.wikipedia.org/wiki/The_Devil_We_Know), which details the impact of this chemical on a town in West Virginia. This restarted talks about the need for transparency and understanding of the chemcials which are polluted in the United States, as it gives communties the right to know what they are exposed to, as well as provides a way to hold companies liable.

A dataset already exists to manage this though, although it did not include PFOAs until recently. That dataset is the [Toxics Release Inventory (TRI)](https://en.wikipedia.org/wiki/Toxics_Release_Inventory) which is a dataset containing all chemicals released by manufacturing plants in the U.S. which must be reported to the EPA. This was established in 1986 to bring transparency to the chemical industry, partially as a reaction to the [1984 Bhopal disaster](https://en.wikipedia.org/wiki/Bhopal_disaster), which started one of the first movements towards community right to know. This dataset was introduced to me in one of my previous courses (CHEG613: Energy and the Environment) as one of the most efficient pieces of environmental legislation; however, I have not had the opportunity to explore it. 

In chemical manufacturing it is impossible not to release some amount of toxic chemical (pressure equipment, incinerators, and wastewater treatment can never be perfect) while accidents will also cause unintended releases. However, it is still generally possible to mitigate them, at least making incremental improvements over time. I am interested in what trends exist in environemntal control, specifically in the chemical industry. These include how the mode of chemical emissions has changed (to air or water?) over time, as well as general chemical emissions by state over time.

I am also interested in if there are any trends in in the types of chemical emmited by state overtime, as not all toxic chemicals are equal. Specifically, dioxins are considered the most toxic man-made compounds [(a daily intake of 2*10^-12 grams/kg body weight is considered safe for humans)](https://en.wikipedia.org/wiki/Dioxins_and_dioxin-like_compounds#Human_toxicity), while other compounds like persistent and bioaccumulate compounts (which include PFOA) are considered extremely toxic. Finally, we must also consider conventional carcinogens, as chemicals of special concern. 

Note that the TRI reporting requirements have changed over time, a breakdown of major changes and its history is given [here] (https://www.epa.gov/toxics-release-inventory-tri-program/timeline-toxics-release-inventory-milestones).

## Specific Research Questions

- What are the general trends in toxic chemical emissions over time? 
    - Have chemcial emissions decreased since the implementation of the TRI
    - How does this depend on industry?
    - How does this depend on the type of chemical emmited?
- Where are these chemicals emitted?
    - How has this changed overtime?
- What trends exist in toxic chemical emissions by state?
    - Is there any "red state"-"blue state" effect?

How do releases in the TRI correlate with year, state, and county location? Specifically:
- Do chemical plants get better at controlling emissions over time?
- Does the composition of chemicals released by a facility change over time? Or in other words will a facility get better at controlling the emissions of one chemical versus another?
- Have plants shifted from releasing PBTs and dioxins over time?
- Is there any "red state"-"blue state" effect?
- Can you see the impact of environmental regulation, or the loosening of it by administration?


## Justification (Expand on)

Chemical releases, especially of toxic chemicals, are important to manage and minimize. Knowledge of how chemical plants minimize releases from year to year, as well as if there is any geographic variation could be important in accessing environmental regulations and industrial best practices. Similarly, looking at overall trends is important to ensure that the industry is getting cleaner, trending in a more sustainable direction.

As far as evidence, I think it is safe to say that toxic chemical emissions and their prevention is an important problem. One way to demonstrate this is by looking at the EPA, who constantly try to reduce these emission through programs like the [clean air act in 1970](https://www.epa.gov/clean-air-act-overview/progress-cleaning-air-and-improving-peoples-health). While this specific research using the TRI can be motivated by interesting research that has already done using the dataset, including looking at correlations between chemical releases and community composition [1]. The impact of involving employees in pollution abatement programs on emissions [2]. As well as the overall impact that the TRI had on changes in stock prices after the first reporting of it [3]. However, all of these studies are relatively old (published sometime in the 90s), meaning that they miss out on two decades worth of trends, motivating a new look at the TRI.

Works Cited
1. Arora, S., Cason, T. N., Arora, S. & Casont, T. N. Do Community Characteristics Influence Environmental Outcomes? Evidence from the Toxics Release Inventory Southern Economic Association, 65, 691–716 (1999).
2. Bunge, J., Cohen-rosenthal, E. & Ruiz-quintanilla, Employee participation in pollution reduction : preliminary analysis of the Toxics , Release Inventory. Journal of cleaner Production 4, 9–16 (1996).
3. Konar, S., Cohen, M. Information As Regulation : The Effect of Community Right to Know Laws on Toxic Emissions. Journal of Environmental Economics and Management, 32, 109–124 (1997).


## Data set to be used

As stated I am going to use the TRI which has data from 1987-2019 accessible in csv files [here](https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-data-files-calendar-years-1987-2019?).

Documentation about the dataset is given in this [pdf](https://www.epa.gov/sites/production/files/2019-08/documents/basic_data_files_documentation_aug_2019_v2.pdf).

## Ethical concerns and other considerations

Some ethical concerns and  I have about my analysis include:
- As a chemical engineering major entering industry (although in a more sustainable chem company) I will have my own biases about the industry and will probably not be as critical as someone who is not.
- Chemical companies could use this analysis as evidence that they do better than their peers or other in a geographic area, which may disincentivize improvement.
- I am not an expert of the toxicity of different chemicals, I know generally PBTs/dioxins are worse than others on the list but there is not necessarly a consensus on all chemicals on the list. Treating them the same would be disingenuous, but may be necessary for this level.
- Mentioning a "red state" - "blue state" effect implicitly assumes that the "red states" will allow for more emissions than blue states.

Other considerations/compounding factors:
- The geographical distribution of chemical plants is skewed, i.e. there are a lot of petroleum refineries in Texas, but none in Massachusetts.
- You reach a level in emissions control technology where it can be hard (and costly) to improve, expecting plants to improve year to year is not realistic.
- Controlling by facility size is required in looking at fugitive air emissions as they are a function of the amount of pressurized equipment.
- Changing the composition of chemicals released may be more indicative of a change in product, not of a change in the process.

In [None]:
import pandas as pd
import numpy as np
import time

## Scraping the data

Scraped the data from CSV files which the EPA publishes for TRI data from 1987-2019  [here](https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-data-files-calendar-years-1987-2019?). Documentation about the dataset is given in this [pdf](https://www.epa.gov/sites/production/files/2019-08/documents/basic_data_files_documentation_aug_2019_v2.pdf).

In [None]:
# First coppied the 2019 file:
df_2019 = pd.read_csv('https://enviro.epa.gov/enviro/efservice/MV_TRI_BASIC_DOWNLOAD/year/=/2019/fname/TRI_2019_US.csv/CSV',low_memory=False)
# selected the columns I am interested in
df = df_2019[["1. YEAR","2. TRIFD","8. ST","15. PARENT CO NAME","20. INDUSTRY SECTOR","34. CHEMICAL","39. CLASSIFICATION","42. CARCINOGEN","45. 5.1 - FUGITIVE AIR","59. ON-SITE RELEASE TOTAL","82. OFF-SITE RELEASE TOTAL","101. TOTAL RELEASES","116. 8.9 - PRODUCTION RATIO"]]
del df_2019 # save space

#Scrape the rest
for x in range(1987,2019):
    address = 'https://enviro.epa.gov/enviro/efservice/MV_TRI_BASIC_DOWNLOAD/year/=/'+str(x)+'/fname/TRI_'+str(x)+'_US.csv/CSV'
    df_new = pd.read_csv(address,low_memory=False)
    
    df_new_trimmed = df_new[["1. YEAR","2. TRIFD","8. ST","15. PARENT CO NAME","20. INDUSTRY SECTOR","34. CHEMICAL","39. CLASSIFICATION","42. CARCINOGEN","45. 5.1 - FUGITIVE AIR","46. 5.2 - STACK AIR","47. 5.3 - WATER","59. ON-SITE RELEASE TOTAL","82. OFF-SITE RELEASE TOTAL","101. TOTAL RELEASES","116. 8.9 - PRODUCTION RATIO"]]

    df = df.append(df_new_trimmed, ignore_index=True)
    del df_new
    time.sleep(2)

## Getting Familiar with the TRI: Looking at Industry Dependence 

It is interestesting to see the impact of the specific industry on the overall chemicals emitted in the TRI, especially as in 1997 sevem new industries were added to the TRI, which added a significant amount of emissions. Without controlling for this, any overall trends would see a massive bump in emmissions around this time.

One intersting way to look at this is to see how each industry contributes to the overall emmision levels as repored in the TRI:

In [None]:
sumChem_byIndustry_year = df.groupby(["1. YEAR","20. INDUSTRY SECTOR"])["101. TOTAL RELEASES"].sum()*10**-6
sumChem_byIndustry_year.unstack(level=1).plot.bar(stacked=True).legend(loc='center left',bbox_to_anchor=(1.0, 0.5));
plt.xlabel("Year")
plt.ylabel("Total chemicals emmited (MMlb)")
plt.show()

This figure is extremely overwhellming, but it demonstrates the spike in 1998, which you can see is mostly dominated by the addition of coal mining and electric production in the dataset.

Note that the MMlb stands for million pounds

## Focusing on the Chemical Industry

From the figure above, we can make out that the chemical industry has traditionally been a large contributor of the amount of toxics materials released. As this is the industry that I am the most interested in, and the one which motivated the creation of the TRI, I want to look at it in more depth. 

Starting with the same plot of total emissions:

In [None]:
df_chemical = df[df["20. INDUSTRY SECTOR"]=="Chemicals"]

In [None]:
allChem = df_chemical.groupby(["1. YEAR"])["101. TOTAL RELEASES"].sum()*10**-6
allChem.plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Total chemicals emmitted (MMlb)")
plt.show()

Here you can see that the total amount of chemicals emmitted has decreased dramatically since the introduction of the TRI

**Add more analysis**

What about the more nasty chemicals?

In [None]:
sumCarcinogensChem_year = df_chemical[df_chemical["42. CARCINOGEN"]=="YES"].groupby(["1. YEAR"])["101. TOTAL RELEASES"].sum()*10**-6
sumCarcinogensChem_year.plot(kind='bar') 
plt.xlabel("Year")
plt.ylabel("Total carcinogenic chemicals emmitted (MMlb)")
plt.show()

In [None]:
sumDioxChem_year = df_chemical[df_chemical["39. CLASSIFICATION"]=="Dioxin"].groupby(["1. YEAR"])["101. TOTAL RELEASES"].sum()*10**-3
sumDioxChem_year.plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Total dioxins emmitted (kg)")
plt.show()

In [None]:
sumPBTChem_year = df_chemical[df_chemical["39. CLASSIFICATION"]=="PBT"].groupby(["1. YEAR"])["101. TOTAL RELEASES"].sum()*10**-6
plt.xlabel("Year")
plt.ylabel("Total PBTs emmitted (Mlb)")
plt.show()

Here the story is less straightforward. We can see that the total number of carcinogenic compounds emmitted overtime has decreased, but dioxin and PBT use has been less straightforward. At least on the dioxin side, a methodology change and extra restrictions in 2008 explains that dip. While for PBTs, the increase in 1999 is attributed to the addition of 7 new PBTs while in 2001 lead was added as a PBT. These changes to the TRI artifically change the trends.

## Where do these emissions go?

We can see that the majority of a plants emissions are treated onsite, not offsite:

In [None]:
onsite = df_chemical.groupby(["1. YEAR"])["59. ON-SITE RELEASE TOTAL"].sum()*10**-6
offsite = df_chemical.groupby(["1. YEAR"])["82. OFF-SITE RELEASE TOTAL"].sum()*10**-6
df_emissionLocation = pd.concat({"Onsite":onsite,"Offsite":offsite},axis=1)
df_emissionLocation.plot.bar(y=['Onsite', "Offsite"], stacked=True)
plt.xlabel("Year")
plt.ylabel("Total amount of chemicals emitted (Mlb)")
plt.title("Onsite to Offsite Emissions in the Chemical Industry")
plt.show()

This indicates that the majority of the chemicals are emitted into the water or air surronding a facility. Not sent to external facilities to be disposed of or recycled.

We can then see the way in which chemicals are typically emmitted

In [None]:
fugAir = df_chemical.groupby(["1. YEAR"])["45. 5.1 - FUGITIVE AIR"].sum()*10**-6
stackAir = df_chemical.groupby(["1. YEAR"])["46. 5.2 - STACK AIR"].sum()*10**-6
waterAir = df_chemical.groupby(["1. YEAR"])["47. 5.3 - WATER"].sum()*10**-6
df_emissionType = pd.concat({"Fugitive Air":fugAir,"Stack Air":stackAir,"Water":waterAir},axis=1)
df_emissionType.plot.bar(y=['Fugitive Air', "Stack Air"], stacked=True)
df_emissionType.plot.bar(y=['Fugitive Air', "Stack Air","Water"], stacked=True)
plt.ylabel("Total amount of chemicals emitted (Mlb)")
plt.title("Avenue of Emmission")

## Which states have the most/least chemical emmissions?

## How do trends in emmissions differ with states?