# Exploration: Unregulated Contaminant Monitoring Rule (UCMR)

This data comes from the EPA:
<br>Link: https://www.epa.gov/dwucmr/occurrence-data-unregulated-contaminant-monitoring-rule#12

<br>Information on Data: EPA uses the unregulated contaminant monitoring rule to collect data for contaminants that are suspected to be present in drinking water and do not have health-based standards set under the SDWA. 

<br>Per each of the blocks, the EPA reviews contaminants that have been evaluated through existing prioritization processes, inluding previous contaminants and the CCL. Additional contaminants may be identfied based on current research. 

<br>Chemicals that are not registered for use in the US, do not have an anlytical reference standard, or do not have an analytical method ready for use are generally not considered. 

| Known Carcinogens | Probable Carcinogens | Possible Carcinogens |
|------|------|------|
|1,3 – butadiene (13-15)|Diazinon (01-05)|2,4,6-trichlorophenol (01-05)|
|Stontium-90 (13-15)|N-nitroso-diethylamine (NDEA) (08-10)|N-nitroso-di-n-butylamine (NDBA) (08-10)|
|Chromium (VI) (13-15)|N-nitroso-dimethylamine (NDMA) (08-10)|N-nitroso-di-n-propylamine (NDPA)  (08-10)|
|alpha-hexachlorocyclohexane (18-20)|1,2,3 – trichloropropane (13-15)|N-nitroso-methylethylamine (NMEA)  (08-10)|
|O-toluidine  (18-20)|Methyl Chloride (Chloromethane) (18-20)|N-nitroso-pyrrolidine  (NPYR) (08-10) |
||Methyl Chloride (Chloromethane)(18-20)|1,1 – Dichloroethane (13-15)|
|||1,4-dioxane (13-15)|
|||Vanadium (13-15)|
|||Molybdenum (13-15)|
|||Butylated hydroxyanisole (18-20)|
|||Quinoline (18-20)|
|||Microcystin –LR (18-20)|

Data is downloaded as .xlsx sheets per each of the UCMR block runs. Selecting known carcinogens for exploration first. 

## Let's explore the known carcinogens: 

In [80]:
#Library Imports: 

#Basic py: 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import glob
import random
import base64
from PIL import Image
from io import BytesIO
from IPython.display import HTML
import io
import pdfkit

#Geo
import geopandas as gpd
import fiona
from shapely.geometry import Point
import descartes

### UCMR3 2013-2015

In [81]:
path = 'C:\\Users\\u0890227\\Desktop\\UCMR\\'

#Data From MA - WY
UCMR3_MA_WY_df = pd.read_csv(path + "ucmr-3-occurrence-data-by-state\\UCMR3_All_MA_WY.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR3_AK_LA_df = pd.read_csv(path + "ucmr-3-occurrence-data-by-state\\UCMR3_All_Tribes_AK_LA.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR3_DT_df = pd.read_csv(path + "ucmr-3-occurrence-data-by-state\\UCMR3_DT.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR3_ZIPS_df = pd.read_csv(path + "ucmr-3-occurrence-data-by-state\\UCMR3_ZipCodes.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")

  interactivity=interactivity, compiler=compiler, result=result)


In [82]:
UCMR3_df = pd.concat([UCMR3_MA_WY_df,UCMR3_AK_LA_df],axis=0)
UCMR3_df.sort_values(by = ["State"])
UT_UCMR3_df = UCMR3_df[UCMR3_df.loc[:,"State"] == 'UT']
print("There are a total of %d values in the Utah portion of the dataset" %UT_UCMR3_df.shape[0])

There are a total of 15553 values in the Utah portion of the dataset


In [170]:
UCMR3_ZIPS_df = UCMR3_ZIPS_df.dropna()
UCMR3_ZIPS_df.loc[:,'ZIPCODE'] = UCMR3_ZIPS_df.loc[:,'ZIPCODE'].str.strip('-')

(11700, 2)
(15553, 22)
(38039, 23)


In [231]:
#Repeats in the data for certain Zipcodes. Need just the unique values
UCMR3_ZIPS_df = UCMR3_ZIPS_df.sort_values(by = ["PWSID"])
UT_ZIPS = UCMR3_ZIPS_df.iloc[10842:10950]
UT_ZIPS = UT_ZIPS.drop_duplicates(subset = ["PWSID"])

In [232]:
#Join the Zipcodes for each measurement: 
UT_Zips_UCMR3_df = pd.merge(UT_UCMR3_df,UT_ZIPS, on = "PWSID",how = 'inner')
UT_Zips_UCMR3_df["Contaminant"].value_counts()

chlorate                     826
chromium-6                   824
cobalt                       819
molybdenum                   819
strontium                    819
vanadium                     819
chromium                     815
1,1-dichloroethane           629
PFHxS                        629
PFOS                         629
PFBS                         629
1,2,3-trichloropropane       629
bromomethane                 629
PFHpA                        629
Halon 1011                   629
PFOA                         629
HCFC-22                      629
1,3-butadiene                629
chloromethane                629
PFNA                         629
1,4-dioxane                  627
17-alpha-ethynylestradiol    107
estriol                      107
testosterone                 107
17-beta-estradiol            107
equilin                      107
estrone                      107
4-androstene-3,17-dione      107
germanium                     61
manganese                     61
tellurium 

For the chemicals of interest the following data counts are as follows:

|Chemical|Total Dataset Counts|
|------|------|
|chromium-6|824|
|strontium|819|
|1,3-butadiene|629|

In [241]:
#What is the diversity of locations? 
#Chromium-6
chrom_df = UT_Zips_UCMR3_df[UT_Zips_UCMR3_df.loc[:,"Contaminant"] == 'chromium-6']
chrom_locs = chrom_df.loc[:,'ZIPCODE'].value_counts()
print(chrom_locs.iloc[:10])
print("Decent spread over the SLC basin for measurements. Not many are in rural areas though")

84043    38
84107    37
84084    37
84070    34
84065    32
84014    29
84015    28
84003    26
84404    26
84663    22
Name: ZIPCODE, dtype: int64
Decent spread over the SLC basin for measurements. Not many are in rural areas though


In [247]:
#Strontium
strom_df = UT_Zips_UCMR3_df[UT_Zips_UCMR3_df.loc[:,"Contaminant"] == 'strontium']
strom_locs = strom_df.loc[:,'ZIPCODE'].value_counts()
print(strom_locs.iloc[:10])
print("Decent spread over the SLC basin for measurements. Not many are in rural areas though. Identical spread to chromium")

84043    38
84107    37
84084    37
84070    34
84065    32
84014    29
84015    28
84003    26
84404    26
84663    22
Name: ZIPCODE, dtype: int64
Decent spread over the SLC basin for measurements. Not many are in rural areas though. Identical spread to chromium


In [249]:
#Strontium
buta_df = UT_Zips_UCMR3_df[UT_Zips_UCMR3_df.loc[:,"Contaminant"] == '1,3-butadiene']
buta_locs= buta_df.loc[:,'ZIPCODE'].value_counts()
print(buta_locs.iloc[:10])
print("Highest concentration of measurments is in SLC proper, although spread is over the entire basin.")

84107    35
84084    33
84070    30
84043    29
84065    24
84003    22
84014    21
84663    20
84015    18
84404    17
Name: ZIPCODE, dtype: int64
Highest concentration of measurments is in SLC proper, although spread is over the entire basin.


### UCMR 4 (2018-2020)

In [88]:
#Import the data: 
UCMR4_MA_WY_df = pd.read_csv(path + "ucmr_4_occurrence_data_by_state\\UCMR4_All_MA_WY.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR4_AK_LA_df = pd.read_csv(path + "ucmr_4_occurrence_data_by_state\\UCMR4_All_Tribes_AK_LA.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR4_Cyanotoxin_AddtlDataElem_df = pd.read_csv(path + "ucmr_4_occurrence_data_by_state\\UCMR4_Cyanotoxin_AddtlDataElem.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR4_HAA_AddtlDataElem_df = pd.read_csv(path + "ucmr_4_occurrence_data_by_state\\UCMR4_HAA_AddtlDataElem.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")
UCMR4_ZIPS_df = pd.read_csv(path + "ucmr_4_occurrence_data_by_state\\UCMR4_ZipCodes.txt", 
                       sep="\t", header=0,encoding = "ISO-8859-1")

In [90]:
UCMR4_df = pd.concat([UCMR4_MA_WY_df,UCMR4_AK_LA_df],axis=0)
UCMR4_df.sort_values(by = ["State"])
UT_UCMR4_df = UCMR4_df[UCMR4_df.loc[:,"State"] == 'UT']
print("There are a total of %d values in the Utah portion of the dataset" %UCMR4_df.shape[0])

There are a total of 452650 values in the Utah portion of the dataset


In [97]:
#Join the Zipcodes for each measurement: 
UT_Zips_UCMR4_df = pd.merge(UCMR4_ZIPS_df,UT_UCMR4_df, on = "PWSID",how = 'inner')
UT_Zips_UCMR4_df.head()

Unnamed: 0,PWSID,ZIPCODE,PWSName,Size,FacilityID,FacilityName,FacilityWaterType,SamplePointID,SamplePointName,SamplePointType,...,SampleID,Contaminant,MRL,MethodID,AnalyticalResultsSign,AnalyticalResultValue(µg/L),SampleEventCode,MonitoringRequirement,Region,State
0,UTAH02018,84324,Mantua Town Water,S,20001,Spring Chlorinator,GW,EP1,Sample Tap - Spring Chlorinator,EP,...,100370P,1-butanol,2.0,EPA 541,<,,SEA1,AM,8,UT
1,UTAH02018,84324,Mantua Town Water,S,20001,Spring Chlorinator,GW,EP1,Sample Tap - Spring Chlorinator,EP,...,100370P,2-methoxyethanol,0.4,EPA 541,<,,SEA1,AM,8,UT
2,UTAH02018,84324,Mantua Town Water,S,20001,Spring Chlorinator,GW,EP1,Sample Tap - Spring Chlorinator,EP,...,100370P,2-propen-1-ol,0.5,EPA 541,<,,SEA1,AM,8,UT
3,UTAH02018,84324,Mantua Town Water,S,20001,Spring Chlorinator,GW,EP1,Sample Tap - Spring Chlorinator,EP,...,100370P,alpha-hexachlorocyclohexane,0.01,EPA 525.3,<,,SEA1,AM,8,UT
4,UTAH02018,84324,Mantua Town Water,S,20001,Spring Chlorinator,GW,EP1,Sample Tap - Spring Chlorinator,EP,...,100370P,butylated hydroxyanisole,0.03,EPA 530,<,,SEA1,AM,8,UT


In [98]:
#Counts per each chemical: 
UT_Zips_UCMR4_df.loc[:,"Contaminant"].value_counts()

germanium                      572
manganese                      572
2-methoxyethanol               569
1-butanol                      569
2-propen-1-ol                  569
butylated hydroxyanisole       546
o-toluidine                    546
quinoline                      546
tribufos                       535
chlorpyrifos                   535
oxyfluorfen                    535
profenofos                     535
tebuconazole                   535
ethoprop                       535
alpha-hexachlorocyclohexane    535
total permethrin               535
dimethipin                     535
anatoxin-a                     487
cylindrospermopsin             487
total microcystin              486
HAA9                           447
HAA6Br                         447
HAA5                           447
Name: Contaminant, dtype: int64

**Known** Carcinogen data counts from UCMR 4:

|Chemical|Total Dataset Counts|
|------|------|
|o-toluidine|546|
|alpha-hexachlorocyclohexane |535|

In [122]:
#What is the diversity of locations? 
#O-Toluidine
o_tol_df = UT_Zips_UCMR4_df[UT_Zips_UCMR4_df.loc[:,"Contaminant"] == 'o-toluidine']
o_tol_locs = o_tol_df.loc[:,'ZIPCODE'].value_counts()
print(o_tol_locs.iloc[:5])
print("Predominantly, the great majority of these measurements are coming from the east bench of SLC")

84121    35
84123    35
84107    35
84106    21
84117    21
Name: ZIPCODE, dtype: int64
Predominantly, the great majority of these measurements are coming from the east bench of SLC


In [125]:
a_hex_df = UT_Zips_UCMR4_df[UT_Zips_UCMR4_df.loc[:,"Contaminant"] == 'alpha-hexachlorocyclohexane']
a_hex_df_locs = a_hex_df.loc[:,'ZIPCODE'].value_counts()
print(a_hex_df_locs.iloc[:5])
print("Predominantly, the great majority of these measurements are coming from the east bench of SLC (Cottonwood and Murray)")

84123    31
84121    31
84107    31
84043    20
84060    20
Name: ZIPCODE, dtype: int64
Predominantly, the great majority of these measurements are coming from the east bench of SLC (Cottonwood and Murray)


## Summary of Work: 

Herein, I perform an exploratory analysis on the carcinogenic water measurments from the Unregulated Contaminant Monitoring Rule. Every ~5 years, a list of chemicals, not approved under the SDWA act, are monitored in public water supplies for roughly two years. This list of chemicals changes per each UCMR cycle and thus measurments have short temporal resolution. 

<br>Only 5 known carcinogens showcase measurment: 1,3 butadiene, Strontium-90, Chromium (VI), alpha-hexachlorocyclohexane and O-toluidine. Each carcinogen contains ~700-800 measu