# Riverkeeper Data: Evaluating Enterococcus in NY Rivers

For this homework we decided to utilize both Pandas and Bokeh to analyse the data. In brief, the data itself was relatively clean, however, it was necessary to convert some non-numeric data. In 4 instances, the data had either a > or < point. These were: [">2420", ">24196","<1","<10"]. Out of necessity these points were converted to a "rounded" number, arbitrarly set as:[2500,25000,0,5] respectively. This was done to make it easier to manipulate the data.

## Question 1 

In [149]:
import pandas as pd
import numpy as np 

#Data Upload
url = "https://raw.githubusercontent.com/jlaurito/CUNY_IS608/master/lecture4/data/riverkeeper_data_2013.csv"
dat_raw = pd.read_csv(url, index_col = 0)

# Data Cleaning (removine Greater than and Less then values)
rep_val = [">2420", ">24196","<1","<10"]
new_val = [2500,25000,0,5]
dat_raw["EnteroCount"] = dat_raw["EnteroCount"].replace(rep_val,new_val)
dat_raw["EnteroCount"] = dat_raw["EnteroCount"].astype(int)
dat_raw['Date'] = pd.to_datetime(dat_raw['Date'])

#Create some datasets for displaying values
dat_ent = dat_raw.drop(['FourDayRainTotal','SampleCount'], 1)
dat_avg = dat_ent.groupby(dat_ent.index).mean()


For Question 1, it was decided that finding the Top and Bottom 10 places representing the best and worse places to swim by taking averages over all the samples, as there were many data points per location. This was used as a starting point for our analysis. Below is the displayed results: 

In [150]:
bot_10 = dat_avg.sort_values('EnteroCount', ascending=False).head(10)
bot_10

Unnamed: 0_level_0,EnteroCount
Site,Unnamed: 1_level_1
Gowanus Canal,4314.540541
Newtown Creek- Metropolitan Ave. Bridge,3037.614035
Tarrytown Marina,2264.481481
Saw Mill River,1471.54
Upper Sparkill Creek,1315.515152
Newtown Creek- Dutch Kills,1231.807018
Kingsland Pt. Park- Pocantico River,924.97619
Orangetown STP Outfall,867.596491
Mohawk River at Waterford,625.628571
Kingston STP Outfall,487.125


In [151]:
top_10 = dat_avg.sort_values('EnteroCount', ascending=True).head(10)
top_10

Unnamed: 0_level_0,EnteroCount
Site,Unnamed: 1_level_1
Poughkeepsie Drinking Water Intake,8.210526
Croton Point Beach,13.0625
Stony Point mid-channel,15.068182
Haverstraw Bay mid-channel,16.104167
Little Stony Point,17.394737
Poughkeepsie Launch Ramp,17.675676
TZ Bridge mid-channel,18.45614
Yonkers mid-channel,22.326923
Cold Spring Harbor,22.514286
Irvington Beach,26.027778


In [152]:
# Taking the Index for the top and bottom 10 Rivers
top_10_ind = top_10.index.tolist()
bot_10_ind = bot_10.index.tolist()

# Creating a Year Column then Average over that year, and Pivotting the data
dat_ent['Year'] = dat_ent['Date'].dt.year
dat_yearly_bot = dat_ent.loc[bot_10_ind,:]
dat_yearly_bot.reset_index(level=0, inplace=True)

dat_yearly_top = dat_ent.loc[top_10_ind,:]
dat_yearly_top.reset_index(level=0, inplace=True)


In this first graph, we wanted to show the outliers, as an explanation needs to be done regarding the areas with an Entero Count greater than 24196. Because we set these values equal to 25000 in the beginning, we can see how these outliers are significant. First, it greatly increased the average of these specific site, and as you can tell almost all of the bottom 10 had at one point this astronomically high reading. Arguably, any place that has such a high reading should AUTOMATICALLY be a no-swim zone, so it would be accurate to say that these graphs  

In [153]:
from bokeh.charts import *

p = BoxPlot(dat_yearly_bot, values='EnteroCount', label="Site",  color = "Site",
            title="Worse Places to Swim Based on Entero Count ")

output_notebook()

show(p)

After reviewing the outlier data, we decided to remove it to compare our top 10 to our bottom ten, which we can see below: 

In [154]:
p = BoxPlot(dat_yearly_bot, values='EnteroCount', label="Site",  color = "Site",
            outliers= False, title="Worse Places to Swim Based on Entero Count ")

output_notebook()

show(p)

In [155]:
p = BoxPlot(dat_yearly_top, values='EnteroCount', label="Site", color = "Site",
            outliers= False, title="Best Places to Swim Based on Entero Count ")

output_notebook()

show(p)

## Question 2 

The next area we wish to test is the time frame between testing. First, we want to get an understanding which areas were tested most, so using the sample count: 

In [156]:
#Technically.... This took the Mean of the sample count... It worked...
dat_raw.groupby(dat_raw.index).mean().sort_values('SampleCount', ascending=False).head(5)

Unnamed: 0_level_0,EnteroCount,FourDayRainTotal,SampleCount
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Piermont Pier,481.57754,0.53369,187
Upper Sparkill Creek,1315.515152,0.526061,165
125th St. Pier,178.787879,0.771212,66
Nyack Launch Ramp,103.803279,0.506557,61
Newtown Creek- Dutch Kills,1231.807018,0.852632,57


In [157]:
dat_raw.groupby(dat_raw.index).mean().sort_values('SampleCount', ascending=True).head(5)

Unnamed: 0_level_0,EnteroCount,FourDayRainTotal,SampleCount
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tarrytown Marina,2264.481481,0.814815,27
Cold Spring Harbor,22.514286,0.305714,35
Hudson above Mohawk River,228.2,0.631429,35
Island Creek/Normans Kill,438.685714,0.642857,35
Marlboro Landing,87.428571,0.291429,35


As you can see, most areas were sampled at least 35 times (there is only one 27 sample in the group) Piermont Pier and Upper Sparkill Creek, were definitely sampled more frequently than any other, by a large margin. It is also interesting to note that Upper Sparkill was listed as one of the bottom places to swim in the previous section.

## Question 3

For Question 3, we made some 

In [158]:
p = Scatter(dat_raw, x='EnteroCount', y='FourDayRainTotal', title="Rain vs. Entero Count", color="navy",
            xlabel="Entero Count", ylabel="4 Day Rain Total")
output_notebook()
show(p)