# Exploring Water Quality Complaints

### Author: Katie Shakman 

I noticed a number of times during the winter that there was dark water flowing from our sink's faucet.  Not for very long, just a few seconds and usually when the water temperature was set to hot, but I hadn't noticed this at all in the summer.  I wondered if water complaints like this might be more common in the colder months, when there is presumably more damage to pipes from freezing weather.  I decided to try to find some water quality data for the city and explore it for possible seasonal fluctuations or geographic patterns.

I started by downloading the Water Complaints from NYC Open Data.  To work with the data, I decided to use Python 3 in a Jupyter notebook.  To begin, I wanted to visualize the data and answer a few basic questions about it.

----------------------------------------------------------------------------------------------


### What kinds of complaints are people making? 

<img src="what_complaints.png" width="100%">

### Water Complaints By Geographical Area

Question: Where are water quality complaints coming from in the city?

One way to visualize the complaints geographically is to plot a scatter of the complaints by location (latitude and longitude).  Let's see if the incidents appear to cluster by latitude & longitude: 


#### Water complaints plotted by latitude and longitude.

<img src="by_location.png" width="100%">

We can immediately see that the scattered data seem to come from all 5 boroughs: we can see the shapes of the Bronx and Manhattan (NW blob), Queens and Brooklyn (central/E blob, and SE strip corresponding to The Rockaways barrier islands), and Staten Island (SW).

We can color-code the map by borough to make it clearer which complaints come from which regions:

<img src="by_location_colorByBorough.png" width="90%">

We might want to see if any of these boroughs has dramatically more complaints than another.  
To address that, we can start by getting the counts of complaints by borough, and visualizing them as a bar plot: 

<img src="complaint_count_byBorough.png" width="120%">

We might wonder if certain areas are more prone to certain types of complaints.  We can use the "Descriptor" column to color-code the datapoints and we get a map like this:

<img src="byLocation_ColorByDescriptorAll.png" width="120%">

From this color-coded plot, it appears that Manhattan and nearby areas of Brooklyn and Queens have a high density of taste/odor complaints (bluer colors). However, it's hard to really see the breakdown just from this plot, given the high density of complaints in certain areas.  We can break it down into subplots by borough: 

<img src="barByBorough_ColorByDescriptorAll.png" width="120%">

I wanted to normalize the number of complaints to the population per borough, so I also downloaded NYC OpenData's census table. 

After dividing each borough's complaints count by the 2010 census population for that borough (one could argue that a more recent population estimate would have been better, but the 2010 census should be an OK approximation), we get the updated bar plot:  

Let's get a bar plot to see which areas have the most complaints, normalized by population.
<img src="NumComplaintsByBoroughNormBy2010Pop.png" width="100%">

Now we can see that Staten Island has the most water quality complaints per capita, followed by Manhattan. 

### Water Complaints Over Time

When did the complaints come in?  Do they correspond to the major hurricanes? 
<img src="NumberOfComplaintsByDayAnnotated.png" width="80%">

The times of highest complaint frequency seem not to be directly after major hurricanes.  Instead, it looks like there is a slight increase in complaints during the spring and summer. 

Let's see what we get when we optimize a cosine function to fit the latter half of the data (which seemed to have fewer outliers): 
<img src="FitCosine.png" width="80%">

It looks like there is some periodicity (our optimized function is not too flat), but it's not that pronounced (the amplitude of our optimized curve is lower than our initial guess).  It looks like on average there are about 4 complaints per day, and that doesn't seem to be highly dependent on what time of year it is.  Perhaps it takes a really serious water problem to get New Yorkers to call in, and those don't happen seasonally.  Probably indicative of a well maintained water supply system!

Note that the data was fit without any dates having a value of zero.  We might get a high-amplitude fit if we filled in the missing dates with zeros.  

Also, some of the calls were requests for information ("No sampling required..."), and these entries could be removed and the data reanalyzed.  May also want to color points on lat/long plot to see if certain types of complaints come mainly from certain areas (also certain times is of interest).  Can probably use a dummy variable from the call notes for this.  Or without a dummy var, a nice way to do this with seaborn is shown here: 

As a future question, it would be fun to explore how this periodicity (or lack thereof) compares to water quality trends at the beaches, which we might expect to be much more seasonal (due to e.g. increased growth of bacteria during warmer months, without the tight control of water filtration systems used in the municipal water supply). 

## Sources

Data: <br>
https://opendata.cityofnewyork.us <br>
OpenData's Water Quality complaints (Originally retrieved 3/12/2017): <br>
https://data.cityofnewyork.us/Environment/Water-Quality-complaints/qfe3-6dknNYC <br>
New York City Population By Boroughs (Retrieved 5/13/2017): <br>
https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Boroughs/9mhd-na2n <br>

Tips and Other Resources: <br>
http://www.stackoverflow.com/ <br>
https://python-graph-gallery.com/ <br>