# Explainer Notebook

To keep this notebook short, readable and informative we decided not to include any code in it. You are encouraged to visit the 3 notebooks in which the analysis has been performed, accompanied by commented code.

1. [Preliminary Analysis](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Preliminary%20Analysis.ipynb)
2. [Analysis of House Requests and 311 Efficiency](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Analysis%20of%20House%20Complaints%20and%20311%20Efficiency.ipynb)
3. [Analysis of Noise Complaints](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Analysis%20of%20Noise%20Complaints.ipynb)

## Motivation
#### What is your dataset?
Our dataset gathers 311 Service Request in the city of New York from 2010 to present. 311 is an American phone number designed to access non-emergency government services(as opposed to 911). Some examples of its usage are: Noise Complaints, Problems with Heating/Hot Water, Illegal Parking reporting.

#### Why did you choose this/these particular dataset(s)?
We chose this dataset because of its interesting nature. This service allows us to get a direct point of view on New Yorkers' lifestyle: we can get information on their noisy behaviour or on their houses' unsanitary conditions, for instance. We can then look at it from a geographical point of view and gain insights about New York City and its different areas.

This curated dataset did not need any articulated cleaning process and it had a satisfactory amount of entries. After cleaning the initial file, we had more than 7 million entries to work with.

#### What was your goal for the end user's experience?
Our goal was to make the user explore this interesting dataset and to get him to know better New York City in the process, as we did. 

We focused on two main things: Noise and House complaints, as they were the most fascinating complaint types, being also the ones with the highest amount of observations.

## Basic stats. Let's understand the dataset better
Please refer to the notebook [Preliminary Analysis](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Preliminary%20Analysis.ipynb).

#### Write about your choices in data cleaning and preprocessing
The original file was ~7GB in size when downloaded. 

The first step was to remove null values. Then, out of the 53 original columns, we only kept the 12 ones useful for our analysis, including:
* Created Date
* Closed Date
* Complaint Type
* Borough
* District
* Longitude
* Latitude

Finally, we removed entries that presented these criteria, considering them as outliers:
* Outside the 2010 - 2017 timeframe
* Negative call duration 
* 'Unspecified' Borough/Neighborood

After these operations, we obtained a cleaned file of size ~1GB.

<!---

For the two main branches of our analysis we performed filtering on the Complaint Type.

In the analysis of Noise Complaints we only focused on the 5 most frequent noise types, i.e:
* 'Noise - Park'
* 'Noise - Vehicle'
* 'Noise - Commercial'
* 'Noise - Street/Sidewalk'
* 'Noise - Residential'

In the analysis of House Requests we focused only on these complaints:
* 'HEAT/HOT WATER'
* 'PLUMBING'
* 'GENERAL CONSTRUCTION'
* 'UNSANITARY CONDITION'
* 'PAINT/PLASTER'
* 'ELECTRIC'
* 'DOOR/WINDOW'
* 'WATER LEAK'
* 'FLOORING/STAIRS'
* 'APPLIANCE'
-->

#### Write a short section that discusses the dataset stats
Let's start by giving general statistics about the dataset.

In the cleaned file, we can count 7 343 313 entries. The complaints are labeled with 559 unique categories.

The most frequent complaints are:
    1. 'Noise - Residential' (921 438)
    2. 'HEAT/HOT WATER' (666 588)
    3. 'HEATING' (518 605)
    4. 'Blocked Driveway' (425 798)
    5. 'PLUMBING' (364 728)

This is the distribution of 311 requests over the years:
![311 calls over the years](pics/callsoveryears.png)
We can see from this histogram an ever increasing amount of 311 requests over the years. This is probably due to the fact that people are getting more and more confident with the service. Also, the ability of quickly submitting a complaint through the website/mobile app must have contributed in increasing the amount of requests.

This is the average time to resolution of 311 requests over the years:
![Time to resolution over the years](pics/ttroveryears.png)
We can notice from this plot that 311 efficiency in dealing with requests/complaints has been improving consistently over the years. This can again be attributed to the changes in the service over the time. In fact, we can infer that while in early years requests were in lower amount and of greater "importance", in more recent years they have taken the form of reports that you do through an app, for example to signal blocked driveway or dead animal removal. 

## Theory. Which theoretical tools did you use?
Please refer to the notebooks [Analysis of House Requests and 311 Efficiency](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Analysis%20of%20House%20Complaints%20and%20311%20Efficiency.ipynb) and [Analysis of Noise Complaints](http://nbviewer.jupyter.org/github/oikonang/social_data_visualization/blob/master/Final_Project/Analysis%20of%20Noise%20Complaints.ipynb).

#### Talk about your model selection.
We built two models, one related to House Complaints and the other to Noise Complaints.

To select our model we first explored the dataset looking for apparent correlations, in order to select our features. These are the steps we followed in the analysis of the House Complaints:

1\. We plotted the frequency of the different complaint types over the years.
![House complaints over the years](pics/housecompoveryears.png)


2\. We plotted the frequency of the different complaint types across the hours of the day.
![House complaints over the day](pics/housecompoverday.png)


3\. We plotted the frequency of the different complaint types across the months of the year.
![House complaints over the months](pics/housecompovermonths.png)


4\. We plotted the average time to resolution per complaint type over the years.
![House complaints over the months](pics/ttrperhousecomp.png)


5\. We plotted the ratio between P(complaint type | district) and P(complaint) per district.
![Ratio per house complaint per district](pics/houseratioperdistrict.png)


After this exploration, we decided to choose as features of our model the **month** and the **part of the day**. The third chosen feature is the **neighborood** of New York City, considered more relevant than the simple borough. What we try to predict is the **_complaint type_**, given these three variables.


We followed a similar procedure to build the model for the Noise Complaints, it can be read in the respective notebook. At the end of the exploration, we identified as features the **latitude** and **longitude** of the complaints and we tried to predict the **_complaint type_**.

#### Describe which machine learning tools you use and why the tools you've chosen are right for the problem you're solving.
Among the different tools at our disposal, we chose to use Decision Tree Classifiers and KNN Classifiers. We also made use of K-Means clustering.


Since we are working with Dates and Gelocations, Decision Trees is the perfect tool for the job. This is because (integer boundaries for features. Categorical variables, tree suitable for best splits.)


KNN Classifier also fits very well our analysis. This is because (geolocation, only 2 dimensions. Euclidean distance)

#### How did you split the data into test/training. Did you use cross validation?
Decision Trees: 90% 10%


No cross validation


KNN: 100% train

#### Explain the model performance. How did you measure it? Are your results what you expected?
Classifiers: scores


Clustering: Total square error vs. number of clusters. 
![Total square error vs number of clusters](pics/error.png)

## Visualizations
#### Explain the visualizations you've chosen.
We have implemented four D3 visualizations.

1. **Exploration of the dataset**. In this visualization, clicking on a location of NYC will make the map zoom on the corresponding neighborood. A tooltip will provide information about its name, the total number of complaints and the most frequent complaint type.

2. **Prediction of House complaint type**. The user can choose from 3 dropdown menus the preferred combination District/Month/Part of the day, to see the 3 most likely house complaint. (Results from Decision Tree)

3. **K-Means clustering of Unsanitary Conditions complaints**. The user can choose with a slider the value of K to see the clustering.

4. **Prediction of Noise complaint type**. The user can hover anywhere on NYC map and a tooltip shows the most likely noise complaint in that location. The point targeted by the mouse will be colored according to the complaint type. (Result of KNN classification)

#### Why are they right for the story you want to tell?


## Discussion. Think critically about your creation
#### What went well?


#### What is still missing? What could be improved?, Why?
We wanted to investigate a third branch in our analysis, related to Street Complaints, like 'Blocked Driveway' or 'Illegal Parking'. We could have combined our dataset with the one related to New York's traffic. We haven't done it because of time constraint.


Our visualizations work as expected but they could be further tweaked, making transitions smoother for example. This was also not done because of time.