# 1. Motivation
### What is your dataset?

In this project, we are looking at **New York City** and its **gentrification**. We used 2 datasets: the *Property Valuation and Assessment Data* and the *NYPD Arrests Data (Historic)*. 

In the *Property* dataset we are mostly focused on the *FULLVAL* column since it is the one with the information about the prices of properties. We also grouped the values per ZIP code, in order to later use them in a map of New York.

In the *Crime* dataset, we focused on 4 crimes from the *OFNS_DESC* column: *Rape*, *Murder & Non-negligent men slaughter*, *Robbery* and *Dangerous Weapons*. We wanted to gather data about the most violent crimes.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn import model_selection
from IPython.display import Image
from sklearn.metrics import accuracy_score

from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh import plotting
from bokeh.models import FactorRange
from bokeh.transform import dodge
from bokeh.models import Legend
from bokeh.palettes import Category20c

In [None]:
# Import property data
#data_property = pd.read_csv("Property_Valuation_and_Assessment_Data.csv") 
#df_property = pd.DataFrame(data_property)

# Import crime data
#data_crime = pd.read_csv("NYPD_Arrests_Data__Historic_.csv") 
#df_crime = pd.DataFrame(data_crime)

#Show overview
#print(df_property.head())
#print(df_crime.head())

  interactivity=interactivity, compiler=compiler, result=result)


### Why did you choose this/these particular datasets?

We started with a question: *What factors, visible in open datasets, could be correlated with gentrification?*. Some of our hypothesis were "the evolution of price of property in a zone" and "the overall safety of a zone". In this project, we want to confirm or reject our suppositions, and try to show what they represent through visualization. 

The *Property* and *Crime* datasets gives usefull information to answer these questions, and they range from 2010 to 2019, wich is an acceptable span of time to observe evolution and changes.

### What was your goal for the end user's experience?

We want the user to learn about gentrification and some of its manifestations in New York. We want the user to be able to interact with our visualizations in order to have a more engaging experience. Our goal is to clearly communicate our findings and letting space for the user to explore our visual story. 

# 2. Basic statistcs
### Write about your choices in data cleaning and preprocessing

At first we had a 2 GB data set of the property values with a lot different variables that were not relevant to our study, so we started by removing from the dataframe all the useless columns (such as "owner", "block" or "taxclass"). Afterwards we needed to ensure that every row was complete, and that there was no cell holding ‘NaN’, to do so we just apply the method dropna() on our dataframe. At that point we almost have a fully ready dataframe, we then just needed to ensure that all the variables have the desired type, such as the zip codes that should be integers as the year variable. And we also ensured that the values of the properties were in the type float.

(Add code)

We then conduct a similar cleaning and processing on the crime data set that we’re using. We first identify only the crimes that are relevant to our study and the filter out all the other crimes, in our case we get interested in the crimes : Robbery, Dangerous weapons, Rape and Murder associated with non negligent men slaughter. The objective of this data set was primarily to produce a movie with a heat map, therefore we first create a new column associated to each crime representing the year the arrestation has been made, and thus afterwards we filter out our data even more in order to have a data frame at the end containing only the year (as an integer), and the longitude and latitude as a float.


### Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis. 

Crime data not balanced (check pie chart), but we kept it like that bcs we want to see the total evolution.



# 4. Genre. 

### Which genre of data story did you use?

Our website is organized as a **slideshow**. This lets us use different types of visualization embedded in the website, and also allows the user to navigate the visualizations in the order of his choice. 

Inside the slides we use **annotated graph** genre and **animations**. The first allows us to communicate both visual and textual information very easily, and directly explain the visualization. The latter is a more engaging visualization, from which the user can spot changes in the evolution straight away.

### Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?

We used several tools presented by the [Segal and Heer paper][1]. We will first explain the tools used for Visual narrative:

In **Highlighting**, we used *motion* and *zooming* in our maps. This lets the user explore those visualizations as he wishes. 

As for **Visual structuring** we are using a *Progression bar* in our Heatmap movie of the crime data in order to be able to select which year is to be viewed and to see how far in the evolution the current frame is.

For **Transition guidance** we made use of *Animated transitions* in our website slideshow and *Object continuity* in between our slides. By doing that we achieve a smoother and more enjoyable transition between the elements of our visualizations.

[1]: http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf


### Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

Now lets get a look at the tools used for Narrative structure:

We **ordered** our slideshow in a *user directed path*, this means the user can choose which slide he will see next, in our case, moving to the right goes to a slide that shows a new caracteristic analyzed, and going down shows more details about the current caracteristic.

For **interactivity**, we use *navigation buttons* and *filter/selection* features. Both can be found in the bookeh plot we created. Interactivity makes the user engaged and lets him create his own experience, which is always great to take into account when creating a story. The tools we used allow us to make our visualizations more interactive.

Finally we used structure tools for **messaging**. In our visualizations, *Captions*, *Annotations* and *Introduction test* were implemented. With those tools, we can communicate to the user our findings and observations at the same time as they see the image. It also gives a structure and links our creations to our initial question. 

# 6. Discussion. Think critically about your creation

### What went well?

With this project, we were able to work on something that the three of us were interested in. We enjoyed making research and analyzing data about a topic that we didn't know much about beforehand and definitely learned alott about. 

We were able to make different visualizations that we hadn't tried before and to find different ways of presenting data.

However, we are aware that this project is far from a big success, and the main point lies on the take-aways brought by our struggles and failures.

### What is still missing? What could be improved? Why?

Our first problem is in the analysis of the data. In fact, we had high expectations on what type of correlations/links we would find between the criterias and gentrification. In reality, the data was harder to exploit than expected since grouping price evolution in areas as large as zip codes, resulted in a "flatened-out" result where it is very hard to observe any substantial increase/decrease, due to the amount of properties (some might increase alot, but it will get lost in the average).

We also struggled finding additional datasets from NYC that were relevant to gentrification.

Secondly, we were unable to find crime data with zip codes. Therefore, it was very hard to put together the price and crime datasets without losing precision (we could have merged them by borough but it would have been too general, insignificant, and hard to utilize). Our intention to analyze those two criteria was therefore, harder to achieve.
In addition to that, we wanted to make a Bookeh plot showing the evolution of violence and prices in the 7 Zip codes, but it was also impossible to do for the same reasons.

Moreover, we realize that our machine learning part did not achieve at all what we wanted it to do. The initial goal was to predict which zones would be affected by gentrification in the future, but we realized that our criteria could not serve as qualifiers as well as we thought they would.

As a final word, we would say that our initial idea was ambitious but we did not have enough initial skills and did not put enough time and ressources in the completion of our task.