# Week 5

This week, we switch gears a bit. You're getting better at JavaScript and D3 is not a foreign concept in the same way.  That means we're ready to get started on a **Bigger Project** where we spend a few weeks digging deep, using the skills you've learned so far to build something coherent and comprehensive, to get closer to exhaustively mapping out a large an complex dataset (and also learning a few new skills along the way).

## Crime in NYC
I thought a lot about what that new project should be. In the past I've had classes look at Twitter, Marvel/DC Comic Book Characters, Predictive Policing in San Francisco, and Philosophers. Then I realized that I'll be going to New York for a conference in the spring. I'm staying [here](http://www.sheratonbrooklyn.com) -- and since I'm always scared of accidentally ending up in bad neighborhoods, I thought we should focus on mapping crime in New York City.

OK, maybe it's not just because I'm going to NYC. It's also because there's fantastic data availble, for example through NYC's amazing [open data portal](https://opendata.cityofnewyork.us). 

We'll be looking at incident level NYPD Complaint Data (since there are more reports than actual police actions). You can find it here:

* [NYPD Historic Complaint Data](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i). This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end 2016. That's 24 columns and 5.58 million rows of pure fun. There's time with minute-resolution and GPS coordinates. So there should be enough data for exciting visualization.

Your won't be the only ones to visualize this and related datasets. I did a quick search and found these

* https://www.nytimes.com/interactive/projects/crime/homicides/map
* https://nycdatascience.com/blog/student-works/crime-and-demographics-in-new-york-city/
* https://www.datasciencecentral.com/profiles/blogs/7-sins-in-nyc
* http://iquantny.tumblr.com/post/144197004989/the-nypd-was-systematically-ticketing-legally

## The plan for today

But enough about this dataset. We still have some lecturing get through as well as a bit of practical information. Here's the plan. 

* First a small video to explain the reason for using peergrade.
* The a little lecture on BAD visualization (I hope it will be entertaining) + a bonus video.
* And while we're getting better at D3, we're not done with the book, so we'll also read a bit
* Finally, we'll visualize some crime statistics

## Part 1: What's the deal with peergrade.io

So we're using peergrade in the course. I made a little video to explain why we do this.

In [12]:
YouTubeVideo("-TC18KgpiIQ",width=800, height=450)

I realized, however, that one thing wasn't completely clear in the video. To make things crystal clear I'll address this below. There are 2 important points.

**1) The peer grading scores your work received does not influence your grade**

I strongly believe that you should be solving complex problems. Problems that have *many* correct solutions, just like in the real world. This means problems that you can't pose as multiple choice assignments, you can't get a computer to grade this kind of work (yet - maybe with deep learning, but that's another story).

This class, however, has around 100 students. So how do we give you good feedback? The answer is peer evaluations. By crowd-sourcing evaluations, you guys can get feedback on your work quickly and when you average over many other students, the quality is [as high as you'd get from TA's](http://journals.sagepub.com/doi/pdf/10.3102/00346543070003287). There are lots of pro/cons related to peer evaluation and I discuss those elsewhere on these pages.

But an important thing to realize is that **how your peers evaluated you, does not determine your grade**! (We do look at the quality of *your* peer evaluations of others' work. And the quality of your evaluations of other people's assignments, is reflected in the grade, see below).

**2) So how does the grade come about?**

When it's time to do the grading, the TAs and I get together. And we set it up so that at least two of us look at each assignment (including the final project assignments). Then we discuss each assignment as a group and write down a numerical assessment. We also take into account the quality of your peer evaluations.

Each grade is then based on those numerical assessments. The grade is a holistic evaluation of your work in the entire course, but as a rule of thumb 50% of the grade is due to Assignment 1 and 2, while the other 50% are due to your final project assignments and peer assessments.





## Part 2: Video Lectures

In [11]:
# Sune talks about what makes good/bad visualizations
from IPython.display import YouTubeVideo
YouTubeVideo("TVdfoSxg3V4",width=600, height=338)

> *Exercise*: Some questions for the video 
>
> * Who is Edward Tufte? (Go online and find his info), summarize in max 3 lines.
> * What is the "Lie Factor"?
> * What's the idea behind the "Data-Inc ratio". Should it be maximized or minimized?

**Bonus video**: Just to provide another perspective on data visualization (which covers many of the topics I've discussed over the past few weeks but from new angles), below is a great talk about data visualization by David McCandless. This one is optional, but recommended.

In [7]:
# David McCandless TED talk (this one is optional)
YouTubeVideo("5Zg-C8AAIGg",width=800, height=450)

## Part 3: Reading the book

This time, we'll read chapter 12-13 about *Selections* and *Layouts*. You know how I think you should work with the book, so I won't repeat that. 

> *Reading*: Read Chapter 12-13 of IDV.

And since you now know I won't put the questions for the book on the Assignments, I'm not going to list any here, but just assume that you'll read the chapters with out me testing you on the knowledge. 

And if we're being honest, I'm not planning on using chapter 12 today, so focus on chapter 13 just skim chapter 12, so you know where the info is (for later).

## Part 4: Visualizing some data

Ok. So in the beginning of this notebook I boldly proclaimed that we were starting a new and exciting project today. And that is true. But we're going to start slowly, since I also want to make sure you practice stuff in the book. Thus we'll do three D3 visualizations based on Chapter 13.

**What about the data pre-processing?**. The basic data we'll be working with has 24 columns and 5.58 million rows. That's a lot to work with and too much to load using your browser. Thus, you'll have to preprocess the data before generating the visualizations below. There are many ways to do that. 

If you know Python, a good solution is to just download everything and then use [`pandas`](https://pandas.pydata.org), which is a great way to work with large tables of data. If you don't know Python you could process the data in your favorite programming language (R, MatLab, JavaScript). If you hate programming altogether, you can use the filters on the web-page to extract subsets of the data (press "View Data" and then use the "Filter" option).  

> *Excercise*: **A doughnut chart** of total crime in the [5 boroughs](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City). You may ask, why a doughnut chart? My answer is: because it's better than a pie-chart. It just looks cool. 
> 
> * Count the total number of crimes (any crime-type) in each of the five boroughs in 2016 (you may want to take a look at the `BORO_NM` column to find the borough names).
> * Normalize so you get the fraction of all crimes in each borough. 
> * Visualize the fractions in a donughnut chart.
> * Make sure each segment of the doughnut also shows the fraction it represents.
> * Make sure you have labels that show which segment belongs to each borough. 

> *Exercise*: **A stacked area chart** of crime over time. To make things simpler, let's look at Manhattan, and let's look at 2016 only.
> 
> * Take a look at the `PD_DESC` column. Create an overview (just for yourself) of the possible crime types. Try to simplify, by only taking the first word if there's a comma-separated list, and removing numbers. So that `ASSAULT 1` and `ASSAULT 2` both simply become `ASSAULT`, etc.
> * What are the 5 most common crime types in Manhattan in 2016?
> * Calculate the total number of occurrences for each of these 5 types in each month of 2016.
> * Create a stacked area chart that displays the data you calculated above.
> * Include tooltips, similar to fig 13-8 in the book.

> *Exercise*: **A network** of fake NYC data. 
> 
> Let's also create a little network just because they are cool. It's not easy to generate a meaningful network based on the NYC crime data, so I've put a fake little. Create the network with force-directed layout, and make sure that nodes are draggable. Play around with values for the various parameters (`d3.forceManyBody()`, etc).


