# Week 4

Yay! It's week 4. Today's we'll keep things light. 

I've noticed that many of you are struggling a bit to keep up and still working on exercises from the previous weeks. Thus, this week we only have two components with no lectures and very little reading. 


## Overview

* An exercise on visualizing geodata using a different set of tools from the ones we played with during Lecture 2.
* Thinking about visualization, data quality, and binning. Why ***looking at the details of the data before applying fancy methods*** is often important.

## Part 1: Visualizing geo-data

It turns out that `plotly` (which we used during Week 2) is not the only way of working with geo-data. There are many different ways to go about it. (The hard-core PhD and PostDoc researchers in my group simply use matplotlib, since that provides more control. For an example of that kind of thing, check out [this one](https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621).)

Today, we'll try another library for geodata called "[Folium](https://github.com/python-visualization/folium)". It's good for you all to try out a few different libraries - remember that data visualization and analysis in Python is all about the ability to use many different tools. 

The exercise below is based on the code illustrated in this nice [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data), so let us start by taking a look at that one.

*Reading*. Read through the following tutorial
 * "How to: Folium for maps, heatmaps & time data". Get it here: https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data
 * (Optional) There are also some nice tricks in "Spatial Visualizations and Analysis in Python with Folium". Read it here: https://towardsdatascience.com/data-101s-spatial-visualizations-and-analysis-in-python-with-folium-39730da2adf

In [1]:
import numpy as np
import pandas as pd
import folium
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
from scipy import stats
import requests

> *Exercise 1.1*: A new take on geospatial data. 
>
>A couple of weeks ago (Part 3 of Week 2), we worked with spacial data by using color-intensity of shapefiles to show the counts of certain crimes within those individual areas. Today, we look at studying geospatial data by plotting raw data points as well as heatmaps on top of actual maps.
> 


In [2]:
df_crime = pd.read_csv('Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')

> * First start by plotting a map of San Francisco with a nice tight zoom. Simply use the command `folium.Map([lat, lon], zoom_start=13)`, where you'll have to look up San Francisco's longitude and latitude.


In [15]:
df_crime.Y

0          90.000000
1          37.781896
2          37.799789
3          37.741362
4          37.708200
             ...    
2129520    37.798880
2129521    37.770470
2129522    37.755446
2129523    37.780478
2129524    37.712061
Name: Y, Length: 2129525, dtype: float64

In [17]:
map_SF = folium.Map(location=[37.7713,-122.4227],
                        zoom_start = 13)
map_SF

> * Next, use the the coordinates for SF City Hall `37.77919, -122.41914` to indicate its location on the map with a nice, pop-up enabled maker. (In the screenshot below, I used the black & white Stamen tiles, because they look cool).
> <img src="https://raw.githubusercontent.com/suneman/socialdata2022/main/files/city_hall_2022.png" alt="drawing" width="600"/>


In [22]:
map_SF = folium.Map(location=[37.77919,-122.41914],
                        tiles = "Stamen Toner",
                        zoom_start = 13)
folium.Marker([37.77919,-122.41914], popup='City Hall').add_to(map_SF)
map_SF

> * Now, let's plot some more data (no need for pop-ups this time). Select a couple of months of data for `'DRUG/NARCOTIC'` and draw a little dot for each arrest for those two months. You could, for example, choose June-July 2016, but you can choose anything you like - the main concern is to not have too many points as this uses a lot of memory and makes Folium behave non-optimally. 
> We can call this kind of visualization a *point scatter plot*.


In [23]:
df_crime['Date'] = pd.to_datetime(df_crime['Date'])
df_crime['Hour'] = pd.DatetimeIndex(df_crime['Time']).hour
df_crime['Day'] = pd.DatetimeIndex(df_crime['Date']).day
crime = 'DRUG/NARCOTIC'
df_nar = df_crime[df_crime['Category']==crime].sort_values(by=['Date'])
mask = (df_nar['Date'] >= '06/01/2016') & (df_nar['Date'] <= '08/01/2016')
df_nar_2 = df_nar[mask]

In [62]:
lat = df_nar_2.Y
lon = df_nar_2.X

map_SF = folium.Map(location=[37.77919,-122.41914],
                        tiles = "Stamen Toner",
                        zoom_start = 13)

for cor in zip(lat,lon):
    folium.Marker(list(cor), popup='Drug/Narcotic').add_to(map_SF)

map_SF


Ok. Time for a little break. Note that a nice thing about Folium is that you can zoom in and out of the maps.



> *Exercise 1.2*: Heatmaps.
> * Now, let's play with **heatmaps**. You can figure out the appropriate commands by grabbing code from the main [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data)) and modifying to suit your needs.


In [30]:
from folium import plugins
from folium.plugins import HeatMap

>    * To create your first heatmap, grab all arrests for the category `'SEX OFFENSES, NON FORCIBLE'` across all time. Play with parameters to get plots you like.


In [77]:
crime = 'SEX OFFENSES, NON FORCIBLE'
df_sex = df_crime[df_crime['Category']==crime]

map_heat = folium.Map(location=[37.77919,-122.41914],
                      tiles = "Stamen Toner",
                    zoom_start = 13) 

# Ensure you're handing it floats
df_sex['Y'] = df_sex['Y'].astype(float)
df_sex['X'] = df_sex['X'].astype(float)

# List comprehension to make out list of lists
heat_data = [[row['Y'],row['X']] for index, row in df_sex.iterrows()]

# Plot it on the map
#HeatMap(heat_data).add_to(map_heat)

# Display the map
hm = plugins.HeatMap(heat_data,auto_play=True,radius = 12,blur = 5, max_opacity=0.9)
hm.add_to(map_heat)
map_heat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


>    * Now, comment on the differences between scatter plots and heatmaps. 
>.      - What can you see using the scatter-plots that you can't see using the heatmaps? 
>.      - And *vice versa*: what does the heatmaps help you see that's difficult to distinguish in the scatter-plots?


Scatterplots allows us to see the trends by examinig the relationships of two continuous variables whereas heatmap mainly displays strength of correlation between two variables. On the otherhand, Heatmap solves the problem of overlapping commmonly faced when using scatterplots as there are many datapoints to plot. It is difficult for example in this case for us to identify the subtle change in concentration of crimes using markers (like scatter plots) compared to the smooth color gradation in the heatmap.

>    * Play around with the various parameters for heatmaps. You can find a list here: https://python-visualization.github.io/folium/plugins.html


In [59]:
hm = plugins.HeatMap(heat_data,auto_play=True,radius = 4,blur = 15, max_opacity=0.8)
hm.add_to(map_heat)
map_heat

>    * Comment on the effect on the various parameters for the heatmaps. How do they change the picture? (at least talk about the `radius` and `blur`).


The smaller the radius, the smaller the clusters of points were shown. The smaller the blue value the clearer the cluster centers.

> For one combination of settings, my heatmap plot looks like this.
> <img src="https://raw.githubusercontent.com/suneman/socialdata2022/main/files/crime_hot_spot.png" alt="drawing" width="600"/>
>    * In that screenshot, I've (manually) highlighted a specific hotspot for this type of crime. Use your detective skills to find out what's going on in that building on the 800 block of Bryant street ... and explain in your own words. 

(*Fun fact*: I remembered the concentration of crime-counts discussed at the end of this exercise from when I did the course back in 2016. It popped up when I used a completely different framework for visualizing geodata called [`geoplotlib`](https://github.com/andrea-cuttone/geoplotlib). You can spot it if you go to that year's [lecture 2](https://nbviewer.jupyter.org/github/suneman/socialdataanalysis2016/blob/master/lectures/Week3.ipynb), exercise 4.)

In [90]:
df_sex[df_sex['Address'].str.contains('BRYANT')].T

Unnamed: 0,2096344
PdId,5093696614030
IncidntNum,50936966
Incident Code,14030
Category,"SEX OFFENSES, NON FORCIBLE"
Descript,INCEST
DayOfWeek,Saturday
Date,2004-08-21 00:00:00
Time,09:00
PdDistrict,SOUTHERN
Resolution,NONE


For the final element of working with heatmaps, let's now use the cool Folium functionality `HeatMapWithTime` to create a visualization of how the patterns of your favorite crime-type changes over time.

> *Exercise 1.3*: Heatmap movies. This exercise is a bit more independent than above - you get to make all the choices.


> * Start by choosing your favorite crimetype. Prefereably one with spatial patterns that change over time (use your data-exploration from the previous lectures to choose a good one).


In [96]:
df_theft['Date'].dt.day

91447      1
18410      1
1925205    1
92708      1
91402      1
          ..
2084145    1
2087933    1
597927     1
158237     1
2090366    1
Name: Date, Length: 10891, dtype: int64

In [100]:
crime = 'LARCENY/THEFT'
df_theft = df_crime[df_crime['Category']==crime].sort_values(by=['Date'])
mask = (df_theft['Date'] >= '12/01/2017') & (df_theft['Date'] <= '03/01/2018')
df_theft = df_theft[mask]

> * Now, choose a time-resolution. You could plot daily, weekly, monthly datasets to plot in your movie. Again the goal is to find interesting temporal patterns to display. We want at least 20 frames though.


> * Create the movie using `HeatMapWithTime`.


In [104]:
df_theft

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2,Hour,Day
91447,17631467806244,176314678,6244,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,2017-12-01,08:45,CENTRAL,NONE,...,,5.0,2.0,,,,,19.0,8,1
18410,17631630106244,176316301,6244,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,2017-12-01,17:00,SOUTHERN,NONE,...,,,2.0,,,3.0,,53.0,17,1
1925205,17631363606224,176313636,6224,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Friday,2017-12-01,08:00,NORTHERN,NONE,...,,8.0,2.0,,,1.0,,100.0,8,1
92708,17631337906244,176313379,6244,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,2017-12-01,13:00,NORTHERN,NONE,...,,,1.0,,,,,22.0,13,1
91402,17097398706304,170973987,6304,LARCENY/THEFT,GRAND THEFT FROM A BUILDING,Friday,2017-12-01,10:22,CENTRAL,NONE,...,,,2.0,,,,,50.0,10,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2084145,18605155306244,186051553,6244,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Thursday,2018-03-01,10:30,PARK,NONE,...,,,1.0,,,,,13.0,10,1
2087933,18605039506372,186050395,6372,LARCENY/THEFT,PETTY THEFT OF PROPERTY,Thursday,2018-03-01,08:50,NORTHERN,NONE,...,21.0,,1.0,,,,,15.0,8,1
597927,18605152506244,186051525,6244,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Thursday,2018-03-01,22:00,TARAVAL,NONE,...,,,2.0,,,,,43.0,22,1
158237,18016166106240,180161661,6240,LARCENY/THEFT,ATTEMPTED THEFT FROM LOCKED VEHICLE,Thursday,2018-03-01,14:10,TARAVAL,NONE,...,,,2.0,,,,,42.0,14,1


In [106]:
map_heat_t = folium.Map(location=[37.77919,-122.41914],
                      tiles = "Stamen Toner",
                    zoom_start = 13) 

# Ensure you're handing it floats
df_theft['Y'] = df_theft['Y'].astype(float)
df_theft['X'] = df_theft['X'].astype(float)

heat_df = df_theft.copy()
# Create weight column, using date
heat_df['Weight'] = df_theft['Hour']
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Y','X', 'Weight'])

# List comprehension to make out list of lists
heat_data = [[[row['Y'],row['X']] for index, row in heat_df[heat_df['Weight'] == i].iterrows()] for i in range(0,24)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(map_heat_t)
# Display the map
map_heat_t


> * Comment on your results: 
>   - What patterns does your movie reveal?
>   - Motivate/explain the reasoning behind your choice of crimetype and time-resolution. 

It is clear that the cases of theft decreases as we progress into day time hours and peaks during midnight hours. This finding is coherent with our common belief that dark, night hours are strong incentive for theft as the environment will be ideal for hiding and people are less alert during sleep. 

## Part 2: Errors in the data. The importance of looking at raw (or close to raw) data.

We started the course by plotting simple histogram and bar plots that showed a lot of cool patterns. But sometimes the binning can hide imprecision, irregularity, and simple errors in the data that could be misleading. In the work we've done so far, we've already come across at least three examples of this in the SF data. 

1. In the temporal activity for `PROSTITUTION` something surprising is going on on Thursday. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/prostitution.png), where I've highlighted the phenomenon I'm talking about.
2. When we investigated the details of how the timestamps are recorded using jitter-plots, we saw that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/jitter.png), where I've highlighted the phenomenon I'm talking about.
3. And, today we saw that the Hall of Justice seemed to be an unlikely hotspot for sex offences. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/crime_hot_spot.png).

> *Exercise 2*: Data errors. The data errors we discovered above become difficult to notice when we aggregate data (and when we calculate mean values, as well as statistics more generally). Thus, when we visualize, errors become difficult to notice when binning the data. We explore this process in the exercise below.
>
>This last exercise for today has two parts:
> * In each of the examples above, describe in your own words how the data-errors I call attention to above can bias the binned versions of the data. Also, briefly mention how not noticing these errors can result in misconceptions about the underlying patterns of what's going on in San Francisco (and our modeling).
> * Find your own example of human noise in the data and visualize it.