## Motivation

Crime is a major concern in many cities around the world. Society is always looking for ways to reduce crime rates and make cities safer. If we don't feel safe walking around our city, we can't enjoy the city's amenities or be as happy and productive as we could be. Therefore, our goal with this project is firstly to identify which crimes are the most important for travelers and residents in a city. As New york is one of the largest and most visited cities in the world, we chose to analyze the crime data of this city, as it has abundant and free data available, as well as a large number of crimes reported. Another goal is to identify when and where these crimes occur most frequently so that we can provide a tool to help people make informed decisions about their safety while travelling or working and living in New York City. Finally, the last goal is to examine how socio-economic factors and other variables can influence or help explain crimes rates in the city. This will ultimately help people stay safe and make better decisions about where to live, work, and travel in New York City.


- What is your dataset?

Our chosen dataset is the New York City Police Department (NYPD) Complaint Data Historic [dataset](https://data.cityofnewyork.us/d/qgea-i56i?category=Public-Safety&view_name=NYPD-Complaint-Data-Historic) which can be found on the [NYC Open Data](https://opendata.cityofnewyork.us/) website. The dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to 23rd april 2024. We have downloaded it on 23 April 2024. 

- Why did you choose this/these particular dataset(s)?

The reason this dataset was chosen is because it is the largest and most comprehensive dataset available for crimes in New York City. We therefore felt it would be most representative of the true crime rates, and would give us room to narrow our datastory into something more specific, and still have enough data to accurately convey our message.

- What was your goal for the end user's experience?

We want the user to be able to get a glimpse into which crimes are important to look out for, and give them a tool in which they are explore the data further with socio-economic overlays, as well as plot a route which avoids crime hotspots. 

## Basic Stats

- Write about your choices in data cleaning and preprocessing


**Data Cleaning**

In [4]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import json
import os

# Load the NYPD policing data
cwd = os.getcwd()
parent_dir = os.path.dirname(cwd)
filename = 'NYPD_Complaint_Data_Historic_20240423.csv' # Note that this file is not included in the repository as it is too large
# You must therefore download the file from: https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i?category=Public-Safety&view_name=NYPD-Complaint-Data-Historic
df = pd.read_csv(filename)
df_original = df.copy()

  df = pd.read_csv(filename)


In [5]:
# The dataset size
print('Total number of crimes', len(df)) # 8.5 million rows
# Number of variables
print("Variables: ", df.shape[1]) # 35 columns
# Show first 5 rows with all columns
df.head()

Total number of crimes 8496991
Variables:  35


Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,Lat_Lon,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,16784525,06/17/2006,00:30:00,,(null),6.0,06/18/2006,578,HARRASSMENT 2,638.0,...,F,,40.734091,-74.006238,"(40.734091, -74.006238)",PATROL BORO MAN SOUTH,(null),45-64,WHITE,F
1,10973318,04/06/2006,09:30:00,,(null),6.0,04/12/2006,578,HARRASSMENT 2,638.0,...,F,,40.741288,-74.006167,"(40.741288, -74.006167)",PATROL BORO MAN SOUTH,(null),45-64,WHITE,F
2,23859785,08/12/2006,11:29:00,,(null),20.0,08/12/2006,105,ROBBERY,361.0,...,M,,40.775083,-73.982182,"(40.775083, -73.982182)",PATROL BORO MAN NORTH,(null),(null),UNKNOWN,D
3,16544671,05/22/2006,16:30:00,,(null),47.0,05/22/2006,105,ROBBERY,380.0,...,M,,40.903862,-73.846994,"(40.903862, -73.846994)",PATROL BORO BRONX,(null),25-44,WHITE,M
4,16856905,06/01/2005,00:01:00,06/18/2006,11:00:00,,06/21/2006,104,RAPE,153.0,...,M,,,,,(null),(null),<18,BLACK,F


In [6]:
# Renaming columns for better readability
df = df.rename(columns={
    'CMPLNT_NUM': 'Complaint_ID',
    'ADDR_PCT_CD': 'Precinct',
    'BORO': 'Borough',
    'BORO_NM': 'Borough_Name',
    'CMPLNT_FR_DT': 'Complaint_From_Date',
    'CMPLNT_FR_TM': 'Complaint_From_Time',
    'CMPLNT_TO_DT': 'Complaint_To_Date',
    'CMPLNT_TO_TM': 'Complaint_To_Time',
    'CRM_ATPT_CPTD_CD': 'Crime_Completed',
    'HADEVELOPT': 'NYCHA_Housing',
    'HOUSING_PSA': 'Housing_PSA',
    'JURISDICTION_CODE': 'Jurisdiction_Code',
    'JURIS_DESC': 'Jurisdiction_Description',
    'KY_CD': 'Offense_Code',
    'LAW_CAT_CD': 'Offense_Level',
    'LOC_OF_OCCUR_DESC': 'Location_Type',
    'OFNS_DESC': 'Offense_Description',
    'PARKS_NM': 'Park_Name',
    'PATROL_BORO': 'Patrol_Borough',
    'PD_CD': 'Internal_Classification_Code',
    'PD_DESC': 'Internal_Classification_Description',
    'PREM_TYP_DESC': 'Premises_Type',
    'RPT_DT': 'Report_Date',
    'STATION_NAME': 'Transit_Station_Name',
    'SUSP_AGE_GROUP': 'Suspect_Age_Group',
    'SUSP_RACE': 'Suspect_Race',
    'SUSP_SEX': 'Suspect_Sex',
    'TRANSIT_DISTRICT': 'Transit_District',
    'VIC_AGE_GROUP': 'Victim_Age_Group',
    'VIC_RACE': 'Victim_Race',
    'VIC_SEX': 'Victim_Sex',
    'X_COORD_CD': 'X_Coordinate',
    'Y_COORD_CD': 'Y_Coordinate',
    'Latitude': 'Latitude',
    'Longitude': 'Longitude'
})

In [7]:
# Clean data

#Number of initial NaN values: 10 110 435
#Number of "(null)" string values: 50 220 416
#Number of "UNKNOWN" string values: 6 204 102
#True number of NaN values: 66 534 953

# replace all "(null)" and "UNKNOWN" values with Nan in one loop so .isna() can be used
values_to_replace = ['(null)', 'UNKNOWN']
df.replace(values_to_replace, np.nan, inplace=True)
# count true NaN values
nan_count = df.isna().sum().sum()
print(f'True number of NaN values: {nan_count}') # 66.5 mil

True number of NaN values: 66534953


In [8]:
# Percentage of missing values in each column
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)

NYCHA_Housing                          99.643474
Park_Name                              99.591761
Transit_District                       97.773565
Transit_Station_Name                   97.773565
Housing_PSA                            92.432015
Suspect_Age_Group                      66.353654
Suspect_Race                           58.202157
Suspect_Sex                            44.298340
Victim_Race                            32.586936
Victim_Age_Group                       30.994878
Complaint_To_Date                      21.083464
Complaint_To_Time                      21.022995
Location_Type                          20.467622
Premises_Type                           0.286396
Offense_Description                     0.222102
Internal_Classification_Code            0.085842
Internal_Classification_Description     0.085842
Borough_Name                            0.080876
Precinct                                0.008250
Complaint_From_Date                     0.007709
Patrol_Borough      

**Which columns can we remove?**

Looking at columns with over 1% missing values a qualitative assessment is made, as seen below:

Yes:

* NYCHA_Housing: Yes, as NYCHA_Housing is an organization that provides affordable housing in New York City, which is not relevant.
* Housing_PSA: Yes, as it is the Development Level Code and we have the geographical location
* Parks_Name, Transit_District, Transit_Station_Name: Yes, as the geographical location is already provided in the Latitude and Longitude columns.

No:


* Suspect and Victim: No, as we are looking for safer travel routes in NYC
* Complaint_To_Date and Complaint_To_Time columns: No, because they are important for certain crimes
  * E.g. Complaint ID: 16856905 was RAPE from 06/01/2005 00:01:00 to 06/18/2006 11:00:00 in precinct 104
* Location_Type: No, as it describes the location of the crime [inside, outside, front of, opposite of, rear of] relative to the Premises_Type [street, residence-house, commercial building,park/playground, etc]


**Which rows of columns should we remove?**

All columns with less than 1% missing values are obviously important when writing a police complaint entry, so we choose to remove rows with NaN values in these columns. 
To support this, we can see that these columns align with what we are looking for in the dataset. We need geograpical data (lat, long, borough), time  (from date), crime type. 


In [9]:
# We need geograpical data (lat, long, borough), time  (from date), crime type. 
# We will remove rows with NaN values in these columns.
important_columns = [
    'Jurisdiction_Code', 
    'Report_Date', 
    'Offense_Code', 
    'Offense_Level', 
    'Jurisdiction_Description', 
    'Complaint_ID',
    'Complaint_From_Date', 
    'Complaint_From_Time', 
    'Crime_Completed',
    'Victim_Sex',
    'Latitude', 
    'Longitude', 
    'Y_Coordinate', 
    'X_Coordinate', 
    'Lat_Lon',
    'Patrol_Borough',
    'Precinct',
    'Borough_Name',
    'Internal_Classification_Code',
    'Internal_Classification_Description',
    'Offense_Description',
    'Premises_Type']

# Drop rows with NaN values from important columns
df = df.dropna(subset=important_columns) # 8.44 mil rows left

# Remove these columns as they are not needed
# * NYCHA_Housing: Yes, as NYCHA_Housing is an organization that provides affordable housing in New York City, which is not relevant.
# * Housing_PSA: Yes, as it is the Development Level Code and we have the geographical location
# * Parks_Name, Transit_District, Transit_Station_Name: Yes, as the geographical location is already provided in the Latitude and Longitude columns.
df = df.drop(columns=['NYCHA_Housing', 'Housing_PSA', 'Park_Name', 'Transit_District', 'Transit_Station_Name'])

**Year exploration**

The Complaint_From_Date column has the date REPORTED, not the date of the crime
* Therefore some crimes are may be from before 2006
* If an incident results in a victims death, the incident is upgraded to murder and the date is changed to the date of victims death!
* If there is a Complain_To_Date the crime occured in a time range.

In [10]:
# Check years for anomalies
years = df['Complaint_From_Date'].str.split('/').str[-1] # e.g. split '06/17/2006' into '2006'
# convert to int and sort 
unique_years = years.astype(int).sort_values().unique()
crime_counts_per_year = years.value_counts().sort_index()
# display max rows none
pd.set_option('display.max_rows', None)
print('Unique years:', unique_years, 'Counts:', crime_counts_per_year)

# Print how many crimes before 2006 and after 2019
print('Crimes before 2006:', len(df[years.astype(int) < 2006])) # 15k
print('Crimes after 2019:', len(df[years.astype(int) > 2019])) # 1.5mil
print('Crimes after 2022:', len(df[years.astype(int) > 2022])) # 129k

# Removing crimes before 2006 as it is inconsistent and biased. 
# And removing crimes after 2022 as it seems to be incomplete data
df = df[(years.astype(int) >= 2006) & (years.astype(int) <= 2022)]
print(" length of Df after removing years", len(df)) # 8.3 mil

Unique years: [1010 1011 1014 1015 1016 1017 1018 1019 1020 1021 1022 1024 1025 1026
 1027 1028 1029 1900 1905 1906 1908 1909 1910 1911 1912 1913 1914 1915
 1916 1917 1918 1919 1920 1921 1922 1924 1928 1929 1930 1938 1940 1941
 1942 1945 1946 1947 1948 1949 1950 1951 1954 1955 1956 1957 1958 1959
 1960 1961 1962 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
 2017 2018 2019 2020 2021 2022 2023 2024] Counts: 1010         6
1011         3
1014         1
1015         9
1016        18
1017        24
1018        26
1019        13
1020         2
1021         7
1022        10
1024         1
1025         1
1026         5
1027         3
1028        10
1029         4
1900         6
1905         2
1906         1
1908         3
1909         3
1910         9
1911         8
1912    

Crime counts per year:
* From 2006 to 2022 we have ~500k reports per year
* 2023 is a uniquely low count at ~6k
* 2024 is at ~123k becuase its only from before 23rd April

We choose to only keep data between 2006 to 2022 

**Crime types**

A crime complaint only shows the most serious offense, even if it involved mulitple offenses. 
* Attempted crimes also shown, even if unsuccessful

In [11]:
# Looking at the 25 most common crimes
focus_crimes = df['Offense_Description'].value_counts().head(25)

# Drop all other crimes except the focus_crimes
df = df[df['Offense_Description'].isin(focus_crimes.index)]

In [12]:
# Export the cleaned data
df.to_csv('NYPD_Complaint_Data_Cleaned.csv', index=False) # index False as row number not needed

- Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

The basic dataset stats are given in the above code, but will be summarised here:
* Roughly 8.5 million entries with 35 columns, which we cleaned to about 8.3 million entries with 30 columns. 
* The cleaned dataset contains complaints reported from years between 2006 to 2022.
* Nan values and some columns were removed if there was reason to, based on a qualitative assessment.
* Crime types were filtered to the 25 most common crimes, to make sure there was enough data for a good analysis.

Note that from the NYPD_Complaint_Incident_Level_Data_Footnotes which can be found under  [attachments](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-HistNYPD_Complaint_Incident_Level_Data_Footnotesric/qgea-i56i/about_data?category=Public-Safety&view_name=NYPD-Complaint-Data-Historic) there is additional information on the dataset. Such as, it is stated that certain locations are anonymised if there e.g. are related to rape or sex crimes to protect the victims identity.

The most important columns for our analysis are:
* Complaint_From_Date






## Data Analysis

- Describe your data analysis and explain what you've learned about the dataset.
- If relevant, talk about your machine-learning.

In [None]:
# Histogram over df years updated
years_updated = df['Complaint_From_Date'].str.split('/').str[-1]
crime_counts_per_year_updated = years_updated.value_counts().sort_index()
plt.figure(figsize=(10, 6))
plt.bar(crime_counts_per_year_updated.index, crime_counts_per_year_updated.values)
plt.title('Number of Crimes per Year')
plt.xlabel('Year')
plt.ylabel('Number of Crimes')
plt.show()


In [None]:
# Looking at the 25 most common crimes
focus_crimes = df['Offense_Description'].value_counts().head(25)
plt.figure(figsize=(10, 6))
plt.barh(focus_crimes.index, focus_crimes.values)
plt.title('Top 25 Most Common Crimes')
plt.xlabel('Number of Crimes')
plt.ylabel('Crime Description')
plt.gca().invert_yaxis()
plt.show()


**Socioeconomic Data Analysis**

The analysis aims to highlight potential societal problems that may contribute to increased crime activity in NYC. Here, we focus on examining data related to poverty levels and total enrollment in lower educational institutions between 2013 and 2018, as these factors have been suggested in previous studies to be causally linked to the overall rise in criminal activity. We explore the relationship between these variables through linear fitting, correlation coefficients, and visual inspection of several bar plots depicting crime rates across different NYC boroughs.

This section is designed to enhance users' understanding and potentially mitigate prejudice concerning the issue of crime. It is important to note that this page should not solely critique individuals who commit such crimes.

## Genre

Which genre of data story did you use?

- Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
- Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

## Visualizations
- Explain the visualizations you've chosen.
- Why are they right for the story you want to tell?

##### **Bar Charts**
Oliver, write stuff here

##### **Line & Polar Plot**
Oliver, write stuff here (also, I honestly cant read the figure text, we need to increase the size)


##### **Bokeh Plots**
Bokeh plots are employed to provide a quantitative visual representation of the occurrence of specific targeted crimes on an hourly basis, filtered for particular days and months. This feature aims to aid users in comprehending the concentration of focus crimes within their respective contexts. The interactive functionality is designed to offer a user-friendly and practical approach to generating an overview of the crimes, along with facilitating comparative visualizations.

These plots are intended to support the planning of cases involving individuals interacting with NYC, as illustrated through the use cases 1 and 2.

##### **Use Case 1**

The user-customizable heatmap is central to Use Case 1, providing a dynamic visualization that represents crime data spatially and temporally across New York City. This visualization is chosen because it allows users to interact with the data directly, adjusting parameters to see how crime hotspots change based on various factors such as time of day, type of crime, and date. Given our proposed user segment we believe this tool really compliments that narrative and could be a useful tool both for educational and practical purposes. We spent a lot of time on the visualization of Use Case 1 to accomplish the following features which we deem important:
##### User-Customizable Heatmap
- **Interactive Experience**: Allows users to interact with and manipulate the data to see how crime hotspots change based on time, type, and location. This empowers users, making the data relevant and personalized to their needs.
- **Practical Use**: Users can plan their movements around the city more safely by visualizing crime patterns related to their personal schedules or routes.
##### Navigation Features
- **Actionable Insights**: Integrates practical navigation tools with crime data, providing users with safe travel routes. This direct application helps in transforming abstract crime data into concrete safety measures.
- **Enhanced Engagement**: Users are more likely to use and benefit from a tool that actively helps them navigate through, and avoid, high-risk areas.
#####  NYPD Precinct Locations
- **Resource Visibility**: Shows where NYPD precincts are located, providing users with knowledge of nearby law enforcement resources, which can be reassuring.
- **Contextual Understanding**: Enhances the reliability of the tool by contextualizing crime data with police presence, offering insights into safety resource distribution across the city.

##### Behind the visualization
- The entire code for the visualization can be seen in the file "html_templates/usecase1.html" and the data prep in "data analysis & preprocessing/heatmap-prep.ipynb"
- The heatmap was constructed using "Leaflet" which is an interactive Javascript library for maps
- The input controls were built in Javascript
- The route planner was built using Mapbox api
- The data for the NYDP precints was extracted from [Precincts](https://www.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page)

##### **Use Case 2**

The visualizations for Use Case 2 are designed to provide a comprehensive tool for business owners to identify the safest and most strategic locations for their businesses in New York City. By combining crime data with socioeconomic factors and interactive markers, the tool not only informs about potential risks but also facilitates a holistic approach to business location planning. This ensures that decisions are well-rounded, taking into account both security and community characteristics, thereby aligning with the overarching goal of enhancing the safety of New Yorkers through data-driven insights.

##### User-Customizable Crime Heatmap
- **Interactive Analysis**: Allows users to explore crime data specifically targeting businesses across New York City. This interactivity helps in understanding local crime dynamics affecting business locations.
- **Direct Application**: Business owners can visualize and assess potential risks in different areas, helping them to make informed decisions about where to set up or relocate their operations.

##### Movable Markers for Potential Business Locations
- **Practical Planning Tool**: Users can place up to five markers on the map to consider potential business locations. This feature enables interactive exploration of safety considerations for multiple locations simultaneously.
- **Spatial Decision Support**: By moving markers and observing changes in the heatmap, users can dynamically assess the impact of crime on potential business sites, leading to optimized location strategies.

##### Toggleable Socioeconomic Features
- **Contextual Insights**: The ability to toggle various borough-level socioeconomic features (like population density, economic status, and demographic profiles) provides deeper insights into the broader context that might influence crime rates and business success.
- **Enhanced Decision Making**: These features allow users to consider not just crime but also economic and social factors that are crucial for strategic business planning and location selection.

##### Behind the visualization
- The entire code for the visualization can be seen in the file "html_templates/usecase2.html" and the data prep in "data analysis & preprocessing/heatmap-prep.ipynb"
- The heatmap was constructed using "Leaflet" which is an interactive Javascript library for maps
- The input controls were built in Javascript
- The socioeconomic data was taken from the dataset "2020 Census Data-census tracts & higher" which was downloaded on [2020 Census](https://www.nyc.gov/site/planning/planning-level/nyc-population/2020-census.page)

##### **Socioeconomic Plots**
The following plots were created for this section: 
- **Linear fit**: A linear fit was applied between poverty and total enrollment to highlight a potential relationship between variables, which, as indicated in referenced papers, may contribute to an increased occurrence of crime. The correlation coefficient between the variables was also calculated.
- **Bar plots**: Across the boroughs of NYC, the reported number of individuals relative to the population is visualized for comparative purposes, both unnormalized and normalized. Additionally, the occurrence of crime incidents, relative to the proportion of the population in each borough, is depicted through a separate bar plot. Here, a comparative analysis is utilized to illustrate the possible connection between boroughs with high poverty rates and a high occurrence of crimes.

## Discussion

- What went well?,
- What is still missing? What could be improved?, Why?

#### Johan's Notes: What went well?,
- In the final project we believe we have achieved a more compelling and interactive data story than the previous one. This, was in part due to the lack of restrictions on the number of figures. The added real-estate enabled us to make several high quality visualizations such as the Bokeh plots and Use Cases which enable the user to truly dive down into the data and conduct data exploration themselves. 
- We also believe we were able to achieve a new level of depth in our story, integrating multiple sources of data that compliment each other
- We had a great emphasis on user-interactivity and we believe we achieved an impressive standard given the restrictions of a static webpage such as Github Pages

#### Johan's Notes: What is still missing? What could be improved?, Why?
- With more time we would have loved to dive even deeper into the nitty gritty details of crime prevention and perhaps investigate crime from a more practical perspective such as tips and tricks that the everyday citizen could employ to minimize the rist of being victim of a crime
- Given our backgrounds studying AI & Data We would also have loved to conduct actual machine learning research, perhaps to create models for preventative measures or to quantify risk levels for given circumstances, etc. One could even envision that the data the user inputs on the interactive maps could be passed as input for a ML model. 
- It would have been very interesting to spend more time on the interplay between socioeconomics and crime, perhaps including those in the aforementioned ML models


## Contributions
Who did what?

- You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
- It is not OK simply to write "All group members contributed equally".


#### Johan Böcher Hanehøj - s194495 (primary responsibilities)
- Website creation and maintenance
- Creating/managing the project backlog (github projects) 
- Usecase 1 (creation of the visualization and the related dataprocessing / text writing / research)
- Usecase 2 (creation of the visualization and the related dataprocessing / text writing / research)
- Explainer Notebook


#### Oliver Rosbæk Elmgreen - s204070 (primary responsibilities)
- Initial data cleaning and preprocessing
- Introduction, focus crimes, summary sections on website (visualizations / text / research)
- Motivation, Basic Stats & Data cleaning, Data Analysis in explainer notebook
- Formalia choices (colour theme, font, fig sizes, inserting citations, narrative structure) 
- Proofreading ('røde tråd') and incorporating feedback from assignment 2


#### Benjamin Kock Fazal - s200431 (primary responsibilities)
- Bokeh plots (creation of the visualization and the related dataprocessing / text writing / research)
- Section regarding Socioeconomic angle (visualizations / text / research)
- Linear fit
- Correlation calculations 