# STAT 201 Group Project: Group 12

## Statistical Inference: Time Of Occurrence Of Comparable Violent vs. Less Violent Crimes
**Group Members:** Rashi Selarka, Alice Zhang, Medha Singh

In [None]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)
library(caret)

### Introduction

Large-scale studies of crime data from the FBI demonstrates that violent crimes often happen at night (Bannister, 2019). This statement seems reasonable since the victims are easier targets and witnesses become unlikely. We wanted to test this statement, and take it a step further by examining how different types of crimes may correlate with the time they take place at by looking at (seemingly) similar crimes and testing if they occur at similar times. The question we'll aim to answer is:

_**Is the mean time of occurrence of violent / more serious crimes different from the mean time of occurrence of less violent / less serious crimes?**_

We'll be focusing our report on 2 hypothesis tests to answer this question - one comparing the mean times between the violent crime of vehicle collision with _fatality_ vs. the relatively less violent crime of vehicle collision with _injury_, and another comparing the mean times between the violent crime of theft _of_ vehicle vs. the relatively less violent crime of theft _from_ vehicle.

We have used the Vancouver crime data found on the Vancouver Police Department's website for our analysis, which consists of crime data on a year-by-year basis beginning in 2003, designed in a way to provide individuals with a general overview of criminal activity falling into several categories. Through this investigation, we hope to gain not just a better understanding of hypothesis testing, but also of the time of day we are more vulnerable to certain crimes (Brands, 2013).

### Methods: Preliminary Results

Note: The data had to be locally stored and loaded into the notebook since the website doesn't have an option to scrape it. It must be downloaded onto the server home to be accessed.

In [None]:
crime_data <- read_csv("/home/jupyter/crimedata_csv_all_years.csv")

In [None]:
head(crime_data)

Of the various types of crime listed in the dataset, we have picked these 4 to analyse:
* **Vehicle Collision or Pedestrian Struck (with Fatality)**: Includes primarily pedestrian or cyclist struck and killed by a vehicle. It also includes vehicle to vehicle fatal accidents, however these incidents are fewer in number when compared to the overall data set.
* **Vehicle Collision or Pedestrian Struck (with Injury)**: Includes all categories of vehicle involved accidents with injuries. This includes pedestrian and cyclist involved incidents with injuries.
* **Theft from Vehicle**: Theft of property from a vehicle.
* **Theft of Vehicle**: Theft of a vehicle, motorcycle, or any motor vehicle.

Since we are focusing on the time of occurence of the data, we selected the columns of TYPE and HOUR. We go on to filter out the 2 pairs of specific crimes that we have chosen for our inference.
We also look at the frequency of see if the attributes are balanced or not, and if they are not, we plan to do that further on in the project. Basic summary statistics about the dataset are generated as well to make sure the data is balanced - we chose the mean and standard deviation as our descriptive parameters.

In [None]:
crime_data_grouped <- crime_data %>%
            group_by(TYPE) %>%
            select(TYPE, HOUR) %>%
            data.frame()


crime_data_sort <- crime_data_grouped[ order( crime_data_grouped$HOUR), ] %>%
                select(TYPE, HOUR)%>%
                as.tibble()%>%
                filter(TYPE == "Vehicle Collision or Pedestrian Struck (with Injury)" | 
                       TYPE == "Vehicle Collision or Pedestrian Struck (with Fatality)"|
                       TYPE == "Theft of Vehicle" | 
                       TYPE == "Theft from Vehicle")
head(crime_data_sort)
tail(crime_data_sort)
cd_count <- crime_data_sort %>%
            group_by(TYPE) %>%
            summarize(count=n())
cd_count


Here, we plot a boxplot of the data to better visualise and compare the distributions of the occurrence times of the crimes.

In [None]:
#scale the plot 
options(repr.plot.width = 12, repr.plot.height = 12)
cd_plot <- crime_data_sort %>%  
    ggplot(aes(x = TYPE, y = HOUR)) + 
    geom_boxplot() + 
    ylab("hour of the day for the crime") +
    ggtitle("Boxplots of hours of day for different crimes") +
    theme(axis.text.x = element_text(size=15, angle = 55), 
          axis.text.y = element_text( size = 12, angle = 45), 
          text = element_text(face = "bold", size = 15)) 
cd_plot

Now we look at the bootstrapped sampling distribution of means for our testing. 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

theft_of_vehicle_bootstrap <- crime_data_sort %>%
                    filter(TYPE=="Theft of Vehicle" ) %>%
                    rep_sample_n(size = 30, reps=10000, replace = TRUE)

sampling_dist_theftof <- theft_of_vehicle_bootstrap %>%
                    group_by(replicate)%>%
                    summarize(mean = mean(HOUR))
sampling_dist_plot <- sampling_dist_theftof %>%
                ggplot() + 
                geom_histogram(aes(mean), bins = 15, color="white") +
                xlab("mean time of theft from vehicle") + 
                theme(text = element_text(size=15)) + 
                ggtitle("Bootstrapped sampling dist.", subtitle = "mean time of theft from vehicle") 


head(theft_of_vehicle_bootstrap)
head(sampling_dist_theftof)
sampling_dist_plot

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

vehical_col_inj_bootstrap <- crime_data_sort %>%
                    filter(TYPE=="Vehicle Collision or Pedestrian Struck (with Injury)" ) %>%
                    rep_sample_n(size = 30, reps=10000, replace = TRUE)

sampling_dist_vc_inj <- vehical_col_inj_bootstrap %>%
                    group_by(replicate)%>%
                    summarize(mean = mean(HOUR))
sampling_dist_plot_vc <- sampling_dist_vc_inj %>%
                ggplot() + 
                geom_histogram(aes(mean), bins = 15, color="white") +
                xlab("mean time at which vehicle collision occured") + 
                theme(text = element_text(size=15)) + 
                ggtitle("Bootstrapped sampling dist.", subtitle = "mean time at which vehicle collision occured") 


head(vehical_col_inj_bootstrap)
head(sampling_dist_vc_inj)
sampling_dist_plot_vc

These bootstapped samples helped us estimate if the average time of occurrence of Theft of Vehicle is comparable with Theft from Vehicle, and if Vehicle Collision with Fatality is comparable with Vehicle Collision with Injury.

### Methods: Hypothesis Test I: Vehicle Collision with Fatality vs. Vehicle Collision with Injury

MEDHA PLEASE TAKE THIS! Start adding cells and working below here:

### Methods: Hypothesis Test II: Theft of Vehicle vs. Theft from Vehicle

ALICE PLEASE TAKE THIS! Start adding cells and working below here:

### Discussion

We initially expected our results to corroborate the alternative hypotheses - that car accidents involving death are expected to happen later (i.e. at a different time) than ones with just injury, owing to drivers going at higher speeds at night, there being lower visibility, and more drunk and reckless drivers on the road. And we expect that theft of vehicle and theft from vehicle doesn’t happen at around the same time of day on average as well as theft from vehicle can take place all throughout the day depending on the neighbourhood _(an interesting uncertainty for future research to examine)_ whereas theft of vehicle is likely to happen when it’s more isolated (perhaps later at night).

~ a paragraph about the significance of the results + conclusion once we get them (mention that crimes are similar) ~

Our inference most crucially provides a starting point for law enforcement to go about predicting the time during which certain types of crimes are likely to occur. It is also useful for the general adult population to know about, so one would know what time of day they are more vulnerable and should be taking more precautions.

The questions that arise for the future are many - other than time of day, what are other predictors we could look at; perhaps certain months when crime spikes? Also, if perhaps the month and time of day are related? Maybe crime could occur at earlier hours in the winter months because it gets dark sooner, etc. It's worth looking into a more diverse dataset that could provide information about what the crime was vs. what the criminal was actually convicted of or what punishment they served - potential miscarriage of justice is an extremely relevant issue in the world currently, and statistical insights into it would largely help build a case when necessary.

**References**

Crimes that Happen While You Sleep. (2020, November 04). Retrieved from https://www.thesleepjudge.com/crimes-that-happen-while-you-sleep/

Bannister J, O’Sullivan A, Bates E. Place and time in the Criminology of Place. Theoretical Criminology. 2019;23(3):315-332. doi:10.1177/1362480617733726

Brands, J., Schwanen, T., & Aalst, I. V. (2013). Fear of crime and affective ambiguities in the night-time economy. Urban Studies, 52(3), 439-455. doi:10.1177/0042098013505652

Vancouver Police Department: Crime Data. Retrieved from https://geodash.vpd.ca/opendata/

Violent Crimes Most Likely to Occur At Night. (2019, June 14). Retrieved from https://www.securitymagazine.com/articles/90384-murder-robbery-and-driving-while-impaired-happen-at-night

Violent Crime. (2018, September 10). Retrieved from https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/topic-pages/violent-crime
