This is the hypothesis for conducting a one-tailed hypothesis test on whether vehicle collision or pedestrian struck with fatality occurs later than just that with injury.

* **Import packages and Data**

In [24]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)
library(caret)
library(stringr)
library(broom)

“package ‘broom’ was built under R version 4.0.2”


In [2]:
crime_data <- read.csv("/home/jupyter/crimedata_csv_all_years (1).csv")

In [3]:
head(crime_data)

Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>
1,Break and Enter Commercial,2012,12,14,8,52,,Oakridge,491285.0,5453433
2,Break and Enter Commercial,2019,3,7,2,6,10XX SITKA SQ,Fairview,490613.0,5457110
3,Break and Enter Commercial,2019,8,27,4,12,10XX ALBERNI ST,West End,491007.8,5459174
4,Break and Enter Commercial,2014,8,8,5,13,10XX ALBERNI ST,West End,491015.9,5459166
5,Break and Enter Commercial,2020,7,28,19,12,10XX ALBERNI ST,West End,491015.9,5459166
6,Break and Enter Commercial,2005,11,14,3,9,10XX ALBERNI ST,West End,491021.4,5459161


* **Clean/Wrangle data**

In [40]:
crime_data_grouped <- crime_data %>%
            group_by(TYPE) %>%
            select(TYPE, HOUR) %>%
            data.frame()


crime_data_sort <- crime_data_grouped[ order( crime_data_grouped$HOUR), ] %>%
                select(TYPE, HOUR)%>%
                as.tibble()%>%
                filter(TYPE == "Vehicle Collision or Pedestrian Struck (with Injury)" | 
                       TYPE == "Vehicle Collision or Pedestrian Struck (with Fatality)"|
                       TYPE == "Homicide" | 
                       TYPE == "Offence Against a Person")
head(crime_data_sort)
tail(crime_data_sort)
cd_count <- crime_data_sort %>%
            group_by(TYPE) %>%
            summarize(count=n())
cd_count

TYPE,HOUR
<chr>,<int>
Homicide,0
Homicide,0
Homicide,0
Homicide,0
Homicide,0
Homicide,0


TYPE,HOUR
<chr>,<int>
Vehicle Collision or Pedestrian Struck (with Injury),23
Vehicle Collision or Pedestrian Struck (with Injury),23
Vehicle Collision or Pedestrian Struck (with Injury),23
Vehicle Collision or Pedestrian Struck (with Injury),23
Vehicle Collision or Pedestrian Struck (with Injury),23
Vehicle Collision or Pedestrian Struck (with Injury),23


`summarise()` ungrouping output (override with `.groups` argument)



TYPE,count
<chr>,<int>
Homicide,274
Offence Against a Person,66682
Vehicle Collision or Pedestrian Struck (with Fatality),301
Vehicle Collision or Pedestrian Struck (with Injury),26457


* **Conduct Hypothesis test** 

 * Null Hypothesis: vehicle collision or pedestrian struck with fatality occurs is the same as that with injury.
 * Alternative Hypothesis: vehicle collision or pedestrian struck with fatality occurs later than just that with injury

In [41]:
data_fatality_injury <- crime_data_sort %>%
                filter(TYPE == "Vehicle Collision or Pedestrian Struck (with Fatality)"|
                      TYPE == "Vehicle Collision or Pedestrian Struck (with Injury)")
head(data_fatality_injury)

TYPE,HOUR
<chr>,<int>
Vehicle Collision or Pedestrian Struck (with Fatality),0
Vehicle Collision or Pedestrian Struck (with Fatality),0
Vehicle Collision or Pedestrian Struck (with Fatality),0
Vehicle Collision or Pedestrian Struck (with Fatality),0
Vehicle Collision or Pedestrian Struck (with Fatality),0
Vehicle Collision or Pedestrian Struck (with Fatality),0


Here, since we can't simply calculate mean/median of "time" to judge whether an event happens later(e.p, an event happens at 1am is "later" than an event happens at 11pm, but an event happens at 3pm is later than an event happens at 2pm). Therefore, we use the proportion of events happen at night(between 6pm the day before and 6am) to make judgement.

In [42]:
re_type <- function(a) {
    if (a == "Vehicle Collision or Pedestrian Struck (with Fatality)"){
         a <- "Fatality"
    }else{
          a <- "Injury"
    } 
}
re_night <- function(a) {
    if (a){
         a <- "night"
    }else{
          a <- "not_night"
    } 
}

data_fatality_injury <- data_fatality_injury %>%
                        mutate(nighttime = (0 <= HOUR & HOUR <= 6|18 <= HOUR & HOUR <= 23))
data_fatality_injury$TYPE <- lapply(data_fatality_injury$TYPE, re_type)
data_fatality_injury$nighttime <- lapply(data_fatality_injury$nighttime, re_night)
head(data_fatality_injury)
tail(data_fatality_injury)


TYPE,HOUR,nighttime
<list>,<int>,<list>
Fatality,0,night
Fatality,0,night
Fatality,0,night
Fatality,0,night
Fatality,0,night
Fatality,0,night


TYPE,HOUR,nighttime
<list>,<int>,<list>
Injury,23,night
Injury,23,night
Injury,23,night
Injury,23,night
Injury,23,night
Injury,23,night


In [43]:
data_fatality_injury %>%
    group_by(TYPE, nighttime) %>%
    tally() %>%
    spread(TYPE, n)

nighttime,Fatality,Injury
<list>,<int>,<int>
night,143,10212
not_night,158,16245


Let  𝑝1  be the proportion of Vehicle Collision or Pedestrian Struck (with Fatality)) that happen at night, and let  𝑝2  be the proportion of Vehicle Collision or Pedestrian Struck (with Injury) that happen at night.

In [45]:
crime_summary <-
    data_fatality_injury %>% 
    group_by(TYPE) %>% 
    summarise(n = n(), p_hat = mean(nighttime=="night"), `.groups` = "drop") %>% 
    pivot_wider(names_from = TYPE, values_from = c(n, p_hat)) %>% 
    mutate(prop_diff = p_hat_Fatality-p_hat_Injury)
crime_summary

n_Fatality,n_Injury,p_hat_Fatality,p_hat_Injury,prop_diff
<int>,<int>,<dbl>,<dbl>,<dbl>
301,26457,0.4750831,0.3859848,0.08909825


In [46]:
crime_summary <-
    crime_summary %>% 
    mutate(p = (p_hat_Fatality * n_Fatality + p_hat_Injury * n_Injury)/(n_Injury+n_Fatality),
           null_std_error = sqrt(p*(1-p)*(1/n_Fatality+1/n_Injury))) %>% 
    mutate(p_value = pnorm(prop_diff, 0, null_std_error, lower.tail=FALSE))%>%
    select(-p)
crime_summary

n_Fatality,n_Injury,p_hat_Fatality,p_hat_Injury,prop_diff,null_std_error,p_value
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
301,26457,0.4750831,0.3859848,0.08909825,0.02823295,0.0008002254


Now that we have a data summary prepared for a test. We have two options: 1. Use Permutation and CLT to estimate the difference between two independent proportions; 2. Use non-parametric test(2 sample Z test) to estimate the difference. 

Now we consider about option2(The reason why we in favor of option2 is that we do not have to assume any Symmetry of the distribution of our sample. And also, they actually lead to the same conclusion).

In [27]:
## option2
crime_prop_test <- 
    tidy(prop.test(x = c(crime_summary %>% mutate(success = n_Fatality * p_hat_Fatality) %>% pull(success),
                    crime_summary %>% mutate(success = n_Injury * p_hat_Injury) %>% pull(success)),
              n = c(crime_summary$n_Fatality, crime_summary$n_Injury),
                   alternative = "greater",
             correct=FALSE))

crime_prop_test


estimate1,estimate2,statistic,p.value,parameter,conf.low,conf.high,method,alternative
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
0.4750831,0.3859848,9.959229,0.0008002254,1,0.04149799,1,2-sample test for equality of proportions without continuity correction,greater


* Conclusion

Since one-tailed test, we use significance level 0.1 here. Because p.value is 0.0008 < 0.1, we have sufficient evidence to reject the null hypothesis in favor of alternative hypothesis. And this is exactly what we expect in our proposal.