## Title: 

**Introduction:**

 * Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
 * Clearly state the question you will try to answer with your project
 * Identify and describe the dataset that will be used to answer the question

 **Preliminary exploratory data analysis:**
 
 * Demonstrate that the dataset can be read from the web into R
 * Clean and wrangle your data into a tidy format
 * Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.
 * Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

**Methods:**

 * Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
   
 * Describe at least one way that you will visualize the results

**Expected outcomes and significance:**

 1. What do you expect to find?
 2. What impact could such findings have?
 3. What future questions could this lead to?

#### LIBRARY DOWNLOADS:

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

#### GIT READING: 

In [2]:
download.file("https://raw.githubusercontent.com/ireneberezin/DSCI-PROJECT-38/main/data/crime.csv", "crime_data.csv")

In [3]:
crime_data <- read_csv("crime_data.csv")
crime_data

[1mRows: [22m[34m530652[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD
[32mdbl[39m (9): YEAR, MONTH, DAY, HOUR, MINUTE, X, Y, Latitude, Longitude

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y,Latitude,Longitude
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Other Theft,2003,5,12,16,15,9XX TERMINAL AVE,Strathcona,493906.5,5457452,49.2698,-123.0838
Other Theft,2003,5,7,15,20,9XX TERMINAL AVE,Strathcona,493906.5,5457452,49.2698,-123.0838
Other Theft,2003,4,23,16,40,9XX TERMINAL AVE,Strathcona,493906.5,5457452,49.2698,-123.0838
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Offence Against a Person,2017,4,13,,,OFFSET TO PROTECT PRIVACY,,0.0,0,0.00000,0.0000
Theft from Vehicle,2017,6,5,17,0,8XX HAMILTON ST,Central Business District,491487.8,5458386,49.27817,-123.1170
Vehicle Collision or Pedestrian Struck (with Injury),2017,6,6,17,38,13XX BLOCK PARK DR,Marpole,490204.0,5451444,49.21571,-123.1345


In [35]:
crime_data_filtered <- crime_data|>
rename(c("type_of_crime"=TYPE, "Year"=YEAR, "Month"=MONTH, "Day"=DAY, "Hour"=HOUR,
         "Minute"=MINUTE, "block_location"=HUNDRED_BLOCK, "Neighborhood"=NEIGHBOURHOOD))|>
filter(Year==2017)|>
rowwise()|>
mutate(minute_hour = Minute+Hour*60)|>
mutate(percent_day = minute_hour/1440*100)|>
select(-Hour, -Minute, -minute_hour)
crime_data_filtered

type_of_crime,Year,Month,Day,block_location,Neighborhood,X,Y,Latitude,Longitude,percent_day
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Theft from Vehicle,2017,5,3,13XX ALBERNI ST,West End,490724.2,5459449,49.28772,-123.1276,39.79167
Offence Against a Person,2017,4,4,OFFSET TO PROTECT PRIVACY,,0.0,0,0.00000,0.0000,
Offence Against a Person,2017,3,27,OFFSET TO PROTECT PRIVACY,,0.0,0,0.00000,0.0000,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Offence Against a Person,2017,4,13,OFFSET TO PROTECT PRIVACY,,0.0,0,0.00000,0.0000,
Theft from Vehicle,2017,6,5,8XX HAMILTON ST,Central Business District,491487.8,5458386,49.27817,-123.1170,70.83333
Vehicle Collision or Pedestrian Struck (with Injury),2017,6,6,13XX BLOCK PARK DR,Marpole,490204.0,5451444,49.21571,-123.1345,73.47222


In [18]:
crime_summary <- crime_data_filtered |>
count(type_of_crime)|>
rename(c("Count"=n))
crime_summary 

type_of_crime,Count
<chr>,<int>
Break and Enter Commercial,1111
Break and Enter Residential/Other,1304
Homicide,11
⋮,⋮
Theft of Vehicle,755
Vehicle Collision or Pedestrian Struck (with Fatality),5
Vehicle Collision or Pedestrian Struck (with Injury),671


In [37]:

predictor_info <- crime_data_filtered|>
select(-type_of_crime, -Year, -Neighborhood, -block_location)|>
mutate(across(Month:Day, as.integer))|>
drop_na()|>
map_df(mean)|>
pivot_longer(cols=Month:percent_day, names_to="predictor_variable", values_to="mean")
predictor_info

predictor_variable,mean
<chr>,<dbl>
Month,3.865526e+00
Day,1.475476e+01
X,4.922081e+05
⋮,⋮
Latitude,49.26534
Longitude,-123.10710
percent_day,58.43333


In [83]:
na_count <- crime_data_filtered |>
summarise(across(everything(), is.na))|>
mutate(across(everything(), as.factor))|>
mutate(across(everything(), fct_recode(everything(), "0" = "FALSE", "1"="TRUE")))
na_count

ERROR: [1m[33mError[39m in `mutate()`:[22m
[1m[22m[36mℹ[39m In argument: `across(everything(), fct_recode(., `0` = "FALSE", `1` =
  "TRUE"))`.
[1mCaused by error in `check_factor()`:[22m
[33m![39m object '.' not found


In [84]:
?rowwise

0,1
rowwise {dplyr},R Documentation

0,1
data,Input data frame.
...,<tidy-select> Variables to be preserved when calling summarise(). This is typically a set of variables whose combination uniquely identify each row. NB: unlike group_by() you can not create new variables here but instead you can select multiple variables with (e.g.) everything().
