PhilaFire_Final_Presi_1.Rmd

---
title: "Understanding and Forecasting the Community Impacts of Structure Fire"
author: "Kendra Hills, Myron Bañez, & Ben Keel"
date: "MUSA Practicum Feb 2023"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 3
    fig_caption: yes
    theme: lumen
---
## **Introduction**

![](/Users/kendrae.hills/Desktop/Spring 2023/MUSA_Practicum/Picture1.png)
 
### Abstract 
This projects explores the experiences of properties and neighborhoods after fires occur with the purpose of developing a predictive model that can inform the Philadelphia Fire Department (PFD) on the likelihood of redevelopment or vacancies for fire impacted structures. This intelligence will allow PFD and partner agencies to understand when pro-active expertise and services might have their highest impact. We want to thank Commissioner Adam Theil, Kathy Matheson, Andrew Newell the PFD, and our instructors Matt Harris and Michael Fichman for their guidance and support for this project. 

### Motivation & Use Case
In 2022, there were 1.2 million structure fires in the country that led to 2,500 deaths — 276 of them children. Last year major cities like Philadelphia and New York grappled with severe and deadly structure fires. In Philadelphia specifically,41 people died from fires, while nearly 200 were injured,and thousands were displaced. 

As  it stands, the PFD has very limited knowledge about what happens after their job is complete, and there is no  programmed set of economic development interventions that are used by public agencies in Philadelphia as a response to fire. The PFD expressed their desire to better understand and predict consequences of a fire so that they can better understand recovery patterns.  

Through a storytelling lens that will contextualize our research at the incident level, this project  development will provide the PFD with predictions of X for each property, depending on fire severity, and visualize it to allow them to study and gain understanding of aftermath trends. Our application will be be used as an interactive tool to  assist PFD and partner agencies to understand when pro-active expertise and services might have their highest impact. 


```{r Setup, include=FALSE}
knitr::opts_chunk$set(echo= TRUE, warning = FALSE, message = FALSE) 
# Set Up
library(boxr)
library(mapview)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(tigris)
library(viridis)
library(dplyr)
library(tidycensus)
library(ggplot2)
library(RSocrata)
library(lubridate)
library(janitor)
library(proxy)
library(FNN)
library(plotROC)
library(pROC)
library(ggcorrplot)


options(scipen = 999)
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  #plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

#Color Palettes

palette2 <- c("#b9cfcf", "#e19825")

palette3_sat <- c("#e19825","#d55816","#7b230b")
palette3_desat <- c("#B19C7D","#7F5F52","#262626")

palette4 <- c("#f1c82b","#e19825","#d55816","#7b230b")
palette4_desat <- c("#B19C7D","#B27D49","#7F5F52","#262626")

palette5_sat <- c("#f1c82b","#e19825","#d55816","#7b230b","#413028")
pallette5_desat <- c("#ead5b7","#d2b190","#b18e6f","#7f5f52","#413028")


palette7_cats <- c("#b9cfcf","#20b1ae","#e19825","#7b230b","#b47c49", "#3f3128", "#8f8172")

#Sources for Graphs
creditFire <- "Source: Philadelphia Fire Department"
creditOpen <- "Source: Open Data Philly"

g<-glimpse
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
```

```{r Box Set Up, include=FALSE}
#Loading in Fire Data 
## INSERT BOX WORKFLOW HERE ##
box_auth(client_id = "zhmuvtkofhuu21py025uhr8cffzew2om", 
         client_secret = "k39pMJM1FUnBmGnRrM9zVUWk1eth0EVi")
box_setwd(186732420366)
box_getwd()
box_ls()
list <- box_ls() %>% as.data.frame()
structureFire <- box_read_excel(1093000179542) 

dat <- structureFire

```

```{r Cleaning The Dataset, include=FALSE}
# Creating geometry for the fires
dat <- dat %>% drop_na("Longitude", "Latitude") %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:4326")

# Cleaning the column names
dat<-clean_names(dat)

# Address string
dat <-
  dat %>% mutate(street_type = ifelse(street_type == 'AV', "AVE", street_type)) %>% 
  unite(address, c('address_number', 'street_prefix', 'street_name', 'street_type'), sep = " ", remove = FALSE, na.rm=TRUE)

# Extracting quarter
dat <- dat %>% mutate(quarter = floor_date(alarm_date, unit="quarter"))

# Reducing columns
dat <- dat %>%
  dplyr::select(address, quarter, property_use, incident_number, number_of_exposures, incident_type, building_status, fire_spread, no_spread, code_description, geometry, alarm_date, cad_nature_code_description,
                minor_damage, significant_damage, heavy_damage, extreme_damage)

# Removing duplicates
dat <- dat[!duplicated(dat$incident_number),]

```

## **Exploratory Analysis** 

### The Data: Understanding Fire Incidents 

With the help of the PFD, we were given access to proprietary data that consist of rich and extensive fire data collected by the department dating back to 2009. 

To better understand the outcomes of fire impacted fires,we first conducted prelminatry literature review research on why fires occur in the first place.  The following model, originally  developed by Charles Jennings(1996),is a conceptualized model that represents the interrelationships between environmental, structural, and human factors as they relate to fire. We find this model useful as a way to devlop more powerful predictors of the incidence of fire and nuanced model to determine their social and economic impacts in the future. 

![Jennings, 1996](/Users/kendrae.hills/Desktop/Spring 2023/MUSA_Practicum/Presentation images /Concept_diagram.png)

### How Many Fires Occur Per Address?
```{r Counting the Number of Fires Per Address, echo=FALSE, message=FALSE, warning=FALSE}
#Count the number of Fires per address
nFires_perAddress <- dat%>%
  st_drop_geometry()%>%
  count(address, sort=TRUE)%>%
  left_join(dplyr::select(dat, address), by="address")%>% #removed na.rm=TRUE
  st_as_sf()

#remove duplicates from above
nFires_perAddress <- nFires_perAddress[!duplicated(nFires_perAddress$address),]

#Barplot of Counts of Fires for Each Address
nFires_perAddress%>%
  filter(n < 7)%>%
  ggplot()+
  geom_bar(mapping=aes(x=as.factor(n)), fill="#A5300F")+
  labs(title="Number of Fires Per Address",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Count of Fires")+
  ylab("Number of Structures")+
  theme(panel.background = element_rect(fill = "#f3efe0"))

```

There have been at least one incident of a fire for over 15,000 structures in Philadelphia. 

### Counts Over Time
```{r Fire Counts Over Time, echo=FALSE}
#Line plot of Fires per quarter, Min to Max
dat %>%
  ggplot(aes(x=as_date(quarter))) +
      geom_bar(fill="#A5300F")+
      labs(title = "Quarterly Count of Unique Fire Incidents",
           subtitle = "Philadelphia County, 2009 Q1 - 2022 Q4", 
           y = "Number of Fires")+   
    scale_x_date(name = "Year", date_breaks = "1 year")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = "#f3efe0"))

```
There is little  unique patterns to the data, and it appears that fires have been relatively consistent over the last 15 years.
- Note: fires dip from July to September

### Property Use

Focusing on residential cou
```{r Building Use, echo=FALSE}
#Building Use for All Fires
#Plotting the Frequency of Fires based on Property Use

nFires_perAddress_BuildAll <- nFires_perAddress %>%
  left_join(st_drop_geometry(dplyr::select(dat, property_use, address)), by="address")

#Bar plot of property use counts
dplyr::select(nFires_perAddress_BuildAll, -address)%>%
  st_drop_geometry()%>%
  gather(Variable, value, -n)%>%
  count(Variable, value)%>%
  group_by(Variable)%>%
  filter(n > 150)%>%
  ggplot(., aes(value, n))+
      geom_bar(position = "dodge", stat="identity", fill="#A5300F") +
      labs(x="Category", y="Frequency",
           title = "Top 10 Property Uses Among All Structure Fires",
           subtitle = "Philadelphia County, 2009-2022")+
          theme(axis.text.x = element_text(angle = 45, hjust = 1),
                
        panel.background = element_rect(fill = "#f3efe0"))

```

If we cut out non-residential buildings, which have very different types of fires and post-fire outcomes than residential buildings (supercategory 4), we still keep the vast majority of the data.
```{r eval=FALSE, include=FALSE}

#Count Fires by PropUse and Add Super Category column
## Error with lines 203-205
#nFires_perAddress_PropUse <- nFires_perAddress %>%
 # left_join(st_drop_geometry(dplyr::select(structureFire_sf_addressU,`Property Use`, address)), by="address")%>%
  #mutate(Property_Use_SuperCat = substr(`Property Use`, 1, 1))
#Commenting this out for now, fix later!
```

```{r Share of Property Use Amongst Super Categories, echo=FALSE}
#Chart of Building Status counts
nFires_perAddress_BuildAll%>%
  mutate(Property_Use_SuperCat = substr(property_use, 1, 1))%>%
  dplyr::select(-address, -property_use)%>%
  st_drop_geometry()%>%
  gather(Variable, value, -n)%>%
  count(Variable, value, sort = TRUE)%>%
  group_by(Variable)%>%
  summarize(`Property Use Supercategory` = value, `Share (%)` = round((n/sum(n)*100), 2))%>%
  kable()%>%
  kable_styling()

```

### Measures of Fire Severity

When measuring outcomes of buildings that experience a fire, the obvious question is how severe the fire was. There are multiple possible determinants in the data. We ended up using the first one, the ordinal variable labeled "fire spread". Here's how we made our decision:

- Disruption of fire spread within he fire spread severity. SO using fire spread is ideal 

- Incident type: weighted to one cateogry, not as useful

- No record for a lot of fire incidents 

#### Fire Spread

In our research, time had a big correlation with damage. The longer fire burns close to metal and concrete, the hotter those critical infrastructure pieces get, and the less able they are to hold weight. 

Source: https://www.homego.com/blog/house-fire-damage/

This was the simplest and most normally-distributed feature as compared to the three others we considered. 
```{r Bar Chart of Fire Spread, echo=FALSE}

dat %>%
  st_drop_geometry()%>%
  count(fire_spread)%>%
  ggplot()+
    geom_col(mapping=aes(x=as.factor(fire_spread), y=n, fill=fire_spread))+
    labs(title = "Number of Fires by Fire Spread", 
         subtitle = "Philadelphia County, 2009-2022", 
         caption = creditFire,
         x="Fire Spread Code",
         y="Number of Fires")+
    scale_fill_manual(values = palette5_sat,
                      name = "Fire Spread \nConfined To:", 
                      labels = c("Object", "Room", "Floor", "Building", "Beyond", "NA"))+
    theme(plot.title = element_text(size=18),
          
        panel.background = element_rect(fill = "#f3efe0"))

```

#### Floor Damage Counts

Originally these counts seemed useful as direct measures of damage, but the large amount of "no record" instances means that it can't function as a reliable metric. 
```{r Fire Damage Counts Setup, include=FALSE}
g(dat)
#How to measure severity in detail beyond 
sFire_severity <- dat%>%
  st_drop_geometry()%>%
  dplyr::select(incident_type, minor_damage, significant_damage, heavy_damage, extreme_damage, fire_spread)%>%
  mutate(Worst_Damage = ifelse(extreme_damage > 0, "Extreme",
                          ifelse(heavy_damage > 0, "Heavy",
                            ifelse(significant_damage > 0, "Significant",
                              ifelse(minor_damage > 0, "Minor", "No Record")))))%>%
  count(Worst_Damage, fire_spread)
```

```{r Fire Damage Counts, echo=FALSE}
ggplot(sFire_severity)+
  geom_col(mapping=aes(x=as.factor(Worst_Damage), y=n, fill=fire_spread))+
  labs(title = "Number of Fires by Worst Recorded Floor Damage", 
       subtitle = "Philadelphia County, 2009-2022", 
       caption = creditFire,
       x="Incident Type Code",
       y="Number of Fires")+      
  scale_fill_manual(values = palette5_sat,
                      name = "Fire Spread \nConfined To:", 
                      labels = c("Object", "Room", "Floor", "Building", "Beyond", "NA"))+
  theme(plot.title = element_text(size=18),
        panel.background = element_rect(fill = "#f3efe0"))

```

#### Cad Nature Code Description

We have more specific categories, like CAD nature code description.
```{r undefined echo=FALSE}
#Count the unique values for CAD
nFires_CADDescr <- dat %>%
  st_drop_geometry%>%
  group_by(fire_spread)%>%
  count(cad_nature_code_description, sort=TRUE)

#Barplot of counts, by count
nFires_CADDescr%>%
  filter(n>50)%>%
  ggplot()+
  geom_col(mapping=aes(x=cad_nature_code_description, y=n, fill=fire_spread))+
  labs(title="Frequency of Fire Types",
       subtitle="Philadelphia County, 2009-2022, Above 50 Unique Incidents")+
  xlab("CAD Nature Code Description")+
  ylab("Number of Fires")+
      scale_fill_manual(values = palette5_sat,
                      name = "Fire Spread \nConfined To:", 
                      labels = c("Object", "Room", "Floor", "Building", "Beyond", "NA"))+
            theme(axis.text.x = element_text(angle = 45, hjust = 1),
              panel.background = element_rect(fill = "#f3efe0"))
```

#### Incident Type

Incident type measures whether the interior or exterior of the structure has collapsed. Considering outcomes like demolition, vacancy, and major construction for homes affected by a fire, then collapse seems like a strong candidate for collelation. 

However, the vast majority of cases are type 1110: no collapse. This doesn't make it a very helpful measure for severity, especially when we can see fire spread varying so heavily inside these categories. 
```{r Incident Type Bar Plot All Severities, echo=FALSE}
dat %>%
  st_drop_geometry()%>%
  count(incident_type, fire_spread)%>%
  ggplot()+
    geom_col(mapping=aes(x=as.factor(incident_type), y=n, fill=fire_spread))+
    labs(title = "Number of Fires by Incident Type, All Severities", 
         subtitle = "Philadelphia County, 2009-2022", 
         caption = creditFire,
         x="Incident Type Code",
         y="Number of Fires")+
    scale_fill_manual(values = palette5_sat,
                      name = "Fire Spread \nConfined To:", 
                      labels = c("Object", "Room", "Floor", "Building", "Beyond", "NA"))+
    theme(plot.title = element_text(size=18),
        panel.background = element_rect(fill = "#f3efe0"))

```

```{r Incident Type Bar Plot Greater Severities, echo=FALSE}
dat %>%
  st_drop_geometry()%>%
  count(incident_type, fire_spread)%>%
  filter(incident_type != 111 & incident_type != 1110)%>%
  ggplot()+
    geom_col(mapping=aes(x=as.factor(incident_type), y=n, fill=fire_spread))+
    labs(title = "Number of Fires by Incident Type, Greater Severities", 
         subtitle = "Philadelphia County, 2009-2022", 
         caption = creditFire,
         x="Incident Type Code",
         y="Number of Fires")+
    scale_fill_manual(values = palette5_sat,
                      name = "Fire Spread \nConfined To:", 
                      labels = c("Object", "Room", "Floor", "Building", "Beyond", "NA"))+
    theme(plot.title = element_text(size=18),
        panel.background = element_rect(fill = "#f3efe0"))

```

As a result, we picked fire spread as our measure of severity.

### Outliers

We will classify an outlier as an observation more than 3 standard deviations away from the population mean.

The mean number of fires per location is 1.115, weighted by the large amount of places where only 1 fire has occurred. The standard deviation of this fire population is 0.521. The result, 2.678, means any of the 293 locations with three or more fires will be classified as an outlier.
```{r ACS Data Loading, include=FALSE}

acs_vars <- c("B01001_001E")

acsTractsPHL.2020 <- get_acs(geography = "tract",
                             year = 2020, 
                             variables = acs_vars, 
                             geometry = TRUE, 
                             state = "PA", 
                             county = "Philadelphia", 
                             output = "wide") 
```

```{r undefined, echo=FALSE}

nFires_perAddress_Outliers <- filter(nFires_perAddress, n>2)

ggplot()+
  geom_sf(data=acsTractsPHL.2020, fill='#f0efe0', color='dark gray')+
  geom_sf(data=nFires_perAddress_Outliers, aes(color=q5(n)), alpha=0.5)+
    scale_color_manual(values=palette5_sat, labels=qBr(nFires_perAddress_Outliers, "n"))+
  labs(title = "Addresses With 3+ Fires") + mapTheme()

```

Apart from centers of population, there is not an obvious geographic distribution of the outliers. More research could be useful with ACS data to determine correlations. 
```{r echo=FALSE}
#Plotting the Frequency of Fires based on the Outliers' property use

dplyr::select(nFires_perAddress_BuildAll, -address)%>%
  filter(n>2)%>%
  st_drop_geometry()%>%
  gather(Variable, value, -n)%>%
  count(Variable, value)%>%
  group_by(Variable)%>%
  filter(n>23)%>%
  ggplot(., aes(value, n))+
      geom_bar(position = "dodge", stat="identity", fill="#A5300F") +
      labs(x="Category", y="Frequency",
           title = "Top 10 Property Uses Among Addresses with 3+ Fires",
           subtitle = "Philadelphia County, 2009-2022",
           credit = creditFire)+
          theme(axis.text.x = element_text(angle = 45, hjust = 1),
                
        panel.background = element_rect(fill = "#f3efe0"))

```

There is a difference among property use, in that 215, the designation for "schools, high/junior/middle" is now in the top 3. Code 500, the designation for "mercantile, business, other", also has a greater share than in the complete data set. 

Regardless of the cause, these school and commercial fires are very different in their causes and outcomes than a residential fire. This outlier analysis supports our decision to remove the non-residential fires from our research in order to focus what narratives we can observe. 

## **Data Wrangling**: Panal Data Analysis
A key element to the exploratory analysis is understanding how fires relate to properties and other relevant data that will help us better predict post fire impacted properties. To start, we have decided to work with 311 complaints, permit data, and property assessment data to further explore and craft differnt post fire scenarios. 

### Fire Panel - Initial, Count, and Final Panel

The fire panel is created as the base template containing the key information related to fire incidents. The dataset first undergoes numerous operations to refine the information.

The current structure of the initial dataset contains information regarding fire incidents from January 2009 - December 2022. Despite having the same address, there will be a new record of fire incident for that address as each fire has a unique incident number. In order to see the count and severity of properties with and without a fire incident we create a panel to display the observations of every possible address and time combination.
```{r Initial Panel, echo=TRUE}
# Initial panel 
dat.panel <-
  expand.grid(quarter = unique(dat$quarter), 
              address = unique(dat$address))

```

```{r Count Panel, echo=TRUE}
# Count Panel
count.panel <- 
  dat %>%
  st_drop_geometry() %>%
  group_by(quarter, address, fire_spread) %>%
  count(address, sort=TRUE)

# Changing address to factor for join purposes later
count.panel$address <-
  as.factor(count.panel$address)
```

```{r Final Panel, echo=TRUE}
# Final Panel
final.panel <- left_join(dat.panel, count.panel, by=c("address", "quarter")) # Join
final.panel <- final.panel %>% dplyr::select(address, quarter, fire_spread, n) %>% rename(count = n) # Condensing & renaming
final.panel[is.na(final.panel)] <- 0 # Assigning 0 to NA for everything
final.panel$fire_spread <- as.numeric(final.panel$fire_spread) # Making fire_spread numeric

# Calculating the maximum severity score 
final.panel <-
  final.panel %>%
  group_by(address, quarter, count) %>% summarise(severity_index = max(fire_spread))

g(final.panel)
```

### OPA Panel
```{r OPA Upload and building counts, echo=TRUE}

#Data from https://www.opendataphilly.org/dataset/opa-property-assessments

#metadata at: https://metadata.phila.gov/#home/datasetdetails/5543865f20583086178c4ee5/representationdetails/55d624fdad35c7e854cb21a4/?view_287_page=3

# Reading the data
# Kendra
#opa_dat <- read_csv("/Users/kendrae.hills/Desktop/Spring 2023/MUSA_Practicum/opa_properties_public-2.csv")

# Myron
opa_dat <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/OpenDataPhilly-opa_properties_public.csv")

# Creating geometry for the properties
opa_dat <- opa_dat%>%
  drop_na(lng, lat)%>%
  st_as_sf(coords = c("lng", "lat"),
           crs = "EPSG:4326")

g(opa_dat)
```

```{r OPA Setup, echo=TRUE}
# Reducing columns
opa_dat_small_sf <- opa_dat[!duplicated(opa_dat$location),] %>%
  dplyr::select(location, category_code, category_code_description, building_code, building_code_description, building_code_new, building_code_description_new, total_area, total_livable_area, owner_1, owner_2, market_value, market_value_date, mailing_street, number_of_bedrooms, number_of_bathrooms, number_stories, interior_condition, assessment_date, year_built, year_built_estimate, zoning, quality_grade, central_air, exterior_condition, fireplaces, fuel, taxable_building, topography, type_heater, sale_price, separate_utilities)%>%
  rename(address = location)

# Filtering for residential properties
opa_dat_small_sf <- opa_dat_small_sf %>% filter(category_code == 1 | category_code == 2 | category_code == 3)

# Extracting just the addresses
opa_dat_small <- opa_dat_small_sf%>%
  dplyr::select(address)%>%
  st_drop_geometry()

```

```{r Time Panel, echo=TRUE}
# Time Panel
quarter <- c("Q1", "Q2", "Q3", "Q4") # Creating quarters
year <- c(2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022) # Creating years

comb <- expand.grid(year = year, quarter = quarter)%>%
  mutate(yq = paste(year, ":", quarter))%>%
  mutate(yqDT = yq(yq))%>%
  arrange(yqDT)

time.panel <- as_tibble(comb)%>%
  dplyr::select(yqDT)
```

```{r OPA Panel, include=FALSE}
opa.panel <- expand.grid(address = opa_dat_small$address, 
             quarter = time.panel$yqDT)
```

### Combined Panel
```{r OPA + Fire Panel, echo=TRUE}
# Combining OPA and Fire
opa_count.panel <- full_join(opa.panel, final.panel, by=c("address", "quarter")) # Join
opa_count.panel[is.na(opa_count.panel)] <- 0 # Assigning 0 to NA for everything 
```

# Adding Open Data

### 311 Call Data
```{r loading 311 data, include=FALSE}
#311 Data Upload, downloaded from https://data.phila.gov/visualizations/311-requests/

All311 <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/OpenDataPhilly-311Calls.csv")

AllLI <- st_read("/Users/myronbanez/Desktop/Coding/PhilaFireData/complaints.geojson")
```

Filtering 311 Data to the appropriate categories. Data is limited to only 2014-2020. START RUNNING HERE
```{r undefined , include=FALSE}
#Filtering to only the fire/building-relevant terms

#strictly filtering to vacancy complaints for initial combination
property311 <- filter(All311, 
                        #service_name == "Building Dangerous" |  
                        #service_name == "Dangerous Building Complaint " |  
                        #service_name == "Fire Safety Complaint" | 
                        #service_name == "Maintenance Complaint" |
                        #service_name == "Maintenance Residential or Commercial" |
                        service_name == "Vacant House or Commercial" ) %>%
                        #service_name == "Fire Residential or Commercial" |
                        #service_name == "Complaints against Fire or EMS"
dplyr::select(objectid, service_request_id, status, service_name, service_code, requested_datetime, agency_responsible, address, zipcode, lat, lon)%>%
  drop_na(lat, lon, address)%>%
  st_as_sf(coords = c("lon", "lat"),
           crs = "EPSG:4326")

#Reducing variables and calculating the quarter of the calls
prop311_small <- property311%>%
  dplyr::select(service_name, requested_datetime, address)%>%
  st_drop_geometry()%>%
  mutate(quarter = floor_date(requested_datetime, unit="quarter"))

#counting the calls per address per quarter
vacant311_count <- prop311_small%>%
  group_by(address, quarter)%>%
  count(address, sort=TRUE)%>%
  rename(n_311Vacant = n)

vacant311_count%>%
  group_by(quarter)%>%
  summarize(count = sum(n_311Vacant))%>%
  ggplot(aes(x=quarter, y=count))+
  geom_col(fill="#A5300F")+
    labs(title="311 Vacancy Complaints",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Date, Rounded to Beginning of Quarter")+
  ylab("Number of Fires")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#f3efe0"))
#Data is now ready to join with panel

```

### L&I Data
Designations for L&I data switch during certain periods, but it does cover the span of all the fire data.
```{r LI filter}
#strictly filtering to vacancy complaints for initial combination
vacantLI <- filter(AllLI, 
                        complaintcodename == "VACANT HOUSE" |
                        complaintcodename == "VACANT HOUSE RESIDENTIAL" |
                        complaintcodename == "SPECIAL VACANT HOUSE" |
                        complaintcodename == "VACANT PROPERTY COMPLAINT" ) %>%
  dplyr::select(address, addressobjectid, complaintdate, complaintcodename, geometry)%>%
  mutate(quarter = as_date(floor_date(complaintdate, unit="quarter")))%>%
  st_set_crs("EPSG:4326")

vacantLI_count <- vacantLI%>%
  st_drop_geometry%>%
  drop_na(address)%>%
  group_by(address, quarter)%>%
  count(address, sort=TRUE)%>%
  rename(n_Vacant = n,
         address = address,
         quarter = quarter)
  
vacantLI_count$address <- as.factor(vacantLI_count$address)  

vacantLI_count%>%
  group_by(quarter)%>%
  summarize(count = sum(n_Vacant))%>%
  ggplot(aes(x=quarter, y=count))+
  geom_col(fill="#A5300F")+
    labs(title="L&I Vacancy Complaints",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Date, Rounded to Beginning of Quarter")+
  ylab("Number of Fires")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#f3efe0"))
#Data is now ready to join with panel
```

We combine these for better coverage 2014-2020, but we take out the calls that happened during the same time on the same address.
```{r}
#Outer join to ensure no date/quarter combos are the same, getting unique values
vacant311_count_Clean <- vacant311_count%>%
  anti_join(vacantLI_count, by=c("address", "quarter"))

#Row bind L&I and Clean 311 together.
vacantLI311 <- rbind(vacantLI_count, vacant311_count_Clean)
vacantLI311%>%
  group_by(quarter)%>%
  summarize(count = sum(n_Vacant))%>%
  ggplot(aes(x=quarter, y=count))+
  geom_col(fill="#A5300F")+
    labs(title="L&I Vacancy Complaints",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Date, Rounded to Beginning of Quarter")+
  ylab("Number of Fires")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#f3efe0"))

vacant311_count$address <- as.factor(vacant311_count$address) 
```

### Permit Data
Importing Permit Data by combining their two sets together.
```{r Permit Data, echo=TRUE}
#Importing Permit Data
#Data from https://www.opendataphilly.org/dataset/licenses-and-inspections-building-permits

#metadata at: https://metadata.phila.gov/#home/datasetdetails/5543868920583086178c4f8f/representationdetails/5e9a01ac801624001585ca11/

permits0715 <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/OpenDataPhilly-permits_0715.csv")
permits1623 <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/OpenDataPhilly-permits_1623.csv")

permitsAll <- rbind(permits0715, permits1623)

permits_sf <- permitsAll%>%
  drop_na(lng, lat)%>%
  st_as_sf(coords = c("lng", "lat"),
           crs = "EPSG:4326")

``` 

Cleaning Permit Data and Creating Count Table
```{r Permit Data Clean and Count, echo=TRUE}
permits_sf_res <- permits_sf%>%
  dplyr::select(permittype, permitdescription, permitissuedate, commercialorresidential, address)%>%
#  filter(permitdescription == "DEMOLITION PERMIT" |
#         permitdescription == "GENERAL PERMIT" |
#         permitdescription == "NEW CONTRUCTION PERMIT" |
#         permitdescription == "RESIDENTIAL BUILDING PERMIT" |
#         permitdescription == "FAST FORM BUILDING PERMIT" |
#         permitdescription == "ALTERATION PERMIT")%>%
  filter(commercialorresidential != "COMMERCIAL")%>%
  mutate(quarter = as_date(floor_date(permitissuedate, unit="quarter")))%>%
  filter(year(quarter) >= 2009)

permits_count <- permits_sf_res %>%
  st_drop_geometry()%>%
  group_by(address, quarter)%>%
  count(address, sort = TRUE)%>%
  rename(n_permits = n)

permits_count%>%
  ggplot(aes(x=quarter, y=n_permits))+
  geom_col(fill="#A5300F")+
    labs(title="Permit Records",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Date, Rounded to Beginning of Quarter")+
  ylab("Number of Records")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#f3efe0"))
```

### Real Estate Transfer Data
```{r SHORTCUT: Real Estate Transfers Count Panel}
transfers_count <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/transfers_count.csv")
```

```{r Real Estate Transfers Load-In}
#transfers <- read_csv("Data/OpenDataPhilly-transfers.csv")

#Select relevant fields and filter to fire data range
#transfers_data <- transfers%>%
#  dplyr:: select(objectid, recording_date, street_address, document_type)%>%
#  filter(year(recording_date) > 2008,
#         !is.na(street_address))%>%
#  rename(address = street_address)%>%
#  mutate(quarter = as_date(floor_date(recording_date, unit="quarter")))
```

```{r Real Estate Transfers Count Panel}
#transfers_count <- transfers_data %>%
#  group_by(address, quarter)%>%
#  count(address, sort = TRUE)%>%
#  rename(n_transfers = n)

#write.csv(transfers_count, "~/Desktop/Coding/MUSA Practicum/MUSA_Practicum-/Data/transfers_count.csv")
```

```{r Real Estate Transfers ggplot}
transfers_count%>%
  ggplot(aes(x=quarter, y=n_transfers))+
  geom_col(fill="#A5300F")+
    labs(title="Real Estate Transfer Records",
       subtitle="Philadelphia County, 2009-2022")+
  xlab("Date, Rounded to Beginning of Quarter")+
  ylab("Number of Records")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#f3efe0"))
```

### Joining Open Data to Existing Panel
```{r SHORTCUT: Join 311 and Permits to panel_FireOPA}
panel_OPAFireOpenData <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/panel_OPAFireOpenData.csv")
```

```{r Join 311 and Permits to panel_FireOPA,include=FALSE}
#panel_OPAFire311 <- left_join(opa_count.panel, vacant311_count, by=c("address", "quarter"))
#panel_OPAFire311$n_311Vacant[is.na(panel_OPAFire311$n_311Vacant)] <- 0

#panel_OPAFire311Permit <- left_join(panel_OPAFire311, permits_count, by=c("address", "quarter"))
#panel_OPAFire311Permit$n_permits[is.na(panel_OPAFire311Permit$n_permits)] <- 0

#panel_OPAFireOpenData <- left_join(panel_OPAFire311Permit, transfers_count, by=c("quarter", "address"))
#panel_OPAFireOpenData$n_transfers[is.na(panel_OPAFireOpenData$n_transfers)] <- 0
```

# Calculating Outcomes
Objective is to know which fires had vacancy complaints, permit requests, or real estate transfers in the months or years following their fire. 
```{r Calculate Average Quarters Until Result}
#filter panel to just outcomes or fires
panel_Positives <- panel_OPAFireOpenData %>%
  filter(count > 0 | n_Vacant > 0 | n_permits > 0 | n_transfers > 0)

#filter to just fires, then combine those addresses with additional outcomes and building categories
panel_FirePositives <- panel_OPAFireOpenData %>%
  filter(count > 0)%>%
  dplyr::select(address)%>%
  distinct(address, .keep_all = TRUE)%>%
  left_join(panel_Positives, by="address")%>%
  left_join(dplyr::select(opa_dat_small_sf, address, category_code_description, mailing_street, building_code_description), by="address")%>%
  mutate(condo = ifelse(grepl("CONDO", building_code_description) == TRUE, TRUE, FALSE),
          owner_occ = ifelse(condo == FALSE & category_code_description != "MULTI FAMILY",
                             ifelse(address == mailing_street, TRUE, FALSE),
                             NA))%>%
  dplyr::select(-mailing_street, -condo, -building_code_description)%>%
  st_drop_geometry()
  #mutate(diff = interval(lag(quarter, n=1),quarter) %/% years(1))  

#Eliminate the data that comes before the fires, as those are not outcomes of the fire
panel_FirePositives <- panel_FirePositives%>%
  group_by(address)%>%
  mutate(f = cumsum(count))%>% #counts the cumulative sum of the number of fires at that address so far.
  filter(f > 0)%>% #if that number is zero, then we don't want the data
  dplyr::select(-f)

# edit here if we wanna play with the lags
#Join to get fire incident number and date
#calculate the difference between the quarter of the incident date and the quarter of the outcome
panel_FirePositivesDiff <- panel_FirePositives%>%
  left_join(st_drop_geometry(dplyr::select(dat, incident_number,address, quarter)), by=c("address"))%>%
  group_by(incident_number)%>% #some addresses have multiple fires, so we use incident number instead
  mutate(mSinceFire = interval(quarter.y, quarter.x) %/% months(1),
         ySinceFire = mSinceFire / 12,
         cat_code = toupper(category_code_description))%>%
  filter(mSinceFire >= -1,#Eliminate entries before fires (occurs because of incident_number group duplicates)
         mSinceFire < 49, #Eliminate entries after four years, as they are irrelevant (arbitrary)
         !(count > 0 & ySinceFire > 0))%>% #For addresses with multiple incidents, take out repeated fire observ's
  dplyr::select(-category_code_description)%>%
  st_as_sf()

#Chart:
#For every outcome, what is the median time since a fire occurred?
#panel_FirePositivesDiff%>%
#  st_drop_geometry()%>%
#  ungroup()%>%
#  filter(!(n_Vacant == 0 & n_permits == 0 & n_transfers == 0)) %>%
#  mutate(ySinceFire = mSinceFire / 12)%>%
#  dplyr::select(n_Vacant, n_permits, n_transfers, ySinceFire)%>%
#  gather(Variable, value, -ySinceFire)%>%
#  filter(value > 0)%>%
#  group_by(Variable)%>%
#  summarize(`Median Years Since Fire` = median(ySinceFire))%>%
#  kable()%>%
#  kable_styling()
```

## Results (Boolean)
The outcomes here can be used as the dependent variable for our model, when combined back with all the other properties. Our 2-years-after-fire results panel has 14,545 observations, down from the original data's ~21,000, so we lost about 40% of incidents. We'll have to check why. 
```{r SHORTCUT: Calculate Boolean Outcome Code}
panel_Results2Y <- read_csv("/Users/myronbanez/Desktop/Coding/PhilaFireData/panel_Results2Y.csv")
```

```{r Calculate Boolean Outcome Code}
#six months boolean outcome code
#panel_Results2Q <- panel_OPAFireOpenData %>%
#    mutate(fireVacant2Q = ifelse(address == lag(address, n=1) & 
#                               (count > 0 | lag(count, n=1) > 0) &
#                                (n_311Vacant > 0) > 0 , 1, 0),
#           firePermit2Q = ifelse(address == lag(address, n=1) & 
#                               (count > 0 | lag(count, n=1) > 0) &
#                                (n_permits > 0) > 0 , 1, 0),
#          fireTransfer2Q = ifelse(address == lag(address, n=1) & 
#                     (count > 0 | lag(count, n=1) > 0) &
#                      (n_transfers > 0) > 0 , 1, 0))

#2 Year Outcomes for Each Incident
#panel_Results2Y <- panel_FirePositivesDiff %>%
#    st_drop_geometry()%>%
#    dplyr::select(-mSinceFire, -cat_code, -quarter.y)%>%
#    filter(., ySinceFire <= 2)%>% # edit here if we wanna play with the lags
#    group_by(address, incident_number)%>%
#    summarize(count = sum(count),
#              severity_index = max(severity_index),
#              outcome_vacant = sum(n_311Vacant),
#              outcome_permit = sum(n_permits),
#              outcome_transfer = sum(n_transfers),
#              quarter = min(quarter.x))

# Later: Join back to original dataset to get the spatial features
```

# Feature Engineering 

### OPA and Fire Dataset 
```{r Cleaning Dataframes and Slight Engineering}
# OPA - Creating an OPA dataset to get just the variables we want for feature engineering and doing slight feature engineering
# Numeric
opa_dat_small_sf_num <- opa_dat_small_sf %>% 
  dplyr::select(address, total_livable_area, market_value, sale_price, number_of_bedrooms, number_of_bathrooms, number_stories, 
  interior_condition) %>% st_drop_geometry()
opa_dat_small_sf_num[is.na(opa_dat_small_sf_num)] <- 0 # Assigning 0 to NA for everything

# Categorical
opa_dat_small_sf_cat <- opa_dat_small_sf %>% 
  dplyr::select(address, quality_grade, year_built, central_air, exterior_condition, fireplaces, fuel, taxable_building, 
  topography, type_heater)  %>% st_drop_geometry()

opa_dat_small_sf_fe <- left_join(opa_dat_small_sf_num,opa_dat_small_sf_cat, by = "address") # Joining

# OPA - Quality Grade
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(grade = case_when(
    quality_grade == "A+" |quality_grade == "A" | quality_grade == "A-" | quality_grade == "B+" | quality_grade == "B" | 
      quality_grade == "B-"| quality_grade == "C+"| quality_grade == "C" ~ "Average or Better",
    TRUE ~ "Below Average")) 

# OPA - Fuel
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(fuel_type = case_when(
    fuel == "A" | fuel == "B" ~ "Fossil",
    fuel == "D" | fuel == "F" ~ "Solid", 
    fuel == "C" | fuel == "E" ~ "Alternative", 
    TRUE ~ "Other"))

# OPA - Topography
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(topo = case_when(
    topography == "A" ~ "Above Street Level",
    topography == "B" ~ "Below Street Level", 
    topography == "C" ~ "Flood Plain",
    topography == "D" ~ "Rocky",
    topography == "F" ~ "Street Level",
    TRUE ~ "Other"))

# OPA - Heater
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(electric_heater = case_when(
    type_heater == "C" ~ "Yes",
    TRUE ~ "No"))

# OPA - Central Air
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(air_central = case_when(
    central_air == "Y" ~ "Yes",
    TRUE ~ "No"))

# OPA - Fireplace
opa_dat_small_sf_fe <-
  opa_dat_small_sf_fe %>%
  mutate(fireplace = case_when(
    fireplaces > 0 ~ "Yes",
    TRUE ~ "No"))

# OPA - Year Built
opa_dat_small_sf_fe$year_built <- as.numeric(as.character(opa_dat_small_sf_fe$year_built))

# OPA - Cleaning
opa_dat_small_sf_fe <- opa_dat_small_sf_fe %>% dplyr::select(-quality_grade, -central_air, -fuel, -topography, -type_heater)
opa_dat_small_sf_fe[is.na(opa_dat_small_sf_fe)] <- 0 # Assigning 0 to NA for everything

# Fire Data - Creating an fire dataset to get just the variables we want for feature engineering
dat_fe <- dat %>% dplyr::select(address, incident_type, number_of_exposures, minor_damage, significant_damage, heavy_damage, extreme_damage)
```

```{r Joining}
panel_Results2Y_sf <- left_join(panel_Results2Y, opa_dat_small_sf_fe, by="address") # Joining with OPA FE variables
panel_Results2Y_sf[is.na(panel_Results2Y_sf)] <- 0 # Assigning 0 to NA for everything

# Joining panel_Results2Y with Time outcome FE variables
panel_FirePositivesDiff_fe <- panel_FirePositivesDiff %>% 
  dplyr::select(address, mSinceFire, ySinceFire, cat_code, owner_occ) %>% st_drop_geometry() # Condensing dataframe
fire_panel <- merge(panel_Results2Y_sf,panel_FirePositivesDiff_fe ) # Join
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates
```

```{r Feature Engineering}
# Turning outcomes into binary
fire_panel <-fire_panel %>%
  mutate(vacant = case_when(
    outcome_vacant > 0 ~ 1,
    TRUE ~ 0
  ))

fire_panel <-fire_panel %>%
  mutate(permit = case_when(
    outcome_permit > 0 ~ 1,
    TRUE ~ 0
  ))

fire_panel <-fire_panel %>%
  mutate(transfer = case_when(
    outcome_transfer > 0 ~ 1,
    TRUE ~ 0
  ))

# Market Value
fire_panel <-
  fire_panel %>%
  mutate(mkt_value = case_when(
    market_value < 25000 ~ "Low", # < $250,000
    market_value > 24999 & market_value < 500000 ~ "Medium", # $250,000-$500,000
    TRUE ~ "High")) # $500,000+

# Number of Bedrooms
fire_panel <-
  fire_panel %>%
  mutate(bedrooms = case_when(
    number_of_bedrooms < 4 ~ "Low", # 1-3 Bedrooms
    number_of_bedrooms > 3 & number_of_bedrooms < 8 ~ "Medium", # 4-7 Bedrooms
    TRUE ~ "High")) # 8+

# Number of Bathrooms
fire_panel <-
  fire_panel %>%
  mutate(bathrooms = case_when(
    number_of_bathrooms < 3 ~ "Low", # 1-2 Bathroom
    number_of_bathrooms > 2 & number_of_bathrooms < 6 ~ "Medium", # 3-5 Bathrooms
    TRUE ~ "High")) # $6+

# Livable Area 
fire_panel <-
  fire_panel %>%
  mutate(area = case_when(
    total_livable_area < 5001 ~ "Low", # < 0-5,000 sqft
    total_livable_area > 4999 & total_livable_area < 10000 ~ "Medium", # 5,001-10,000 sqft
    TRUE ~ "High")) # 10,001+ sqft

# Year Built
fire_panel <-
  fire_panel %>%
  mutate(built_year = case_when(
    year_built < 1951 ~ "Up to 1951", 
    year_built > 1959 & year_built < 2000 ~ "1950 to 1999",
    TRUE ~ "2000 to Present")) 

# Interior Condition when fire happened - Fire in 2018 and condition is how it is today. remove?
fire_panel <-
  fire_panel %>%
  mutate(condition = case_when( 
    interior_condition > 0 & interior_condition < 5 ~ "Average", 
    interior_condition == 5 ~ "Below Average",
    interior_condition == 6 ~ "Vacant",
    interior_condition == 7 ~ "Sealed",
    interior_condition == 0 ~ "Unknown")) 

# Sale vs Market
fire_panel <-
  fire_panel %>%
  mutate(sale_value = case_when( 
    sale_price > market_value ~ "Above Market", 
    TRUE ~ "Below Market")) 

# Sale vs Market
fire_panel <-
  fire_panel %>%
  mutate(resident_type = case_when( 
    owner_occ == "TRUE" ~ "Owner", 
    owner_occ == "FALSE" ~ "Renter", 
    TRUE ~ "Other")) 

dat_geo <- dat %>% dplyr::select(address, geometry) # Geometry
fire_panel <- left_join(fire_panel, dat_geo, by = "address") # Joining to get geometry
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates
```

### Neighborhoods
```{r Adding Neighborhoods}
# Load Data
neighborhoods <- st_read("/Users/myronbanez/Desktop/Coding/PhilaFireData/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson") %>% st_transform(crs = 3652) %>% dplyr::select(mapname, geometry) # Remove extra columns

fire_panel_sf <- st_as_sf(fire_panel) %>% st_transform(st_crs(neighborhoods)) # Make sf

fire_panel <- st_join(fire_panel_sf, neighborhoods) # Join
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates

fire_panel <- fire_panel %>% rename(neighborhood = mapname) # Renaming
```

### Redevelopment Certified Areas
```{r Redevelopment Area}
#https://www.opendataphilly.org/dataset/redevelopment-certified-areas

rca <- st_read("/Users/myronbanez/Desktop/Coding/PhilaFireData/Redevelopment_Certified_Areas.geojson") %>% 
  st_transform(crs = 3652) %>% dplyr::select(NAME)

fire_panel <- st_join(fire_panel, rca) # Join
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates

# 
fire_panel <-fire_panel %>%
  mutate(redevelopment_area = case_when(
    is.na(fire_panel$NAME) ~ 0, # Is not in a redevelopment certified area
    TRUE ~ 1 # Is in a redevelopment certified area
  ))

fire_panel <- fire_panel %>% dplyr::select(-NAME)
```

### Demographics
```{r Demographics}
census_api_key("05b9c101eb2ee7dc7abb88140da527ce637ac07f", overwrite = TRUE)

## Income, Age, White, Black, AIAN, Asian, NHPI, Other, Two or More, Hispanic
acs_fe <- c("B19013_001", "B11005_001", "B02001_002", "B02001_003", "B02001_004", "B02001_005", "B02001_006", "B02001_007", "B02001_008","B03002_001")

acs_data <- load_variables(year = 2019, dataset = "acs5", cache = TRUE)
acs_features <- get_acs(geography = "tract",
                             year = 2020, 
                             variables = acs_fe, # Income, Age, Race
                             geometry = TRUE, 
                             state = "PA", 
                             county = "Philadelphia", 
                             output = "wide") %>% st_transform(st_crs(neighborhoods))
acs_features <- acs_features %>% rename(income = B19013_001E, 
                                        age = B11005_001E, 
                                        white = B02001_002E,
                                        black = B02001_003E,
                                        AIAN = B02001_004E,
                                        asian = B02001_005E,
                                        NHPI = B02001_006E,
                                        other = B02001_007E,
                                        multi = B02001_008E,
                                        hispanic = B03002_001E) %>% 
                      dplyr::select(NAME, income, age, white, black, AIAN, asian, NHPI, other, multi, hispanic)

## Percent Change in White Population
vars <- c("B03002_003") #Variable for white alone
philly_data_2010 <- get_acs(geography = "tract",
                        variables = vars,
                        year = 2010,
                        state = "PA",
                        county = "Philadelphia County",
                        key = census_api_key) %>% dplyr::select(NAME, estimate) %>% rename(estimate_2010 = estimate)
philly_data_2019 <- get_acs(geography = "tract",
                        variables = vars,
                        year = 2019,
                        state = "PA",
                        county = "Philadelphia County",
                        key = census_api_key) %>% dplyr::select(NAME, estimate) %>% rename(estimate_2019 = estimate)
philly_data <- left_join(philly_data_2010, philly_data_2019, by="NAME" ) # Joining frames together
philly_data <- philly_data %>% mutate(white_change = (estimate_2019 - estimate_2010)/estimate_2010) # Calculation
philly_data$white_change[!is.finite(philly_data$white_change)] <- 0 # Getting rid of Inf
philly_data$white_change <- round(philly_data$white_change, digits = 2) # Rounding

# Joins
acs_features_full <- left_join(acs_features, philly_data, by="NAME") %>% dplyr::select(-estimate_2010, -estimate_2019)
fire_panel <- st_join(fire_panel, acs_features_full) 
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates

## Poverty
fire_panel <-
  fire_panel %>%
  mutate(poverty = case_when( 
    income < 25000 ~ "Yes", 
    TRUE ~ "No")) 

## Income Level - 40k is best so far
fire_panel <-
  fire_panel %>%
  mutate(income_level = case_when(
    income < 40000 ~ "Low", 
    TRUE ~ "High")) 

## Price per sqft
fire_panel <-
  fire_panel %>%
  mutate(value_sqft = market_value/total_livable_area) 

fire_panel$value_sqft[!is.finite(fire_panel$value_sqft)] <- 0
fire_panel_nogeo <- fire_panel %>% st_drop_geometry()
fire_panel <- left_join(fire_panel_nogeo, dat_geo, by="address")
fire_panel <- fire_panel[!duplicated(fire_panel$incident_number),] # Removing Duplicates
fire_panel_nogeo <- fire_panel %>% st_drop_geometry()
```

### Schools
```{r Schools}
schools <- st_read("/Users/myronbanez/Desktop/Coding/PhilaFireData/Schools.geojson") %>% st_transform(crs = 3652)

# Setting up nearest neighbors
st_c <- st_coordinates
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

fire_panel <- fire_panel %>% st_as_sf() %>% st_transform(crs = 3652)

fire_panel <-
  fire_panel %>% 
    mutate(
      schools_nn1 = nn_function(st_c(fire_panel), st_c(schools), 1), 
      schools_nn2 = nn_function(st_c(fire_panel), st_c(schools), 2), 
      schools_nn3 = nn_function(st_c(fire_panel), st_c(schools), 3), 
      schools_nn4 = nn_function(st_c(fire_panel), st_c(schools), 4), 
      schools_nn5 = nn_function(st_c(fire_panel), st_c(schools), 5)) 

fire_panel_nogeo <- fire_panel %>% dplyr::select(-geometry)

#write.csv(fire_panel_nogeo, "~/Desktop/Coding/fire_panel_nogeo.csv")
```

# VARIABLES 
```{r}
NON-FE
#Continuous: severity_index, total_livable_area, market_value, sale_price, taxable_building

#Categorical: number_of_bedrooms, number_of_bathrooms, number_stories, interior_condition, year_built, exterior_condition, fireplaces

FE
#Categorical: redevelopment_area, grade, fuel_type, topo, electric_heater, air_central, cat_code, owner_occ, fireplace, mkt_value, bedrooms, bathrooms, area, built_year, condition, sale_value, resident_type, poverty, income_level

#Continuous: value_sqft, schools_nn1, schools_nn2, schools_nn3, schools_nn4, schools_nn5,income, age, white, black, AIAN, asian, NHPI, other, multi, hispanic, mSinceFire

train_vacant %>% 
  dplyr::select(vacant, severity_index, total_livable_area, market_value, sale_price, taxable_building, income, age, white, black, AIAN, asian, NHPI, other, multi, hispanic ) %>% 
  gather(Variable, Value, -vacant) %>% 
   ggplot(aes(Value, vacant)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables", 
          caption="Figure 4.1") +
     plotTheme()
```

```{r Saving Variables}
#Variables that are in the dataset without feature engineering.
variables_raw <- c("total_livable_area", "market_value", "sale_price", "number_of_bedrooms", "number_of_bathrooms", 
                   "number_stories", "interior_condition", "quality_grade", "year_built", "central_air", "exterior_condition", 
                   "fireplaces", "fuel", "taxable_building", "topography", "type_heater", "cat_code", "owner_occ", #OPA DATA
                  "mSinceFire", "ySinceFire", #FIRE DATA
                  "neighborhoods", #NEIGHBORHOOD DATA
                  "income", "age", "white", "black", "AIAN", "asian", "NHPI", "other", "multi", "hispanic", "estimate_2010", "
                  estimate_2019" #ACS DATA
                  ) 
#address, incident_number, count, quarter, outcome_vacant, outcome_permit, outcome_transfer

#Variables that are in the dataset with feature engineering.
variables_engineered <- c("grade", "fuel_type", "topo", "electric_heater", "air_central", "fireplace", "mkt_value", "bedrooms", "bathrooms", "area", "built_year", "condition", "sale_value", "resident_type", "redevelopment_area", "poverty", "income_level", "white_change" "value_sqft", "schools_nn1", "schools_nn2", "schools_nn3", "schools_nn4", "schools_nn5") 
#vacant, permit, transfer
```