<a href="https://colab.research.google.com/github/jpeone/ds-unit-1-sprint-4-build/blob/master/unit_1_sprint_4_build.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Making a Dumpster Fire Companion Notebook<br>
###Objective
Inspired by a local arson's spree of 12 Dumpster fires, I'll explore the recods of 911 calls to Seattle's Fire Department. I'll attempt to answer the questions:
_Just how common are dumpster fires? Are dumpster fire series, like the one above, uncommon?_


###Sources
[911 Calls](https://data.seattle.gov/Public-Safety/Seattle-Real-Time-Fire-911-Calls/kzjm-xkqj), [Code Types](https://data.seattle.gov/Public-Safety/SFD-Type-Codes-Standard-Response/mati-fqsc), 
[Inspiring Incident](https://www.capitolhillseattle.com/2019/12/police-seek-help-investigating-overnight-arson-string-after-broadway-trash-fire-arrest/)

###Outline
Explore and Clean<br>
Feature Engineering 1<br>
Visualization 1<br>
Feature Engineering 2 <br>
Visualization 2<br>
Further Analysis<br>

In [0]:
#old habits die hard, here is an import block
import pandas as pd
import requests
import datetime
import plotly.express as px
from geopy.distance import distance

##Exploration and Cleaning

In [87]:
#Uses socrata api, ref paging instructions here: 
#https://dev.socrata.com/docs/paging.html
#it appears from the documentation that there no is size or rate limit to using 
#socrata api.
#Lets get gutsy and just do it in one call lol

params = {'$limit': 5000000, '$offset': 0}
url = 'https://data.seattle.gov/resource/fire-911.json'
response = requests.get(url, params = params)
response

<Response [200]>

In [88]:
#got a 200 code, so things are looking good. Lets check that object length
len(response.json())

1434311

In [89]:
#okay, thats the 1.43 million as per the website. lets slap it in a dataframe
df = pd.DataFrame(response.json())
print(df.shape)
df.head()

(1434311, 12)


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,:@computed_region_ru88_fbhk,:@computed_region_kuhn_3gp2,:@computed_region_q256_3sug,:@computed_region_2day_rhn5,:@computed_region_cyqu_gs94
0,Nw 59th St / 15th Av Nw,Triaged Incident,2019-10-11T23:34:00.000,47.67163,-122.376212,"{'type': 'Point', 'coordinates': [-122.376212,...",F190108891,4,1,18386,,
1,1431 Minor Av,Aid Response,2019-12-27T01:24:00.000,47.613247,-122.327187,"{'type': 'Point', 'coordinates': [-122.327187,...",F190136279,8,12,18081,,
2,1402 3rd Av,Rescue Elevator,2019-05-30T11:22:00.000,47.608766,-122.336894,"{'type': 'Point', 'coordinates': [-122.336894,...",F190056035,14,24,18081,,
3,500 17th Av,Aid Response,2019-05-30T11:23:00.000,47.606176,-122.310249,"{'type': 'Point', 'coordinates': [-122.310249,...",F190056036,9,17,19578,,
4,201 Occidental Av S,Triaged Incident,2019-11-12T12:33:00.000,47.600873,-122.332877,"{'type': 'Point', 'coordinates': [-122.332877,...",F190120585,49,20,18379,,


In [90]:
#didn't expect all the computed region columns. I'll go ahead and drop those
#shape is 12 wide, I'm keeping 7 columns.
dropped = df.columns[7:]
df.drop(dropped, axis = 1, inplace = True)
df.shape

(1434311, 7)

In [91]:
#okay some nans, good to know. I'm keeping df as close to original as possible,
#so I'm going to leave nans in here. I'll have to adjust for nans in my working 
#dataframes
df.isnull().sum()

address             10
type                 0
datetime             0
latitude           239
longitude          239
report_location    229
incident_number      0
dtype: int64

In [92]:
df['type'].value_counts()

Aid Response                  703823
Medic Response                274922
Auto Fire Alarm                82652
Trans to AMR                   63810
Aid Response Yellow            31210
                               ...  
Rescue Lock In/Out Yellow          1
Mutual Aid Strike Team Lad         1
Mutual Aid, Aircraft               1
Explosion Unk Situation            1
Tunnel North Ops Bldg              1
Name: type, Length: 222, dtype: int64

In [93]:
#how common are these?
df[df['type'] == 'Dumpster Fire'].shape[0]/df.shape[0] * 100

0.11719912905917894

In [0]:
#looks like dumpster fires are pretty rare
dumpster = df[df['type'] == 'Dumpster Fire'].copy()

In [95]:
print(dumpster.shape)
dumpster.head()

(1681, 7)


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number
511,900 S Jackson St,Dumpster Fire,2019-07-20T19:36:00.000,47.599191,-122.321025,"{'type': 'Point', 'coordinates': [-122.321025,...",F190076831
1598,1723 Summit Av,Dumpster Fire,2019-08-10T03:08:00.000,47.616656,-122.325547,"{'type': 'Point', 'coordinates': [-122.325547,...",F190084947
2821,S Michigan St / Corson Av S,Dumpster Fire,2019-11-02T02:31:00.000,47.547503,-122.321439,"{'type': 'Point', 'coordinates': [-122.321439,...",F190116740
5832,714 7th Av,Dumpster Fire,2019-09-11T06:05:00.000,47.605435,-122.327711,"{'type': 'Point', 'coordinates': [-122.327711,...",F190097400
6138,4231 8th Av Ne,Dumpster Fire,2019-07-22T13:07:00.000,47.658112,-122.319788,"{'type': 'Point', 'coordinates': [-122.319788,...",F190077473


In [0]:
#okay lets pull out the month and year to do some analysis
dumpster['year'] = pd.DatetimeIndex(dumpster['datetime']).year
dumpster['month'] = pd.DatetimeIndex(dumpster['datetime']).month

In [0]:
#okay this is a live data set, and rather than get the paging exactly right,
#I'll just drop any calls from 2020 and beyond.
dumpster = dumpster[dumpster['year'] < 2020]

In [98]:
print(dumpster.shape)
dumpster.tail()

(1678, 9)


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month
1429898,229 Broadway E,Dumpster Fire,2019-04-12T06:02:59.000,47.620242,-122.320878,"{'type': 'Point', 'coordinates': [-122.320878,...",F190036651,2019,4
1430007,8th Av N / Harrison St,Dumpster Fire,2019-04-12T15:33:03.000,47.622051,-122.341066,"{'type': 'Point', 'coordinates': [-122.341066,...",F190036792,2019,4
1431931,2711 Franklin Av E,Dumpster Fire,2019-04-19T22:53:14.000,47.64461,-122.324624,"{'type': 'Point', 'coordinates': [-122.324624,...",F190039329,2019,4
1432615,8532 15th Av Nw,Dumpster Fire,2019-04-22T17:11:47.000,47.690901,-122.376806,"{'type': 'Point', 'coordinates': [-122.376806,...",F190040225,2019,4
1433125,Brooklyn Av Ne / Ne 47th St,Dumpster Fire,2019-04-24T19:30:00.000,47.6631,-122.314238,"{'type': 'Point', 'coordinates': [-122.314238,...",F190041023,2019,4


In [99]:
#well we lucked out anyways, no nans pulled
dumpster.isnull().sum()

address            0
type               0
datetime           0
latitude           0
longitude          0
report_location    0
incident_number    0
year               0
month              0
dtype: int64

###Exploratory Visualizations
Small side track to see which months have the most fires reported.

In [106]:
#okay lets make a crosstab.
tot_fire_p_month = pd.crosstab(dumpster['type'], dumpster['month'])
tot_fire_p_month

month,1,2,3,4,5,6,7,8,9,10,11,12
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Dumpster Fire,96,99,141,125,159,184,218,163,149,122,91,131


In [107]:
#boom, lets make that digestible for plotly
#I'm transposing, putting the month into a column (instead of as an index), 
#and renaming the columns for better display.

columns = {'month' : 'Month', 'Dumpster Fire': 'Total Dumpster Fires'}
tot_fire_p_month = tot_fire_p_month.T.reset_index().rename(columns = columns)
tot_fire_p_month

type,Month,Total Dumpster Fires
0,1,96
1,2,99
2,3,141
3,4,125
4,5,159
5,6,184
6,7,218
7,8,163
8,9,149
9,10,122


In [109]:
#okay so this is the total number of dumpster fires reported over the years of
#2003-2019.  Kind cool to look at, but normalized might be better for 
#understanding how much of an anomoly, if it was an anomoly.

fig = px.bar(tot_fire_p_month, x = 'Month', y = 'Total Dumpster Fires')
fig.show()

In [110]:
#Making a normalized crosstab
avg_fire_p_month = pd.crosstab(dumpster['type'], dumpster['month'], 
                               normalize = True)
avg_fire_p_month

month,1,2,3,4,5,6,7,8,9,10,11,12
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Dumpster Fire,0.057211,0.058999,0.084029,0.074493,0.094756,0.109654,0.129917,0.097139,0.088796,0.072706,0.054231,0.078069


In [111]:
#for mah boy plotly
columns = {'month' : 'Month', 'Dumpster Fire': 'Average Dumpster Fires'}
avg_fire_p_month = avg_fire_p_month.T.reset_index().rename(columns = columns)
avg_fire_p_month

type,Month,Average Dumpster Fires
0,1,0.057211
1,2,0.058999
2,3,0.084029
3,4,0.074493
4,5,0.094756
5,6,0.109654
6,7,0.129917
7,8,0.097139
8,9,0.088796
9,10,0.072706


In [112]:
#damn so if i you look at the month of december, and the average is WAY low.
#so our arson was over 10x the average for the month of december in one night.

fig = px.bar(avg_fire_p_month, x = 'Month', y = 'Average Dumpster Fires')
fig.show()

##Feature Engineering 1

In [0]:
#Okay lets get back on track.
#Just plough through these, sorted by datetime and give a simple boolean value 
#for if each call was within 24 hours.
#Then just group each set, and count how many calls in a row.

#check if two dates are within 24 hours
def within_24(earlier, later):
  return (later - earlier) <= datetime.timedelta(hours = 24)

In [114]:
#indexes are still showing from the original data set, just going to reset these
#for ease of processing and using iloc.
dumpster.sort_values('datetime', inplace = True)
dumpster.reset_index(drop = True, inplace = True)
dumpster.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month
0,747 Broadway,Dumpster Fire,2003-11-10T15:43:04.000,47.608621,-122.320756,"{'type': 'Point', 'coordinates': [-122.320756,...",F030082401,2003,11
1,32nd Av E / E Madison St,Dumpster Fire,2003-11-14T23:45:39.000,47.62704,-122.290929,"{'type': 'Point', 'coordinates': [-122.290929,...",F030083954,2003,11
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085104,2003,11
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085103,2003,11
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"{'type': 'Point', 'coordinates': [-122.342145,...",F030085264,2003,11


In [0]:
#need some actual datetime objects to do this so lets just convert that date
#time column in to datetime objects.
dumpster['date_obj'] = pd.DatetimeIndex(dumpster['datetime'])

In [116]:
dumpster.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj
0,747 Broadway,Dumpster Fire,2003-11-10T15:43:04.000,47.608621,-122.320756,"{'type': 'Point', 'coordinates': [-122.320756,...",F030082401,2003,11,2003-11-10 15:43:04
1,32nd Av E / E Madison St,Dumpster Fire,2003-11-14T23:45:39.000,47.62704,-122.290929,"{'type': 'Point', 'coordinates': [-122.290929,...",F030083954,2003,11,2003-11-14 23:45:39
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085104,2003,11,2003-11-18 09:10:42
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085103,2003,11,2003-11-18 09:10:57
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"{'type': 'Point', 'coordinates': [-122.342145,...",F030085264,2003,11,2003-11-18 15:38:25


In [117]:
#can't really think of how to do this without good old fashioned indexing
#so I'm going to do that for now.  Would love to find a better way to do this
#the - 1 in the range is to compensate for looking one index ahead
#I'm comparing the index, to the index+1, and seeing if they are within 24
#hours, then storing that truth value in a list to be added to our dataframe

#we are assuming all calls are greater than 24 hours apart, and only changing
#when they aren't
booleans = [False] * dumpster.shape[0]

for i in range(dumpster.shape[0] - 1):
  if within_24(dumpster['date_obj'].iloc[i], dumpster['date_obj'].iloc[i + 1]):
    booleans[i] = True
    booleans[i+1] = True
  else:
    continue

print(len(booleans))
print(dumpster.shape[0])

1678
1678


In [118]:
#length matches shape, so I did it right.
print(booleans)

[False, False, True, True, True, False, False, True, True, False, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, True, True, False, False, False, False, False, True, True, False, False, False, False, False, False, True, True, True, True, True, True, False, True, True, False, True, True, False, False, True, True, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, False, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, False, False, False, False, False, False, False, True, True, False, True, True, True, False, True, True, True, True, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, False, False, True, True, False, False, False, False, False, False, False, True, True, False, F

In [119]:
#okay I'm pretty sure I didn't make a mistake, but more that I didn't make a 
#meaningful choice when I selected a boolean.  Lets try this again, and I will
#simply choose to group them by the initial index value. This is getting very
#c-like, and not very pythonic
groups = [0] * dumpster.shape[0]

#I need to dynamically change i, so it will be a while loop
i = 0
while i < dumpster.shape[0]:

  #i + 1 because we can't be comparing the same index
  for j in range(i + 1, dumpster.shape[0]):
    if within_24(dumpster['date_obj'].iloc[i], dumpster['date_obj'].iloc[j]):
      groups[i] = i
      groups[j] = i
    else:
      
      #ugh god this is so jank. -1 to account for the manually incrementing
      i = j - 1
      break
  i += 1

#I think thise might be what we are looking for
print(len(groups))
print(dumpster.shape[0])


1678
1678


In [120]:
print(groups)

[0, 0, 2, 2, 2, 0, 0, 7, 7, 0, 10, 10, 10, 13, 13, 15, 15, 17, 17, 17, 0, 0, 0, 0, 0, 0, 26, 26, 0, 0, 0, 0, 0, 33, 33, 0, 0, 0, 0, 0, 0, 41, 41, 43, 43, 43, 0, 0, 48, 48, 0, 51, 51, 0, 0, 55, 55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 67, 67, 69, 69, 71, 71, 0, 74, 74, 74, 0, 0, 79, 79, 81, 81, 81, 84, 84, 86, 86, 86, 89, 89, 0, 0, 93, 93, 0, 0, 0, 0, 0, 0, 0, 102, 102, 0, 105, 105, 105, 0, 109, 109, 111, 111, 0, 0, 0, 0, 0, 0, 0, 120, 120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 136, 136, 0, 0, 0, 141, 141, 0, 0, 0, 0, 0, 0, 0, 150, 150, 0, 0, 0, 0, 0, 0, 0, 0, 160, 160, 0, 0, 0, 0, 0, 0, 0, 0, 170, 170, 0, 0, 0, 0, 176, 176, 176, 0, 0, 181, 181, 181, 181, 185, 185, 185, 188, 188, 188, 188, 188, 188, 188, 0, 196, 196, 196, 196, 0, 201, 201, 0, 204, 204, 204, 0, 208, 208, 0, 0, 0, 0, 214, 214, 0, 0, 218, 218, 220, 220, 0, 0, 0, 0, 0, 227, 227, 227, 0, 231, 231, 231, 231, 235, 235, 0, 0, 0, 240, 240, 240, 240, 244, 244, 244, 0, 0, 0, 0, 251, 251, 0, 0, 0, 0, 0, 0, 0, 0, 261, 261, 0, 0, 0, 0, 

In [121]:
#for manual eyeball comparison, looks like I'm getting the groupings I expect.
#especially group 1666, that is our serial dumpster arson.
dumpster.tail(15)

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj
1663,520 2nd Av W,Dumpster Fire,2019-11-28T07:19:00.000,47.623539,-122.35933,"{'type': 'Point', 'coordinates': [-122.35933, ...",F190126059,2019,11,2019-11-28 07:19:00
1664,515 1st Av W,Dumpster Fire,2019-11-28T07:24:00.000,47.623458,-122.358023,"{'type': 'Point', 'coordinates': [-122.358023,...",F190126061,2019,11,2019-11-28 07:24:00
1665,504 E Denny Way,Dumpster Fire,2019-12-03T12:26:00.000,47.618488,-122.325234,"{'type': 'Point', 'coordinates': [-122.325234,...",F190127833,2019,12,2019-12-03 12:26:00
1666,1214 Boylston Av,Dumpster Fire,2019-12-17T21:17:00.000,47.612253,-122.323199,"{'type': 'Point', 'coordinates': [-122.323199,...",F190132873,2019,12,2019-12-17 21:17:00
1667,1308 SENECA ST,Dumpster Fire,2019-12-17T21:24:00.000,47.611732,-122.324132,"{'type': 'Point', 'coordinates': [-122.324132,...",F190132878,2019,12,2019-12-17 21:24:00
1668,801 Spring St,Dumpster Fire,2019-12-17T21:36:00.000,47.608663,-122.329113,"{'type': 'Point', 'coordinates': [-122.329113,...",F190132883,2019,12,2019-12-17 21:36:00
1669,1000 8th Av,Dumpster Fire,2019-12-17T21:39:00.000,47.607945,-122.328471,"{'type': 'Point', 'coordinates': [-122.328471,...",F190132885,2019,12,2019-12-17 21:39:00
1670,1000 8th Av,Dumpster Fire,2019-12-17T21:41:00.000,47.607945,-122.328471,"{'type': 'Point', 'coordinates': [-122.328471,...",F190132888,2019,12,2019-12-17 21:41:00
1671,211 1st Av S,Dumpster Fire,2019-12-17T22:28:00.000,47.600788,-122.334182,"{'type': 'Point', 'coordinates': [-122.334182,...",F190132896,2019,12,2019-12-17 22:28:00
1672,2nd Av S / S Jackson St,Dumpster Fire,2019-12-17T22:37:00.000,47.599201,-122.331578,"{'type': 'Point', 'coordinates': [-122.331578,...",F190132898,2019,12,2019-12-17 22:37:00


In [122]:
#Since they passed the eyeball check, my next step would be to throw these in a 
#column, and then do a groupby/ value_count on that column to see if there are 
#any groups that even come closer to 1666

dumpster['calls_within_24h'] = groups
dumpster.tail(15)

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h
1663,520 2nd Av W,Dumpster Fire,2019-11-28T07:19:00.000,47.623539,-122.35933,"{'type': 'Point', 'coordinates': [-122.35933, ...",F190126059,2019,11,2019-11-28 07:19:00,1662
1664,515 1st Av W,Dumpster Fire,2019-11-28T07:24:00.000,47.623458,-122.358023,"{'type': 'Point', 'coordinates': [-122.358023,...",F190126061,2019,11,2019-11-28 07:24:00,1662
1665,504 E Denny Way,Dumpster Fire,2019-12-03T12:26:00.000,47.618488,-122.325234,"{'type': 'Point', 'coordinates': [-122.325234,...",F190127833,2019,12,2019-12-03 12:26:00,0
1666,1214 Boylston Av,Dumpster Fire,2019-12-17T21:17:00.000,47.612253,-122.323199,"{'type': 'Point', 'coordinates': [-122.323199,...",F190132873,2019,12,2019-12-17 21:17:00,1666
1667,1308 SENECA ST,Dumpster Fire,2019-12-17T21:24:00.000,47.611732,-122.324132,"{'type': 'Point', 'coordinates': [-122.324132,...",F190132878,2019,12,2019-12-17 21:24:00,1666
1668,801 Spring St,Dumpster Fire,2019-12-17T21:36:00.000,47.608663,-122.329113,"{'type': 'Point', 'coordinates': [-122.329113,...",F190132883,2019,12,2019-12-17 21:36:00,1666
1669,1000 8th Av,Dumpster Fire,2019-12-17T21:39:00.000,47.607945,-122.328471,"{'type': 'Point', 'coordinates': [-122.328471,...",F190132885,2019,12,2019-12-17 21:39:00,1666
1670,1000 8th Av,Dumpster Fire,2019-12-17T21:41:00.000,47.607945,-122.328471,"{'type': 'Point', 'coordinates': [-122.328471,...",F190132888,2019,12,2019-12-17 21:41:00,1666
1671,211 1st Av S,Dumpster Fire,2019-12-17T22:28:00.000,47.600788,-122.334182,"{'type': 'Point', 'coordinates': [-122.334182,...",F190132896,2019,12,2019-12-17 22:28:00,1666
1672,2nd Av S / S Jackson St,Dumpster Fire,2019-12-17T22:37:00.000,47.599201,-122.331578,"{'type': 'Point', 'coordinates': [-122.331578,...",F190132898,2019,12,2019-12-17 22:37:00,1666


In [123]:
#huh 1666 is a bit smaller than I expected from the article.  Looking at the
#tail calls above, it seems like not all of the 12 fires were called into 911.
#good to know
dumpster['calls_within_24h'].value_counts()

0       849
1666      9
579       8
188       7
1621      6
       ... 
1065      2
1067      2
1070      2
1075      2
872       2
Name: calls_within_24h, Length: 355, dtype: int64

In [124]:
dumpster['calls_within_24h'].value_counts().value_counts()

2      273
3       57
4       18
5        2
849      1
9        1
8        1
7        1
6        1
Name: calls_within_24h, dtype: int64

##Visualization 1

In [125]:
#this might make an interesting graph. 
call_groups = pd.DataFrame(dumpster['calls_within_24h'].value_counts().value_counts())

call_groups

Unnamed: 0,calls_within_24h
2,273
3,57
4,18
5,2
849,1
9,1
8,1
7,1
6,1


In [0]:
call_groups.reset_index(inplace = True)

In [0]:
call_groups.columns = ['Number of calls within 24 hours', 'Occurences']

###Manual Edits
For the sake of time, I'm going to leave my original feature code alone.  Instead I will describe what is happening as justification for this row edit.<br><br>

Since I only gave unique labels to calls within 24 hours of each other, single calls were all labeled the same as 0.  So when I did the above value count, it listed 0 as though it had 849 calls within 24 hours.  This is wrong, and it should list that there are 849 individual calls each within their own 24 hour period.<br><br>
In hindsight, my original code should issue a unique value for these calls.  For now I'll say this is justification enough for the following row edits.

In [128]:
#manually fixing a wonky line due to my feature code

num = call_groups['Number of calls within 24 hours'].iloc[4]
call_groups['Number of calls within 24 hours'].iloc[4] = 1

call_groups['Occurences'].iloc[4] = num
call_groups

Unnamed: 0,Number of calls within 24 hours,Occurences
0,2,273
1,3,57
2,4,18
3,5,2
4,1,849
5,9,1
6,8,1
7,7,1
8,6,1


In [129]:
fig = px.bar(call_groups, x = 'Number of calls within 24 hours', y = 'Occurences')
fig.update_xaxes(tickmode = 'linear')
fig.show()

In [130]:
#I think it makes a better graph when single calls are left in.  But good to
#explore options.

fig = px.bar(call_groups.drop(4, axis = 0), x = 'Number of calls within 24 hours', y = 'Occurences')
fig.update_xaxes(tickmode = 'linear')
fig.show()

###Unpacking Features
Because I am counting the occurences of a "spree" or a "series" of fires, I need to multiply the occurence by the number of calls that happen in that spree.  So think: a spree of 3 calls happend 5 times.  15 total calls were made between those sprees.  To figure out the percentage I need total spree calls / total calls to get the percentage.

In [0]:
#okay so maybe this would be even easier for your average person to understand
#if I presented the above as an average, instead of as just numbers.
call_groups['raw_percentage'] = call_groups['Occurences'] * call_groups['Number of calls within 24 hours']

In [132]:
#double checking to make sure I'm doing it right.
call_groups['raw_percentage'].sum() == dumpster.shape[0]

True

In [0]:
#next figure out the % of each by dividing the dumpster dataframe length.
call_groups['raw_percentage'] = call_groups['raw_percentage'].apply(lambda x: ((x * 100)/ dumpster.shape[0]))

In [0]:
#rounding
call_groups['raw_percentage'] = call_groups['raw_percentage'].apply(lambda x: round(x, 2))

In [135]:
#Okay rounding errors, I should have expected that.  There is definitely a super
#computer sciency way to deal with this. But I'm going with the classic
#pull .02 randomly from the values.
call_groups['raw_percentage'].sum()

100.02

In [136]:
#To keep my notebook factoring consistently, I'm pulling from rows 1 and 3

call_groups['raw_percentage'].iloc[1] = call_groups['raw_percentage'].iloc[1] - .01
call_groups['raw_percentage'].iloc[3] = call_groups['raw_percentage'].iloc[3] - .01



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [137]:
#good a clean round percentage total. Important for my visualization!
call_groups['raw_percentage'].sum()

100.0

In [0]:
#will make for a better overlay in plotly express.
call_groups['Percentage'] = call_groups['raw_percentage'].apply(lambda x: str(x) + '%')

In [170]:
call_groups['Percentage']

0    32.54%
1    10.18%
2     4.29%
3     0.59%
4     50.6%
5     0.54%
6     0.48%
7     0.42%
8     0.36%
Name: Percentage, dtype: object

In [0]:
#the color setting on plotly express takes a list of strings for color
#I only want one color, so my list is pretty short lol.
colors = ['#7a0000'] 

In [171]:
#I think this is one that will really demonstrate how much of an outlier 9
#calls in a day really is.

fig = px.bar(call_groups, x = 'Number of calls within 24 hours', 
             y = 'Occurences', text = 'Percentage', 
             title = 'Dumpster Fires Reported to 911',
             color_discrete_sequence = colors)
fig.update_xaxes(tickmode = 'linear')
fig.show()

##Feature Engineering 2

###Re-Defining
After looking at the above graph, I got the feeling that I hadn't considered enough details when determining what features I need to really identify series of fires.  Realistically if I had a fire in northgate, and a fire in southpark in the same day, I'd count that as a "series."  Its just not a very realistic analysis for my questions.  So I'm going to add a distance feature to narrow this down some.

In [141]:
#so lets look at figuring out our distances of each group.
dumpster.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h
0,747 Broadway,Dumpster Fire,2003-11-10T15:43:04.000,47.608621,-122.320756,"{'type': 'Point', 'coordinates': [-122.320756,...",F030082401,2003,11,2003-11-10 15:43:04,0
1,32nd Av E / E Madison St,Dumpster Fire,2003-11-14T23:45:39.000,47.62704,-122.290929,"{'type': 'Point', 'coordinates': [-122.290929,...",F030083954,2003,11,2003-11-14 23:45:39,0
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085104,2003,11,2003-11-18 09:10:42,2
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"{'type': 'Point', 'coordinates': [-122.334464,...",F030085103,2003,11,2003-11-18 09:10:57,2
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"{'type': 'Point', 'coordinates': [-122.342145,...",F030085264,2003,11,2003-11-18 15:38:25,2


In [0]:
#okay this is the furthest distance between fires on the 1666 fire spree.
#remember geopy does lat first, long second
p1 = (dumpster['latitude'].iloc[1666], dumpster['longitude'].iloc[1666])
p2 = (dumpster['latitude'].iloc[1671], dumpster['longitude'].iloc[1671])

In [0]:
d = distance(p1, p2)

In [144]:
#based off this, it is probably fair to say a spree must have all fire locations
#within one mile of each other.
d.miles

0.9437619193725458

In [0]:
#okay lets clean that point column up.
def coords(row):
  return(row['latitude'], row['longitude'])

In [0]:
dumpster['report_location'] = dumpster.apply(coords, axis = 1)

In [147]:
dumpster.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h
0,747 Broadway,Dumpster Fire,2003-11-10T15:43:04.000,47.608621,-122.320756,"(47.608621, -122.320756)",F030082401,2003,11,2003-11-10 15:43:04,0
1,32nd Av E / E Madison St,Dumpster Fire,2003-11-14T23:45:39.000,47.62704,-122.290929,"(47.627040, -122.290929)",F030083954,2003,11,2003-11-14 23:45:39,0
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085104,2003,11,2003-11-18 09:10:42,2
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085103,2003,11,2003-11-18 09:10:57,2
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"(47.661385, -122.342145)",F030085264,2003,11,2003-11-18 15:38:25,2


In [148]:
#I want to know of all the "spree" calls, what their distance from the original
#call was.  So I'm going to filter out to only spree calls, and then just 
#iterate and compare.

sprees = dumpster[dumpster['calls_within_24h'] != 0].copy()
print(sprees.shape)
sprees.head()

(829, 11)


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085104,2003,11,2003-11-18 09:10:42,2
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085103,2003,11,2003-11-18 09:10:57,2
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"(47.661385, -122.342145)",F030085264,2003,11,2003-11-18 15:38:25,2
7,401 2nd Av S,Dumpster Fire,2003-12-02T22:09:02.000,47.599192,-122.331578,"(47.599192, -122.331578)",F030090264,2003,12,2003-12-02 22:09:02,7
8,133 Pontius Av N,Dumpster Fire,2003-12-03T00:31:27.000,47.618887,-122.332245,"(47.618887, -122.332245)",F030090301,2003,12,2003-12-03 00:31:27,7


In [149]:
#I'm going to make an additiona column, so that for each row we can see what
#the first location is.  Then later I can use that location to do an .apply()
#by row to find the distance.

first_location = []
for i in sprees.index:
  point_index = sprees['calls_within_24h'].loc[i]
  first_location.append(sprees['report_location'].loc[point_index])
print(len(first_location))
print(first_location)

829
[('47.611142', '-122.334464'), ('47.611142', '-122.334464'), ('47.611142', '-122.334464'), ('47.599192', '-122.331578'), ('47.599192', '-122.331578'), ('47.608009', '-122.302770'), ('47.608009', '-122.302770'), ('47.608009', '-122.302770'), ('47.675985', '-122.382128'), ('47.675985', '-122.382128'), ('47.597848', '-122.286930'), ('47.597848', '-122.286930'), ('47.584001', '-122.386423'), ('47.584001', '-122.386423'), ('47.584001', '-122.386423'), ('47.616761', '-122.314266'), ('47.616761', '-122.314266'), ('47.603512', '-122.315329'), ('47.603512', '-122.315329'), ('47.624593', '-122.359508'), ('47.624593', '-122.359508'), ('47.599192', '-122.325980'), ('47.599192', '-122.325980'), ('47.599192', '-122.325980'), ('47.661269', '-122.313130'), ('47.661269', '-122.313130'), ('47.574384', '-122.329059'), ('47.574384', '-122.329059'), ('47.539220', '-122.376575'), ('47.539220', '-122.376575'), ('47.612937', '-122.300058'), ('47.612937', '-122.300058'), ('47.671645', '-122.387587'), ('47.

In [150]:
#size looks good, and passes the visual inspection. Lets give these coords a
#column

sprees['start_location'] = first_location
sprees.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h,start_location
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085104,2003,11,2003-11-18 09:10:42,2,"(47.611142, -122.334464)"
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085103,2003,11,2003-11-18 09:10:57,2,"(47.611142, -122.334464)"
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"(47.661385, -122.342145)",F030085264,2003,11,2003-11-18 15:38:25,2,"(47.611142, -122.334464)"
7,401 2nd Av S,Dumpster Fire,2003-12-02T22:09:02.000,47.599192,-122.331578,"(47.599192, -122.331578)",F030090264,2003,12,2003-12-02 22:09:02,7,"(47.599192, -122.331578)"
8,133 Pontius Av N,Dumpster Fire,2003-12-03T00:31:27.000,47.618887,-122.332245,"(47.618887, -122.332245)",F030090301,2003,12,2003-12-03 00:31:27,7,"(47.599192, -122.331578)"


In [0]:
def dist_from_start(row):
  return distance(row['start_location'], row['report_location']).miles

In [0]:
sprees['distance'] = sprees.apply(dist_from_start, axis = 1)

In [153]:
sprees.head()

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h,start_location,distance
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085104,2003,11,2003-11-18 09:10:42,2,"(47.611142, -122.334464)",0.0
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085103,2003,11,2003-11-18 09:10:57,2,"(47.611142, -122.334464)",0.0
4,Stone Way N / N 45th St,Dumpster Fire,2003-11-18T15:38:25.000,47.661385,-122.342145,"(47.661385, -122.342145)",F030085264,2003,11,2003-11-18 15:38:25,2,"(47.611142, -122.334464)",3.489572
7,401 2nd Av S,Dumpster Fire,2003-12-02T22:09:02.000,47.599192,-122.331578,"(47.599192, -122.331578)",F030090264,2003,12,2003-12-02 22:09:02,7,"(47.599192, -122.331578)",0.0
8,133 Pontius Av N,Dumpster Fire,2003-12-03T00:31:27.000,47.618887,-122.332245,"(47.618887, -122.332245)",F030090301,2003,12,2003-12-03 00:31:27,7,"(47.599192, -122.331578)",1.361


##Visualization 2

In [0]:
#now lets filter out only the calls that are less than a mile apart
#as per our new definition.

less_than_mile = sprees[sprees['distance'] <= 1]

In [155]:
print(less_than_mile.shape)
less_than_mile.head()

(487, 13)


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,year,month,date_obj,calls_within_24h,start_location,distance
2,6TH AV / PIKE ST,Dumpster Fire,2003-11-18T09:10:42.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085104,2003,11,2003-11-18 09:10:42,2,"(47.611142, -122.334464)",0.0
3,6th Av / Pike St,Dumpster Fire,2003-11-18T09:10:57.000,47.611142,-122.334464,"(47.611142, -122.334464)",F030085103,2003,11,2003-11-18 09:10:57,2,"(47.611142, -122.334464)",0.0
7,401 2nd Av S,Dumpster Fire,2003-12-02T22:09:02.000,47.599192,-122.331578,"(47.599192, -122.331578)",F030090264,2003,12,2003-12-02 22:09:02,7,"(47.599192, -122.331578)",0.0
10,23rd Av / E Cherry St,Dumpster Fire,2003-12-08T20:41:15.000,47.608009,-122.30277,"(47.608009, -122.302770)",F030092604,2003,12,2003-12-08 20:41:15,10,"(47.608009, -122.302770)",0.0
13,20th Av Nw / Nw 65th St,Dumpster Fire,2003-12-09T23:11:28.000,47.675985,-122.382128,"(47.675985, -122.382128)",F030092968,2003,12,2003-12-09 23:11:28,13,"(47.675985, -122.382128)",0.0


In [156]:
less_than_mile['calls_within_24h'].value_counts().value_counts()

1    249
2     87
3     14
4      2
9      1
5      1
Name: calls_within_24h, dtype: int64

In [157]:
#okay I think I'm interpreting this right.  But basically if there is only 1
#value, that means its pair was dropped for being greater than a mile off.
#which means we can ignore it for determing what sprees happened within a mile
#radius.  Wow this is awesome.

final_sprees = pd.DataFrame(less_than_mile['calls_within_24h'].value_counts().value_counts())
final_sprees.head()

Unnamed: 0,calls_within_24h
1,249
2,87
3,14
4,2
9,1


In [0]:
final_sprees.drop(index = 1, axis = 1, inplace = True)

In [0]:
final_sprees.reset_index(inplace = True)

In [0]:
#lets get some nice column names for Plotly
final_sprees.rename(columns = {'index': 'Number of calls within 24 hours', 'calls_within_24h': 'Occured within a mile'}, inplace = True)

In [161]:
#lets knock out a percentage of total for this one as well.
final_sprees['raw_percentage'] = final_sprees['Occured within a mile'] * final_sprees['Number of calls within 24 hours']
final_sprees['raw_percentage'] = final_sprees['raw_percentage'].apply(lambda x: round((x * 100) / dumpster.shape[0], 2))
final_sprees['Percentage'] = final_sprees['raw_percentage'].apply(lambda x: str(x) + '%')
final_sprees['Percentage']

0    10.37%
1      2.5%
2     0.48%
3     0.54%
4      0.3%
Name: Percentage, dtype: object

In [162]:
final_sprees['raw_percentage']

0    10.37
1     2.50
2     0.48
3     0.54
4     0.30
Name: raw_percentage, dtype: float64

In [172]:
#aaayyyyyy looking good. I think we will call it quits here for this project.
fig = px.bar(final_sprees, x = 'Number of calls within 24 hours', 
             y = 'Occured within a mile', text = 'Percentage',
             title = 'Dumpster Fires Reported to 911',
             color_discrete_sequence = colors)
fig.update_xaxes(tickmode = 'linear')
fig.show()

###Further Analysis
I planned to score each of these rows as well,
and run some ttests to really sink home some of my findings, but ultimately ran out of time.  Keeping these here, incase I decide to come back to this project after turn in.

In [164]:
#okay I want to score these 
dumpster['calls_within_24h'].value_counts()

0       849
1666      9
579       8
188       7
1621      6
       ... 
1065      2
1067      2
1070      2
1075      2
872       2
Name: calls_within_24h, Length: 355, dtype: int64

In [0]:
#pandas replace function can take a dictionary.  So I'm going to just make a
#dictionary to turn out valls_within_24h column into a score for me to get a
#better illustration of what is typical

dictionary = {}

for value in dumpster['calls_within_24h'].value_counts().index:
  dictionary[value] = dumpster['calls_within_24h'].value_counts().loc[value]
 

In [166]:
#okay same length as our value_counts series.  I feel confident this worked
len(dictionary)

355

In [167]:
#alright sum and value_counts look good.  Other than 849, it looks like I scored
#my dataframe correctly.  So I'm going to adjust the 849 value to 1, see the
#justifcation above.
dumpster['score'] = dumpster['calls_within_24h'].replace(dictionary)
print(dumpster['score'].value_counts().sum())
dumpster['score'].value_counts()

1678


849    849
2      546
3      171
4       72
5       10
9        9
8        8
7        7
6        6
Name: score, dtype: int64

In [168]:
dumpster['score'].replace(849, 1, inplace = True)
dumpster['score'].value_counts()

1    849
2    546
3    171
4     72
5     10
9      9
8      8
7      7
6      6
Name: score, dtype: int64

In [169]:
#going to have to stop analysis here for time.
dumpster['score'].describe()

count    1678.000000
mean        1.800954
std         1.179106
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         9.000000
Name: score, dtype: float64

##Final Insights
Just some insights and bits to assist writing my blog post.

In [180]:
#What percent of total calls to the fire department are dumpster fires?
df[df['type'] == 'Dumpster Fire'].shape[0] / df.shape[0] * 100

0.11719912905917894

In [175]:
#what percent of all dumpster fire reports are a part of a spree
call_groups['raw_percentage'].sum() - call_groups['raw_percentage'].loc[4]

49.4

In [176]:
#what percent of all dumpster fire reports are a spree within a mile of
#each other.
final_sprees['raw_percentage'].sum()

14.190000000000001

In [186]:
#what percent of all calls are a part of a dumpster fire spree?
a = df[df['type'] == 'Dumpster Fire'].shape[0] / df.shape[0] * 100
b = final_sprees['raw_percentage'].sum() 

#14.19% of .12%
(b / 100) * (a / 100) * 100

0.016630556413497494