## Extracts data from 2012-2023 csv

## Removes suspect data after 2021.4.30

Data pulled from the Baltimore Police Department's Open Data, located here: https://www.baltimorepolice.org/crime-stats/open-data

###### The "Part 1 Crime" dataset represents the location and characteristics of major (Part 1) crime against persons, such as homicide, shooting, robbery, aggravated assault, etc., within the City of Baltimore.

Data cleaned back to end of April, 2021, per the below notice:

###### In May, 2020, the Baltimore Police Department began a significant upgrade to its new Records Management Systems to allow the department to transition from a paper-based system into a fully digital reporting environment. As a result of this massive transformation, we have experienced some complexities in properly and accurately translating the data from the new records system into the traditional Open Data Baltimore system. Based on our review, data on Part 1 Crime Incident Reports provided by Open Data Baltimore have been impacted starting in May, 2021 when the new system went online. BPD and the City are actively working with the vendor on a daily basis in addressing this matter as quickly as possible, so that we can fully restore our public reporting of data that ensures transparency and accountability in BPD operations.

(Really, the csv through 2023 was too big to upload to github, but I'm going with 'responsible data management' - note, this means you'll have to download the original file seperately, located immediately below, to use this particular notebook, and you won't be able to push it to github)

Source data located here: https://data.baltimorecity.gov/maps/part-1-crime-data

Found on this page: https://data.baltimorecity.gov/search?q=crime%20data

In [1]:
# Dependencies
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st

In [2]:
# read in open baltimore 2012-2023 dataset
"""
the 'Race' column was coming in mixed dtypes
low_memory helps pandas read the entire csv at once, instead of in chunks
so it stops guessing at the dtype of chunks of the csv
which is what was causing the mixed dtypes
"""
df_raw = pd.read_csv(
    "Resources/2012-2023_BPD_Victim_Based_Crime_Data.csv", low_memory=False
)
df_raw.head()

Unnamed: 0,OBJECTID,CCNO,CrimeDateTime,CrimeCode,Location,Description,Inside_Outside,Weapon,Post,Gender,Age,Race,Ethnicity,District,Neighborhood,Latitude,Longitude,GeoLocation,Premise,Total_Incidents
0,1,23F08231,2023/06/24 04:01:00+00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,15.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,"(39.27330200992213,-76.69243902745305)",,1
1,2,23F08231,2023/06/24 04:01:00+00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,15.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,"(39.27330200992213,-76.69243902745305)",,1
2,3,23F08231,2023/06/24 04:01:00+00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,27.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,"(39.27330200992213,-76.69243902745305)",,1
3,4,23F08231,2023/06/24 04:01:00+00,3JK,600 LUCIA AVE,ROBBERY,,PERSONAL_WEAPONS,833.0,M,25.0,BLACK_OR_AFRICAN_AMERICAN,UNKNOWN,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,"(39.27330200992213,-76.69243902745305)",,1
4,5,23F08235,2023/06/24 03:45:00+00,5A,3200 LILY AVE,BURGLARY,,,922.0,M,48.0,,HISPANIC_OR_LATINO,SOUTHERN,CHERRY HILL,39.246432,-76.636819,"(39.24643210111462,-76.63681903810716)",,1


In [3]:
# clean 2023 data

# create copy of raw 2012-2023 data
df = df_raw

# delete useless columns
del df["CCNO"]
del df["OBJECTID"]

# formatting
df.rename(columns={"Inside_Outside": "Inside/Outside"}, inplace=True)

# split crime date and time
df[["CrimeDate", "CrimeTime"]] = df["CrimeDateTime"].str.split(" ", 1, expand=True)

# pops date and time out of end of df, inserts into beginning of df
df.insert(0, "CrimeDate", df.pop("CrimeDate"))
df.insert(1, "CrimeTime", df.pop("CrimeTime"))
# pops the original CrimeDateTime out, places at the end of the df
df["CrimeDateTime"] = df.pop("CrimeDateTime")

# sorts df by CrimeDateTime
df.sort_values(by="CrimeDateTime", inplace=True, ascending=False)

# converts str CrimeDate to datetime64, removes all dates past 2021-04-30
# errors='coerce' - handles the dates before 1970, 1922, 1969, etc.
df.CrimeDate = pd.to_datetime(df.CrimeDate, errors="coerce")
df = df[df.CrimeDate <= "2021-04-30"]

df.head()

  df[["CrimeDate", "CrimeTime"]] = df["CrimeDateTime"].str.split(" ", 1, expand=True)


Unnamed: 0,CrimeDate,CrimeTime,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,Gender,Age,Race,Ethnicity,District,Neighborhood,Latitude,Longitude,GeoLocation,Premise,Total_Incidents,CrimeDateTime
84670,2021-04-30,23:50:00+00,6D,200 SCOTT ST,LARCENY FROM AUTO,,,932.0,M,22.0,UNKNOWN,,SOUTHERN,WASHINGTON VILLAGE/PIGTOWN,39.285056,-76.629022,"(39.285056,-76.629022)",,1,2021/04/30 23:50:00+00
84666,2021-04-30,23:50:00+00,6G,1700 THAMES ST,LARCENY,I,,213.0,F,29.0,WHITE,,SOUTHEAST,FELLS POINT,39.281896,-76.592512,"(39.281896,-76.592512)",BAR,1,2021/04/30 23:50:00+00
84333,2021-04-30,23:38:00+00,6E,4100 EMMART AVE,LARCENY,O,,631.0,F,52.0,BLACK_OR_AFRICAN_AMERICAN,,NORTHWEST,REISTERSTOWN STATION,39.349471,-76.693679,"(39.349471,-76.693679)",PARKING LOT-OUTSIDE,1,2021/04/30 23:38:00+00
84677,2021-04-30,23:38:00+00,6E,4100 EMMART AVE,LARCENY,O,,631.0,M,26.0,UNKNOWN,,NORTHWEST,REISTERSTOWN STATION,39.349471,-76.693679,"(39.349471,-76.693679)",PARKING LOT-OUTSIDE,1,2021/04/30 23:38:00+00
84074,2021-04-30,23:38:00+00,6E,4100 EMMART AVE,LARCENY,O,,631.0,,,UNKNOWN,,NORTHWEST,REISTERSTOWN STATION,39.349471,-76.693679,"(39.349471,-76.693679)",PARKING LOT-OUTSIDE,1,2021/04/30 23:38:00+00


In [4]:
# write to csv
# remove old index for formatting
df.to_csv("Resources/2012-2021_BPD_Victim_Based_Crime_Data_clean.csv", index=False)

## Weapon types

Sadly, by cleaning up the data, we lose half of the fun stuff (compare the next cell, through 2023, and the one immediately below, through 2021)

In [5]:
# weapon types through 2023
print(
    f"""The number of weapon types thorugh 2023 is {df_raw.Weapon.nunique()}

{df_raw.Weapon.unique()}

{df_raw.Weapon.value_counts()}
"""
)

The number of weapon types thorugh 2023 is 22

['PERSONAL_WEAPONS' 'KNIFE_CUTTING_INSTRUMENT' nan 'OTHER' 'UNKNOWN'
 'BLUNT_OBJECT' 'HANDGUN' 'FIREARM' 'ASPHYXIATION' 'MOTOR_VEHICLE_VESSEL'
 'AUTOMATIC_HANDGUN' 'KNIFE' 'RIFLE' 'FIRE_INCENDIARY_DEVICE'
 'OTHER_FIREARM' 'SHOTGUN' 'POISON' 'AUTOMATIC_RIFLE' 'AUTOMATIC_FIREARM'
 'FIRE' 'EXPLOSIVES' 'HANDS' 'DRUGS_NARCOTICS_SLEEPING_PILLS']

FIREARM                           49045
OTHER                             31852
KNIFE                             20249
PERSONAL_WEAPONS                  13443
HANDS                              7144
HANDGUN                            2415
FIRE                               2374
KNIFE_CUTTING_INSTRUMENT           1390
BLUNT_OBJECT                        944
UNKNOWN                             400
MOTOR_VEHICLE_VESSEL                195
AUTOMATIC_HANDGUN                    80
OTHER_FIREARM                        42
ASPHYXIATION                         41
RIFLE                                37
SHOTGUN   

In [6]:
# weapon types through 2021
print(
    f"""The number of weapon types thorugh 2021 is {df.Weapon.nunique()}

{df.Weapon.unique()}

{df.Weapon.value_counts()}
"""
)

The number of weapon types thorugh 2021 is 10

[nan 'FIREARM' 'OTHER' 'FIRE' 'KNIFE' 'HANDS' 'PERSONAL_WEAPONS'
 'KNIFE_CUTTING_INSTRUMENT' 'BLUNT_OBJECT' 'HANDGUN' 'UNKNOWN']

FIREARM                     42435
OTHER                       28118
KNIFE                       16995
HANDS                        6865
FIRE                         2233
PERSONAL_WEAPONS               35
BLUNT_OBJECT                    2
KNIFE_CUTTING_INSTRUMENT        1
HANDGUN                         1
UNKNOWN                         1
Name: Weapon, dtype: int64

