# Name: Margaret Nguyen

# Data Aggregation: Massachusetts Crash Data and ACS Data

**Assignment: Retrieve crash data from the present day back to 2011 for Massachusetts. Try to find a way to organize the data by municipality so that we can later merge it with county subdivision data from the ACS 5-year estimates. You can access the Massachusetts crash data [here](https://apps.impact.dot.state.ma.us/cdp/home). Click on "Data Extraction" and search for crashes; they should be available by year.**

**Here is the link to the Massachusetts Law Enforcement Crash Report Data Dictionary: [link](https://www.umasstransportationcenter.org/images/umtc/UMassSafe/Massachusetts%20Crash%20Report%20Data%20Dictionary.pdf).**

## Credit:

The following code is based on the work of my supervisor, Mitch Shiles. The original code can be found at this link: [Mitch Shiles' GitHub](https://github.com/rmshiles/Carlisle-Local-Crash-Analysis/blob/main/1.%20Municupal%20Crash%20Data%20Aggregation%20.ipynb).

In [2]:
# Import necessary libraries 
import requests
import json, csv
import timeit
from pathlib import Path  

# Import data handling libraries 
import numpy as np
import pandas as pd
import requests
import json

#from dateutil.rrule import rrule, DAILY, MONTHLY
#from datetime import  timedelta

# Import graphing libraries 
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

from pathlib import Path  

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.6f}'.format

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Set Global Variables

In [3]:
# Set Global Variables

# Set the time period in which to look at 
START_YEAR = 2017
END_YEAR = 2021

## Query ACS Data for Municipalities 

In [4]:
# Query ACS data for Municipalities 

# Census API
HOST = 'https://api.census.gov/data'

# Year to get data for 
YEAR = '{}'.format(END_YEAR)# 2005,2010,2015,2020  

# Survey to Access data from (ACS 5 year estimates)
DATA_SET = 'acs/acs5'
BASE_URL = '/'.join([HOST, YEAR, DATA_SET])

# Create an empty dictionary for predicates 
predicates = {}

# VARIABLES
# Population Estimate:  B01003_001E
# Population Margin of error: B01003_001M
# Bike to work Estimate: B08006_014E
# Bike to work Margin of error: B08006_014M
# Walk to work Estimate:B08006_015E
# Walk to work Margin of error:B08006_015M
# Drive to work alone Estimate: B08006_003E
# Drive to work alone Margin of error: B08006_003M
# Carpool to work Estimate:B08006_004E
# Carpool to work Margin: 
# Public transit to work Estimate: B08006_008E
# Public transit to work Margine of Error: B08006_008E
# Other to Work 
# Poverty Estimate:
# Poverty Margin of error:

get_vars = ['NAME',
            'B01001_001E',
            'B08006_014E',
            'B08006_014M',
            'B08006_015E',
            'B08006_015M',
            'B08006_003E',
            'B08006_003M',
            'B08006_004E',
            'B08006_004M',
            'B08006_008E',
            'B08006_008M']

predicates['get']=','.join(get_vars)

# Set sub geographies to get data for ('county','Place','county subdivision') * means get all 
predicates['for']='county subdivision:*'

# Set geography that contains sub geographies  (25 = Massachusetts)
predicates['in']='state:25'

# Assemble the API query 
r = requests.get(BASE_URL, params = predicates)

# Print the query URL 
print(BASE_URL,predicates)

https://api.census.gov/data/2021/acs/acs5 {'get': 'NAME,B01001_001E,B08006_014E,B08006_014M,B08006_015E,B08006_015M,B08006_003E,B08006_003M,B08006_004E,B08006_004M,B08006_008E,B08006_008M', 'for': 'county subdivision:*', 'in': 'state:25'}


In [5]:
# Print the length and first thousand characters to see what you got 
print(len(r.text))
print(r.text[0:1000])

47736
[["NAME","B01001_001E","B08006_014E","B08006_014M","B08006_015E","B08006_015M","B08006_003E","B08006_003M","B08006_004E","B08006_004M","B08006_008E","B08006_008M","state","county","county subdivision"],
["County subdivisions not defined, Barnstable County, Massachusetts","0","0","13","0","13","0","13","0","13","0","13","25","001","00000"],
["Barnstable Town city, Barnstable County, Massachusetts","48556","35","52","843","321","18901","926","2649","482","303","136","25","001","03690"],
["Bourne town, Barnstable County, Massachusetts","20364","0","25","182","64","8471","673","722","288","84","55","25","001","07175"],
["Brewster town, Barnstable County, Massachusetts","10282","66","70","1","3","3733","445","153","88","24","29","25","001","07980"],
["Chatham town, Barnstable County, Massachusetts","6554","0","19","124","74","1666","252","196","96","93","102","25","001","12995"],
["Dennis town, Barnstable County, Massachusetts","14664","8","16","147","108","4930","515","658","173","37

In [6]:
# Place the Queried ACS data into a data frame 

# Set the column names to the first row of data from the query 
column_names=r.json()[0:1][0]

# Set the data to everything after the first row and convert to an array to flatten it
ACS_DATA= r.json()[1:]
ACS_data = np.array(ACS_DATA)

# Create the pandas data frame 
ACS_MUNI_DF = pd.DataFrame(columns=column_names , data = ACS_data)

# Reset the index of the data frame
ACS_MUNI_DF.reset_index()

# Rename the columns 
ACS_MUNI_DF.rename(columns ={"B01001_001E":"POPULATION",
                    "B08006_014E":'BIKE_TO_WORK_EST',
                    "B08006_014M":"BIKE_TO_WORK_MARG",
                    "B08006_015E":"WALK_TO_WORK_EST",
                    "B08006_015M":"WALK_TO_WORK_MARG",
                    'B08006_003E':"DRIVE_SOLO_TO_WORK_EST",
                    'B08006_003M':"DRIVE_SOLO_TO_WORK_MARG",
                    'B08006_004E':"CARPOOL_TO_WORK_EST",
                    'B08006_004M':"CARPOOL_TO_WORK_MARG",
                    'B08006_008E':"PUBTRANS_TO_WORK_EST",
                    'B08006_008M':"PUBTRANS_TO_WORK_MARG",
                    "county subdivision":"county_subdivision"}, inplace=True)

# Convert the NAME column to strings 
ACS_MUNI_DF['NAME'] = ACS_MUNI_DF["NAME"].astype(str)

# Remove Massachusetts from NAME This is over redundant since all data will be from Massachusetts
ACS_MUNI_DF['NAME'] = ACS_MUNI_DF.NAME.replace({', Massachusetts':''}, regex=True)

# Create separate Name variables for county and municipality 
ACS_MUNI_DF[['MUNI_NAME','COUNTY_NAME']]=ACS_MUNI_DF['NAME'].str.split(',',expand=True)

# Convert the variables to integers
ACS_MUNI_DF[["POPULATION",
                "BIKE_TO_WORK_EST",
                "BIKE_TO_WORK_MARG",
                "WALK_TO_WORK_EST",
                "WALK_TO_WORK_MARG",
                "DRIVE_SOLO_TO_WORK_EST",
                "DRIVE_SOLO_TO_WORK_MARG",
                "CARPOOL_TO_WORK_EST",
                "CARPOOL_TO_WORK_MARG",
                "PUBTRANS_TO_WORK_EST",
                "PUBTRANS_TO_WORK_MARG",
                "state",                 
                "county",                
                "county_subdivision"]] = ACS_MUNI_DF[["POPULATION",
                                            "BIKE_TO_WORK_EST",
                                            "BIKE_TO_WORK_MARG",
                                            "WALK_TO_WORK_EST",
                                            "WALK_TO_WORK_MARG",
                                            "DRIVE_SOLO_TO_WORK_EST",
                                            "DRIVE_SOLO_TO_WORK_MARG",
                                            "CARPOOL_TO_WORK_EST",
                                            "CARPOOL_TO_WORK_MARG",
                                            "PUBTRANS_TO_WORK_EST",
                                            "PUBTRANS_TO_WORK_MARG",
                                            "state",                 
                                            "county",                
                                            "county_subdivision"]].astype(int)

# Show the length of an object referred to the dataframe
print(len(ACS_MUNI_DF))

# Show the dataframe
ACS_MUNI_DF.head()

357


Unnamed: 0,NAME,POPULATION,BIKE_TO_WORK_EST,BIKE_TO_WORK_MARG,WALK_TO_WORK_EST,WALK_TO_WORK_MARG,DRIVE_SOLO_TO_WORK_EST,DRIVE_SOLO_TO_WORK_MARG,CARPOOL_TO_WORK_EST,CARPOOL_TO_WORK_MARG,PUBTRANS_TO_WORK_EST,PUBTRANS_TO_WORK_MARG,state,county,county_subdivision,MUNI_NAME,COUNTY_NAME
0,"County subdivisions not defined, Barnstable Co...",0,0,13,0,13,0,13,0,13,0,13,25,1,0,County subdivisions not defined,Barnstable County
1,"Barnstable Town city, Barnstable County",48556,35,52,843,321,18901,926,2649,482,303,136,25,1,3690,Barnstable Town city,Barnstable County
2,"Bourne town, Barnstable County",20364,0,25,182,64,8471,673,722,288,84,55,25,1,7175,Bourne town,Barnstable County
3,"Brewster town, Barnstable County",10282,66,70,1,3,3733,445,153,88,24,29,25,1,7980,Brewster town,Barnstable County
4,"Chatham town, Barnstable County",6554,0,19,124,74,1666,252,196,96,93,102,25,1,12995,Chatham town,Barnstable County


## Importing Massachusetts Crash Data from 2017 to 2021

The following is my code:

In [7]:
# Import datasets for the years 2017 to 2021
for num in range(2017, 2022):
    if num == 2018 or num == 2019:
        filename = f'/Users/margaret06/Documents/GitHub/Carlisle_Borough_Transportation_Study/data/{num}_Crash_Level_Details.csv'
    else:
        filename = f'/Users/margaret06/Documents/GitHub/Carlisle_Borough_Transportation_Study/data/{num}_Crashes.csv'
    
    # Read the CSV file into a DataFrame and assign it to a dynamically named variable
    var_name = f'crash_{num}'  # Create a variable name
    
    # globals() provides access to variables defined at the top level of your script or module
    globals()[var_name] = pd.read_csv(filename)
    
    # Can also use locals for this cell because locals() provides access to variables defined within the current function or scope
    # locals()[var_name] = pd.read_csv(filename)

  globals()[var_name] = pd.read_csv(filename)
  globals()[var_name] = pd.read_csv(filename)
  globals()[var_name] = pd.read_csv(filename)
  globals()[var_name] = pd.read_csv(filename)


## Data Cleaning

In [8]:
# Show the difference of variables between datasets
# Create a list of dataset names for the years 2018 to 2021
dataset_years = [2018, 2019, 2020, 2021]

# Loop through the list of dataset years
for year in dataset_years:
    # Generate the dataset variable names based on the year
    current_dataset_name = f'crash_{year}'
    
    # Find the columns that are unique to the current year's dataset compared to 2017
    columns_only_in_current = set(locals()[current_dataset_name].columns) - set(crash_2017.columns)
    
    if columns_only_in_current:
        print(f"\nColumns only in {current_dataset_name}:")
        for column in columns_only_in_current:
            print(column)


Columns only in crash_2018:
CRASH_DATE
T_EXC_TYPE
T_EXC_TIME
CRASH_TIME_2

Columns only in crash_2019:
CRASH_DATE
T_EXC_TYPE
T_EXC_TIME
CRASH_TIME_2

Columns only in crash_2020:
CRASH_DATE
T_EXC_TYPE
T_EXC_TIME
CRASH_TIME_2

Columns only in crash_2021:
SHAPE
CRASH_TIME_2


In [9]:
# Rename the columns
# Keep 'CRASH_DATE' and 'CRASH_TIME_2' since they contain the same type of information as 'CRASH_DATETIME' and 'CRASH_TIME' in 'crash_2017'
crash_2018.rename(columns={'CRASH_TIME_2': 'CRASH_TIME', 'CRASH_DATE': 'CRASH_DATETIME'}, inplace=True)
crash_2019.rename(columns={'CRASH_TIME_2': 'CRASH_TIME', 'CRASH_DATE': 'CRASH_DATETIME'}, inplace=True)
crash_2020.rename(columns={'CRASH_TIME_2': 'CRASH_TIME', 'CRASH_DATE': 'CRASH_DATETIME'}, inplace=True)
crash_2021.rename(columns={'CRASH_TIME_2': 'CRASH_TIME'}, inplace=True)

# Drop columns which aren't in crash_2017
crash_2018 = crash_2018.drop(columns = ['T_EXC_TYPE', 'T_EXC_TIME'])
crash_2019 = crash_2019.drop(columns = ['T_EXC_TYPE', 'T_EXC_TIME'])
crash_2020 = crash_2020.drop(columns = ['T_EXC_TYPE', 'T_EXC_TIME'])
crash_2021 = crash_2021.drop(columns = ['SHAPE'])

# List of dataset names for the years 2018 to 2021
dataset_years = [2018, 2019, 2020, 2021]

# Loop through the list of dataset years
for year in dataset_years:
    # Generate the dataset variable name based on the year
    current_dataset_name = f'crash_{year}'
    
    # Reorder the columns of the current dataset to match crash_2017
    locals()[current_dataset_name] = locals()[current_dataset_name][crash_2017.columns]

In [10]:
# Check the shape and columns (variables) of the datasets
# List of dataset names for the years 2017 to 2021
years = [2017, 2018, 2019, 2020, 2021]

# Loop through the list of dataset years
for year in years:
    # Generate the dataset variable name based on the year
    current_dataset_name = f'crash_{year}'
    
    # Check the shape of the current dataset
    dataset_shape = locals()[current_dataset_name].shape # locals() provides access to variables defined within the current function or scope
    
    # Get the columns of the current dataset
    dataset_columns = locals()[current_dataset_name].columns
    
    # Print the results
    print(f"Year {year}:")
    print(f"Shape: {dataset_shape}")
    print(f"Columns: {dataset_columns}")
    print("\n")

Year 2017:
Shape: (145068, 115)
Columns: Index(['OBJECTID', 'CRASH_NUMB', 'CITY_TOWN_NAME', 'CRASH_DATE_TEXT', 'CRASH_TIME', 'CRASH_DATETIME', 'CRASH_HOUR', 'CRASH_STATUS', 'CRASH_SEVERITY_DESCR', 'MAX_INJR_SVRTY_CL',
       ...
       'CITY', 'STRUCT_CND', 'TERRAIN', 'URBAN_LOC_TYPE', 'AADT_DERIV', 'STATN_NUM', 'OP_DIR_SL', 'SHLDR_UL_T', 'SHLDR_UL_W', 'F_F_CLASS'], dtype='object', length=115)


Year 2018:
Shape: (142272, 115)
Columns: Index(['OBJECTID', 'CRASH_NUMB', 'CITY_TOWN_NAME', 'CRASH_DATE_TEXT', 'CRASH_TIME', 'CRASH_DATETIME', 'CRASH_HOUR', 'CRASH_STATUS', 'CRASH_SEVERITY_DESCR', 'MAX_INJR_SVRTY_CL',
       ...
       'CITY', 'STRUCT_CND', 'TERRAIN', 'URBAN_LOC_TYPE', 'AADT_DERIV', 'STATN_NUM', 'OP_DIR_SL', 'SHLDR_UL_T', 'SHLDR_UL_W', 'F_F_CLASS'], dtype='object', length=115)


Year 2019:
Shape: (140939, 115)
Columns: Index(['OBJECTID', 'CRASH_NUMB', 'CITY_TOWN_NAME', 'CRASH_DATE_TEXT', 'CRASH_TIME', 'CRASH_DATETIME', 'CRASH_HOUR', 'CRASH_STATUS', 'CRASH_SEVERITY_DESCR', 'MAX_

In [11]:
# Define an empty dataframe that will be populated
mass_crash = pd.DataFrame()

# Create a list of these DataFrames
dataframes = [crash_2017, crash_2018, crash_2019, crash_2020, crash_2021]

# Concatenate them into a single DataFrame
mass_crash = pd.concat(dataframes, ignore_index=True)

# Reset the index
mass_crash = mass_crash.reset_index(drop=True)

# Print the length of mass_crash
print(len(mass_crash))

# Show mass_crash
mass_crash.head()

653507


Unnamed: 0,OBJECTID,CRASH_NUMB,CITY_TOWN_NAME,CRASH_DATE_TEXT,CRASH_TIME,CRASH_DATETIME,CRASH_HOUR,CRASH_STATUS,CRASH_SEVERITY_DESCR,MAX_INJR_SVRTY_CL,NUMB_VEHC,NUMB_NONFATAL_INJR,NUMB_FATAL_INJR,POLC_AGNCY_TYPE_DESCR,MANR_COLL_DESCR,VEHC_MNVR_ACTN_CL,VEHC_TRVL_DIRC_CL,VEHC_SEQ_EVENTS_CL,AMBNT_LIGHT_DESCR,WEATH_COND_DESCR,ROAD_SURF_COND_DESCR,FIRST_HRMF_EVENT_DESCR,MOST_HRMFL_EVT_CL,DRVR_CNTRB_CIRC_CL,VEHC_CONFIG_CL,STREET_NUMB,RDWY,DIST_DIRC_FROM_INT,NEAR_INT_RDWY,MM_RTE,DIST_DIRC_MILEMARKER,MILEMARKER,EXIT_RTE,DIST_DIRC_EXIT,EXIT_NUMB,DIST_DIRC_LANDMARK,LANDMARK,RDWY_JNCT_TYPE_DESCR,TRAF_CNTRL_DEVC_TYPE_DESCR,TRAFY_DESCR_DESCR,JURISDICTN,FIRST_HRMF_EVENT_LOC_DESCR,NON_MTRST_TYPE_CL,NON_MTRST_ACTN_CL,NON_MTRST_LOC_CL,IS_GEOCODED,GEOCODING_METHOD_NAME,X,Y,LAT,LON,RMV_DOC_IDS,CRASH_RPT_IDS,YEAR,AGE_DRVR_YNGST,AGE_DRVR_OLDEST,AGE_NONMTRST_YNGST,AGE_NONMTRST_OLDEST,DRVR_DISTRACTED_CL,DISTRICT_NUM,RPA_ABBR,VEHC_EMER_USE_CL,VEHC_TOWED_FROM_SCENE_CL,CNTY_NAME,FMCSA_RPTBL_CL,FMCSA_RPTBL,HIT_RUN_DESCR,LCLTY_NAME,ROAD_CNTRB_DESCR,SCHL_BUS_RELD_DESCR,SPEED_LIMIT,TRAF_CNTRL_DEVC_FUNC_DESCR,WORK_ZONE_RELD_DESCR,AADT,AADT_YEAR,PK_PCT_SUT,AV_PCT_SUT,PK_PCT_CT,AV_PCT_CT,CURB,TRUCK_RTE,LT_SIDEWLK,RT_SIDEWLK,SHLDR_LT_W,SHLDR_LT_T,SURFACE_WD,SURFACE_TP,SHLDR_RT_W,SHLDR_RT_T,NUM_LANES,OPP_LANES,MED_WIDTH,MED_TYPE,URBAN_TYPE,F_CLASS,URBAN_AREA,FD_AID_RTE,FACILITY,OPERATION,CONTROL,PEAK_LANE,SPEED_LIM,STREETNAME,FROMSTREETNAME,TOSTREETNAME,CITY,STRUCT_CND,TERRAIN,URBAN_LOC_TYPE,AADT_DERIV,STATN_NUM,OP_DIR_SL,SHLDR_UL_T,SHLDR_UL_W,F_F_CLASS
0,1,4304436,FREETOWN,01/01/2017,12:43 PM,2017/01/01 12:43:00+00,12:00PM to 12:59PM,Closed,Non-fatal injury,Non-fatal injury - Possible,2,1,0,Local police,Angle,V1: Travelling straight ahead / V2: Entering t...,V1: W / V2: N,V1:(Collision with motor vehicle in traffic) ...,Daylight,Clear,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (No improper driving) / D2: (Failed to y...,V1:(Passenger car) / V2:(Passenger car),2.0,CHACE RD,248 feet W of,COUNTY RD,,,,,,,,,Not at junction,No controls,"Two-way, not divided",City or Town accepted road,Roadway,,,,Yes,Off Intersection,245207.443463,835717.896456,41.770454,-70.956298,PW201700300202,17-1-AC,2017,35-44,65-74,,,,5,SRPEDD,V1:(No) / V2:(No),V1:(No) / V2:(No),BRISTOL,,,No hit and run,,,"No, school bus not involved",40.0,Not reported,No,2265.0,2013.0,0.617,130.0,0.115,30.0,,Not a parkway - not on a designated truck route,,,,,40.0,Surface-treated road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Small Urbanized Area,Urban minor arterial or rural major collector,New Bedford,,Mainline roadway,Two-way traffic,No control,,,CHACE ROAD,RAMP-RT 140 NB TO CHACE RD,COUNTY ROAD,Freetown,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial
1,2,4304698,HUDSON,01/01/2017,2:24 PM,2017/01/01 14:23:59+00,02:00PM to 02:59PM,Closed,Property damage only (none injured),No injury,2,0,0,Local police,Rear-end,V1: Slowing or stopped in traffic / V2: Travel...,V1: N / V2: N,V1:(Collision with motor vehicle in traffic) ...,Daylight,Clear,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (No improper driving) / D2: (Inattention),V1:(Passenger car) / V2:(Passenger car),6.0,SUMMER STREET,,,,,,,,,,SECTOR 1,Not at junction,No controls,"Two-way, not divided",City or Town accepted road,Roadway,,,,Yes,At Address,194772.438927,904301.332431,42.3892,-71.563492,PW201700300709,2017000000028,2017,18-20,65-74,,,,3,MAPC,V1:(No) / V2:(No),V1:(No) / V2:(No),MIDDLESEX,,,No hit and run,,,"No, school bus not involved",30.0,"Yes, device functioning",No,,,,,,,,Not a parkway - not on a designated truck route,4.0,4.0,,,27.0,Bituminous concrete road,2.0,Unstable shoulder,2.0,0.0,,,Large Urbanized Area,Local,Boston (MA-NH-RI),,Mainline roadway,Two-way traffic,No control,,,SUMMER STREET,GROVE STREET,BROAD STREET,Hudson,Good,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Local
2,3,4304699,HUDSON,01/02/2017,4:19 PM,2017/01/02 16:18:59+00,04:00PM to 04:59PM,Closed,Property damage only (none injured),No injury,2,0,0,Local police,Angle,V1: Travelling straight ahead / V2: Turning right,V1: N / V2: N,V1:(Collision with motor vehicle in traffic) ...,Daylight,Clear/Clear,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (No improper driving) / D2: (Inattention),V1:(Passenger car) / V2:(Passenger car),164.0,WASHINGTON STREET,,,,,,,,,,,Not at junction,No controls,Not reported,Massachusetts Department of Transportation,Roadway,,,,Yes,At Address,194298.945899,903664.859506,42.383471,-71.56923,PW201700300710,2017000000068,2017,21-24,65-74,,,,3,MAPC,V1:(No) / V2:(No),V1:(No) / V2:(No),MIDDLESEX,,,No hit and run,,Not reported,"No, school bus not involved",40.0,"Yes, device functioning",No,12781.0,2013.0,0.378,417.0,0.156,228.0,Both sides,Designated truck route ONLY under State Author...,4.0,4.0,0.0,No Shoulder,24.0,Bituminous concrete road,0.0,No Shoulder,2.0,0.0,0.0,,Large Urbanized Area,Rural minor arterial or urban principal arterial,Boston (MA-NH-RI),,Mainline roadway,Two-way traffic,No control,,30.0,WASHINGTON STREET,MARLBOROUGH CITY LINE,MAIN STREET,Hudson,Deficient,Level,"Other urban area, including undeveloped land",,,,,,Principal Arterial - Other
3,4,4304701,READING,01/02/2017,11:48 AM,2017/01/02 11:48:00+00,11:00AM to 11:59AM,Closed,Property damage only (none injured),No injury,2,0,0,Local police,Angle,V1: Turning left / V2: Travelling straight ahead,V1: S / V2: W,V1:(Collision with motor vehicle in traffic) ...,Daylight,Clear/Clear,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (Failed to yield right of way),"V1:(Passenger car) / V2:(Light truck(van, mini...",,VILLAGE ST / JOHN ST,,,,,,,,,,,T-intersection,Stop signs,"Two-way, not divided",City or Town accepted road,Roadway,,,,Yes,At Intersection,233147.953104,919004.312321,42.520877,-71.096591,PW201700300711,631759,2017,25-34,55-64,,,,4,MAPC,V1:(No) / V2:(No),V1:(No) / V2:(No),MIDDLESEX,,,No hit and run,,,"No, school bus not involved",30.0,"Yes, device functioning",No,8032.0,2013.0,0.254,216.0,0.097,104.0,,Not a parkway - not on a designated truck route,,,,,30.0,Bituminous concrete road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Large Urbanized Area,Urban minor arterial or rural major collector,Boston (MA-NH-RI),,Mainline roadway,Two-way traffic,No control,,,VILLAGE STREET,HAVEN STREET,JOHN STREET,Reading,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial
4,5,4304702,READING,01/02/2017,4:32 PM,2017/01/02 16:31:59+00,04:00PM to 04:59PM,Closed,Not Reported,Not reported,2,0,0,Local police,Unknown,V1: Parked / V2: Unknown,V1: W / V2: U,V1:(Collision with motor vehicle in traffic) ...,Dusk,Cloudy/Clear,Dry,Collision with parked motor vehicle,V1:(Collision with motor vehicle in traffic) /...,D1: (Unknown),"V1:(Light truck(van, mini-van, pickup, sport u...",30.0,GENERAL WAY,,,,,,,,,,MARKET BASKET PARKING LOT,Driveway,No controls,"Two-way, not divided",,Roadway,,,,Yes,Operator Designated,233109.189035,918827.160867,42.519285,-71.097076,PW201700300469,631768,2017,,,,,,4,MAPC,V1:(No) / V2:(Unknown),V1:(No) / V2:(No),MIDDLESEX,,,"Yes, hit and run",,,"No, school bus not involved",15.0,Not reported,No,,,,,,,,,,,,,,,,,,,,,Large Urbanized Area,,Boston (MA-NH-RI),,,,,,,GENERAL WAY,WALKERS BROOK DRIVE,,Reading,,Level,,,,,,,


In [12]:
# Clean ACS data
# Define an empty dataframe that will be populated
df_acs_muni = pd.DataFrame()

# Create a copy of ACS_MUNI_DF as df_acs_muni
df_acs_muni = ACS_MUNI_DF.copy()

# Create a boolean mask to identify rows where 'MUNI_NAME' is 'County subdivisions not defined'
mask = df_acs_muni['MUNI_NAME'] == 'County subdivisions not defined'

# Use the mask to drop rows from the DataFrame
df_acs_muni = df_acs_muni[~mask]

# Reset the index and use the drop parameter to remove the old index column
df_acs_muni.reset_index(drop=True, inplace=True)

# Print the shape of dataframe 
print(df_acs_muni.shape)

# Show dataframe
df_acs_muni.head()

(351, 17)


Unnamed: 0,NAME,POPULATION,BIKE_TO_WORK_EST,BIKE_TO_WORK_MARG,WALK_TO_WORK_EST,WALK_TO_WORK_MARG,DRIVE_SOLO_TO_WORK_EST,DRIVE_SOLO_TO_WORK_MARG,CARPOOL_TO_WORK_EST,CARPOOL_TO_WORK_MARG,PUBTRANS_TO_WORK_EST,PUBTRANS_TO_WORK_MARG,state,county,county_subdivision,MUNI_NAME,COUNTY_NAME
0,"Barnstable Town city, Barnstable County",48556,35,52,843,321,18901,926,2649,482,303,136,25,1,3690,Barnstable Town city,Barnstable County
1,"Bourne town, Barnstable County",20364,0,25,182,64,8471,673,722,288,84,55,25,1,7175,Bourne town,Barnstable County
2,"Brewster town, Barnstable County",10282,66,70,1,3,3733,445,153,88,24,29,25,1,7980,Brewster town,Barnstable County
3,"Chatham town, Barnstable County",6554,0,19,124,74,1666,252,196,96,93,102,25,1,12995,Chatham town,Barnstable County
4,"Dennis town, Barnstable County",14664,8,16,147,108,4930,515,658,173,37,47,25,1,16775,Dennis town,Barnstable County


In [13]:
# Lower the muni name from acs dataframe
df_acs_muni["muni"] = df_acs_muni["MUNI_NAME"].str.lower()

# Split the muni_name from the acs dataframe for merging
df_acs_muni["municipality_name"] = df_acs_muni["muni"].str.split(' ').str[0]
df_acs_muni["municipality_name_second"] = df_acs_muni["muni"].str.split(' ').str[1]

# Define a function to conditionally concatenate columns
def concatenate_municipalities(row):
    if "town" not in row["municipality_name_second"] and "city" not in row["municipality_name_second"]:
        return row["municipality_name"] + " " + row["municipality_name_second"]
    else:
        return row["municipality_name"]

# Apply the function to create a new column
df_acs_muni["muni_name"] = df_acs_muni.apply(concatenate_municipalities, axis=1)

# Drop unnecessary columns
df_acs_muni = df_acs_muni.drop(columns=["muni","municipality_name", "municipality_name_second"])

In [14]:
df_acs_muni.head(3)

Unnamed: 0,NAME,POPULATION,BIKE_TO_WORK_EST,BIKE_TO_WORK_MARG,WALK_TO_WORK_EST,WALK_TO_WORK_MARG,DRIVE_SOLO_TO_WORK_EST,DRIVE_SOLO_TO_WORK_MARG,CARPOOL_TO_WORK_EST,CARPOOL_TO_WORK_MARG,PUBTRANS_TO_WORK_EST,PUBTRANS_TO_WORK_MARG,state,county,county_subdivision,MUNI_NAME,COUNTY_NAME,muni_name
0,"Barnstable Town city, Barnstable County",48556,35,52,843,321,18901,926,2649,482,303,136,25,1,3690,Barnstable Town city,Barnstable County,barnstable
1,"Bourne town, Barnstable County",20364,0,25,182,64,8471,673,722,288,84,55,25,1,7175,Bourne town,Barnstable County,bourne
2,"Brewster town, Barnstable County",10282,66,70,1,3,3733,445,153,88,24,29,25,1,7980,Brewster town,Barnstable County,brewster


## Merge ACS Data and mass_crash Data (Massachusetts Crash Data from 2017 to 2021)

In [15]:
# Lower the CITY_TOWN_NAME name from mass_crash dataframe
mass_crash["muni_name"] = mass_crash["CITY_TOWN_NAME"].str.lower()

# Define an empty dataframe that will be populated
df_mass_acs = pd.DataFrame()

# Merge dataframes
df_mass_acs = mass_crash.merge(df_acs_muni, how="inner", on="muni_name")

# Drop the muni_name column
df_mass_acs = df_mass_acs.drop(columns=["muni_name"])

# Reset the index
df_mass_acs = df_mass_acs.reset_index(drop=True)

# Show dataframe
df_mass_acs.head()

Unnamed: 0,OBJECTID,CRASH_NUMB,CITY_TOWN_NAME,CRASH_DATE_TEXT,CRASH_TIME,CRASH_DATETIME,CRASH_HOUR,CRASH_STATUS,CRASH_SEVERITY_DESCR,MAX_INJR_SVRTY_CL,NUMB_VEHC,NUMB_NONFATAL_INJR,NUMB_FATAL_INJR,POLC_AGNCY_TYPE_DESCR,MANR_COLL_DESCR,VEHC_MNVR_ACTN_CL,VEHC_TRVL_DIRC_CL,VEHC_SEQ_EVENTS_CL,AMBNT_LIGHT_DESCR,WEATH_COND_DESCR,ROAD_SURF_COND_DESCR,FIRST_HRMF_EVENT_DESCR,MOST_HRMFL_EVT_CL,DRVR_CNTRB_CIRC_CL,VEHC_CONFIG_CL,STREET_NUMB,RDWY,DIST_DIRC_FROM_INT,NEAR_INT_RDWY,MM_RTE,DIST_DIRC_MILEMARKER,MILEMARKER,EXIT_RTE,DIST_DIRC_EXIT,EXIT_NUMB,DIST_DIRC_LANDMARK,LANDMARK,RDWY_JNCT_TYPE_DESCR,TRAF_CNTRL_DEVC_TYPE_DESCR,TRAFY_DESCR_DESCR,JURISDICTN,FIRST_HRMF_EVENT_LOC_DESCR,NON_MTRST_TYPE_CL,NON_MTRST_ACTN_CL,NON_MTRST_LOC_CL,IS_GEOCODED,GEOCODING_METHOD_NAME,X,Y,LAT,LON,RMV_DOC_IDS,CRASH_RPT_IDS,YEAR,AGE_DRVR_YNGST,AGE_DRVR_OLDEST,AGE_NONMTRST_YNGST,AGE_NONMTRST_OLDEST,DRVR_DISTRACTED_CL,DISTRICT_NUM,RPA_ABBR,VEHC_EMER_USE_CL,VEHC_TOWED_FROM_SCENE_CL,CNTY_NAME,FMCSA_RPTBL_CL,FMCSA_RPTBL,HIT_RUN_DESCR,LCLTY_NAME,ROAD_CNTRB_DESCR,SCHL_BUS_RELD_DESCR,SPEED_LIMIT,TRAF_CNTRL_DEVC_FUNC_DESCR,WORK_ZONE_RELD_DESCR,AADT,AADT_YEAR,PK_PCT_SUT,AV_PCT_SUT,PK_PCT_CT,AV_PCT_CT,CURB,TRUCK_RTE,LT_SIDEWLK,RT_SIDEWLK,SHLDR_LT_W,SHLDR_LT_T,SURFACE_WD,SURFACE_TP,SHLDR_RT_W,SHLDR_RT_T,NUM_LANES,OPP_LANES,MED_WIDTH,MED_TYPE,URBAN_TYPE,F_CLASS,URBAN_AREA,FD_AID_RTE,FACILITY,OPERATION,CONTROL,PEAK_LANE,SPEED_LIM,STREETNAME,FROMSTREETNAME,TOSTREETNAME,CITY,STRUCT_CND,TERRAIN,URBAN_LOC_TYPE,AADT_DERIV,STATN_NUM,OP_DIR_SL,SHLDR_UL_T,SHLDR_UL_W,F_F_CLASS,NAME,POPULATION,BIKE_TO_WORK_EST,BIKE_TO_WORK_MARG,WALK_TO_WORK_EST,WALK_TO_WORK_MARG,DRIVE_SOLO_TO_WORK_EST,DRIVE_SOLO_TO_WORK_MARG,CARPOOL_TO_WORK_EST,CARPOOL_TO_WORK_MARG,PUBTRANS_TO_WORK_EST,PUBTRANS_TO_WORK_MARG,state,county,county_subdivision,MUNI_NAME,COUNTY_NAME
0,1,4304436,FREETOWN,01/01/2017,12:43 PM,2017/01/01 12:43:00+00,12:00PM to 12:59PM,Closed,Non-fatal injury,Non-fatal injury - Possible,2,1,0,Local police,Angle,V1: Travelling straight ahead / V2: Entering t...,V1: W / V2: N,V1:(Collision with motor vehicle in traffic) ...,Daylight,Clear,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (No improper driving) / D2: (Failed to y...,V1:(Passenger car) / V2:(Passenger car),2.0,CHACE RD,248 feet W of,COUNTY RD,,,,,,,,,Not at junction,No controls,"Two-way, not divided",City or Town accepted road,Roadway,,,,Yes,Off Intersection,245207.443463,835717.896456,41.770454,-70.956298,PW201700300202,17-1-AC,2017,35-44,65-74,,,,5,SRPEDD,V1:(No) / V2:(No),V1:(No) / V2:(No),BRISTOL,,,No hit and run,,,"No, school bus not involved",40.0,Not reported,No,2265.0,2013.0,0.617,130.0,0.115,30.0,,Not a parkway - not on a designated truck route,,,,,40.0,Surface-treated road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Small Urbanized Area,Urban minor arterial or rural major collector,New Bedford,,Mainline roadway,Two-way traffic,No control,,,CHACE ROAD,RAMP-RT 140 NB TO CHACE RD,COUNTY ROAD,Freetown,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial,"Freetown town, Bristol County",9165,0,19,0,19,4422,333,309,147,22,21,25,5,25240,Freetown town,Bristol County
1,510,4308146,FREETOWN,01/05/2017,4:40 PM,2017/01/05 16:40:00+00,04:00PM to 04:59PM,Closed,Property damage only (none injured),No injury,1,0,0,Local police,"Sideswipe, opposite direction",V1: Travelling straight ahead,V1: N,V1:(Collision with motor vehicle in traffic),Dark - lighted roadway,Cloudy,Dry,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic),D1: (Unknown),V1:(Passenger car),54.0,COUNTY RD,396 feet N of,MIDDLEBORO RD Rte 18,,,,,,,,,Not at junction,No controls,"Two-way, not divided",Massachusetts Department of Transportation,Roadway,,,,Yes,Off Intersection,246188.076667,833655.702052,41.751833,-70.944661,PW201701000211,17-3-AC,2017,25-34,25-34,,,,5,SRPEDD,V1:(No),V1:(No),BRISTOL,,,No hit and run,,,"No, school bus not involved",45.0,Not reported,No,2730.0,2013.0,0.615,156.0,0.114,36.0,,Designated truck route ONLY under State Author...,,,,,25.0,Bituminous concrete road,0.0,No Shoulder,2.0,0.0,,,Small Urbanized Area,Urban minor arterial or rural major collector,New Bedford,,Mainline roadway,Two-way traffic,No control,2.0,30.0,COUNTY ROAD,NEW BEDFORD CITY LINE,LAKEVILLE TOWN LINE,Freetown,Good,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial,"Freetown town, Bristol County",9165,0,19,0,19,4422,333,309,147,22,21,25,5,25240,Freetown town,Bristol County
2,511,4308147,FREETOWN,01/06/2017,9:04 AM,2017/01/06 09:04:00+00,09:00AM to 09:59AM,Closed,Property damage only (none injured),No injury,1,0,0,Local police,Single vehicle crash,V1: Travelling straight ahead,V1: W,"V1:(Cross median or centerline),(Ran off road...",Daylight,"Snow/Blowing sand, snow",Snow,Collision with utility pole,V1:(Collision with utility pole),D1: (No improper driving),"V1:(Light truck(van, mini-van, pickup, sport u...",,CHACE RD / LOUISE AVE,,,,,,,,,,,T-intersection,No controls,"Two-way, not divided",City or Town accepted road,Roadside,,,,Yes,At Intersection,239939.235286,835197.711027,41.766057,-71.01969,PW201701000212,17-5-AC,2017,55-64,55-64,,,,5,SRPEDD,V1:(No),"V1:(Yes, vehicle or trailer disabled)",BRISTOL,,,No hit and run,,"Road surface condition (wet, icy, snow, slush,...","No, school bus not involved",,Not reported,No,2976.0,2013.0,0.955,305.0,0.137,39.0,,Not a parkway - not on a designated truck route,,,,,22.0,Surface-treated road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Small Urbanized Area,Urban minor arterial or rural major collector,New Bedford,,Mainline roadway,Two-way traffic,No control,,,CHACE ROAD,SLAB BRIDGE ROAD,RAMP-RT 140 SB TO CHACE RD,Freetown,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial,"Freetown town, Bristol County",9165,0,19,0,19,4422,333,309,147,22,21,25,5,25240,Freetown town,Bristol County
3,540,4308239,FREETOWN,01/06/2017,6:45 AM,2017/01/06 06:45:00+00,06:00AM to 06:59AM,Closed,Non-fatal injury,Non-fatal injury - Non-incapacitating,1,2,0,Local police,Single vehicle crash,V1: Travelling straight ahead,V1: W,"V1:(Ran off road right),(Collision with tree)",Dawn,"Snow/Blowing sand, snow",Snow,Collision with tree,V1:(Collision with tree),D1: (No improper driving),V1:(Passenger car),120.0,QUANAPOAG RD,,,,,,,,,,,Not at junction,No controls,"Two-way, not divided",City or Town accepted road,Outside roadway,,,,Yes,At Address,241995.699963,830904.119896,41.72729,-70.995258,PW201701000312,17-4-AC,2017,25-34,25-34,,,,5,SRPEDD,V1:(No),"V1:(Yes, vehicle or trailer disabled)",BRISTOL,,,No hit and run,,"Road surface condition (wet, icy, snow, slush,...","No, school bus not involved",30.0,Not reported,No,,,,,,,,Not a parkway - not on a designated truck route,,,,,22.0,Surface-treated road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Small Urbanized Area,Local,New Bedford,,Mainline roadway,Two-way traffic,No control,,,QUANAPOAG ROAD,FALL RIVER CITY LINE,CHIPAWAY ROAD,Freetown,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Local,"Freetown town, Bristol County",9165,0,19,0,19,4422,333,309,147,22,21,25,5,25240,Freetown town,Bristol County
4,541,4308240,FREETOWN,01/07/2017,11:09 AM,2017/01/07 11:09:00+00,11:00AM to 11:59AM,Closed,Property damage only (none injured),No injury,2,0,0,Local police,Rear-end,V1: Travelling straight ahead / V2: Travelling...,V1: W / V2: W,V1:(Collision with motor vehicle in traffic) ...,Daylight,"Snow/Blowing sand, snow",Snow,Collision with motor vehicle in traffic,V1:(Collision with motor vehicle in traffic) /...,D1: (No improper driving) / D2: (No improper...,V1:(Passenger car) / V2:(Passenger car),193.0,CHACE RD,,,,,,,,,230 feet E of,195 CHACE RD,Not at junction,No controls,"Two-way, not divided",City or Town accepted road,Roadway,,,,Yes,At Address,241102.119868,834830.499901,41.762684,-71.005729,PW201701000313,17-6-AC,2017,55-64,>84,,,,5,SRPEDD,V1:(No) / V2:(No),V1:(No) / V2:(No),BRISTOL,,,No hit and run,,"Road surface condition (wet, icy, snow, slush,...","No, school bus not involved",35.0,Not reported,No,2976.0,2013.0,0.955,305.0,0.137,39.0,,Not a parkway - not on a designated truck route,,,,,22.0,Surface-treated road,2.0,Stable - Unruttable compacted subgrade,2.0,0.0,,,Small Urbanized Area,Urban minor arterial or rural major collector,New Bedford,,Mainline roadway,Two-way traffic,No control,,,CHACE ROAD,SLAB BRIDGE ROAD,RAMP-RT 140 SB TO CHACE RD,Freetown,Fair,Level,"Not applicable (i.e., not a principal arterial...",,,,,,Minor Arterial,"Freetown town, Bristol County",9165,0,19,0,19,4422,333,309,147,22,21,25,5,25240,Freetown town,Bristol County


In [16]:
# Print the amount of municipalities in df_mass_acs
unique_names = df_mass_acs['CITY_TOWN_NAME'].unique()
print(len(unique_names))

347


**The DataFrame `df_acs_muni` has 351 unique municipalities in Massachusetts, and our final DataFrame (`df_mass_acs`) contains 347 unique municipalities in Massachusetts. Therefore, `df_mass_acs` (Massachusetts Crash Data from 2017 to 2021) misses 4 municipalities.**

In [17]:
# Assuming 'data' is a subdirectory in your current working directory
folder_path = 'data/'
file_name = 'df_mass_acs.csv'

# Combine the folder path and file name to create the full file path
full_file_path = folder_path + file_name

# Export dataframe to csv file
df_mass_acs.to_csv(full_file_path, index=True)

## Compress the CSV file before uploading it to GitHub

In [18]:
import gzip
import shutil

# Path to the CSV file you want to compress
csv_file_path = 'data/df_mass_acs.csv'

# Path for the compressed file
compressed_file_path = 'data/df_mass_acs.csv.gz'

# Open the CSV file for reading
with open(csv_file_path, 'rb') as f_in:
    # Open the compressed file for writing
    with gzip.open(compressed_file_path, 'wb') as f_out:
        # Copy the contents of the CSV file to the compressed file
        shutil.copyfileobj(f_in, f_out)

print(f'File compressed to: {compressed_file_path}')

File compressed to: data/df_mass_acs.csv.gz
