Step 0: Import all relevant libraries used in this project

Step 1: Download crime data related to CTA from the City of Chicago data portal API until November 15, 2019

Step 2: Clean the crime data

Step 3: Create Grid Table based on crime data

Step 4: Assign GridID to crime data

Step 5: Load BusStop and TrainStop data

Step 6: Clean BusStop and TrainStop data

Step 7: Assign GridID to BusStop and TrainStop data

Step 8: Load holiday data

Step 9: Clean holiday data

Step 10: Load tables into CloudSQL

Step 11: Daily refresh of crime data

Step 12: Clean daily updated crime data

Step 13: Assign GridID to daily updated crime data

Step 14: Append daily updated crime data to the crime database in CloudSQL

Import and clean CTA data (for reference of .dbf to .csv transformation, see https://pypi.org/project/simpledbf/; perform $ pip install simpledbf)

In [21]:
# import relevant libraries
import pandas as pd
import dbfread
from simpledbf import Dbf5

Create city grid table to generate unique IDs for all crimes and CTA locations to match data across datasets

In [None]:
# @Peter: insert code for grid table generation here

In [22]:
# import CTA_BusStops.dbf
# retrieved from GitHub
# originally downloaded from https://data.cityofchicago.org/Transportation/CTA-Bus-Stops-Shapefile/pxug-u72f
dbf1 = Dbf5('Datasets/CTA_BusStops.dbf', codec='utf-8')

In [23]:
# take a look at the file
dbf1.fields

[('DeletionFlag', 'C', 1),
 ('OBJECTID', 'N', 10),
 ('SYSTEMSTOP', 'N', 19),
 ('STREET', 'C', 75),
 ('CROSS_ST', 'C', 75),
 ('DIR', 'C', 3),
 ('POS', 'C', 4),
 ('ROUTESSTPG', 'C', 75),
 ('OWLROUTES', 'C', 20),
 ('CITY', 'C', 20),
 ('STATUS', 'N', 10),
 ('PUBLIC_NAM', 'C', 75),
 ('POINT_X', 'N', 19),
 ('POINT_Y', 'N', 19)]

In [24]:
# export .dbf file to .csv (BusStops)
dbf1.to_csv('Datasets/CTA_BusStops.csv')

In [25]:
# read .csv for BusStops
BusStops = pd.read_csv('Datasets/CTA_BusStops.csv', index_col = 'OBJECTID')

In [26]:
# change column name POINT_X and POINT_Y to lat and long
BusStops = BusStops.rename(columns={"POINT_X": "longitude", "POINT_Y":"latitude"})

In [27]:
# look at clean data frame
BusStops.head()

Unnamed: 0_level_0,SYSTEMSTOP,STREET,CROSS_ST,DIR,POS,ROUTESSTPG,OWLROUTES,CITY,STATUS,PUBLIC_NAM,longitude,latitude
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
193,6696.0,TAYLOR,THROOP,EB,NS,157,,CHICAGO,1,Taylor & Throop,-87.659294,41.869314
194,22.0,JACKSON,KARLOV,EB,FS,126,,CHICAGO,1,Jackson & Karlov,-87.727808,41.877007
195,4767.0,FOSTER,MONTICELLO,EB,NS,92,,CHICAGO,1,Foster & Monticello,-87.71978,41.975526
196,6057.0,ASHLAND,CERMAK/BLUE ISLAND,SB,NS,"9,X9",N9,CHICAGO,1,Ashland & Cermak/Blue Island,-87.666173,41.852484
197,1790.0,CLARK,ALBION,SB,NS,22,N22,CHICAGO,1,Clark & Albion,-87.671981,42.001785


In [28]:
# import CTA_TrainStops.csv
# retrieved from GitHub
# originally downloaded from https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme
TrainStops = pd.read_csv('Datasets/CTA_TrainStops.csv', index_col = 'STOP_ID')
TrainStops.head()

Unnamed: 0_level_0,DIRECTION_ID,STOP_NAME,STATION_NAME,STATION_DESCRIPTIVE_NAME,MAP_ID,ADA,RED,BLUE,G,BRN,P,Pexp,Y,Pnk,O,Location
STOP_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
30162,W,18th (54th/Cermak-bound),18th,18th (Pink Line),40830,True,False,False,False,False,False,False,False,True,False,"(41.857908, -87.669147)"
30161,E,18th (Loop-bound),18th,18th (Pink Line),40830,True,False,False,False,False,False,False,False,True,False,"(41.857908, -87.669147)"
30022,N,35th/Archer (Loop-bound),35th/Archer,35th/Archer (Orange Line),40120,True,False,False,False,False,False,False,False,False,True,"(41.829353, -87.680622)"
30023,S,35th/Archer (Midway-bound),35th/Archer,35th/Archer (Orange Line),40120,True,False,False,False,False,False,False,False,False,True,"(41.829353, -87.680622)"
30214,S,35-Bronzeville-IIT (63rd-bound),35th-Bronzeville-IIT,35th-Bronzeville-IIT (Green Line),41120,True,False,False,True,False,False,False,False,False,False,"(41.831677, -87.625826)"


In [29]:
# clean up TrainStations's Location column into Point_X and Point_Y
# split location column
LocationNew = TrainStops["Location"].str.split(",", n = 1, expand = True)
LocationNew.head()

Unnamed: 0_level_0,0,1
STOP_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
30162,(41.857908,-87.669147)
30161,(41.857908,-87.669147)
30022,(41.829353,-87.680622)
30023,(41.829353,-87.680622)
30214,(41.831677,-87.625826)


In [30]:
# remove parentheses
LocationNew[0].replace(regex=True,inplace=True,to_replace=r'\(',value=r'')
LocationNew[1].replace(regex=True,inplace=True,to_replace=r'\)',value=r'')
LocationNew.head()

Unnamed: 0_level_0,0,1
STOP_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
30162,41.857908,-87.669147
30161,41.857908,-87.669147
30022,41.829353,-87.680622
30023,41.829353,-87.680622
30214,41.831677,-87.625826


In [31]:
# add Point_Y and Point_X to dataframe
TrainStops["latitude"] = LocationNew[0]
TrainStops["longitude"] = LocationNew[1]
TrainStops = TrainStops.drop("Location", 1)

In [32]:
# look at clean data frame
TrainStops.head()

Unnamed: 0_level_0,DIRECTION_ID,STOP_NAME,STATION_NAME,STATION_DESCRIPTIVE_NAME,MAP_ID,ADA,RED,BLUE,G,BRN,P,Pexp,Y,Pnk,O,latitude,longitude
STOP_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
30162,W,18th (54th/Cermak-bound),18th,18th (Pink Line),40830,True,False,False,False,False,False,False,False,True,False,41.857908,-87.669147
30161,E,18th (Loop-bound),18th,18th (Pink Line),40830,True,False,False,False,False,False,False,False,True,False,41.857908,-87.669147
30022,N,35th/Archer (Loop-bound),35th/Archer,35th/Archer (Orange Line),40120,True,False,False,False,False,False,False,False,False,True,41.829353,-87.680622
30023,S,35th/Archer (Midway-bound),35th/Archer,35th/Archer (Orange Line),40120,True,False,False,False,False,False,False,False,False,True,41.829353,-87.680622
30214,S,35-Bronzeville-IIT (63rd-bound),35th-Bronzeville-IIT,35th-Bronzeville-IIT (Green Line),41120,True,False,False,True,False,False,False,False,False,False,41.831677,-87.625826


Establish API to crime data and clean it (perform pip install sodapy)

In [35]:
# import requests library
from sodapy import Socrata

In [36]:
# API instructions https://dev.socrata.com/foundry/data.cityofchicago.org/ijzp-q8t2
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
# client = Socrata("data.cityofchicago.org", None)

# Example authenticated client (needed for non-public datasets):
client = Socrata("data.cityofchicago.org",
                  "QtMhXqaTTglPlVS3AC6PEQQxD", username = "juli.kleindiek@gmail.com", password = "DEPA_2019")

# Limit to 6000 rows to avoid time out errors, data has <6000 rows, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("5xiy-qnsz", limit = 6000)

In [37]:
results[1]

{':@computed_region_d9mm_jgwp': '7',
 ':@computed_region_43wa_7qmu': '40',
 'date': '2011-03-20T13:00:00.000',
 'location': {'latitude': '41.913644076',
  'human_address': '{"address": "", "city": "", "state": "", "zip": ""}',
  'needs_recoding': False,
  'longitude': '-87.687306326'},
 'district': '014',
 'y_coordinate': '1911784',
 'block': '017XX N WESTERN AVE',
 ':@computed_region_rpca_8um6': '1',
 'latitude': '41.913644076',
 ':@computed_region_awaf_s7ux': '24',
 'description': 'CREDIT CARD FRAUD',
 'location_description': 'CTA BUS',
 'community_area': '24',
 'updated_on': '2018-02-10T15:50:01.000',
 'iucr': '1150',
 'x_coordinate': '1160086',
 'ward': '1',
 'year': '2011',
 'case_number': 'HT212010',
 'fbi_code': '11',
 'domestic': False,
 'longitude': '-87.687306326',
 ':@computed_region_bdys_3d7i': '298',
 'beat': '1434',
 ':@computed_region_6mkv_f3dw': '22535',
 'arrest': False,
 'primary_type': 'DECEPTIVE PRACTICE',
 ':@computed_region_vrxf_vc4k': '25',
 'id': '7979411',
 ':@

In [38]:
# Convert results to pandas DataFrame
crime_dirty = pd.DataFrame.from_records(results)

In [39]:
crime_dirty.head()

Unnamed: 0,:@computed_region_43wa_7qmu,:@computed_region_6mkv_f3dw,:@computed_region_awaf_s7ux,:@computed_region_bdys_3d7i,:@computed_region_d3ds_rm58,:@computed_region_d9mm_jgwp,:@computed_region_rpca_8um6,:@computed_region_vrxf_vc4k,arrest,beat,...,latitude,location,location_description,longitude,primary_type,updated_on,ward,x_coordinate,y_coordinate,year
0,36,4452,22,784,91,22,45,29,False,111,...,41.882189946,"{'latitude': '41.882189946', 'human_address': ...",CTA BUS,-87.641202862,BATTERY,2018-02-10T15:50:01.000,42,1172727,1900420,2011
1,40,22535,24,298,192,7,1,25,False,1434,...,41.913644076,"{'latitude': '41.913644076', 'human_address': ...",CTA BUS,-87.687306326,DECEPTIVE PRACTICE,2018-02-10T15:50:01.000,1,1160086,1911784,2011
2,14,21569,36,772,153,21,57,30,False,1021,...,41.859020037,"{'latitude': '41.859020037', 'human_address': ...",CTA BUS,-87.710681986,THEFT,2018-02-10T15:50:01.000,24,1153868,1891832,2011
3,13,21554,18,531,233,20,59,70,False,613,...,41.735931109,"{'latitude': '41.735931109', 'human_address': ...",CTA BUS,-87.653642482,THEFT,2018-02-10T15:50:01.000,21,1169761,1847097,2011
4,1,21569,14,261,130,21,57,32,False,1033,...,41.847153749,"{'latitude': '41.847153749', 'human_address': ...",CTA BUS,-87.70511925,ROBBERY,2018-02-10T15:50:01.000,24,1155415,1887519,2011


In [40]:
# investigate crime_dirty
crime_dirty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5808 entries, 0 to 5807
Data columns (total 30 columns):
:@computed_region_43wa_7qmu    5803 non-null object
:@computed_region_6mkv_f3dw    5806 non-null object
:@computed_region_awaf_s7ux    5803 non-null object
:@computed_region_bdys_3d7i    5790 non-null object
:@computed_region_d3ds_rm58    5803 non-null object
:@computed_region_d9mm_jgwp    5803 non-null object
:@computed_region_rpca_8um6    5803 non-null object
:@computed_region_vrxf_vc4k    5803 non-null object
arrest                         5808 non-null bool
beat                           5808 non-null object
block                          5808 non-null object
case_number                    5808 non-null object
community_area                 5808 non-null object
date                           5808 non-null object
description                    5808 non-null object
district                       5808 non-null object
domestic                       5808 non-null bool
fbi_code     

In [41]:
# validate the values , i.e. 'id'
crime_dirty.loc[2:3]['id']

2    7981270
3    7979660
Name: id, dtype: object

In [42]:
# bring dataframe into proper format
crime = crime_dirty[['id', 
        'case_number', 
        'date', 
        'block', 
        'iucr', 
        'primary_type', 
        'description', 
        'location_description',
        'arrest',
        'domestic',
        'beat',
        'district',
        'ward',
        'community_area',
        'fbi_code',
        'x_coordinate',
        'y_coordinate',
        'year',
        'updated_on',
        'latitude',
        'longitude']]

In [43]:
# take a look at the proper dataframe
crime.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude
0,8256302,HT490196,2011-09-10T12:46:00.000,0000X N CLINTON ST,0460,BATTERY,SIMPLE,CTA BUS,False,False,...,1,42,28,08B,1172727,1900420,2011,2018-02-10T15:50:01.000,41.882189946,-87.641202862
1,7979411,HT212010,2011-03-20T13:00:00.000,017XX N WESTERN AVE,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,CTA BUS,False,False,...,14,1,24,11,1160086,1911784,2011,2018-02-10T15:50:01.000,41.913644076,-87.687306326
2,7981270,HT213521,2011-03-20T15:00:00.000,034XX W 16TH ST,0820,THEFT,$500 AND UNDER,CTA BUS,False,False,...,10,24,29,06,1153868,1891832,2011,2018-02-10T15:50:01.000,41.859020037,-87.710681986
3,7979660,HT212325,2011-03-20T20:09:00.000,012XX W 87TH ST,0820,THEFT,$500 AND UNDER,CTA BUS,False,False,...,6,21,71,06,1169761,1847097,2011,2018-02-10T15:50:01.000,41.735931109,-87.653642482
4,7989927,HT213271,2011-03-20T20:57:00.000,024XX S KEDZIE AVE,033A,ROBBERY,ATTEMPT: ARMED-HANDGUN,CTA BUS,False,False,...,10,24,30,03,1155415,1887519,2011,2018-02-10T15:50:01.000,41.847153749,-87.70511925


In [44]:
# rename column 'id' into 'crimeID'
crime = crime.rename(columns={"id": "crimeID"})

In [45]:
# define proper data types for each column; WE NEED FURTHER CLEANING HERE
crime = crime.astype({"crimeID": int})

In [46]:
# set index of crime dataframe to 'id'
crime.set_index('crimeID')

Unnamed: 0_level_0,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude
crimeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
8256302,HT490196,2011-09-10T12:46:00.000,0000X N CLINTON ST,0460,BATTERY,SIMPLE,CTA BUS,False,False,0111,001,42,28,08B,1172727,1900420,2011,2018-02-10T15:50:01.000,41.882189946,-87.641202862
7979411,HT212010,2011-03-20T13:00:00.000,017XX N WESTERN AVE,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,CTA BUS,False,False,1434,014,1,24,11,1160086,1911784,2011,2018-02-10T15:50:01.000,41.913644076,-87.687306326
7981270,HT213521,2011-03-20T15:00:00.000,034XX W 16TH ST,0820,THEFT,$500 AND UNDER,CTA BUS,False,False,1021,010,24,29,06,1153868,1891832,2011,2018-02-10T15:50:01.000,41.859020037,-87.710681986
7979660,HT212325,2011-03-20T20:09:00.000,012XX W 87TH ST,0820,THEFT,$500 AND UNDER,CTA BUS,False,False,0613,006,21,71,06,1169761,1847097,2011,2018-02-10T15:50:01.000,41.735931109,-87.653642482
7989927,HT213271,2011-03-20T20:57:00.000,024XX S KEDZIE AVE,033A,ROBBERY,ATTEMPT: ARMED-HANDGUN,CTA BUS,False,False,1033,010,24,30,03,1155415,1887519,2011,2018-02-10T15:50:01.000,41.847153749,-87.70511925
8332454,HT566424,2011-10-30T09:00:00.000,067XX S HALSTED ST,0460,BATTERY,SIMPLE,CTA BUS,False,False,0723,007,6,68,08B,1172128,1860266,2011,2018-02-10T15:50:01.000,41.772016903,-87.64458443
8332084,HT566137,2011-10-30T02:08:00.000,0000X W 95TH ST,0560,ASSAULT,SIMPLE,CTA BUS,True,False,0634,006,21,49,08A,1177762,1841949,2011,2016-02-04T06:33:39.000,41.721627204,-87.624485177
7905059,HT134917,2011-01-25T14:30:00.000,047XX S WENTWORTH AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,CTA BUS,False,True,0231,002,3,37,08B,1175752,1873657,2011,2016-02-04T06:33:39.000,41.808682769,-87.630898966
7906391,HT136142,2011-01-25T15:00:00.000,027XX W DIVISION ST,0560,ASSAULT,SIMPLE,CTA BUS,False,False,1423,014,26,24,08A,1157903,1907866,2011,2016-02-04T06:33:39.000,41.902937604,-87.695433299
7982963,HT214882,2011-03-21T14:30:00.000,107XX S COTTAGE GROVE AVE,0870,THEFT,POCKET-PICKING,CTA BUS,False,False,0513,005,9,50,06,1182249,1833965,2011,2018-02-10T15:50:01.000,41.699615536,-87.60829647


Import and clead holiday data

In [None]:
# @Lola: insert code for loading holiday data

Establie API to weather data

In [None]:
# @Lola: insert code for for establishing API to weather data

Establish connection to GCP CouldSQL 

In [33]:
# import required package
import pyodbc

In [None]:
# define connection to the server
conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=server_name;'
                      'Database=db_name;'
                      'Trusted_Connection=yes;')

In [None]:
# define cursor
cursor = conn.cursor()

In [None]:
# examplary query
cursor.execute('SELECT * FROM db_name.Table')