## Project 5
- created 5-16-22 by GTP
- https://docs.google.com/document/d/1LIJTlCsx54zIG5sOX3heSj00YqdzhZ9c/edit
- *Description*: BSEED has entered a lot of data into free text fields within Accela. Would be useful to find ways to scrape and organize this data so it is useable. Unit data and Certificates of Occupancy are some of our biggest gaps. This might be a way to use administrative data to version and validate 2020 data.
- Technical Skill Level: Medium-High. Skilled at applying Regex to text strings using SQL and/or Python. Experience working with geospatial data, in ArcGIS or otherwise.
- Scope: There are 595 records in the Certificates of Occupancy dataset and 5,930 records in the Certificates of Compliance dataset. Depending on skill level, this could take 6-8 weeks.
- Inputs: Certificates of Compliance, Certificates of Occupancy, Rental Registration data
- General Process:
- Use GIS or Base Units Explorer tool to link Certificates of Occupancy to specific building ids, to create timestamps for when a building was ready for occupants.
- Geocode the addresses in the Certificate of Compliance and Rental Registration datasets and note any addresses that can’t be matched through a manual rematching process and may be missing altogether from the database.


In [1]:
#import data libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numbers
import decimal
#import data science packages
import scipy
import scipy.stats as stats

np.random.seed(222)
%matplotlib inline

In [2]:
#import geographic analysis libraries
import geopandas as gpd
from geopandas import GeoDataFrame
import shapely as shp
from shapely.geometry import Point
from shapely.geometry import shape
import os
import re
from fiona.crs import from_epsg
import pysal as ps
import re
from googlemaps import Client as GoogleMaps
import googlemaps
import gmaps

You can install them with  `pip install urbanaccess pandana` or `conda install -c udst pandana urbanaccess`
  warn(
  from .sqlite import head_to_sql, start_sql


In [3]:
# This is where we will need the API key
gmaps = googlemaps.Client(key=os.environ['GOOGLE_GEOCODER_API'])

In [4]:
#set crs for entire analysis
crs = {'init': 'epsg:4326'}

### data sources

Certificates of Occupancy: https://data.detroitmi.gov/datasets/certificates-of-occupancy-1/explore
- BSEED says that new building (or a rehabbed / renovated building) has satisfied their requirements for habitation, and people can move in / it is ready for occupancy
- note: alice says that this can be issued for individual floors 
- _goal_: deliverable should be a table that is the certificate of occupancy number (record_id) and building footprint ids - sometimes these are 1 to 1 and then sometimes multiple occupancy numbers might relate to a single id
- the census challenge is interested in having this as a record to when exactly a new building was technically 'habitable' - the "birthdate" of the property in terms of occupancy

Certificates of Compliance: https://data.detroitmi.gov/datasets/certificates-of-compliance-1/explore
- this is for properties to be certified as 'compliant' by the city
- Alice has access to the dataset of compliance that has 'description' - which should contain additional details..?
- I think this will be more trying to geocode the ones that don't have lat/lon
- _goal_: there's 33 that didn't geocode - goal is to geocode these and then give description if couldn't geocode

Rental Registrations: https://data.detroitmi.gov/datasets/rental-statuses-1/explore
- (6-1-22): I'll address this next week with Alice on our next call

Base Units: https://base-units-detroitmi.hub.arcgis.com/datasets/detroitmi::units-1/about
- jimmy mcbroom put this together

https://cityofdetroit.github.io/base-unit-tools/explorer

## further notes:
- (6-28-22): 
- city geocoder: 

In [5]:
compliance_gdf = gpd.read_file('../data/Certificates_Of_Compliance/Certificates_Of_Compliance.shp')

In [6]:
len(compliance_gdf)

6071

In [7]:
len(compliance_gdf[compliance_gdf['geometry'].isna()])

33

In [8]:
compliance_gdf_nogeocode = compliance_gdf[compliance_gdf['geometry'].isna()]

In [9]:
compliance_gdf_nogeocode['street_dir'] = compliance_gdf_nogeocode['street_dir'].apply(lambda x: str(x))
compliance_gdf_nogeocode['street_dir'] = pd.Series(compliance_gdf_nogeocode['street_dir']).str.replace('None', '', regex=False)
compliance_gdf_nogeocode['street_typ'] = compliance_gdf_nogeocode['street_typ'].apply(lambda x: str(x))
compliance_gdf_nogeocode['street_typ'] = pd.Series(compliance_gdf_nogeocode['street_typ']).str.replace('None', '', regex=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [10]:
compliance_gdf_nogeocode['address_for_geocode'] = compliance_gdf_nogeocode['street_num'].astype(str) + ' ' +\
                                                  compliance_gdf_nogeocode['street_dir'].astype(str) + ' ' +\
                                                  compliance_gdf_nogeocode['street_nam'].astype(str) + ' ' +\
                                                  compliance_gdf_nogeocode['street_typ'].astype(str) + ' ' +\
                                                  'DETROIT MI'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [11]:
def google_geocode(address_to_geocode):
    geocode_result = gmaps.geocode(address_to_geocode)
    lat = geocode_result[0]['geometry']['location']['lat']
    lon = geocode_result[0]['geometry']['location']['lng']
    return lat, lon

In [12]:
def return_lat(lat_lon):
    return lat_lon[0]

def return_lon(lat_lon):
    return lat_lon[1]

In [13]:
compliance_gdf_nogeocode['lat_lon'] = compliance_gdf_nogeocode['address_for_geocode'].apply(lambda x: google_geocode(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [14]:
compliance_gdf_nogeocode['new_lat'] = compliance_gdf_nogeocode['lat_lon'].apply(return_lat)
compliance_gdf_nogeocode['new_lon'] = compliance_gdf_nogeocode['lat_lon'].apply(return_lon)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [15]:
compliance_gdf_nogeocode.sample(5)

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,task,status,record_sta,parcel_id,lon,lat,ObjectId,geometry,address_for_geocode,lat_lon,new_lat,new_lon
3746,PMB2017-04314,16157,,FIELDING,,Issue CofC,Issued,2019-05-24,,,,3747,,16157 FIELDING DETROIT MI,"(42.4086623, -83.24327699999999)",42.408662,-83.243277
1536,PMB2005-10386,5721,,ST ANTOINE,,Issue CofC,Issued,2021-04-28,,,,1537,,5721 ST ANTOINE DETROIT MI,"(42.3652503, -83.0608451)",42.36525,-83.060845
1143,PMB2005-19369,1501,E.,LARNED,,Issue CofC,Issued,2021-07-23,,,,1144,,1501 E. LARNED DETROIT MI,"(42.3364205, -83.0300334)",42.336421,-83.030033
260,PMB2004-14249,1387,,LARNED,,Issue CofC,Issued,2021-05-19,,,,261,,1387 LARNED DETROIT MI,"(42.3355098, -83.03397679999999)",42.33551,-83.033977
5591,PMB2020-01460,18450,,CHICAGO,,Issue CofC,Issued,2022-05-10,,,,5592,,18450 CHICAGO DETROIT MI,"(42.3651518, -83.2212089)",42.365152,-83.221209


### Compliance df goal:
- export list of non-geocoded compliance to google sheets
- try to geocode these manually
- if can't geocode, then add description column as to why
- google sheet: https://docs.google.com/spreadsheets/d/1agyVFNR8gtoQabLZHSyjIecW8lv3LYlD_7BvXUEZRSM/edit#gid=0

In [16]:
compliance_gdf_nogeocode.to_csv('../data/exports/compliance_gdf_nogeocode.csv')

## Occupancy DF
- geocode missing lat/lons with google
- notes: descriptio column has free text that we could leverage to fill in empty geometry cells

In [17]:
occupancy_gdf = gpd.read_file('../data/Certificates_Of_Occupancy/Certificates_Of_Occupancy.shp')

In [18]:
len(occupancy_gdf[occupancy_gdf['geometry'].isna()])/len(occupancy_gdf)

0.16166666666666665

In [52]:
len(occupancy_gdf[occupancy_gdf['geometry'].isna()])

97

In [20]:
occupancy_gdf_nogeocode = occupancy_gdf[occupancy_gdf['geometry'].isna()]

In [21]:
occupancy_gdf_nogeocode['street_nam'] = occupancy_gdf_nogeocode['street_nam'].apply(lambda x: str(x))
occupancy_gdf_nogeocode['street_nam'] = pd.Series(occupancy_gdf_nogeocode['street_nam']).str.replace('None', '', regex=False)
occupancy_gdf_nogeocode['street_num'] = occupancy_gdf_nogeocode['street_num'].apply(lambda x: str(x))
occupancy_gdf_nogeocode['street_num'] = pd.Series(occupancy_gdf_nogeocode['street_num']).str.replace('None', '', regex=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [22]:
occupancy_gdf_nogeocode['address_for_geocode'] = occupancy_gdf_nogeocode['street_num'].astype(str) + ' ' +\
                                                  occupancy_gdf_nogeocode['street_nam'].astype(str) + ' ' +\
                                                  'DETROIT MI'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [23]:
occupancy_gdf_nogeocode['lat_lon'] = occupancy_gdf_nogeocode['address_for_geocode'].apply(lambda x: google_geocode(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [24]:
occupancy_gdf_nogeocode['new_lat'] = occupancy_gdf_nogeocode['lat_lon'].apply(return_lat)
occupancy_gdf_nogeocode['new_lon'] = occupancy_gdf_nogeocode['lat_lon'].apply(return_lon)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


In [25]:
### export geocoded list
occupancy_gdf_nogeocode.to_csv('../data/exports/occupancy_gdf_nogeocode.csv')

## Occupancy DF
### Match occupancy number and building footprint ids

"AKA 2327 Trumbull Ave" is an example of the free text - as in, AKA "address" is a common pattern, but this already exists / has been extracted into the street_num / street_nam columns - maybe those just have to be geocoded

we're looking for the relationship between housing units and certificates of occupancy. That relationship is often mediated by a building id (use https://cityofdetroit.github.io/base-unit-tools/explorer?id=3263&type=buildings&streetview=true) / there's a xwalk that alice will send over

deliverable should be a table that is the certificate of occupancy number (record_id) and building footprint ids - sometimes these are 1 to 1 and then sometimes multiple occupancy numbers might relate to a single id

the census challenge is interested in having this as a record to when exactly a new building was technically 'habitable' - the "birthdate" of the property in terms of occupancy

most important is to match building id to record_id - more so than geocoding

### Note (7-8-22):
- Questions for alice: 
1. What is the relationship between unit_id and occupancy_gdf? Addr_id is null for everything
2. ... units dataset doesn't seem useful - is there a way to do this by hand?

In [26]:
occupancy_gdf['descriptio'][occupancy_gdf['record_id']=='BLD2021-02415'].values[0]

'(AKA 3321 Cochrane) Construct (11) unit Rowhouse building and Accessory Garages per BZA (41-19) & (SLU2019-00020) per Plans.\r\n(Permit reviewed under BLD2019-03775)'

In [66]:
occupancy_gdf[occupancy_gdf['record_id']=='BLD2021-02415']

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,descriptio,status,date_statu,parcel_id,lon,lat,ObjectId,geometry
467,BLD2021-02415,3303,,COCHRANE,,(AKA 3321 Cochrane) Construct (11) unit Rowhou...,CofO Issued,2021-10-11,8006537.001,-83.074239,42.339555,468,POINT (-83.07424 42.33956)


In [58]:
base_units_shp = gpd.read_file('../data/Units/Units.shp')

In [68]:
base_units_shp.sample(5)

Unnamed: 0,OBJECTID,unit_id,bldg_id,parcel_id,addr_id,unit_statu,use_,geometry
118141,118142,119652,181562.0,,119652,,,POINT (-83.27184 42.42881)
52929,52930,214732,2551.0,,214732,,,POINT (-83.05802 42.33837)
16970,16971,341946,96460.0,,341946,,,POINT (-83.08603 42.37703)
74679,74680,275792,5114.0,,275792,,,POINT (-83.11264 42.41915)
68337,68338,28317,120550.0,,28317,,,POINT (-83.12399 42.41518)


In [60]:
base_units_shp[base_units_shp['bldg_id']==3775]

Unnamed: 0,OBJECTID,unit_id,bldg_id,parcel_id,addr_id,unit_statu,use_,geometry


In [65]:
base_units_shp['use_'].value_counts()

Series([], Name: use_, dtype: int64)

In [27]:
occupancy_gdf[occupancy_gdf['record_id']=='BLD2021-02415']

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,descriptio,status,date_statu,parcel_id,lon,lat,ObjectId,geometry
467,BLD2021-02415,3303,,COCHRANE,,(AKA 3321 Cochrane) Construct (11) unit Rowhou...,CofO Issued,2021-10-11,8006537.001,-83.074239,42.339555,468,POINT (-83.07424 42.33956)


In [28]:
len(occupancy_gdf)

600

In [29]:
occupancy_gdf_empty = occupancy_gdf[occupancy_gdf['geometry'].isna()]

In [53]:
len(occupancy_gdf_empty)

97

In [54]:
occupancy_gdf_empty.sample(5)

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,descriptio,status,date_statu,parcel_id,lon,lat,ObjectId,geometry
341,BLD2018-05358,2860,,JOHN R,,"(AKA UNIT 75) ERECT A 3 STORY, 11 UNIT CARRIAG...",CofO Issued,2021-09-20,,,,342,
558,BLD2018-05354,2852,,JOHN R,,"(AKA UNIT 79) ERECT A 3 STORY, 11 UNIT CARRIA...",CofO Issued,2021-10-07,,,,559,
547,BLD2019-04970,2301,,Trumbull,,"AKA 2301 Trumbull Ave. Unit 14. Per BZA #4-18,...",CofO Issued,2020-09-24,,,,548,
64,BLD2018-05776,2812,,JOHN R,,"(AKA UNIT 51) ERECT A 3 UNIT, 4 STORY TOWNHOUS...",CofO Issued,2021-10-29,,,,65,
539,BLD2020-05103,3500,,ORLEANS,,Revision to BLD2018-10767 to reflect Electrica...,CofO Issued,2021-07-29,,,,540,


In [30]:
occupancy_gdf_empty[occupancy_gdf_empty['record_id']=='BLD2019-00680']

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,descriptio,status,date_statu,parcel_id,lon,lat,ObjectId,geometry
84,BLD2019-00680,2809,,Brush,,"Erect 4 story , 8 unit townhomes as per eplan...",CofO Issued,2021-04-05,,,,85,


In [31]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2019-00680'].values

array(['Erect  4 story , 8 unit townhomes as per eplans w/ a certificate of appropriateness'],
      dtype=object)

In [32]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2020-01564'].values

array(['Modify previous Change of Use Permit to Provisioning Center by adding grow facility; changes to the restroom facilities.'],
      dtype=object)

In [33]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2019-00033'].values

array(['INTERIOR ALTERATIONS TO ESTABLISH USE FOR TENANT SPACE AS COSMETIC RETAIL\nPERMANENT CERTIFICATE OF OCCUPANCY ISSUED (03-20-2019)'],
      dtype=object)

In [34]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2020-04413'].values[0]

'Interior alterations per plans.(1500 E. Woodbridge Suite address per plans, Separate Tenant Build-Out Permit required to establish Occupancy). Subject to all Applicable Federal, State, and Local Executive Orders.\r\n(AKA 1583 Franklin)'

- note: this building is at the corner of e. woodbridge and franklin (hence the aka 1583 franklin)

In [35]:
occupancy_gdf_empty[occupancy_gdf_empty['record_id']=='BLD2020-04413']

Unnamed: 0,record_id,street_num,street_dir,street_nam,street_typ,descriptio,status,date_statu,parcel_id,lon,lat,ObjectId,geometry
393,BLD2020-04413,1522,,WOODBRIDGE,,Interior alterations per plans.(1500 E. Woodbr...,CofO Issued,2021-04-12,,,,394,


In [36]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2019-04976'].values

array(["AKA 2327 Trumbull Ave. Unit 20. Per BZA #4-18, Construct 34' L X 21' W X 37' H Townhouse per plans."],
      dtype=object)

In [37]:
occupancy_gdf_empty['descriptio'][occupancy_gdf_empty['record_id']=='BLD2017-06240'].values

array(['AKA 8032, 8040, 8046, 8056 MEMORIAL. ERECTION OF ONE 4 UNIT ONE STORY WOOD FRAMED TOWNHOUSE AS PER PLANS. SEE BLD2017-00831 FOR MASTER SET OF PLANS.'],
      dtype=object)