# Exploring Building Permit Data
EDA for Toronto Bulding Permits using the CKAN API

## Create CKAN API request methods
Pass queries using SQL and return responses as appropriate objects: either a dataframe, float/str/int, or error out with a response code as required

In [2]:
from helpers.ckan import Ckan

## Export to CSV
Use Ckan class to export data from a Ckan package to a CSV file. Run for all relevant packages for this project.

In [39]:
import os
import re
import time

def make_csv_source_path(package_id: str, dirpath: str) -> str:
    return os.path.join(dirpath, re.sub("\\W+", "-", package_id.lower()) + ".csv")

def export_csv_from_ckan(package_id: str, dirpath = "../data/source/csv/") -> None:
    ckan = Ckan()
    ckan.set_package_id(package_id)
    ckan.get_pkg_info()
    ckan.find_resource_endpoints()
    ckan.export_csv(make_csv_source_path(package_id=package_id, dirpath=dirpath))

In [40]:
pkg_ids = [
    "building-permits-active-permits",
    "building-permits-cleared-permits",
    "address-points-municipal-toronto-one-address-repository",
    "neighbourhoods",
]

for pkg_id in pkg_ids:
    export_csv_from_ckan(package_id=pkg_id)
    time.sleep(6)

## Pandas exploration
Check the data for its features (primary keys, data types, attributes)

In [2]:
import pandas as pd
df = pd.read_csv("../data/source/csv/building-permits-active-permits.csv", dtype=str)

In [18]:
df.drop_duplicates().shape

(262178, 32)

In [48]:
df.columns.to_list()

['PERMIT_NUM',
 'REVISION_NUM',
 'PERMIT_TYPE',
 'STRUCTURE_TYPE',
 'WORK',
 'STREET_NUM',
 'STREET_NAME',
 'STREET_TYPE',
 'STREET_DIRECTION',
 'POSTAL',
 'GEO_ID',
 'WARD_GRID',
 'APPLICATION_DATE',
 'ISSUED_DATE',
 'COMPLETED_DATE',
 'STATUS',
 'DESCRIPTION',
 'CURRENT_USE',
 'PROPOSED_USE',
 'DWELLING_UNITS_CREATED',
 'DWELLING_UNITS_LOST',
 'EST_CONST_COST',
 'ASSEMBLY',
 'INSTITUTIONAL',
 'RESIDENTIAL',
 'BUSINESS_AND_PERSONAL_SERVICES',
 'MERCANTILE',
 'INDUSTRIAL',
 'INTERIOR_ALTERATIONS',
 'DEMOLITION',
 'BUILDER_NAME',
 'permit_id',
 'effective_date']

Find the minimum columns required to create a primary key (preserve all rows from source).

In [26]:
df[['PERMIT_NUM', 'REVISION_NUM', 'PERMIT_TYPE', 'BUILDER_NAME']].drop_duplicates().shape

(262178, 4)

The columns we can use to create a hash for a primary key are:
- `PERMIT_NUM`
- `REVISION_NUM`
- `PERMIT_TYPE`
- `BUILDER_NAME`

There are some really efficient hashing algos in python created by [Maruice Borgmeier](https://mauricebrg.com/2022/12/even-more-efficient-hashing-of-columns-in-a-pandas-dataframe.html). We can implement the V3 version that relies on pandas solely for typing the dataframe into strings for hashing.

Hashing algos:
- **xxHash** <-- fastest, use this
- crc32 (fast, old, too many collisions)
- md5 (fast, use as an alternative)
- sha-1 (only slightly less fast)

In [4]:
# Code adapted from Maurice Borgmeier's efficient pandas hashing blog post:
# https://mauricebrg.com/2022/12/even-more-efficient-hashing-of-columns-in-a-pandas-dataframe.html


import abc
import typing
import xxhash
import pandas as pd

class AbstractHasher(abc.ABC):
    """Implement an abstract base class for the Abstract Hasher
    This class prepares code for a concrete Hasher class to carry out dataframe hashing.
    dataframe:

    Attributes
    ----------
    dataframe : pd.Dataframe
        the pandas dataframe with columns to be hashed
    target_columns_name : str
        the name of the column containing the hashes
    columns_to_hash : list[str]
        the list of columns to be used to create the hash

    Methods
    -------
    hash :
        An abstract hash that runs the hash methods from concrete implements
    """

    dataframe: pd.DataFrame
    target_column_name: str
    columns_to_hash: typing.List[str]
    num_records: int

    def __init__(
            self, 
            dataframe: pd.DataFrame,
            columns_to_hash: typing.List[str], 
            target_column_name: str) -> 'AbstractHasher':
        """
        Initialize the AbstractHasher class:

        Attributes
        ----------
        dataframe : pd.Dataframe
            the pandas dataframe with columns to be hashed
        target_columns_name : str
            the name of the column containing the hashes
        columns_to_hash : list[str]
            the list of columns to be used to create the hash
        """

        self.dataframe = dataframe.copy()
        self.target_column_name = target_column_name
        self.columns_to_hash = columns_to_hash
        self.num_records = len(dataframe)

    @abc.abstractmethod
    def hash(self) -> pd.DataFrame:
        """Hash the columns"""

In [5]:
HASH_FIELD_SEPARATOR = "|"
HASH_FUNCTION = xxhash.xxh32


class PyHasher(AbstractHasher):
    """
    Hasher uses itertuples instead of converting values to a list
    """

    def hash(self) -> pd.DataFrame:
        def hash_string_iterable(string_iterable: typing.Iterable[str]) -> str:
            input_str = HASH_FIELD_SEPARATOR.join(string_iterable)
            return HASH_FUNCTION(input_str.encode("utf-8")).hexdigest()

        """
        Apply the hash_string_iterable method to the specified columns of an input 
        dataframe that was typed into a str, then converted into an iter object where 
        each row is a tuple. This creates a series with the same number of items as 
        the input dataframe, and can be inserted into the specified hash column name.
        """
        
        hashed_series = pd.Series(
            map(
                hash_string_iterable,
                self.dataframe[self.columns_to_hash]
                .astype(str)
                .itertuples(index=False, name=None),
            ),
            index=self.dataframe.index,
        )

        self.dataframe[self.target_column_name] = hashed_series

        return self.dataframe

In [6]:
PK_COLS = ['PERMIT_NUM', 'REVISION_NUM', 'PERMIT_TYPE', 'BUILDER_NAME']

df = PyHasher(dataframe=df, columns_to_hash=PK_COLS, target_column_name="permit_id").hash()

## Explore address data
Explore joining address data to obtain neighbourhood data and geolocation points. There are two possible ways to show 'hotspots' of renovations: 
1. By neighbourhood, or geofenced by neighbourhood, and
2. By permit value. 

(1) and (2) can be combined to provide a sense of moneyh being spent in high-activity neighbourhoods.

In [7]:
df_loc = pd.read_csv("../data/source/csv/address-points-municipal-toronto-one-address-repository.csv", dtype=str)

In [31]:
df_loc.columns

Index(['_id', 'ADDRESS_POINT_ID', 'ADDRESS_ID', 'ADDRESS_STRING_ID',
       'LINEAR_NAME_ID', 'CENTRELINE_ID', 'MAINT_STAGE', 'ADDRESS_NUMBER',
       'LINEAR_NAME_FULL', 'LO_NUM', 'LO_NUM_SUF', 'HI_NUM', 'HI_NUM_SUF',
       'LINEAR_NAME', 'LINEAR_NAME_TYPE', 'LINEAR_NAME_DIR',
       'LINEAR_NAME_DESC', 'CENTRELINE_SIDE', 'CENTRELINE_MEASURE',
       'CENTRELINE_OFFSET', 'GENERAL_USE_CODE', 'GENERAL_USE', 'CLASS_FAMILY',
       'CLASS_FAMILY_DESC', 'ADDRESS_CLASS', 'ADDRESS_CLASS_DESC',
       'ADDRESS_POINT_ID_LINK', 'ADDRESS_ID_LINK', 'PLACE_NAME',
       'PLACE_NAME_ALL', 'ADDRESS_STATUS', 'OBJECTID', 'MUNICIPALITY',
       'MUNICIPALITY_NAME', 'WARD', 'WARD_NAME', 'ADDRESS_FULL', 'geometry'],
      dtype='object')

In [34]:
df_loc["WARD_NAME"].nunique()

25

**NOTE**: There is no current link between address points and neighbourhoods. Therefore, we need to geofence the Address Point data to annotate it with the 158 neighbourhoods to which each address belongs. This can be done once a month in case there are new addresses or plots that come up.

In [20]:
df[['STREET_NUM', 'STREET_NAME', 'STREET_TYPE', 'GEO_ID']].head()

Unnamed: 0,STREET_NUM,STREET_NAME,STREET_TYPE,GEO_ID
0,261,FINCH,AVE,11129598
1,1229,GERRARD,ST,10575758
2,1522,QUEEN,ST,12544613
3,85,SPENCER,AVE,8190271
4,168,SHANLY,ST,8456828


In [24]:
df['GEO_ID'].head().to_list()

['11129598', '10575758', '12544613', '8190271', '8456828']

In [25]:
df_loc[df_loc['ADDRESS_POINT_ID'].isin(df['GEO_ID'].head().to_list())]

Unnamed: 0,_id,ADDRESS_POINT_ID,ADDRESS_ID,ADDRESS_STRING_ID,LINEAR_NAME_ID,CENTRELINE_ID,MAINT_STAGE,ADDRESS_NUMBER,LINEAR_NAME_FULL,LO_NUM,...,PLACE_NAME,PLACE_NAME_ALL,ADDRESS_STATUS,OBJECTID,MUNICIPALITY,MUNICIPALITY_NAME,WARD,WARD_NAME,ADDRESS_FULL,geometry
40735,40736,11129598,414147,428374,1610,11129595,REGULAR,261,Finch Ave W,261,...,,,,1695069,NY,North York,18,Willowdale,261 Finch Ave W,"{""type"": ""Point"", ""coordinates"": [-79.43846606..."
385934,385935,8190271,157992,170552,4431,30102027,REGULAR,85,Spencer Ave,85,...,,,,4468854,TO,former Toronto,4,Parkdale-High Park,85 Spencer Ave,"{""type"": ""Point"", ""coordinates"": [-79.43014770..."
427114,427115,8456828,143584,146828,4386,8456810,REGULAR,168,Shanly St,168,...,,,,4695732,TO,former Toronto,9,Davenport,168 Shanly St,"{""type"": ""Point"", ""coordinates"": [-79.43623307..."
464752,464753,12544613,161750,168313,4238,12544598,REGULAR,1522,Queen St W,1522,...,,,,4856770,TO,former Toronto,4,Parkdale-High Park,1522 Queen St W,"{""type"": ""Point"", ""coordinates"": [-79.43881790..."
507435,507436,10575758,235389,250496,3495,10575733,REGULAR,1229,Gerrard St E,1229,...,,,,5018698,TO,former Toronto,14,Toronto-Danforth,1229 Gerrard St E,"{""type"": ""Point"", ""coordinates"": [-79.33006049..."


We found that **`GEO_ID` from the *building permits* table = `ADDRESS_POINT_ID` from the *address points* table**. This simplifies enriching the building permits database with neighbourhood and goelocation points further in the analysis pipeline. Next steps are to continue EDA with the building permits data to create a draft of the main transformations required to get from the data to an analytics dashboard, then modularize transformations to simplify the raw data.

## EDA: Join all building permit data
We want to combine **active** and **cleared** permit data to create a full dataset of permits with their dates such that we can window all permits within a given date filter. These joins and appends might be changed into a minimal version of transformations that are run on filtered queries made to the CKAN api.

In [8]:
df_cleared = pd.read_csv("../data/source/csv/building-permits-cleared-permits.csv", dtype=str)

In [21]:
df.columns

Index(['_id', 'PERMIT_NUM', 'REVISION_NUM', 'PERMIT_TYPE', 'STRUCTURE_TYPE',
       'WORK', 'STREET_NUM', 'STREET_NAME', 'STREET_TYPE', 'STREET_DIRECTION',
       'POSTAL', 'GEO_ID', 'WARD_GRID', 'APPLICATION_DATE', 'ISSUED_DATE',
       'COMPLETED_DATE', 'STATUS', 'DESCRIPTION', 'CURRENT_USE',
       'PROPOSED_USE', 'DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST',
       'EST_CONST_COST', 'ASSEMBLY', 'INSTITUTIONAL', 'RESIDENTIAL',
       'BUSINESS_AND_PERSONAL_SERVICES', 'MERCANTILE', 'INDUSTRIAL',
       'INTERIOR_ALTERATIONS', 'DEMOLITION', 'BUILDER_NAME', 'permit_id'],
      dtype='object')

In [28]:
all([c in df.columns for c in df_cleared.columns])

True

In [9]:
df = pd.concat([
        df.drop("_id", axis=1),
        PyHasher(df_cleared.drop("_id", axis=1), PK_COLS, "permit_id").hash()
    ])

In [41]:
df.columns

Index(['PERMIT_NUM', 'REVISION_NUM', 'PERMIT_TYPE', 'STRUCTURE_TYPE', 'WORK',
       'STREET_NUM', 'STREET_NAME', 'STREET_TYPE', 'STREET_DIRECTION',
       'POSTAL', 'GEO_ID', 'WARD_GRID', 'APPLICATION_DATE', 'ISSUED_DATE',
       'COMPLETED_DATE', 'STATUS', 'DESCRIPTION', 'CURRENT_USE',
       'PROPOSED_USE', 'DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST',
       'EST_CONST_COST', 'ASSEMBLY', 'INSTITUTIONAL', 'RESIDENTIAL',
       'BUSINESS_AND_PERSONAL_SERVICES', 'MERCANTILE', 'INDUSTRIAL',
       'INTERIOR_ALTERATIONS', 'DEMOLITION', 'BUILDER_NAME', 'permit_id',
       'effective_date'],
      dtype='object')

In [27]:
show_cols = [
    'PERMIT_NUM', 'REVISION_NUM', 'PERMIT_TYPE', 'WARD_GRID', 'APPLICATION_DATE', 'ISSUED_DATE',
       'COMPLETED_DATE', 'STATUS', 'DESCRIPTION', 'CURRENT_USE',
       'PROPOSED_USE', 
       'EST_CONST_COST', 'BUILDER_NAME', 'permit_id'
]
df[show_cols].sort_values("APPLICATION_DATE")

Unnamed: 0,PERMIT_NUM,REVISION_NUM,PERMIT_TYPE,WARD_GRID,APPLICATION_DATE,ISSUED_DATE,COMPLETED_DATE,STATUS,DESCRIPTION,CURRENT_USE,PROPOSED_USE,EST_CONST_COST,BUILDER_NAME,permit_id
254758,79 018917 BLD,00,Residential Building Permit,S1234,1979-06-21,1979-11-27,,Permit Issued,,,,600000,,becc1fbd
254759,81 018815 PLB,00,Plumbing(PS),S0428,1981-06-02,1981-06-16,,Permit Issued,,,,450,,c66faeac
278101,81 207001 CMB,00,Residential Building Permit,N1531,1981-10-08,1981-10-08,2022-08-08,Cancelled,SECOND STOREY ADDN,,,100,,477ec98c
254760,82 025573 BLD,00,Building Additions/Alterations,S1026,1982-12-30,1983-02-04,,Permit Issued,,,,45000,,372b9b07
254761,83 012784 PLB,00,Plumbing(PS),S0938,1983-02-01,,,Inspection,,,,,,263c92c9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186203,19 228342 XXX,00,Conditional Permit,W0325,,,2019-09-27,Cancelled,Conditional permit application for model A1 le...,,,,,f17ec127
204275,20 152013 XXX,00,Conditional Permit,E2325,,,2020-06-03,Cancelled,Supply and Installation of free-standing Rack ...,,,,,dcccf276
207486,20 168143 XXX,00,Conditional Permit,N1826,,,2020-07-15,Cancelled,To construct Foundation for a 12 Story Condo a...,,,,,5e906d75
242525,21 235452 XXX,00,Non-Residential Building Permit,W0334,,,2021-11-08,Cancelled,Proposed underpinning to existing commercial b...,,,,,46b44408


In [37]:
[
    sum(df["APPLICATION_DATE"].isna()),
    sum(df["ISSUED_DATE"].isna()),
    sum(df["COMPLETED_DATE"].isna()),
]

[6, 37627, 262203]

In [15]:
df[["APPLICATION_DATE", "ISSUED_DATE", "COMPLETED_DATE"]].isnull().all(axis=1).sum()

0

**NOTE**: There are permit records with at least one date, therefore we need to create an "EFFECTIVE_DATE" column that pulls the latest date available between Application, Issued, and Completed dates. Since Completed > Issued > Application dates, the effective date will be the date that appears first in that order.

Alternatively: we choose the maximum date present in the record. We will use this strategy as it is quicker to implement.

In [10]:
date_columns = ["APPLICATION_DATE", "ISSUED_DATE", "COMPLETED_DATE"]
df["effective_date"] = df[date_columns].fillna("").max(axis=1, skipna=True)


## EDA: Windowing data
Experiment with windowing data to capture the wards with the highest permit activity. We want to see which wards have the highest activity in permit value and the most permits within the last month, quarter, and year. We also want to see the trend for each ward (permit value and number of permits) over time. Eventually we will create a static crosswalk between neighbourhoods and address points to map each address to a neighbourhood.

We will opt to use the `dt.to_period` conversion for the date series. This function depends on a frequency argument which is a representation of the [date windowing defined by pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-period-aliases). We'll use `M` for **monthly**, `Q` for **quarterly**, and `Y` for **yearly**.

In [46]:
def count_unique(df: pd.DataFrame) -> list:
    return [{col: df_loc[col].nunique()} for col in df_loc.columns]

In [23]:
ALLOWED_PERMIT_TYPES = [
    'Residential Building Permit',
    'Mechanical(MS)', 'Plumbing(PS)', 'Multiple Use Permit',
    'Change of Use Permit', 'Building Additions/Alterations',
    'Small Residential Projects', 'Conditional Permit',
    'Fire/Security Upgrade', 'Designated Structures',
    'Drain and Site Service', 'Partial Permit',
    'Demolition Folder (DM)', 'New Houses', 'New Building',
]
filtered = df.loc[
    (df["STATUS"] != "Cancelled") &
    (df["PERMIT_TYPE"].isin(ALLOWED_PERMIT_TYPES))
]
filtered = filtered[["GEO_ID", "effective_date", "EST_CONST_COST"]]
filtered = filtered.merge(df_loc[["ADDRESS_POINT_ID", "WARD_NAME"]], left_on="GEO_ID", right_on="ADDRESS_POINT_ID", how="left")


In [15]:
def convert_cost_column(df:pd.DataFrame, cost_col: str) -> pd.DataFrame:
    df[cost_col] = pd.to_numeric(df[cost_col], errors='coerce')
    return df

def group_by_time(df:pd.DataFrame, date_col: str, group_cols: list, cost_col: str, grouping: str = "month") -> pd.api.typing.DataFrameGroupBy:
    GROUPING = {
        "month": "M",
        "quarter": "Q",
        "year": "Y",
    }
    df = convert_cost_column(df, cost_col)
    df[grouping] = pd.to_datetime(df[date_col]).dt.to_period(freq=GROUPING[grouping])
    return df.groupby(group_cols + [grouping])

def get_ward_period_data(
        df: pd.DataFrame,
        date_col: str,
        group_cols: list,
        cost_col: str,
        grouping: str,
    ) -> pd.DataFrame:
    grouped = group_by_time(df, date_col, group_cols, cost_col, grouping)
    return grouped.agg(
            sum_est_costs = pd.NamedAgg(column="EST_CONST_COST", aggfunc="sum"),
            num_permits = pd.NamedAgg(column="EST_CONST_COST", aggfunc="count"),
        )

In [46]:
GROUP_COLS = ["WARD_NAME"]
get_ward_period_data(df=filtered, date_col="effective_date", group_cols=GROUP_COLS, cost_col="EST_CONST_COST", grouping="month").reset_index()

Unnamed: 0,WARD_NAME,month,sum_est_costs,num_permits
0,Beaches-East York,1985-05,3000.0,1
1,Beaches-East York,1986-08,8000.0,1
2,Beaches-East York,1987-05,30000.0,1
3,Beaches-East York,1987-09,40000.0,1
4,Beaches-East York,1988-02,4000.0,1
...,...,...,...,...
8591,York South-Weston,2024-01,49234690.0,42
8592,York South-Weston,2024-02,401082500.0,46
8593,York South-Weston,2024-03,7057383.0,38
8594,York South-Weston,2024-04,7152011.5,61


## EDA: Try to geofence with geopandas
Can we create a neighbourhood crosswalk with geopandas and intersections? Find where `address-points-municipal-toronto.csv` address points lie within the neighbourhood shapes in `neighbourhoods.csv`.

In [11]:
import geopandas as gpd
import shapely
import json

def geojson_parse(geojson_string: str) -> object:
    return shapely.geometry.shape(json.loads(geojson_string))

HOOD_KEEP_COLS = ['AREA_ID', 'AREA_SHORT_CODE', 'AREA_NAME', 'geometry']
df_hood = pd.read_csv("../data/source/csv/neighbourhoods.csv")[HOOD_KEEP_COLS]
df_hood['geometry'] = df_hood['geometry'].apply(geojson_parse)
df_hood = gpd.GeoDataFrame(df_hood)

ADDRESS_KEEP_COLS = ['ADDRESS_POINT_ID', 'geometry']
df_loc_geo = df_loc[ADDRESS_KEEP_COLS].copy()
df_loc_geo['geometry'] = df_loc_geo['geometry'].apply(geojson_parse)
df_loc_geo = gpd.GeoDataFrame(df_loc_geo)

In [12]:
df_intersect = df_loc_geo.overlay(df_hood, how='intersection')

In [101]:
df_intersect.drop(columns='geometry').to_csv("../data/analysis/csv/address-neighbourhoods.csv", index=False)

In [13]:
df_intersect.drop(columns='geometry')

Unnamed: 0,ADDRESS_POINT_ID,AREA_ID,AREA_SHORT_CODE,AREA_NAME
0,7296114,2502229,19,Long Branch
1,9109405,2502229,19,Long Branch
2,5729529,2502229,19,Long Branch
3,9950585,2502229,19,Long Branch
4,5729496,2502229,19,Long Branch
...,...,...,...,...
525278,14207305,2502302,92,Corso Italia-Davenport
525279,9313238,2502293,114,Lambton Baby Point
525280,14208577,2502293,114,Lambton Baby Point
525281,7974034,2502268,89,Runnymede-Bloor West Village


## EDA: Stats By Neighbourhood (tying it all together)
Now that we've annotated all properties by their neighbourhood, we can run our previous analyses on a per neighbourhood basis.

In [29]:
ALLOWED_PERMIT_TYPES = [
    'Residential Building Permit', 'Small Residential Projects', 
    # 'Conditional Permit', 'Drain and Site Service', 
    # 'Mechanical(MS)', 'Plumbing(PS)', 'Multiple Use Permit',
    # 'Change of Use Permit', 'Building Additions/Alterations',
    # 'Partial Permit', 'Demolition Folder (DM)', 
    'New Houses', 'New Building',
]
df_filtered = df.loc[
    (df["STATUS"] != "Cancelled") &
    (df["PERMIT_TYPE"].isin(ALLOWED_PERMIT_TYPES))
]
df_filtered = df_filtered[["GEO_ID", "effective_date", "EST_CONST_COST"]]
df_filtered = df_filtered.merge(df_intersect[["ADDRESS_POINT_ID", "AREA_NAME"]], left_on="GEO_ID", right_on="ADDRESS_POINT_ID", how="left")


In [22]:
df_filtered

Unnamed: 0,GEO_ID,effective_date,EST_CONST_COST,ADDRESS_POINT_ID,AREA_NAME
0,10575758,2000-01-04,20000,10575758,South Riverdale
1,12544613,2000-07-12,50000,12544613,Roncesvalles
2,8190271,2000-02-16,50000,8190271,South Parkdale
3,8456828,2000-01-11,0,8456828,Dovercourt Village
4,855389,2000-02-03,400000,855389,Downtown Yonge East
...,...,...,...,...,...
491222,493031,2018-09-14,,493031,Bedford Park-Nortown
491223,521259,2018-09-17,,521259,Bedford Park-Nortown
491224,546964,2018-09-17,,546964,Bedford Park-Nortown
491225,513957,2018-09-14,,513957,Bedford Park-Nortown


In [30]:
GROUP_COLS = ["AREA_NAME"]
df_processed = get_ward_period_data(
    df=df_filtered, 
    date_col="effective_date", 
    group_cols=GROUP_COLS, 
    cost_col="EST_CONST_COST", 
    grouping="month"
).reset_index()

**WE DID IT**

In [36]:
df_processed.to_csv("../data/analysis/csv/neighbourhood-permit-summary.csv", index=False)

## Analytics: Explore the new data.
An example of plotting the latest data by **month** and by **number of permits**: we see that the `Englemount-Lawrence` neighbourhood is seeing high amounts of residential permit activity.

In [55]:
df_processed.loc[(df_processed["month"] < "2024-05")].sort_values(["month", "num_permits"], ascending=False)

Unnamed: 0,AREA_NAME,month,sum_est_costs,num_permits
8437,Englemount-Lawrence,2024-04,16512572.52,58
15480,Leaside-Bennington,2024-04,6868326.00,33
25043,The Beaches,2024-04,2411000.00,32
11965,Humewood-Cedarvale,2024-04,7147065.50,29
25520,Trinity-Bellwoods,2024-04,3630502.00,27
...,...,...,...,...
24699,The Beaches,1985-05,3000.00,1
26336,West Queen West,1985-04,40000.00,1
6229,Dovercourt Village,1985-01,10000.00,1
818,Banbury-Don Mills,1984-12,0.00,0


If we filter for records only for the `Englemount-Lawrence` neighbourhood and sort by **month**: we see that this area had a huge permit influx in April of 2024, and hasn't seen much residential permit activity before or after (thus far).

In [57]:
df_processed.loc[df_processed['AREA_NAME'] == "Englemount-Lawrence"].sort_values("month", ascending=False)

Unnamed: 0,AREA_NAME,month,sum_est_costs,num_permits
8438,Englemount-Lawrence,2024-05,0.00,1
8437,Englemount-Lawrence,2024-04,16512572.52,58
8436,Englemount-Lawrence,2024-03,4000000.00,4
8435,Englemount-Lawrence,2024-02,3000.00,1
8434,Englemount-Lawrence,2024-01,2075000.00,3
...,...,...,...,...
8183,Englemount-Lawrence,1995-04,0.00,0
8182,Englemount-Lawrence,1993-10,0.00,0
8181,Englemount-Lawrence,1993-07,0.00,0
8180,Englemount-Lawrence,1990-12,0.00,0


If we filter for records only for the `Leaside-Bennington` neighbourhood and sort by **month**: we see that this area hasn't eased up in the last few months -- lots of renos going on, well above 3M dollars per month. This area is known to be a high-income neighbourhood.

In [58]:
df_processed.loc[df_processed['AREA_NAME'] == "Leaside-Bennington"].sort_values("month", ascending=False)

Unnamed: 0,AREA_NAME,month,sum_est_costs,num_permits
15481,Leaside-Bennington,2024-05,0.00,1
15480,Leaside-Bennington,2024-04,6868326.00,33
15479,Leaside-Bennington,2024-03,3895116.00,15
15478,Leaside-Bennington,2024-02,5775000.00,21
15477,Leaside-Bennington,2024-01,41494359.22,21
...,...,...,...,...
15205,Leaside-Bennington,2000-04,0.00,0
15204,Leaside-Bennington,1999-12,7500.00,1
15203,Leaside-Bennington,1999-10,20000.00,1
15202,Leaside-Bennington,1999-09,10000.00,1


In [54]:
df_processed.loc[(df_processed["month"] > "2022-12") & (df_processed["month"] < "2024-05") & (df_processed["num_permits"] > 0)].sort_values(["month", "num_permits"], ascending=[False, True])

Unnamed: 0,AREA_NAME,month,sum_est_costs,num_permits
2560,Black Creek,2024-04,150000.0,1
4509,Church-Wellesley,2024-04,50000.0,1
7482,East L'Amoreaux,2024-04,55000.0,1
9044,Fenside-Parkwoods,2024-04,0.0,1
10212,Harbourfront-CityPlace,2024-04,0.0,1
...,...,...,...,...
23658,South Riverdale,2023-01,769000.0,18
24425,Stonegate-Queensway,2023-01,3390000.0,18
25505,Trinity-Bellwoods,2023-01,1586000.0,18
22471,Rosedale-Moore Park,2023-01,2790000.0,19


In [53]:
df_processed.loc[df_processed['AREA_NAME'] == "Black Creek"].sort_values("month", ascending=False)

Unnamed: 0,AREA_NAME,month,sum_est_costs,num_permits
2561,Black Creek,2024-05,20000.0,1
2560,Black Creek,2024-04,150000.0,1
2559,Black Creek,2024-03,0.0,0
2558,Black Creek,2024-02,315000.0,3
2557,Black Creek,2024-01,22000.0,2
...,...,...,...,...
2488,Black Creek,2003-06,0.0,0
2487,Black Creek,2002-10,0.0,0
2486,Black Creek,2002-08,0.0,0
2485,Black Creek,2000-06,0.0,0


Next steps: Chlorpleth