<a href="https://colab.research.google.com/github/jonnross88/WhoLetThePuppiesIn/blob/main/puppies_comb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Brought Over


In [None]:
from IPython.display import clear_output



!pip install cartopy
!pip install --upgrade hvplot
!pip install --upgrade panel
!pip install --upgrade param
!pip install --upgrade holoviews
!pip install --upgrade umap-learn
!pip install --upgrade geoviews
!pip install --upgrade bokeh
!pip install --upgrade jupyter_bokeh
!pip install pysal
!pip install spatialpandas
!pip install thefuzz
!pip install pmdarima
!pip install dask[dataframe]

clear_output()

In [None]:
# from google.colab import auth, userdata

# PROJECT_ID = userdata.get('MrPrime')
# auth.authenticate_user(project_id=PROJECT_ID)


In [None]:
import subprocess
from pathlib import Path

def download_file(file_path, file_url):
    file_path = Path(file_path)
    if not file_path.exists():
        print(f"File not found at {file_path}. Downloading now...")
        subprocess.run(["wget", file_url, "-O", str(file_path)])
        print("Download complete.")
    else:
        print(f"File already exists at {file_path}")



In [None]:
hf_url = 'https://raw.githubusercontent.com/jonnross88/WhoLetThePuppiesIn/main/notebooks/helper_functions.py'
hf_path = Path('/content/helper_functions.py')


download_file(hf_path, hf_url)

In [None]:
# Standard libraries
from functools import partial
from IPython.display import clear_output
import json
import math
from pathlib import Path
from PIL import ImageDraw, Image  # For image processing
from urllib.request import urlopen

# Related third party imports
from bokeh.models import FixedTicker, NumeralTickFormatter
import cartopy.crs as ccrs  # For cartographic projections and geographic plots
import colorcet as cc  # Additional color palettes
from esda.moran import Moran, Moran_Local  # Spatial autocorrelation statistics
from fiona.io import ZipMemoryFile
import geopandas as gpd
import geoviews as gv
import geoviews.tile_sources as gts
import spatialpandas as spd
import holoviews as hv

from holoviews import opts
import hvplot.pandas  # noqa
from matplotlib import pyplot as plt
import libpysal as lps  # Spatial analysis library
import numpy as np
import pandas as pd
import panel as pn
import panel.widgets as pnw
from pmdarima import auto_arima  # For determining ARIMA orders
import seaborn as sns
from splot.esda import plot_local_autocorrelation
from sklearn import metrics  # For evaluating model performance
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import umap
from thefuzz import fuzz  # For string matching
from joblib import Memory
from tqdm import tqdm
import pysal as ps
from libpysal.weights import DistanceBand, KNN, Kernel, Queen, Rook
import esda
from splot.esda import (
    lisa_cluster,
    moran_scatterplot,
    plot_local_autocorrelation,
    plot_moran,
)
from splot.libpysal import plot_spatial_weights
import libpysal as lps
from shapely.geometry import Point

# Local application/library specific imports
import helper_functions as hf  # Custom helper functions for this project
# from translate_app import translate_list_to_dict


# clear_output()

In [None]:
import concurrent.futures as cf
from collections import defaultdict
from IPython.display import clear_output
import re

from urllib.request import urlopen
from urllib.parse import urljoin

from bs4 import BeautifulSoup

import lxml
import numpy as np
import pandas as pd


## Dog Puppulation in Zürich: A Geospatial Neighborhood Analysis

### Introduction

#### Problem Statement:
Can we, by the end of January 2024, develop a *pawsome*, data-driven model  that forcasts the dog *puppulation* density across Zürich’s 34 neighborhoods, identifies areas as high density clusters if their dog density is above the 75th percentile and as low density clusters if below the 25th percentile, and achieves this with a Mean Absolute Percentage Error of less than 10%, using time series cross validation, to support urban planning, pet-related businesses, and community welfare?


#### Context:
Following the City Council Resolution to override the Law on the Keeping of Dogs, the City of Zürich has embarked on a comprehensive exploration of dog *puppulation* dynamics in its neighborhoods. This initiative, prompted by that regulatory shift, aims to sniff out patterns in dog *puppulation* density that impact urban planning, business opportunities, and the overall welfare of our furry companions and their owners. The study leverages data from **2015** to **2020** to improve urban planning, boost pet-related business ventures, and foster community welfare through a better understanding of dog *puppulation* density patterns. This study is vital in this new era for Zürich, providing practical recommendations for the near future. The aim is to develop a data-driven model that reliably predicts the dog *puppulation* density across Zürich’s 34 neighborhoods in the near future.


#### Criteria for Success:
Our goal is to *unleash* the power of predictive modeling to forecast the dog *puppulation* density patterns in Zürich, aiming to achieve a Mean Absolute Percentage Error of less than 10% with our model, which we will use to make informed predictions for 2024. Achieving this would be a *pawsitive* step towards informed future urban strategies.


#### Constraints within Solution Space:
- **Temporal Scope**: This study utilizes data from 2015 to 2020 across all datasets. The taxable income datasets, which is only available up to year t - 3 was incorporated into our analysis. To align with other datasets that extend to 2021 and 2022, we employed an auto arima model to predict taxable income for these years.
- **Spatial Resolution**: The study focuses on dog *puppulation* density at the neighborhood level. This may not capture variations within neighborhoods or between smaller areas.
- **Generalizability**: The findings of this study are specific to Zürich and may not be applicable to other cities or regions with different demographic, economic, and cultural contexts.


#### Stakeholders:
- **City Planners and Local Authorities:** Empower data-driven decision-making to enhance urban living conditions.
- **Business Enterprises:** Guide service offerings and marketing strategies.
- **Dog Owners:** Offer insights into community resources and pet care options.


#### Key Data Sources:
- **Geospatial Boundaries:** [Zürich Statistical Quarters](https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere)
- **Dog Ownership Records:** [Dog Owners Dataset](https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv)
- **Demographic Statistics:** [Population Dataset](https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv)
- **Economic Indicators:** [Income Dataset](https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv)
- **Household Dynamics:** [Household Size Dataset](https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv)

#### Analytical Objectives:
- **Understand the Relationship**: Dig into the relationship between demographic factors and dog *puppulation* density across Zürich’s neighborhoods.
- **Identify Trends and Clusters**: Track and map out the spatial and temporal trends of dog *puppulation* density. Identify spatial clusters of high and low dog *puppulation* density.
- **Predict Future Trends**: Predict the near-future trends of dog *puppulation* density using historical data, aiming for a Mean Absolute Percentage Error of less than 10%. This includes forecasting where Zürich’s dog *puppulation* will be booming across its 34 neighborhoods in the immediate future.


### Imports & Configurations

This section includes the necessary imports for libraries, configuration settings for dataframes and visualizations. These components establish the foundational setup for subsequent data analysis and exploration.


In [None]:
# Additional configurations for visualization libraries
gv.extension("bokeh")
hv.extension("bokeh")
hvplot.extension("bokeh")
# pn.extension(template="fast", nthreads=4, sizing_mode="stretch_width")
pn.extension()
# memory cache
# Set the cache directory
CACHE_DIR = "./zurich_cache_directory"
memory = Memory(CACHE_DIR, verbose=0)

# clear_output()

In [None]:
# Pandas display options
# Disable warnings for chained assignments
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 50
pd.options.display.max_rows = 100
hv.streams.PlotSize.scale = 2  # Sharper plots

# Seaborn style setting
sns.set_style("whitegrid")

# Panel configuration for improved interactivity performance
pn.config.throttled = True

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Clear any output created by the extensions and settings
# clear_output()

### Data Description

This project utilizes various datasets to reveal the relationship between dog owner geodemographic factors and dog population density in Zurich.




<table>
    <thead>
        <tr>
            <th>Dataset</th>
            <th>Source URL</th>
            <th>Original Source</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="#Zurich-Statistical-Districts-Geospatial-Data">Zurich Districts Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere">Stadt Zürich</a></td>
            <td>Statistical Quarters</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Dogs-Dataset">Zurich Dogs Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001">Stadt Zürich</a></td>
            <td>Dog populations of the City of Zurich since 2015.</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Population-Dataset">Zurich Population Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903">Stadt Zürich</a></td>
            <td>Population by neighbourhood, origin, sex and age, since 1993.</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Income-Data">Zurich Income Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003">Stadt Zürich</a></td>
            <td>Median income of taxable individuals by year, tax rate and urban district, since 1999</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Household-Dataset">Zurich Household Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806">Stadt Zürich</a></td>
            <td>Private households by household size and urban district, since 2013.</td>
        </tr>
    </tbody>
</table>

<p>These datasets collectively enable a comprehensive analysis of dog ownership trends in Zurich.</p>


### Data Loading
First, we load in all of the datasets.

To enhance readability and ensure consistency across datasets, original column names were translated from German to English and standardized to snake case using our `sanitize_df_column_names` helper function. This transformation facilitates a cleaner, more uniform `pd.DataFrame` structure for analysis.

We then inspect the columns and select the ones we would like to keep for our analysis. We also rename the columns to make them more readable and consistent across datasets.



#### Zurich Statistical Districts Geospatial Data

This first geodataset comes as a compressed file containing 3 geojson files.

1. `z_gdf_0`: point geometry data at the ideal position for placing a number label on the polygon map.

2. `z_gdf_1`: polygon geometry data specifically for visual representation in cartography i.e.maps.

3. `z_gdf_2`: polygon geometry data recommended for use for accurate geometry calculations, like spatial joins or area calculations.

Together these three files provide excellent geodedic information on the geographical region of Zürich for our analysis.

In [None]:
# Define the URL for the Zurich Statistical Quarters geospatial data ZIP file.
zip_gdf_url = "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_statistical_quarters.zip"

# Load the geospatial data into Zurich Geo DataFrames.Would you prefer if we do
zurich_geo_dicts = hf.get_gdf_from_zip_url(zip_gdf_url)

# Rename keys in the Zurich Geo DataFrames with a prefix.
z_gdf = hf.rename_keys(zurich_geo_dicts, prefix="z_gdf_")

# Display the information and a sample of data from each GeoDataFrame in the z_gdf dictionary
for key in z_gdf.keys():
    print(f"\nInformation for {key}:")
    z_gdf[key].info()
    print(f"Sample data from {key}:")
    display(z_gdf[key].sample(3))

#### Zurich Dogs Dataset


In [None]:
# zurich_dog_data_link = "https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv"
zurich_dog_data_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_dogs.csv"
)
# InfoDataFrame is a custom class that inherits from pandas.DataFrame and our InfoMixin
zurich_dog_data = hf.InfoDataFrame(pd.read_csv(zurich_dog_data_link))
zurich_dog_data.limit_info()

# zurich_dog_data = hf.sanitize_df_column_names(zurich_dog_data)
# zurich_dog_data.limit_info()
zurich_dog_data.sample(3)

#### Zurich Population Dataset



In [None]:
# zurich_pop_link = "https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv"
zurich_pop_link = "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_pop.csv"
zurich_pop_data = hf.InfoDataFrame(pd.read_csv(zurich_pop_link))
zurich_pop_data.limit_info()
# zurich_pop_data = hf.sanitize_df_column_names(zurich_pop_data)
# zurich_pop_data.limit_info()
print("Showing a full row of the Zurich population DataFrame:")
zurich_pop_data.sample().T

#### Zurich Income Dataset
These data contain quantile values of the taxable income of natural persons who are primarily taxable in the city of Zurich. Tax income are in thousand francs (integer).

In [None]:
# zurich_income_link = "https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv"
zurich_income_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_income.csv"
)
zurich_income_data = hf.InfoDataFrame(pd.read_csv(zurich_income_link))
zurich_income_data.info()

# Clean column names, display info and sample
# zurich_income_data = hf.sanitize_df_column_names(zurich_income_data)
# zurich_income_data.info()

print("\nShowing a full row of the Zurich income DataFrame:")
zurich_income_data.sample().T

#### Zurich Household Dataset

In [None]:
# zurich_household_data_link = "https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv"
zurich_household_data_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_household.csv"
)
zurich_household_data = hf.InfoDataFrame(pd.read_csv(zurich_household_data_link))
zurich_household_data.limit_info()
print(zurich_household_data.columns)
# zurich_household_data = hf.sanitize_df_column_names(zurich_household_data)
# zurich_household_data.limit_info()
print("\nShowing a full row of the Zurich household DataFrame:")

zurich_household_data.sample().T

### Dataset Wrangling

Before diving into Exploratory Data Analysis (EDA), we need to prepare our datasets. This involves:
- Removing unnecessary columns
- Renaming columns for consistency
- Adding new columns
- Cleaning data (handling missing values, correcting datatypes, and standardizing data)

These steps will ensure our data is clean and well-structured, setting the stage for effective and accurate analysis in the EDA phase. We'll apply these steps to each dataset.

#### Zurich Statistical Districts Geospatial Data

Additional steps for this dataset not yet mentioned:

- area calculations
- spatial join with the geospatial data so that we can consider the districts if we wanted to

In [None]:
zurich_map_gdf = z_gdf["z_gdf_1"]

zurich_map_gdf.rename(
    columns={"qname": "neighborhood", "qnr": "subdistrict", "knr": "district"},
    inplace=True,
)
# Format the subdistrict column to have 3 digits
zurich_map_gdf["subdistrict"] = zurich_map_gdf["subdistrict"].astype(str).str.zfill(3)

# Create the refined geodataframe
subdistrict_gdf = zurich_map_gdf[
    ["neighborhood", "subdistrict", "district", "geometry"]
].copy()

# Display geodataframe information and CRS
subdistrict_gdf.info()
display(subdistrict_gdf.crs)

# Display a sample entry from the transformed geodataframe
subdistrict_gdf.sample().T
# Load the geospatial data for calculation
zurich_calc_gdf = z_gdf["z_gdf_2"]

# Calculate area in square meters and add as a new column
zurich_calc_gdf["subd_area_km2"] = (
    zurich_calc_gdf.to_crs(ccrs.GOOGLE_MERCATOR).area / 1e6
)

# Rename the column for consistency with the main geodataframe
zurich_calc_gdf = zurich_calc_gdf.rename(columns={"qname": "neighborhood"})

# Merge calculated features with the main geodataframe (subdistrict_gdf)
area_gdf = subdistrict_gdf.merge(
    zurich_calc_gdf[["neighborhood", "subd_area_km2"]], on="neighborhood"
)

# Display a snapshot of the merged geodataframe
display(area_gdf.sample().T)


districts_gdf = (
    subdistrict_gdf.drop(columns=["neighborhood", "subdistrict"])
    .dissolve(by="district")
    .reset_index()
)
districts_gdf = districts_gdf.dissolve(by="district").reset_index()
districts_gdf["d_area_km2"] = districts_gdf.to_crs(ccrs.GOOGLE_MERCATOR).area / 1e6

display(districts_gdf.sample().T)
districts_gdf

In [None]:
# Save the geodataframe to disk in the data folder
# area_gdf.to_file("../data/zurich_neighborhoods.geojson")
# districts_gdf.to_file("../data/zurich_districts.geojson")

# Save the geodataframe to disk in the data folder
hf.save_to_data(area_gdf, "zurich_neighborhoods.geojson")
hf.save_to_data(districts_gdf, "zurich_districts.geojson")

#### Zurich Dogs Dataset

The original dataset had 31 columns, many redundant. We've picked 18 for our analysis:

- deadline_date_year
- holder_id
- age_v_10_cd
- sex_cd
- circle_cd
- quar_cd
- quar_lang
- race_1_text
- race_2_text
- breed_mixed__breed_cd
- breed_mongrel_long
- breed_mixed__breed_sort
- breed_type_cd
- birth_dog_year
- age_v_dog_cd
- sex_dog_cd
- dog_color_text
- number_of_dogs

From these columns we create a new dataset, `dog_data` and we and we transform these column in preparation for the EDA phase. These transformations included:
- Converting the columns which only contain two different values two binary columns
- translating some values from German to English
- dealing with missing values
- standardizing some of the values for easier grouping.

In [None]:
zurich_dog_data = hf.InfoDataFrame(pd.read_csv(zurich_dog_data_link))

zurich_dog_data_column_name_translations = {
    'StichtagDatJahr': 'reporting_year',
    'HalterId': 'owner_id',
    'AlterV10Cd': 'age_group_10',
    'SexCd': 'owner_gender',
    'KreisCd': 'district',
    'QuarCd': 'subdistrict',
    'Rasse1Text': 'breed_1',
    'Rasse2Text': 'breed_2',
    'RasseMischlingCd': 'mixed_breed_code',
    'RasseMischlingLang': 'mixed_type',
    'DatenstandCd': 'data_status_code',
    'AlterV10Lang': 'age_group_10_long',
    'AlterV10Sort': 'age_group_10_sort',
    'RassentypCd': 'dog_size',
    'GebDatHundJahr': 'dog_birth_year',
    'AlterVHundCd': 'dog_age',
    'SexHundCd': 'dog_gender',
    'HundefarbeText': 'dog_color',
    'AnzHunde': 'number_of_dogs',
    # 'SexLang': 'sex_long',
    # 'SexSort': 'sex_sort',
    # 'KreisLang': 'district_long',
    # 'KreisSort': 'district_sort',
    # 'QuarLang': 'subdistrict_long',
    # 'QuarSort': 'subdistrict_sort',
    # 'RasseMischlingSort': 'mixed_breed_sort',
    # 'RassentypLang': 'breed_type_long',
    # 'RassentypSort': 'breed_type_sort',
    # 'AlterVHundLang': 'dog_age_long',
    # 'AlterVHundSort': 'dog_age_sort',
    # 'SexHundLang': 'dog_sex_long',
    # 'SexHundSort': 'dog_sex_sort',
}

print(f"Dataset now has {zurich_dog_data.shape[0]} rows and {zurich_dog_data.shape[1]} columns\n")
zurich_dog_data.rename(columns=zurich_dog_data_column_name_translations).sample()

In [None]:
from google.cloud import translate

def remove_accents(input_str):
    """Function to remove accents from a string"""
    import unicodedata
    nfkd_form = (
        unicodedata.normalize("NFKD",input_str).encode("ASCII", "ignore").decode())
    return nfkd_form



# def convert_to_snake_case(name):
#     """Convert a camel case string to a snake case string"""
#     name = re.sub(r'[\s,-]+', '_', name)
#     name = re.sub(r"([A-Z])([A-Z][a-z]+)", r"\1_\2", name)
#     name = re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", name)
#     name = re.sub(r'[()]', '', name)
#     return name.lower()

# @memory.cache
# def translate_list_to_dict(
#     list_of_strings,
#     project_id: str = "mrprime-349614",
#     source_lang: str = "de",
#     target_lang: str = "en-US",
# ) -> dict[str, str]:
#     """Translates a list, or another interable, of strings using Cloud Translation API.
#     Returns a TranslateTextResponse object."""
#     client = translate.TranslationServiceClient()
#     location = "us-central1"
#     parent = f"projects/{project_id}/locations/{location}"
#     response = client.translate_text(
#         request={
#             "parent": parent,
#             "contents": list_of_strings,
#             "mime_type": "text/plain",  # mime types: text/plain, text/html
#             "source_language_code": source_lang,
#             "target_language_code": target_lang,
#         }
#     )
#     trans_dict = {
#         text: translation.translated_text
#         for (text, translation) in zip(list_of_strings, response.translations)
#     }
#     return trans_dict


# def sanitize_df_column_names(df):
#     """Function to danitize column names by translating and conveting to snake case"""
#     column_list = df.columns.tolist()
#     # translate the column names
#     translated_dict = translate_list_to_dict(column_list)
#     # map the translated column names to the column names
#     df.rename(columns=translated_dict, inplace=True)
#     # convert the column names to snake case
#     df.columns = [convert_to_snake_case(col) for col in df.columns]
#     return df

In [None]:

zurich_dog_data = zurich_dog_data.rename(columns=zurich_dog_data_column_name_translations)

# After renaming, you may still need to adjust the data types for certain columns
zurich_dog_data["owner_id"] = zurich_dog_data["owner_id"].astype("string").str.zfill(6)
zurich_dog_data["dog_age"] = zurich_dog_data["dog_age"].astype(int)
zurich_dog_data["district"] = zurich_dog_data["district"].astype(int)
zurich_dog_data["subdistrict"] = (
    zurich_dog_data["subdistrict"].astype("string").str.zfill(3)
)
print(f"Dataset now has {zurich_dog_data.shape[0]} rows and {zurich_dog_data.shape[1]} columns\n")



The number of dogs for each row is given in the `number_of_dogs` column. These are 'brothers and sisters' that also have the same owner and same characteristics.

E.g.
- `standard` or breed
- `dog_color_en` or dog color, etc.


We expand the dataset to have one dog for each row, by repeating the rows by the number in the `number_of_dogs` column. We reset the index after so that we have a unique index for each row.


In [None]:

# Repeat each row based on the number of dogs in the row represents
zurich_dog_data = zurich_dog_data.loc[
    zurich_dog_data.index.repeat(zurich_dog_data["number_of_dogs"])
]
# drop the number of dogs column
zurich_dog_data.drop("number_of_dogs", axis=1, inplace=True)
# reset the index
zurich_dog_data.reset_index(drop=True, inplace=True)

print(
    f"Dataset now has {zurich_dog_data.shape[0]} rows and {zurich_dog_data.shape[1]} columns"
)

zurich_dog_data.sample(3)

dog_columns = [
    'reporting_year',
    'owner_id',
    'age_group_10',
    'owner_gender',
    'dog_size',
    'dog_age',
    'mixed_type',
    'dog_gender',
    'dog_color',
    'breed_1',
    'breed_2',
    'district',
    'subdistrict',
]
dog_data = zurich_dog_data[dog_columns].copy()

print(f"Dataset now has {dog_data.shape[0]} rows and {dog_data.shape[1]} columns")


# dog_data = zurich_dog_data[list(new_column_names.values())].copy()
display(
    dog_data.describe(include="all")
    .T.sort_values(by="unique")
    .infer_objects(copy=False)
    .fillna("")
)

First look at thhe `dog_density` data which will be our target variable for the analysis.

In [None]:
# Extract the 'subdistrict' and 'subd_area_km2' columns from the area_gdf dataframe
area_df = area_gdf[['subdistrict', 'subd_area_km2']]


# create a function to create the dog density dataframe
def create_dog_density_df(dog_data, area_df):
  """Creates a dataframe with dog density per subdistrict and reporting_year."""
  # get the unique subdistricts
  subdistricts_list = area_df.subdistrict.unique()
  # filter out subdistricts not in the area_df
  dog_data.loc[~dog_data.subdistrict.isin(subdistricts_list),
               "subdistrict"] = None
  dog_data.dropna(subset=["subdistrict"], inplace=True)
  # get the count of dogs per reporting_year
  dog_data_counts_df = dog_data.groupby(['reporting_year', 'subdistrict'
                                         ]).size().reset_index(name='count')
  # merge the dog counts with the area dataframe's subdistrict and subd_area_km2 columns'
  dog_density_df = dog_data_counts_df.merge(area_df)
  # calculate the dog density
  dog_density_df['dog_density'] = (dog_density_df['count'] /
                                   dog_density_df['subd_area_km2']).round(2)
  # pivot the dataframe
  dog_density_pivot = dog_density_df.pivot(index='subdistrict',
                                           columns='reporting_year',
                                           values='dog_density')
  return dog_density_pivot


create_dog_density_df(dog_data,area_df).hvplot.heatmap(
                          cmap='greens',
                          height=600,
                          width=600,
                          title='Dog Density per Sub-District').opts(
                              active_tools=['box_zoom'],
                              color_levels=7,
                              line_width=2,
                          )

In [None]:
dog_data.info()

In [None]:
# get subdistricts_list from  area_df as these subdistricts are used in predicting the target
subdistricts_list = area_df.subdistrict.unique()
dog_data.loc[~dog_data.subdistrict.isin(subdistricts_list),
               "subdistrict"] = None
# drop rows with missing subdistricts
dog_data = dog_data.dropna(subset='subdistrict')
print(f"Dataset now has {dog_data.shape[0]} rows and {dog_data.shape[1]} columns\n")

# Unique values for "mixed_type" column
breed_cat_list_de = dog_data["mixed_type"].unique().tolist()
print("Breed Categories (German):")
display(breed_cat_list_de)

# Create a dictionary for translation
# breed_cat_dict = translate_list_to_dict(breed_cat_list_de)
# print("\nBreed Category Dictionary (Translation):")
# display(breed_cat_dict)

In [None]:
# Map 'mixed_type' to categories, rename for brevity, and define 'is_pure_breed'
mixed_type_dict = {
    "Rassehund": "PB",
    "Mischling, beide Rassen bekannt": "BB",
    "Mischling, sekundäre Rasse unbekannt": "BU",
    "Mischling, beide Rassen unbekannt": "UU",
}

dog_data["mixed_type"] = dog_data["mixed_type"].map(mixed_type_dict)

# dog_data["mixed_type"] = (
#     dog_data["mixed_type"]
#     .map(breed_cat_dict)
#     .map(
#         {
#             "pedigree dog": "PB",
#             "Mixed breed, both breeds known": "BB",
#             "Mixed breed, secondary breed unknown": "BU",
#             "Mixed breed, both breeds unknown": "UU",
#         }
#     )
# )
dog_data["is_pure_breed"] = dog_data["mixed_type"].eq("PB")
dog_data['is_designer_breed'] = dog_data['mixed_type'].eq('BB')

In [None]:
# Define owner and dog gender

# Drop the columns we just used to create the new columns
if "owner_gender" in dog_data.columns:
    dog_data["is_male_owner"] = dog_data["owner_gender"] == 1
    dog_data = dog_data.drop(columns=["owner_gender"])

if "dog_gender" in dog_data.columns:
    dog_data["is_male_dog"] = dog_data["dog_gender"] == 1
    dog_data = dog_data.drop(columns=["dog_gender"])

dog_data.shape

In [None]:
# Download translations from the github repo
breeds_dict_translations_path = Path('./breeds_dict_translations.json')
breeds_dict_translations_url = "https://raw.githubusercontent.com/jonnross88/WhoLetThePuppiesIn/refs/heads/main/notebooks/breeds_dict_translations.json"

color_dict_translations_path = Path('./dog_color_translations.json')
color_dict_translations_url = "https://raw.githubusercontent.com/jonnross88/WhoLetThePuppiesIn/refs/heads/main/notebooks/dog_color_translations.json"

download_file(breeds_dict_translations_path, breeds_dict_translations_url)
download_file(color_dict_translations_path, color_dict_translations_url)

In [None]:
# Unique values for dog colors
dog_colors = dog_data["dog_color"].str.lower().unique().tolist()
dog_colors.sort()
# print(dog_colors)
# Translate dog colors
# dog_color_translations = translate_list_to_dict(dog_colors)
with open(color_dict_translations_path, 'r') as fp:
    dog_color_translations = json.load(fp)

# dog_data["dog_color_en"] = dog_data["dog_color"].str.lower().map(
#     dog_color_translations)
dog_color_df = pd.DataFrame.from_dict(dog_color_translations, orient='index').reset_index().rename(columns={'index': 'dog_color', 0: 'dog_color_en'})
print("\nColors dataframe sample")
display(dog_color_df.sample(3))


# Unique values for breed_1
breeds_1 = dog_data["breed_1"].str.lower().unique().tolist()

# Unique values for breed_2
breeds_2 = dog_data["breed_2"].str.lower().unique().tolist()

breeds_list = list(set(breeds_1 + breeds_2))
breeds_list = [remove_accents(breed) for breed in breeds_list]
breeds_list.sort()

# breeds_dict = translate_list_to_dict(breeds_list)
with open(breeds_dict_translations_path, 'r') as fp:
    breeds_dict = json.load(fp)

breeds_df = pd.DataFrame.from_dict(breeds_dict, orient="index").reset_index().rename(columns={'index': 'breed_de', 0: 'breed_en'})
print("\nBreeds dataframe sample")
display(breeds_df.sample(3))



##### Breed Standardization
To ensure consistency in the analysis, the breeds in the dataset are standardized. Since the "breed" column is free text, allowing dog owners to input their breed information during registration, variations can exist even for the same breeds. To address this, we will use the dataframe we collected in the last notebook which contains the breeds recognized by the FCI (Fédération Cynologique Internationale). Within this dataframe, each recognized FCI breed has a column listing its name in different languages and alternative, unofficial names.

This approach helps capture variations in breed names and facilitates grouping similar breeds together.




In [None]:
# !wget https://raw.githubusercontent.com/jonnross88/WhoLetThePuppiesIn/main/notebooks/fci_breeds.json
# Saved the fci data in a bucket for easier editing vs the github method

fci_url = 'https://storage.googleapis.com/mrprime_dataset/dogs/fci_breeds.json'

fci_breeds = pd.read_json(fci_url)
fci_breeds[["alt_names", "breed_en"]].sample()


In [None]:

# nan_mask = breeds_df["standard"].isna()

# matched_value = hf.apply_fuzzy_matching_to_breed_column(
#     breeds_df, "breed_de", fci_breeds, [fuzz.WRatio]
# )

# breeds_df.loc[nan_mask, "standard"] = matched_value[nan_mask]
# nan_mask = breeds_df["standard"].isna()
# print(nan_mask.sum())
breeds_df

In [None]:
# Get the FCI dataframe with the recognized breeds
# fci_breeds = pd.read_json("../data/fci_breeds.json")
# fci_breeds = pd.read_json(fci_url)
# fci_breeds[["alt_names", "breed_en"]]

# Create a DataFrame with translated breed names
# breeds_df.columns = ["breed_de", "breed_en"]

# Initialize a "standard" column for breed standardization
breeds_df["standard"] = None
nan_mask = breeds_df["standard"].isna()

# Match each column for breed standardization
for col in breeds_df.columns:
    matched_value = hf.apply_fuzzy_matching_to_breed_column(
        breeds_df.loc[nan_mask], col, fci_breeds, [fuzz.WRatio]
    )
    breeds_df.loc[nan_mask, "standard"] = matched_value[nan_mask]
    nan_mask = breeds_df["standard"].isna()

# Update the standard column for specific cases
breeds_df.loc[nan_mask, "standard"] = breeds_df.loc[nan_mask, "breed_en"]
breeds_df.loc[breeds_df["breed_de"] == "elo", "standard"] = "elo"
breeds_df.loc[breeds_df["breed_de"] == "keine", "standard"] = "none"
breeds_df.loc[breeds_df["breed_de"] == "mischling", "standard"] = "hybrid"

# Convert breed_1 to lowercase for merging
dog_data["breed_1"] = dog_data["breed_1"].str.lower()
dog_data['breed_1'] = dog_data['breed_1'].apply(remove_accents)
dog_data["breed_2"] = dog_data["breed_2"].str.lower()
dog_data['breed_2'] = dog_data['breed_2'].apply(remove_accents)

# Merge with the breeds_df for standardized breed names
dog_data = dog_data.merge(
    breeds_df.drop(columns=["breed_en"]),
    how='left',
    left_on="breed_1",
    right_on="breed_de",
    suffixes=("", "_1"),
)

dog_data = dog_data.merge(
    breeds_df.drop(columns=["breed_en"]),
    how='left',
    left_on="breed_2",
    right_on="breed_de",
    suffixes=("", "_2"),  # Add suffix to distinguish columns
)

dog_data['dog_color'] = dog_data['dog_color'].str.lower()
dog_data = dog_data.merge(dog_color_df, how='left', left_on='dog_color', right_on='dog_color')
dog_data.shape


##### Filtering Doodle Dogs

A specific analysis is conducted to filter out dogs with 'doodle' in their breed names, converting them to mixed breeds and updating breed information accordingly. This is a designer breed which is not yet recognized.


In [None]:
# Create mask to filter out the doodle dogs
doodle_mask = dog_data["breed_1"].str.contains(
    r".*doodle", regex=True, na=False, case=False
)
print(f"Number of doodle dogs: {doodle_mask.sum()}")
# convert them to mixed breed if they are pure breeds
dog_data.loc[doodle_mask, "is_pure_breed"] = False
dog_data.loc[doodle_mask, "standard_2"] = "poodle"
dog_data.loc[doodle_mask, "mixed_type"] = "BB"
dog_data.loc[doodle_mask, "standard"] = dog_data.loc[doodle_mask, "breed_1"].apply(
    lambda x: "golden retriever" if x.startswith("G") else "labrador retriever"
)
dog_data[doodle_mask].sample(3)

In [None]:
# Calculate total dogs per owner and reporting_year
dog_data["pet_count"] = dog_data.groupby(["owner_id", "reporting_year"])["breed_1"].transform(
    "count"
)
print(f"Dataset now has {dog_data.shape[0]} rows and {dog_data.shape[1]} columns\n")


##### Missing Values
Although initially it looked as if we have no missing values, on close investigation we can see that there are placeholder values for where the missing values are. We replaced these with `Nan` values so that they are not mistaken for real values. As these are only few for the columns `subdistrict`, `dog_age`, and `district` we simply drop those rows. Remaining column with missing values `age_group_10`, we:

- fill missing `age_group_10` (dog owners' age groups) with `-1`, tracking these in `age_group_missing`.
- use later years' reporting_years to fill age group where possible, and make these edits in `age_group_10`.


Finally, we create `age_group_20`, grouping ages into 20-year increments, approximating a generation's length.


In [None]:
display(
    dog_data.describe(include="all")
    .T.sort_values(by="unique")
    .infer_objects(copy=False)
    .fillna("")
)

In [None]:
# Create a list of subdistricts to be used for validation
subdistricts_list = subdistrict_gdf["subdistrict"].unique().tolist()

# Define a dictionary of conditions and corresponding columns to be updated
conditions = {
    "dog_size": dog_data["dog_size"] == "UN",
    "age_group_10": dog_data["age_group_10"] > 100,
    "district": dog_data["district"] > 12,
    "dog_age": dog_data["dog_age"] > 30,
    "subdistrict": ~dog_data["subdistrict"].isin(subdistricts_list),
}

# Identify and print unique breeds with 'UN' dog size
un_breeds = dog_data.loc[conditions["dog_size"], "breed_1"].unique()
print(f"Dogs breeds of those missing dog_size data:\n{un_breeds}")

# Replace 'UN' dog size with 'K' and other invalid values with NaN
for column, condition in conditions.items():
    dog_data.loc[condition, column] = "K" if column == "dog_size" else np.nan

# Display the number of NaN values in each column
print("\nNumber of NaN values in each column:")
print(dog_data.isna().sum().sort_values(ascending=False))

In [None]:
dog_data = dog_data.dropna(subset=["dog_age"])

In [None]:
dog_data.columns
dog_data.info()
dog_data.sample(3)

In [None]:
# convert the numerical columns which had NaN values to int
dog_data["dog_age"] = dog_data["dog_age"].astype(int)
dog_data["district"] = dog_data["district"].astype(int)

In [None]:
# Create an indicator variable for missing 'age_group_10' values
dog_data["age_group_missing"] = dog_data["age_group_10"].isna().astype(int)

# Fill in the missing 'age_group_10' values
dog_data["age_group_10"] = dog_data["age_group_10"].fillna(
    dog_data.groupby("owner_id")["age_group_10"].transform(
        lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan
    )
)

dog_data["age_group_10"] = dog_data["age_group_10"].fillna(-1).astype(int)
dog_data["age_group_20"] = dog_data["age_group_10"].apply(
    lambda x: -1 if x == -1 else (x // 20) * 20
)

##### Consolidated Dog Data preprocessing
Combined all that we did with the dog data set into the `preprocess_dog_data` function.

In [None]:
breeds_df

In [None]:
# Data obtained form the first notebook
# fci_breeds = pd.read_json("../data/fci_breeds.json")
# fci_breeds = pd.read_json("/content/fci_breeds.json")



# def get_translation_dict(data, column):
#     """Returns a dataframe with the unique values in the column and their translations"""
#     df = data.copy()
#     data_to_translate = df[column].str.lower().unique()

#     return translate_list_to_dict(data_to_translate)


@memory.cache
def get_breed_standard(dict_of_breeds_translations, agency_breeds_df=fci_breeds):
    """Find the breed standard for each breed in the column"""
    breeds_data = pd.DataFrame.from_dict(dict_of_breeds_translations, orient="index").reset_index().rename(columns={'index': 'breed_de', 0: 'breed_en'})

    # apply fuzzy matching to the breed column to get the standardized breed name
    # create the 'standard' column and fill it with None
    breeds_data["standard"] = None
    nan_mask = breeds_data["standard"].isna()
    # Match each column for breed standardization

    for col in ['breed_de', 'breed_en']:
        matched_value = hf.apply_fuzzy_matching_to_breed_column(
            breeds_data.loc[nan_mask], col, agency_breeds_df, [fuzz.WRatio])

        breeds_data.loc[nan_mask, "standard"] = matched_value[nan_mask]
        nan_mask = breeds_data["standard"].isna()

    # Update the standard column for specific cases
    breeds_data.loc[nan_mask, "standard"] = breeds_data.loc[nan_mask, "breed_en"]
    # Special cases
    breeds_data.loc[breeds_data["breed_de"] == "elo", "standard"] = "elo"
    breeds_data.loc[breeds_data["breed_de"] == "keine", "standard"] = "none"
    breeds_data.loc[breeds_data["breed_de"] == "mischling", "standard"] = "hybrid"
    return breeds_data


def get_doodle_fix(data):
    """Correct doodle dogs to standard entries"""
    df = data.copy()
    # Create mask to filter out the doodle dogs
    doodle_mask = df["breed_1_de"].str.contains(r".*doodle",
                                                regex=True,
                                                na=False,
                                                case=False)

    # convert them to mixed breed if they are pure breeds
    df.loc[doodle_mask, "is_pure_breed"] = False
    df.loc[doodle_mask, "standard_2"] = "poodle"

    df.loc[doodle_mask, "mixed_type"] = "BB"
    df.loc[doodle_mask, "standard"] = df.loc[doodle_mask, "breed_1_de"].apply(
        lambda x: "golden retriever"
        if x.startswith("G") else "labrador retriever")
    return df


@memory.cache
def drop_concealed_nans(data):
    df = data.copy()

    subdistricts_list = df.subdistrict.value_counts().index.tolist()[:34]
    nan_conditions = {
        "dog_size": df["dog_size"] == "UN",
        "age_group_10": df["age_group_10"] > 100,
        "district": df["district"] > 12,
        "dog_age": df["dog_age"] > 30,
    }
    for column, condition in nan_conditions.items():
        df.loc[condition, column] = "K" if column == "dog_size" else np.nan
    return df


# define a function which does all of the preprocessing steps
@memory.cache
def preprocess_dog_data(data, **kwargs):
    """Preprocess the Zurich dog data"""

    df = data.copy()
    df = df.rename(columns=zurich_dog_data_column_name_translations)
    df["owner_id"] = df["owner_id"].astype("string").str.zfill(6)

    df["dog_age"] = df["dog_age"].astype(int)
    df["district"] = df["district"].astype(int)
    df["subdistrict"] = df["subdistrict"].astype("string").str.zfill(3)

    # expand to 1 dog on each row
    df = df.loc[df.index.repeat(df["number_of_dogs"])]
    df = df.drop("number_of_dogs", axis=1)
    df = df.reset_index(drop=True)
    # sub in the translated values
    df["mixed_type"] = df["mixed_type"].map(mixed_type_dict)
    df['subdistrict'] = df['subdistrict'].apply(lambda x: None if x not in subdistricts_list else x)
    # Create the binary columns
    df["is_pure_breed"] = df["mixed_type"].eq("PB")
    df['is_designer_breed'] = df['mixed_type'].eq('BB')

    df["is_male_owner"] = df["owner_gender"] == 1
    df["is_male_dog"] = df["dog_gender"] == 1
    df = df.drop(columns=["owner_gender", "dog_gender"])

    df['dog_color'] = df['dog_color'].str.lower()
    df = df.merge(dog_color_df,
                  left_on="dog_color",
                  right_on="dog_color",
                  how="left")

    df["breed_1_de"] = df["breed_1"].str.lower().apply(remove_accents)
    # df["breed_1_de"] = df["breed_1_de"].apply(remove_accents)
    df["breed_2_de"] = df["breed_2"].str.lower().apply(remove_accents)
    # df["breed_2_de"] = df["breed_2_de"].apply(remove_accents)

    breeds_df = pd.DataFrame()
    breeds_df = get_breed_standard(breeds_dict)

    df = df.merge(
    breeds_df.drop(columns=["breed_en"]),
    how='left',
    left_on="breed_1_de",
    right_on="breed_de",
    suffixes=("", "_1"),
    )
    df = df.merge(
        breeds_df.drop(columns=["breed_en"]),
        how='left',
        left_on="breed_2_de",
        right_on="breed_de",
        suffixes=("", "_2"),
    )

    df = get_doodle_fix(df)

    df = drop_concealed_nans(df)
    df['is_small_dog'] = df['dog_size'].eq('K')

    df = df.dropna(subset=["dog_age", "district", "subdistrict"])

    df["dog_age"] = df["dog_age"].astype(int)
    df["district"] = df["district"].astype(int)
    # get the pet count
    df["pet_count"] = df.groupby(
        ["owner_id","reporting_year"])["breed_1"].transform("count")
    # Create an indicator variable for missing 'age_group_10' values
    df["age_group_missing"] = df["age_group_10"].isna().astype(int)
    df["age_group_10"] = df["age_group_10"].fillna(
        df.groupby("owner_id")["age_group_10"].transform(
            lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan))
    # fill in the missing 'age_group_10' values and create the age_group_20 column
    df["age_group_10"] = df["age_group_10"].fillna(-1).astype(int)
    df["age_group_20"] = df["age_group_10"].apply(lambda x: -1
                                                  if x == -1 else (x // 20) * 20)

    return df

In [None]:
get_breed_standard(breeds_dict)


In [None]:
dog_data_columns_to_keep = [
    "reporting_year",
    "owner_id",
    "dog_size",
    "dog_age",
    "age_group_10",
    "age_group_20",
    "mixed_type",
    "is_pure_breed",
    "is_designer_breed",
    "is_male_owner",
    "is_male_dog",
    "is_small_dog",
    "dog_color_en",
    "standard",
    "standard_2",
    "pet_count",
    "district",
    "subdistrict",
    "age_group_missing",
]

In [None]:
# query each year separately then combine them
# dog_reporting_year_dict = {
#     reporting_year:
#     preprocess_dog_data(
#         hf.query_for_time_period(
#             hf.sanitize_df_column_names(pd.read_csv(zurich_dog_data_link)),
#             start_year=reporting_year,
#             end_year=reporting_year + 1,
#             year_col="date_date_year",
#         ), ) for reporting_year in range(2015, 2024)
# }
dog_data = pd.DataFrame()
dog_data = preprocess_dog_data(pd.read_csv(zurich_dog_data_link))
dog_reporting_year_dict = {
    reporting_year:
    dog_data.query('reporting_year == @reporting_year') for reporting_year in range(2015, 2024)
}



In [None]:
dog_data = pd.concat(dog_reporting_year_dict.values())[dog_data_columns_to_keep]

dog_data_to_2020 = pd.concat({
    reporting_year: df
    for reporting_year, df in dog_reporting_year_dict.items() if reporting_year <= 2020
}.values())[dog_data_columns_to_keep]

In [None]:
dog_data_to_2020.is_small_dog.value_counts()

In [None]:
# dog_data_train.to_csv("../data/processed_dog_data_train.csv", index=False)
hf.save_to_data(dog_data_to_2020, "processed_dog_data_to_2020.csv")
hf.save_to_data(dog_data, "processed_dog_data.csv")

In [None]:
# get the dog density dataframe
dog_density_pivot = create_dog_density_df(dog_data, area_df)
hf.save_to_data(dog_density_pivot, "dog_density_pivot.csv")

In [None]:

# get the dog density pivot for all the years
all_dog_densities = create_dog_density_df(dog_data, area_df)

# convert to long format
dog_densities_long = all_dog_densities.reset_index().melt(
    id_vars='subdistrict', var_name='year', value_name='dog_density')

# define the player widget
player = pnw.Player(name='Year',
                    start=2015,
                    end=2023,
                    value=2015,
                    step=1,
                    width=600,
                    interval=5000)


# @pn.cache(max_items=10)
@pn.depends(player.param.value)
def dog_density_chloropeth(query_year):
    """Returns a chloropleth map of the dog densities"""
    poly_opts = dict(height=400,
                    width=400,
                    cmap='greens',
                    color_levels=[0, 30, 60, 90, 120, 150],
                    line_width=2,
                    line_color='gray',
                    color='dog_density',
                    colorbar=True,
                    tools=['hover'],
                    xaxis='bare',
                    yaxis='bare',
                    active_tools=['box_zoom'],
                    colorbar_position='bottom',
                    backend_opts={'toolbar.autohide': True})
    # filter the dog densities for the year
    dog_density_year = dog_densities_long.query('year == @query_year')
    # merge in the area dataframe to get the geometry
    dog_density_year = area_gdf[['geometry','subdistrict']].merge(dog_density_year)
    # plot the chloropleth map
    return gv.Polygons(dog_density_year).opts(
        **poly_opts, title=f"Dog Density per Sub-District for {query_year}")


# combine the player widget and the chloropleth map
# map_panel = pn.pane.HoloViews(dog_density_chloropeth)
dog_map_panel = pn.panel(dog_density_chloropeth)
# pn.Column(player, dog_map_panel)

#### Zurich Population Dataset

In [None]:
zurich_pop_data_column_name_translations = {
    'StichtagDatJahr': 'reporting_year',
    'AlterVSort': 'age_from_sorting',
    'AlterVCd': 'age',
    'AlterVKurz': 'age_from_short_form',
    'AlterV05Sort': 'age_from_05_sorting',
    'AlterV05Cd': 'age_from_05_code',
    'AlterV05Kurz': 'age_from_05_short_form',
    'AlterV10Cd': 'age_group_10',
    'AlterV10Kurz': 'age_from_10_short_form',
    'AlterV20Cd': 'age_group_20',
    'AlterV20Kurz': 'age_from_20_short_form',
    'SexCd': 'gender_code',
    'SexLang': 'gender_long',
    'SexKurz': 'gender',
    'KreisCd': 'district',
    'KreisLang': 'district_long',
    'QuarSort': 'subdistrict_sort',
    'QuarCd': 'subdistrict',
    'QuarLang': 'neighborhood',
    'HerkunftSort': 'origin_sort',
    'HerkunftCd': 'origin',
    'HerkunftLang': 'origin_long',
    'AnzBestWir': 'pop_count'
}





In [None]:
def update_xaxis_age_group_10(plot, element):
    """Hook to update the x-axis ticker on the plot."""
    plot.state.xaxis.ticker = FixedTicker(ticks=list(range(0, 100,10)))


zurich_pop_data = pd.read_csv(zurich_pop_link)
zurich_pop_data = zurich_pop_data.rename(columns=zurich_pop_data_column_name_translations)
zurich_pop_data['is_male'] = zurich_pop_data['gender_code'] == 1
zurich_pop_data["is_swiss"] = zurich_pop_data['origin'] == 1
zurich_pop_data['district'] = zurich_pop_data['district'].astype(str).str.zfill(2)
zurich_pop_data['subdistrict'] = zurich_pop_data['subdistrict'].astype(str).str.zfill(3)

pop_columns = [
    'reporting_year',
    'age',
    'age_group_10',
    'age_group_20',
    'is_male',
    'is_swiss',
    'pop_count',
    'district',
    'subdistrict',
    'neighborhood',
]


zurich_pop_data_pivot = zurich_pop_data[pop_columns].groupby(['reporting_year','age_group_10'])['pop_count'].sum().reset_index().pivot(
    index='age_group_10',
    columns='reporting_year',
    values='pop_count'
).fillna(0).astype(int)
zurich_pop_data_pivot.T.hvplot.heatmap(height=600, width=1000).opts(active_tools=['box_zoom'], title="Population count by Age Group and Reporting Year")

In [None]:
zurich_pop_data['u20'] = zurich_pop_data['age_group_10'].apply(lambda x: True if x < 20 else False)
zurich_pop_data['u10'] = zurich_pop_data['age_group_10'].apply(lambda x: True if x < 10 else False)

total_pop = zurich_pop_data.groupby(['reporting_year','subdistrict'])['pop_count'].sum().reset_index(name='pop_count')
child_pop = zurich_pop_data[zurich_pop_data['u20']].groupby(['reporting_year','subdistrict'])['pop_count'].sum().reset_index(name='u20_pop_count')
infant_pop = zurich_pop_data[zurich_pop_data['u10']].groupby(['reporting_year','subdistrict'])['pop_count'].sum().reset_index(name='u10_pop_count')
child_pop = child_pop.merge(infant_pop, on=['reporting_year', 'subdistrict'], how='left')

child_pop['u20_pop_pct'] = (child_pop['u20_pop_count'] / total_pop['pop_count'])
child_pop['u10_pop_pct'] = (child_pop['u10_pop_count'] / total_pop['pop_count'])



In [None]:
child_pop_change = child_pop.set_index(['reporting_year', 'subdistrict'])[['u20_pop_count', 'u10_pop_count']].groupby(['subdistrict']).diff()
child_pop_pct_change = child_pop.set_index(['reporting_year', 'subdistrict'])[['u20_pop_count', 'u10_pop_count']].groupby(['subdistrict']).pct_change()
child_pop = child_pop.merge(child_pop_change, on=['reporting_year', 'subdistrict'], how='left', suffixes=('', '_change'))
child_pop = child_pop.merge(child_pop_pct_change, on=['reporting_year', 'subdistrict'], how='left', suffixes=('', '_pct_change'))

# round the pct and pct_change columns to 4dp
child_pop = child_pop.round({'u20_pop_count_pct_change': 4, 'u10_pop_count_pct_change': 4, 'u20_pop_pct': 4, 'u10_pop_pct': 4})

child_pop = child_pop.query('reporting_year >= 2015')
child_pop

In [None]:
pop_layout = hv.Layout()

for year in range(2015, 2023):
    pop_layout += zurich_pop_data_pivot[year].hvplot.bar(bar_width=25).opts(active_tools=['box_zoom'], hooks=[update_xaxis_age_group_10])

pop_layout.cols(3)

In [None]:
zurich_pop_data_pivot_delta =  zurich_pop_data_pivot.T.diff().T.drop(columns=[1993])

pop_delta_layout = hv.Layout()

for year in range(2015, 2023):
    pop_delta_layout += zurich_pop_data_pivot_delta[year].hvplot.bar(
        bar_width=25, grid=True
    ).opts(active_tools=['box_zoom'], hooks=[update_xaxis_age_group_10])

pop_delta_layout.cols(3)

In [None]:
# combine all of the processing of the population data into a single function call
@memory.cache
def preprocess_pop_data(data):
    """Returns a preprocessed population dataframe"""
    # combine the above cell into a single function call
    df = data.copy()
    df = df.rename(columns=zurich_pop_data_column_name_translations)
    df['is_male'] = df['gender_code'] == 1
    df["is_swiss"] = df['origin'] == 1
    df['district'] = df['district'].astype(str).str.zfill(2)
    df['subdistrict'] = df['subdistrict'].astype(str).str.zfill(3)
    return df




In [None]:
def update_xaxis_year(plot, element):
    """Hook to update the x-axis ticker on the plot."""
    plot.state.xaxis.ticker = FixedTicker(ticks=list(range(2015, 2023)))

# pop_data = preprocess_pop_data(pd.read_csv(zurich_pop_link)).query('reporting_year >= 2015')

# get a pivot of the pop data with the subdistrict as the index and the reporting_year as the columns
pop_data_pivot = zurich_pop_data.query('reporting_year >= 2015').groupby([
    'reporting_year', 'subdistrict'
])['pop_count'].sum().reset_index().pivot(index='subdistrict',
                                          columns='reporting_year',
                                          values='pop_count')

# get the pop density pivot using the area dataframe
pop_density_pivot = pop_data_pivot.div(
    area_df.set_index('subdistrict')['subd_area_km2'], axis=0).round(2)
# pop_density_pivot

# normalize the data to plot an area chart
pop_density_pivot_norm = pop_density_pivot.div(pop_density_pivot.sum(axis=0),
                                               axis=1)

pop_data_pivot_norm = pop_data_pivot.div(pop_data_pivot.sum(axis=0), axis=1)
pop_overlay = pop_data_pivot_norm.T.hvplot(
    line_width=1, ) * pop_data_pivot_norm.T.round(4).hvplot.scatter(
        alpha=0.6,
        size=10,
        title='Normalized Population Data per Sub-District',
        legend=False)

pop_overlay.opts(
    show_legend=False,
    height=600,
    width=800,
    show_grid=True,
    xlabel = '',
    hooks=[update_xaxis_year],
)

In [None]:
pn.state.clear_caches()
pn.state.kill_all_servers()
zurich_pop_data

In [None]:
pop_data = zurich_pop_data.query('reporting_year >= 2015')
pop_data_year_subdistrict =  pop_data.groupby(['reporting_year', 'subdistrict'])['pop_count'].sum().reset_index()
pop_data_year_subdistrict.sort_values(by=['subdistrict', 'reporting_year'])


In [None]:


# @pn.cache(max_items=10)
@pn.depends(player.param.value)
def people_density_chloropeth(query_year):
    """Returns a chloropleth map of the population densities"""
    poly_opts = dict(height=400,
                    width=400,
                    cmap='greens',
                    color_levels=[0, 1500, 3000, 4500, 6000, 7500],
                    line_width=2,
                    line_color='gray',
                    color='pop_density',
                    colorbar=True,
                    tools=['hover'],
                    xaxis='bare',
                    yaxis='bare',
                    active_tools=['box_zoom'],
                    colorbar_position='bottom',
                    backend_opts={'toolbar.autohide': True})
    # query for that year
    if query_year >2022:
        query_year = 2022
        pop_data_subdistrict = pop_data_year_subdistrict.query('reporting_year == @query_year')
    else:
        pop_data_subdistrict = pop_data_year_subdistrict.query('reporting_year == @query_year')
    # merge the area dataframe and get the pop density
    pop_area_df = area_gdf.merge(pop_data_subdistrict, on='subdistrict')
    pop_area_df['pop_density'] = pop_area_df['pop_count'] / pop_area_df['subd_area_km2']
    pop_polygon = gv.Polygons(pop_area_df)

    pop_polygon.opts(
        **poly_opts, title=f"Pop Density per Sub-District for {query_year}")
    return pop_polygon


# combine the player widget and the chloropleth map
pop_map_panel = pn.pane.HoloViews(people_density_chloropeth)




map_panel = pn.panel(gts.EsriImagery * gv.Polygons(zurich_map_gdf).opts(height=800, width=800, fill_alpha=0, xaxis='bare', yaxis='bare', line_width=2, line_color='white')  )
pn.Column(player,
          pn.Row(
          pn.Column(dog_map_panel, pop_map_panel),
          map_panel)
)



In [None]:
# save the processed population data to data folder
# pop_data.to_csv("../data/processed_pop_data.csv", index=False)
hf.save_to_data(pop_data, "processed_pop_data.csv")

In [None]:
def calculate_avg_dogs_owned(dataframe):
  """Calculates the average dogs owned per owner."""
  dataframe = dataframe.drop_duplicates(subset=['owner_id', 'reporting_year'])
  dataframe['cumulative_pet_count'] = dataframe.groupby('owner_id')['pet_count'].cumsum()
  dataframe['avg_dogs_owned'] = (dataframe['cumulative_pet_count'] / dataframe['years_a_dogowner']).round(2)
  return dataframe['avg_dogs_owned']



In [None]:
# Get the dog owner percentage ratio for each of the subdistricts for each of the years
dogowner_data_year_subdistrict = dog_data.groupby(['reporting_year', 'subdistrict'])['owner_id'].nunique().rename('dogowner_count').reset_index()
dog_data_year_subdistrict = dog_data.groupby(['reporting_year', 'subdistrict'])['owner_id'].size().rename('dog_count').reset_index()
small_dog_year_subdistrict = dog_data.groupby(['reporting_year', 'subdistrict'])['is_small_dog'].sum().rename('small_dog_count').reset_index()

reporting_year_subdistrict_counts = (
    pop_data_year_subdistrict
    .merge(dogowner_data_year_subdistrict)
    .merge(dog_data_year_subdistrict)
    .merge(small_dog_year_subdistrict)
)

reporting_year_subdistrict_counts['dogowner_ratio'] = (reporting_year_subdistrict_counts['dogowner_count'] / reporting_year_subdistrict_counts['pop_count']).round(4)
reporting_year_subdistrict_counts['people_per_dog'] = (reporting_year_subdistrict_counts['pop_count'] / reporting_year_subdistrict_counts['dog_count']).astype(int)
reporting_year_subdistrict_counts['small_dog_ratio'] = (reporting_year_subdistrict_counts['small_dog_count'] / reporting_year_subdistrict_counts['dog_count']).round(4)
reporting_year_subdistrict_counts



In [None]:

dog_data['first_year_registered'] = dog_data.groupby(['owner_id'])['reporting_year'].transform('min')
dog_data['years_a_dogowner'] = dog_data.groupby('owner_id')['reporting_year'].transform(lambda x: x.rank(method='dense')).astype(int)
dog_data['is_new_owner'] = dog_data['years_a_dogowner'] == 1
dog_data['is_returning_owner'] = dog_data['years_a_dogowner'] > 1

dog_data


In [None]:

dog_data_unique = dog_data.drop_duplicates(subset=['owner_id', 'reporting_year'])
dog_data_unique['cumulative_pet_count'] = dog_data_unique.groupby('owner_id')['pet_count'].cumsum()
dog_data_unique['avg_dogs_owned'] = (dog_data_unique['cumulative_pet_count'] / dog_data_unique['years_a_dogowner']).round(2)
dog_data_unique


In [None]:

if 'avg_dogs_owned' in dog_data.columns:
    dog_data = dog_data.drop(columns=['avg_dogs_owned'])
dog_data = dog_data.merge(dog_data_unique[['owner_id', 'reporting_year', 'avg_dogs_owned']], on=['owner_id', 'reporting_year'], how='left')
dog_data


In [None]:
dog_owners_df = dog_data_unique[['owner_id', 'reporting_year', 'years_a_dogowner', 'is_new_owner', 'is_returning_owner', 'avg_dogs_owned', 'subdistrict']]
dog_owners_df = dog_owners_df.groupby(['reporting_year', 'subdistrict'], as_index=False).agg({'avg_dogs_owned': 'mean', 'is_new_owner': 'sum', 'is_returning_owner': 'sum', 'years_a_dogowner': 'mean'})
dog_owners_df = dog_owners_df.rename(
    columns={
        'is_new_owner': 'new_owner_count',
        'is_returning_owner': 'returning_owner_count',
        'years_a_dogowner': 'avg_owner_experience',
        'avg_dogs_owned': 'avg_dogs_per_owner',
        }
)
dog_owners_df = dog_owners_df.round({'avg_dogs_per_owner': 2, 'avg_owner_experience': 2})
dog_owners_df


In [None]:
# reporting_year_subdistrict_counts = reporting_year_subdistrict_counts.merge(dog_owners_df, on=['reporting_year', 'subdistrict'])
# reporting_year_subdistrict_counts = reporting_year_subdistrict_counts.merge(child_pop, on=['reporting_year', 'subdistrict'])
reporting_year_subdistrict_counts

#### Zurich Income Dataset


In [None]:
zurich_income_data.sample()

zurich_income_data_column_names_translations={
    'StichtagDatJahr': 'reporting_year',
    'QuarCd': 'subdistrict',
    'QuarLang': 'neighborhood',
    'QuarSort': 'subdistrict_sort',
    'SteuerTarifSort': 'tax_rate_sort',
    'SteuerTarifCd': 'tax_rate_code',
    'SteuerTarifLang': 'tax_status',
    'SteuerEinkommen_p50': 'median_income',
    'SteuerEinkommen_p75': 'upper_q_income',
    'SteuerEinkommen_p25': 'lower_q_income',

}
zurich_tax_status_translations = {
    'Grundtarif': 'Basic tariff',
    'Verheiratetentarif': 'Married tariff',
    'Einelternfamilientarif': 'Single-parent family tariff',
    }

zurich_income_data = zurich_income_data.rename(columns=zurich_income_data_column_names_translations)

In [None]:
# Extract unique values from 'tax_tariff_long' column and convert to list
# tax_tariff_long_de = zurich_income_data.tax_tariff_long.unique().tolist()

# Translate the list to a dictionary using a helper function
# tax_tariff_long_translated = translate_list_to_dict(tax_tariff_long_de)

# Display the translated dictionary for verification
# display(tax_tariff_long_translated)

# Map the translated dictionary to 'tax_tariff_long' column, creating a new 'tax_status' column
zurich_income_data["tax_status"] = zurich_income_data['tax_status'].map(zurich_tax_status_translations)



# Create a dictionary mapping old column names to new ones
# income_data_column_mapping = {
#     "quar_lang": "neighborhood",
#     "date_date_year": "reporting_year",
#     "tax_income_p_50": "median_income",
#     "tax_income_p_25": "lower_q_income",
#     "tax_income_p_75": "upper_q_income",
# }



zurich_income_data["subdistrict"] = (
    zurich_income_data["subdistrict"].astype(int).astype("string").str.zfill(3)
)
zurich_income_data["district"] = zurich_income_data["subdistrict"].str[:2].astype(int)


# Define a list of columns of interest for the final dataframe
columns_of_interest_income = [
    "neighborhood",
    "reporting_year",
    "district",
    "subdistrict",
    "tax_status",
    "median_income",
    "lower_q_income",
    "upper_q_income",
]


display(
    zurich_income_data.describe(include="all")
    .T.sort_values(by="unique")
    .infer_objects(copy=False)
    .fillna("")
)

##### Handling Missing Income Data

The income datasets only extend up to 2020 due to the tax data evaluation process. To fill in the missing data for 2021 and 2022, we explored 2 strategies:

- Using all available data from 1999 onwards
- Applying a log transformation to the data from 1999 onwards

We assessed these strategies using various metrics, including mean absolute error, mean absolute percentage error, median absolute error, mean squared error, mean squared logarithmic error, and R2 score.

We employed an auto ARIMA model to predict the values for 2021 and 2022. We tested the model's accuracy by predicting the values for 2019 and 2020 using the income data from 1999 to 2018 and the log-transformed income data from the same period.

The log-transformed data provided more accurate predictions, as indicated by lower mean absolute error, mean absolute percentage error, and median absolute percentage error. Therefore, we chose this approach to predict the missing income data for 2021 and 2022 for the 34 neighborhoods.

In [None]:
income_from_1999 = (
    zurich_income_data[columns_of_interest_income]
    .groupby(["reporting_year", "subdistrict"])[
        ["median_income", "lower_q_income", "upper_q_income"]
    ]
    .median()
    .round(3)
    .reset_index()
)
income_from_1999

In [None]:
def create_pivot(df, column):
    """Create a pivot table and a pivot table of the natural logarithm of the specified column."""
    df[f"lg_{column}"] = np.log(df[column])
    pivot = (
        df[["subdistrict", "reporting_year", column]]
        .pivot(index="reporting_year", columns="subdistrict", values=column)
        .asfreq("YS")
    )
    lg_pivot = (
        df[["subdistrict", "reporting_year", f"lg_{column}"]]
        .pivot(index="reporting_year", columns="subdistrict", values=f"lg_{column}")
        .asfreq("YS")
    )
    return pivot, lg_pivot


# Convert the 'reporting_year' column to datetime format
income_from_1999["reporting_year"] = pd.to_datetime(income_from_1999["reporting_year"], format="%Y")

# Create pivot tables
median_income_pivot_from_1999, lg_median_income_pivot_from_1999 = create_pivot(
    income_from_1999, "median_income"
)
lower_q_income_pivot_from_1999, lg_lower_q_income_pivot_from_1999 = create_pivot(
    income_from_1999, "lower_q_income"
)
upper_q_income_pivot_from_1999, lg_upper_q_income_pivot_from_1999 = create_pivot(
    income_from_1999, "upper_q_income"
)

In [None]:
print("Pivot table of the natural logarithm of the median income:")
display(lg_median_income_pivot_from_1999.tail())

print("\nPivot table of the median income:")
median_income_pivot_from_1999.tail()

In [None]:
my_metrics = [
    metrics.mean_absolute_error,
    metrics.mean_absolute_percentage_error,
    metrics.median_absolute_error,
    metrics.mean_squared_error,
    metrics.mean_squared_log_error,
    metrics.r2_score,
]


def calculate_metrics(actual, predicted):
    actual_values = actual.loc[predicted.index].values.ravel()
    predicted_values = predicted.values.ravel()
    return {
        metric.__name__: metric(actual_values, predicted_values)
        for metric in my_metrics
    }


calculate_metrics_partial = partial(calculate_metrics, median_income_pivot_from_1999)


def convert_to_long_format(df, value_name="value"):
    """Converts a DataFrame from wide to long format."""
    return (
        df.unstack()
        .reset_index()
        .rename(columns={"level_0": "subdistrict", "level_1": "reporting_year", 0: value_name})
    )


def plot_arima_forecast(long_df, value_name, vline=2020, **kwargs):
    """Plots the value_name column of a Dataframe in long format."""
    forecast_color = kwargs.get("forecast_color", "gray")
    non_forecast_color = kwargs.get("non_forecast_color", "gray")

    v_line = hv.VLine(x=pd.to_datetime(f"{vline}")).opts(
        color="red", line_dash="dotted"
    )
    forecast_df = long_df.loc[long_df["reporting_year"] >= f"{vline}"]
    forecast_line = forecast_df.hvplot(
        x="reporting_year",
        y=value_name,
        by="subdistrict",
        color=forecast_color,
        line_dash="dashed",
        legend=False,
    )
    non_forecast_df = long_df.loc[long_df["reporting_year"] <= f"{vline}"]
    non_forecast_line = non_forecast_df.hvplot(
        x="reporting_year", y=value_name, by="subdistrict", color=non_forecast_color
    )
    return non_forecast_line * forecast_line * v_line

In [None]:
# Arima models to assess the forecast of the median income using log values
lg_from_1999_pred_last_2 = hf.forecast_arima(
    lg_median_income_pivot_from_1999, 2019, n_periods=2, model_desc="Log Model 1999"
)
# Arima models to assess the forecast of the median income
from_1999_pred_last_2 = hf.forecast_arima(
    median_income_pivot_from_1999, 2019, n_periods=2, model_desc="From 1999"
)

In [None]:
metrics_df = pd.DataFrame(
    {
        "From 1999": calculate_metrics_partial(from_1999_pred_last_2),
        "lg From 1999": calculate_metrics_partial(lg_from_1999_pred_last_2.map(np.exp)),
    }
)
metrics_df

The log-transformed data from 1999-2018 yielded the lower error predictions for 2019 and 2020 median income across 34 sub-districts, with the lower mean absolute error (MAE) and mean absolute percentage errors (MAPE). The mean absolute percentage error is especially relevant as it reflects the relative accuracy of predictions.

We will now apply the same strategy and use the `auto_arima` algorithm again to estimate the unknown median values, lower quartile values and upper quartile values for the years 2021 and 2022. In doing this all our datasets would have those reporting_year years in common for easy alignment.

In [None]:
def forecast_and_convert(df, start_year, periods, model_desc, column_name):
    """Forecast using autoarima and convert to long format."""
    lg_pred = hf.forecast_arima(
        df, start_year, n_periods=periods, model_desc=model_desc
    )
    long_format = convert_to_long_format(lg_pred.map(np.exp), column_name)
    return long_format


# Forecast and convert to long format
long_format_median_income_21_22 = forecast_and_convert(
    lg_median_income_pivot_from_1999, 2021, 2, "Log Model 1999", "median_income"
)
long_format_lower_q_income_21_22 = forecast_and_convert(
    lg_lower_q_income_pivot_from_1999,
    2021,
    2,
    "Log lower_q Model 1999",
    "lower_q_income",
)
long_format_upper_q_income_21_22 = forecast_and_convert(
    lg_upper_q_income_pivot_from_1999,
    2021,
    2,
    "Log upper_q Model 1999",
    "upper_q_income",
)

In [None]:
# merge the three forecasted dataframes
income_forecast_df = long_format_median_income_21_22.merge(
    long_format_lower_q_income_21_22, on=["subdistrict", "reporting_year"]
).merge(long_format_upper_q_income_21_22, on=["subdistrict", "reporting_year"])

income_forecast_df.head()

In [None]:
# Concatenate the forecasted dataframes with the original dataframe
income_from_1999_with_forcasted = pd.concat(
    [income_from_1999, income_forecast_df], axis=0
)[["reporting_year", "subdistrict", "median_income", "lower_q_income", "upper_q_income"]]

# Align the reporting_year years with the dog_data
income_from_1999_with_forcasted = hf.query_for_time_period(
    income_from_1999_with_forcasted, year_col='reporting_year'
)

In [None]:
# save the processed income data to data folder
# income_from_1999_with_forcasted.to_csv("../data/processed_income_data.csv", index=False)
hf.save_to_data(income_from_1999_with_forcasted, "processed_income_data.csv")

In [None]:
income_whole_numbers = income_from_1999_with_forcasted.set_index(
    ['reporting_year', 'subdistrict']).astype(int).reset_index()
income_whole_numbers['reporting_year'] = income_whole_numbers['reporting_year'].dt.year
# add jitter to the reporting_year column
income_whole_numbers['reporting_year_jittered'] = income_whole_numbers[
    'reporting_year'] + np.random.normal(0, 0.1, size=len(income_whole_numbers))
income_whole_numbers.hvplot.scatter(
    x='reporting_year_jittered',
    y=['lower_q_income', 'median_income', 'upper_q_income'],
    size=10,
    by='subdistrict',
    xlabel='',
    title='Income Data per Sub-District',
    xticks=[2014, 2016, 2018, 2020, 2022],
    height=400,
    width=600,
    hover_cols=['subdistrict'],
    legend='top',
    alpha=0.8).opts(backend_opts={'toolbar.autohide': True})

In [None]:
(
    plot_arima_forecast(
        income_from_1999_with_forcasted,
        "median_income",
        vline=2020,
        non_forecast_color="blue",
    ).opts(height=800, active_tools=["box_zoom"])
    * plot_arima_forecast(
        income_from_1999_with_forcasted,
        "upper_q_income",
        vline=2020,
        non_forecast_color="green",
    )
).opts(height=800, active_tools=["box_zoom"], title="Median Income and Upper Quartile Income", show_legend=False)

#### Zurich Household Dataset

For the household datasets we first rename some of the columns so that they are more readable and consistent with the other datasets. We then process it to obtain an average household size per neighborhood, weighted by the number of households.

In [None]:
zurich_household_data.sample()

# Define a dictionary to map old column names to new ones
zurich_household_data_columns_translations = {
    'StichTagDatJahr': 'reporting_year',
    'QuarSort': 'subdistrict',
    'QuarLang': 'neighborhood',
    'KreisSort': 'district',
    'KreisLang': 'district_long',
    'hh_groesseSort': 'household_size',
    'hh_groesseLang': 'household_size_long',
    'AnzHH': 'household_count',
    'AnzBestWir': 'resident_count',
}

# Rename the columns
zurich_household_data = zurich_household_data.rename(columns=zurich_household_data_columns_translations)

In [None]:


# Create new columns
zurich_household_data["subdistrict"] = (
    zurich_household_data["subdistrict"].astype("string").str.zfill(3)
)
zurich_household_data["district"] = (
    zurich_household_data["subdistrict"].str[:2].astype(int)
)
zurich_household_data["household_size"] = (
    zurich_household_data["household_size"].astype("string").str.zfill(2)
)

# Create a dataframe with only the columns of interest
columns_of_interest_household = [
    "neighborhood",
    "reporting_year",
    "district",
    "subdistrict",
    "household_size",
    "household_count",
    "resident_count",
]
hh_data = zurich_household_data[columns_of_interest_household]

# Align the reporting_year years with the dog_data
# hh_data = hf.query_for_time_period(hh_data)

# Display dataframe info and first 10 rows
hh_data.info()
hh_data.head(10)

In [None]:
hh_size_data = (
    hh_data[['reporting_year', 'subdistrict', 'household_size', 'household_count']]
    .pivot(index=['reporting_year', 'subdistrict'], columns='household_size', values='household_count').add_prefix('hh_')
    # .reset_index()
    .sort_values(by=['subdistrict', 'reporting_year'])
)
hh_size_data['hh_24'] = hh_size_data['hh_02'] + hh_size_data['hh_03'] + hh_size_data['hh_04']
hh_size_data['hh_56'] = hh_size_data['hh_05'] + hh_size_data['hh_06']


In [None]:

hh_size_data_changes = hh_size_data.groupby(['subdistrict']).diff()
hh_reporting_year_subdistrict =  hh_size_data.merge(hh_size_data_changes, on=['reporting_year', 'subdistrict'], suffixes=('', '_change')).reset_index()
hh_reporting_year_subdistrict.head()


In [None]:
comb_df = pd.DataFrame()
comb_df = (
    reporting_year_subdistrict_counts
    .merge(dog_owners_df, on=['reporting_year', 'subdistrict'])
    .merge(child_pop, on=['reporting_year', 'subdistrict'])
    .merge(hh_reporting_year_subdistrict, on=['reporting_year', 'subdistrict'])
    .query('reporting_year > 2015')
    # .set_index(['reporting_year', 'subdistrict'])
)
comb_df.describe().T


In [None]:
cols_for_pairplot = ['hh_01_change', 'hh_24_change', 'hh_56_change',
                     'u10_pop_count_change', 'u20_pop_count_change', 'u10_pop_count_pct_change', 'u20_pop_count_pct_change',
                     'returning_owner_count', 'new_owner_count', 'avg_dogs_per_owner', 'people_per_dog',
                     'small_dog_ratio',
                     'reporting_year']

sns.pairplot(comb_df[cols_for_pairplot], kind='reg', corner=True, plot_kws={'scatter_kws': {'s': 1}} )

In [None]:

comb_df[cols_for_pairplot].corr().round(2).hvplot.heatmap(width=800, height=800, cmap='coolwarm_r', ).opts(xrotation=90, active_tools=['box_zoom'], symmetric=True )

In [None]:
comb_df.sample()

In [None]:
def highly_correlated_cols(df, threshold=0.95):
    """
    Identifies highly correlated columns in a DataFrame.
    """
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    return to_drop

highly_correlated_cols(comb_df)

In [None]:
# Calculate total residents per subdistrict
hh_data["total_residents"] = hh_data.groupby(["reporting_year", "subdistrict"])[
    "resident_count"
].transform("sum")

# Calculate the total households per subdistrict
hh_data["total_households"] = hh_data.groupby(["reporting_year", "subdistrict"])[
    "household_count"
].transform("sum")

# Average household size
hh_data["avg_household_size"] = hh_data["total_residents"] / hh_data["total_households"]

# Calculate weighted average household size
hh_data["resident_portion"] = hh_data["resident_count"] / hh_data["total_residents"]
# household_data

hh_grouped_data = (
    hh_data.groupby(["reporting_year", "subdistrict", "neighborhood"])[
        ["avg_household_size", "total_households"]
    ]
    .mean()
    .reset_index()
)
hh_grouped_data.info()
hh_grouped_data.describe(include="all").T.sort_values(by="unique").infer_objects(
    copy=False
).fillna("")

In [None]:

hh_data = hh_data.sort_values(by=['subdistrict', 'household_size', 'reporting_year'])
hh_data['hh_count_delta'] = hh_data.groupby(['subdistrict', 'household_size'])['household_count'].transform(lambda x: x.diff()).fillna(0).astype(int)



In [None]:
# plot a stacked bar chart for the reporting_year  2022 for each subdistrict
# the size of the bar is proportional to the resident_portion, color by household_size
hh_data_2022 = hh_data.query('reporting_year == 2022').reset_index()
hh_data_2022

In [None]:
pn.state.clear_caches()
pn.state.kill_all_servers()

In [None]:
# barh_chart_player = pnw.Player(name='Year',value=2016,start=2015,end=2022,step=1,width=500,interval=2000)
bar_chart_player = pnw.Player(value=2016,start=2015,end=2022,step=1,width=400,interval=2000)

hh_size = pnw.ToggleGroup(
    behavior='check',
    options = ['01', '02', '03', '04', '05', '06'],
    value=['01'],
    button_style = 'outline',
    width=400)

unique_household_sizes = hh_data['household_size'].unique()

hh_colors = list(cc.glasbey)
hh_color_mapping = {size: color for size, color in zip(unique_household_sizes, hh_colors)}



In [None]:
hh_color_mapping

In [None]:

hh_bar_opts = dict(
    cmap=hh_color_mapping,
    stacked=True,
    xlabel='',
    height=400,
    width=1000,
    alpha=0.6,
    legend='right',
    hover_cols=['subdistrict'],
    line_color='white',
    grid=True
)


@pn.depends(bar_chart_player.param.value, hh_size.param.value)
def household_count_bar(query_year, hh_size):
    """Returns a bar chart of the household count per sub-district for the specified year."""
    hh_data_year = hh_data.copy()
    hh_data_year = hh_data_year[hh_data_year['reporting_year'] == query_year].copy()
    hh_data_year = hh_data_year[hh_data_year['household_size'].isin(hh_size)]
    return hh_data_year.hvplot.bar(
        x='subdistrict',
        y='household_count',
        by='household_size',
        title=f'Household Count per Sub-District | {query_year}',
        **hh_bar_opts
    ).opts(backend_opts={'toolbar.autohide': True}, active_tools=['box_zoom'], )


# put the bar chart on a panel
household_count_panel = pn.panel(household_count_bar)


In [None]:

@pn.depends(bar_chart_player.param.value, hh_size.param.value)
def hh_count_delta_bar(query_year, hh_size):
    """Returns a bar chart of the household count per sub-district for the specified year."""
    hh_data_copy = hh_data.copy()
    hh_data_year_1 = hh_data_copy[hh_data_copy['reporting_year'] == query_year]
    hh_data_year_1 = hh_data_year_1[hh_data_year_1['household_size'].isin(hh_size)]
    hh_data_delta_bars =  hh_data_year_1.hvplot.bar(
        x='subdistrict',
        y='hh_count_delta',
        by='household_size',
        title=f'Change in Household Count per Sub-District | {query_year}',
        **hh_bar_opts
    ).redim(reporting_year='year', hh_count_delta='Change from Prior Year', subdistrict='tract', hh_size='household_size')
    hh_data_delta_bars.opts(backend_opts={'toolbar.autohide': True}, active_tools=['box_zoom'], )
    # add a hline for the xaxis
    zero_line = hv.HLine(y=0).opts(color='black', line_width=0.5)

    return hh_data_delta_bars * zero_line


# put the bar chart on a panel
household_delta_count_panel = pn.panel(hh_count_delta_bar)


In [None]:
# Use representative_point() so that the label is in the polygon that it is identifying
subdistrict_gdf['label_pos_x'] = subdistrict_gdf['geometry'].apply(lambda x: x.representative_point().coords[:][0][0])
subdistrict_gdf['label_pos_y'] = subdistrict_gdf['geometry'].apply(lambda x: x.representative_point().coords[:][0][1])




In [None]:
# get the Geodataframe for the subdistrivct labels

subdistrict_labels_gdf = z_gdf['z_gdf_0']
subdistrict_labels_gdf = subdistrict_labels_gdf.rename(columns={'kuerzel': 'subdistrict'})
subdistrict_labels_gdf['subdistrict'] = subdistrict_labels_gdf['subdistrict'].str.zfill(3)
subdistrict_labels_gdf.head()

In [None]:
subdistrict_labels_gdf['x'] = subdistrict_labels_gdf.geometry.x
subdistrict_labels_gdf['y'] = subdistrict_labels_gdf.geometry.y


subdistrict_labels = hv.Labels(
    {
        ('x', 'y'): subdistrict_labels_gdf[['x', 'y']].values.tolist(),
        'text': subdistrict_labels_gdf['subdistrict'].values.tolist()
    },
    ['x', 'y'], 'text'
)
subdistrict_labels.opts(
        opts.Labels(text_font_size='10pt')
    )


In [None]:


neighborhood_map = hv.Polygons(subdistrict_gdf, vdims=['subdistrict']).opts(
    xlabel='',
    ylabel='',
    xaxis='bare',
    yaxis='bare',
    width=800,
    height=800,
    fill_alpha=0.5,
    cmap=['white'],
    tools=['hover'],
    active_tools=['box_zoom'],
    title="Polygon showing the Subdistricts location"
)
subdistrict_panel = pn.panel(neighborhood_map * subdistrict_labels)


# hv.help(gv.Labels)
# subdistrict_labels

pn.Row(
    pn.Column(
    pn.Row(bar_chart_player, hh_size),
   household_count_panel, household_delta_count_panel),
    subdistrict_panel
)


In [None]:
# save the processed household data to data folder
# hh_data.to_csv("../data/processed_household_data.csv", index=False)
hf.save_to_data(hh_grouped_data, "processed_household_data.csv")

#### Merged Datasets
Now that we have all datasets with some common columns we can attempt to merge them just to see if anything stands out. Since all of our processed files will save to the data folder we can simply just load them.

In [None]:
# Load processed data from CSV files
subdistrict_gdf = gpd.read_file("../data/zurich_neighborhoods.geojson")
districts_gdf = gpd.read_file("../data/zurich_districts.geojson")
processed_dog_data = pd.read_csv("../data/processed_dog_data.csv")
processed_pop_data = pd.read_csv("../data/processed_pop_data.csv")
processed_income_data = pd.read_csv("../data/processed_income_data.csv")
processed_household_data = pd.read_csv("../data/processed_household_data.csv")

# Display the last 5 rows of the processed dog data
processed_dog_data.tail()

# Pad 'subdistrict' column with leading zeros to make it 3 digits
processed_dog_data["subdistrict"] = (
    processed_dog_data["subdistrict"].astype("string").str.zfill(3)
)

# Group dog data by 'reporting_year' and 'subdistrict' and calculate the size of each group
grouped_dog_data = processed_dog_data.groupby(["reporting_year", "subdistrict"])
dog_count_to_merge = grouped_dog_data.size().rename("dog_count").reset_index()

# Count unique 'owner_id' in each group
owner_count_to_merge = (
    grouped_dog_data["owner_id"].nunique().rename("owner_count").reset_index()
)

# Count the number of small dogs (dog_size='K') in each group
small_dog_count_to_merge = (
    processed_dog_data.loc[processed_dog_data["dog_size"] == "K"]
    .groupby(["reporting_year", "subdistrict"])
    .size()
    .rename("small_dog_count")
    .reset_index()
)
# Count the number of pure breed dogs in each group
pure_breed_count_to_merge = (
    processed_dog_data.loc[processed_dog_data["is_pure_breed"]]
    .groupby(["reporting_year", "subdistrict"])
    .size()
    .rename("pure_breed_count")
    .reset_index()
)
# Count the number of male owners in each group
male_owner_count_to_merge = (
    processed_dog_data.loc[processed_dog_data["is_male_owner"] == True]
    .groupby(["reporting_year", "subdistrict"])["owner_id"]
    .nunique()
    .reset_index(name="male_owner_count")
)
# Pad 'subdistrict' column in population data and group by 'reporting_year' and 'subdistrict'
processed_pop_data["subdistrict"] = (
    processed_pop_data["subdistrict"].astype("string").str.zfill(3)
)
pop_to_merge = (
    processed_pop_data.groupby(["reporting_year", "subdistrict"])
    .agg({"pop_count": "sum"})
    .reset_index()
)

# Pad 'subdistrict' column in income data, truncate 'reporting_year' to 4 digits, and group by 'reporting_year' and 'subdistrict'
processed_income_data["subdistrict"] = (
    processed_income_data["subdistrict"].astype("string").str.zfill(3)
)
processed_income_data["reporting_year"] = processed_income_data["reporting_year"].str[:4].astype(int)
income_to_merge = (
    processed_income_data.groupby(["reporting_year", "subdistrict"])
    .agg({"median_income": "mean", "lower_q_income": "mean", "upper_q_income": "mean"})
    .round(3)
    .reset_index()
)

# Pad 'subdistrict' column in household data and group by 'reporting_year' and 'subdistrict'
processed_household_data["subdistrict"] = (
    processed_household_data["subdistrict"].astype("string").str.zfill(3)
)
hh_to_merge = (
    processed_household_data.groupby(["reporting_year", "subdistrict"])
    .agg({"avg_household_size": "mean", "total_households": "mean"})
    .round(3)
    .reset_index()
)

# Merge all the grouped data into a single DataFrame
z_subd_merged = (
    dog_count_to_merge.merge(owner_count_to_merge)
    .merge(male_owner_count_to_merge)
    .merge(small_dog_count_to_merge)
    .merge(pure_breed_count_to_merge)
    .merge(pop_to_merge)
    .merge(income_to_merge)
    .merge(hh_to_merge)
    .merge(subdistrict_gdf[["subdistrict", "district", "subd_area_km2"]])
)

In [None]:
z_subd_merged["small_dog_frac"] = round(
    z_subd_merged["small_dog_count"] / z_subd_merged["dog_count"], 3
)
# Add in the owner to population ratio
z_subd_merged["owner_pop_ratio"] = (
    z_subd_merged["owner_count"] / z_subd_merged["pop_count"]
)


# Add in the geometry data and subd_area_km2 for density calculations
z_subd_merged["dog_subd_density"] = (
    z_subd_merged["dog_count"] / z_subd_merged["subd_area_km2"]
)
z_subd_merged["hh_subd_density"] = (
    z_subd_merged["total_households"] / z_subd_merged["subd_area_km2"]
)
z_subd_merged["pop_subd_density"] = (
    z_subd_merged["pop_count"] / z_subd_merged["subd_area_km2"]
)
z_subd_merged["owner_subd_density"] = (
    z_subd_merged["owner_count"] / z_subd_merged["subd_area_km2"]
)


z_subd_merged_2015 = hf.query_for_time_period(z_subd_merged, 2015, 2016, year_col="reporting_year")
z_subd_merged.head(5)

##### Dimensionality Reduction: UMAP vs PCA

When dealing with high-dimensional data, dimensionality reduction techniques like UMAP (Uniform Manifold Approximation and Projection) and PCA (Principal Component Analysis) are often used. These techniques transform the data into a lower-dimensional space, making it easier to visualize and analyze.
##### PCA (Principal Component Analysis)
**PCA** is a linear technique that focuses on preserving the global structure of the data, which refers to the overall variance in the data. It's computationally efficient but may not always capture the relationships in the data when reducing to very low dimensions.

On the other hand, **UMAP** is a nonlinear technique that aims to preserve both the global and local structure of the data. The local structure refers to the relationships in the data when reducing to very low dimensions, like 2D or 3D. This makes UMAP potentially more effective than PCA for visualization purposes, but it comes at the cost of higher computational intensity.

Here's a summary of the key differences:

| | UMAP | PCA |
|---|---|---|
| **Preserves Global Structure** | Yes | Yes |
| **Preserves Local Structure** | Yes | No |
| **Computational Intensity** | High | Low |

In [None]:
# declare a panel widget for buttons
n_neighbors_slider = pnw.IntSlider(
    value=15, start=5, end=100, step=5, width=400, name="n_neighbors"
)
my_clusters_slider = pnw.IntSlider(
    value=5, start=2, end=35, step=1, width=400, name="n_clusters"
)
min_dist_button = pnw.RadioButtonGroup(options=[0.1, 0.2, 0.4, 0.7], value=0.2)
# List of values to try for n_neighbors and min_dist
n_neighbors_values = list(range(5, 51, 5))
min_dist_values = [0.1, 0.2, 0.4, 0.7]

In [None]:
from sklearn.preprocessing import OneHotEncoder


# X_set = z_subd_merged.copy()
X_set = comb_df.copy()

columns_to_ohe = ["reporting_year"]

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
for column in columns_to_ohe:
    # Convert the district column to string
    X_set[column] = X_set[column].astype(str)
    encoded_features = ohe.fit_transform(X_set[column].values.reshape(-1, 1))
    df_encoded_features = pd.DataFrame(
        encoded_features, columns=ohe.categories_[0], index=X_set.index
    )
df_encoded_features

In [None]:
# comb_df.isna().sum()
# X_set
comb_df

In [None]:
# List the columns to be used on PCA analysis
columns_to_use = [
    "dog_subd_density",
    "hh_subd_density",
    "pop_subd_density",
    "owner_subd_density",
    "dog_count",
    "pop_count",
    "owner_count",
    "male_owner_count",
    "pure_breed_count",
    "owner_pop_ratio",
    "total_households",
    "median_income",
    "avg_household_size",
    "small_dog_frac",
    "small_dog_count",
    "subd_area_km2",
    # "lower_q_income",
    # "upper_q_income",
]

# State the range for the number of clusters to be used for the KMeans algorithm
cluster_range = range(2, 20)

# scale the data
scaler = StandardScaler()
# This is for just one year, so 2015 only
# scaled_data = scaler.fit_transform(z_subd_merged_2015[columns_to_use])

# scaled_data = scaler.fit_transform(z_subd_merged[columns_to_use])
cols_for_pca = comb_df.columns.tolist()
cols_for_pca.remove("subdistrict")
cols_for_pca.remove("reporting_year")
scaled_data = scaler.fit_transform(comb_df[cols_for_pairplot])

scaled_data_df = pd.DataFrame(scaled_data, columns=cols_for_pairplot, index=comb_df.index)
X_set = pd.concat([scaled_data_df, df_encoded_features], axis=1)
pca = PCA()
x_pca = pca.fit_transform(X_set)


# Create the DataFrame for the PCA data
pca_df = pd.DataFrame(
    x_pca,
    columns=[f"PC{i+1}" for i in range(x_pca.shape[1])],
    index=X_set.index,
)
# Create the plot of 1st and 2nd principal components
pca_plot = pca_df.hvplot.scatter(
    x="PC1",
    y="PC2",
    title="PCA of Zurich Sub-Districts",
    hover_cols=["subdistrict"],
    width=500,
    height=500,
    colorbar=True,
    grid=True
).opts(active_tools=['box_zoom'])

# Create the explained variance plot
explained_variance = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance)
explained_variance_df = pd.DataFrame(
    {
        "Principal Component": range(1, len(explained_variance) + 1),
        "Explained Variance": explained_variance,
        "Cumulative Explained Variance": cumulative_explained_variance,
    }
)

# Explained variance plot
ev_plot = explained_variance_df.hvplot(
    x="Principal Component",
    y=["Explained Variance", "Cumulative Explained Variance"],
    # kind="bar",
    title="Explained Variance by Principal Component",
    width=500,
    height=500,
    shared_axes=False,
).opts(active_tools=['box_zoom'],legend_position="top_right")

# Show layout of the PCA plots
hv.Layout([pca_plot, ev_plot]).cols(2)


In [None]:
x_pca.shape
X_set.shape


In [None]:
# Fit the PCA model
pca = PCA()
pca.fit(scaled_data_df)
h_color = "green"
line_opts = dict(color=h_color, line_dash="dashed",
                 line_alpha=0.8, line_width=1)

# Plot the cumulative explained variance ratio with number of components
n = pca.n_components_
cvr = np.r_[0, pca.explained_variance_ratio_.cumsum()]
cvr_plot = hv.Curve(cvr)
cvr_scatter = hv.Scatter(cvr_plot).opts(size=5, color="darkgray", alpha=0.8)

# Add dashed lines at the 5th component and its corresponding cumulative variance ratio
v5_line = hv.Path([(5, 0), (5, cvr[5])]).opts(**line_opts)
h5_line = hv.Path([(0, cvr[5]), (5, cvr[5])]).opts(**line_opts)
v5_label = hv.Text(
    x=5,
    y=0,
    text=" 5 Components",
    valign="bottom",
    halign="left",
).opts(color=h_color)
h5_label = hv.Text(
    x=0,
    y=cvr[5],
    text=f" {cvr[5]:.2f}",
    halign="left",
    valign="top",
).opts(color=h_color)


hv.Overlay(cvr_plot * cvr_scatter * v5_line * h5_line * v5_label * h5_label).opts(
    title="Cumulative Explained Variance Ratio",
    xlabel="Components",
    ylabel="",
    ylim=(0, 1),
    xlim=(0, n + 1),
    height=400,
    width=400,
    tools=["hover"],
    show_legend=False,
    yticks=list(np.linspace(0, 1, 3)),
    xticks=list(range(0, n + 1, 2)),
)

We then selected only the principal components which accounted for 95% of the `cumulative explained variance ratio`. Using these reduced number of  dimensions, we were then able to perform a `Kmeans` cluster analysis to identify clusters of our `subdistricts` with similar features. Although we have 34 neighborhoods within 12 districts, the neighborhoods may not necessarily be clustered along those lines. To assess the quality of our clustering results, we use the `silhouette score`.

In [None]:
# transform the data
pca_data = pca.transform(scaled_data_df)

pca_data_df = pd.DataFrame(pca_data, columns=[f"PC{i}" for i in range(1, n + 1)])


# Get the number of components that explain at least 95% of the variance
num_components = np.where(cvr >= 0.95)[0][0]

# Select the first `num_components` columns
pca_data_df_reduced = pca_data_df.iloc[:, :num_components]

print(f"{pca_data_df_reduced.shape[1]} components explain 95% of the variance")
pca_data_df_reduced.head()

pca_data_widget = pnw.DataFrame(pca_data_df_reduced, name="PCA Data")


@pn.depends(
    num_clusters=my_clusters_slider.param.value, pca_dataset=pca_data_widget.param.value
)
def get_pca_plots(num_clusters, pca_dataset):
    """Create PCA plots for the given number of clusters."""

    cluster_labels = hf.compute_kmeans_labels(pca_dataset, num_clusters)

    clustered_dataset_df = hf.create_clustered_data_df(pca_dataset, cluster_labels)

    clustered_dataset_df = hf.add_columns(
        clustered_dataset_df, comb_df, ["subdistrict"]
    )

    plot = hf.create_scatterplot_with_origin_cross(
        clustered_dataset_df,
        x="PC1",
        y="PC2",
        title=f"K-means Clustering with k={num_clusters}",
    )
    return plot

In [None]:
@memory.cache
def calculate_clusters_scores(embeddings_dict, cluster_range=None):
    """Calculate K-means clustering scores for all embeddings in the dictionary and return a DataFrame."""
    # Create a list to store the dataframes
    score_dataframes = []
    if cluster_range is None:
        # Define the range of cluster numbers
        cluster_range = list(range(2, 20))

    for embedding_key, embeddings in tqdm(embeddings_dict.items(), desc="Calculating scores"):
        for num_clusters in cluster_range:
            kmeans_model = KMeans(n_init=20, n_clusters=num_clusters, random_state=628).fit(
                embeddings
            )
            cluster_labels = kmeans_model.labels_

            silhouette_score = round(
                metrics.silhouette_score(embeddings, cluster_labels), 3
            )
            calinski_harabasz_score = round(
                metrics.calinski_harabasz_score(embeddings, cluster_labels),
            )
            davies_bouldin_score = round(
                metrics.davies_bouldin_score(embeddings, cluster_labels), 3
            )

            # Create a DataFrame
            scores_df = pd.DataFrame(
                {
                    "embedding_key": [embedding_key],
                    "num_clusters": [num_clusters],
                    "silhouette_score": [silhouette_score],
                    "calinski_harabasz_score": [calinski_harabasz_score],
                    "davies_bouldin_score": [davies_bouldin_score],
                }
            )

            # Append the dataframe to the list
            score_dataframes.append(scores_df)

    # Concatenate all dataframes in the list into a single dataframe
    result_df = pd.concat(score_dataframes, ignore_index=True)

    return result_df


# get the scores for the PCA data

pca_scores = hf.calculate_clusters_scores({"PCA": pca_data_df_reduced}, cluster_range)
# pca_scores

In [None]:
# Define the color mapping functions
def color_max(s):
    is_max = s == s.max()
    return ["color: red" if v else "" for v in is_max]


def color_min(s):
    is_min = s == s.min()
    return ["color: blue" if v else "" for v in is_min]


# Apply the color mapping functions to the DataFrame
styled_df = pca_scores.style.apply(
    color_max, subset=["silhouette_score", "calinski_harabasz_score"]
).apply(color_min, subset=["davies_bouldin_score"])

# Display the styled DataFrame
styled_df

We are able to see how the neighborhoods are clustered in the scatterplot below (on an arbitrary plane off the first 2 principle components), which is colored by the `cluster` label below. We also included two other metric which although we did not use, were still useful to look at.
- `Calinski-Harabasz`: ratio of the between-clusters and inter-clusters dispersion for all clusters. The higher the value, the better the clustering.
- `Davies-Bouldin`: Similarity between clusters Comparing the distance between the clusters and the size of the clusters themselves. A lower score here relates to a model with better separation between the clusters.

In [None]:
# list the cluster_metrics to be considered
cluster_metrics = [
    "silhouette_score",
    "calinski_harabasz_score",
    "davies_bouldin_score",
]

pn.Column(my_clusters_slider, pn.pane.HoloViews(get_pca_plots))

In [None]:
# Get the best number of clusters for silhouette score
best_n_clusters = pca_scores.loc[pca_scores["silhouette_score"].idxmax()][
    "num_clusters"
]
print(f"Best number of clusters Silhouette: {best_n_clusters}")

# Apply k-means clustering to the PCA data
kmeans = KMeans(n_init=20, n_clusters=best_n_clusters, random_state=628)
kmeans.fit(pca_data_df_reduced)
pca_data_df_reduced["cluster"] = kmeans.labels_

# add columns district, subdistrict, and reporting_year
pca_data_df_reduced = hf.add_columns(
    pca_data_df_reduced, comb_df, ["subdistrict", "reporting_year"]
)


# Group the DataFrame by 'subdistrict' and get the set of 'cluster' for each 'subdistrict'
subdistrict_pca_cluster = (
    pca_data_df_reduced[["subdistrict", "reporting_year", "cluster"]]
    .groupby(["subdistrict"])["cluster"]
    .apply(set)
    .reset_index(name="pca_cluster_set")
)
# subdistrict_pca_cluster

In [None]:

# Calculate the number of unique clusters for each 'subdistrict' and store it in a new column
subdistrict_pca_cluster["subdistrict_cluster_count"] = subdistrict_pca_cluster[
    "pca_cluster_set"
].apply(len)

# Check if there are any 'subdistrict' with more than one unique cluster and sum them
(~subdistrict_pca_cluster["subdistrict_cluster_count"] == 1).sum()

# unravel the set in the pcs column
subdistrict_pca_cluster["pca_cluster"] = subdistrict_pca_cluster[
    "pca_cluster_set"
].apply(lambda x: list(x)[0])


# Plot the Clusters (colormap) and Districts(white line) and subdistricts (black (default) line)
subdistrict_gdf.merge(
    subdistrict_pca_cluster[["subdistrict", "pca_cluster"]]
).hvplot.polygons(
    aspect="equal",
    geo=True,
    tiles="EsriImagery",
    color="pca_cluster",
    # cmap="glasbey_dark",
    colormap=cc.glasbey_dark[: best_n_clusters + 1],
    hover_cols=["all"],
    title=f"PCA || {best_n_clusters} Clusters",
    xaxis="bare",
    yaxis="bare",
    colorbar=False,
    alpha=0.3,
) * gv.Polygons(
    districts_gdf
).opts(
    line_color="white", line_width=2, fill_alpha=0.02, height=600, width=600
)

We can also use these cluster labels to see how the neighborhoods are distributed on a map. We can see that the clusters are not necessarily along the district lines (white outline) but still have a strong geographical contiguity. This clustering was based only on the non geometry features, meaning excluding the `area` and geographical coordinates as features for example. Features like the `area` and the geographical corordinates will not make much sense for pca as they have no variation within them from year to year as the other features do. As our features also included `reporting_year` which was the year feature, some `subdistricts` were classified into different clusters for different years. This may be due to some non linearity in our system.

In [None]:
# Compute correlations between original variables and principal components

corr = pd.DataFrame(
    pca.components_.T[:, :num_components],
    columns=pca_data_df.columns[:num_components],
)
corr["variable"] = cols_for_pairplot
# corr_var_df = pd.DataFrame({'variable': cols_for_pca})
# corr = pd.concat([corr_var_df, corr], axis=1)

corr = corr.melt(id_vars="variable", var_name="PC", value_name="corr")
# plot using hvplot
corr_plot = corr.hvplot.bar(
    x="PC",
    y="corr",
    by="variable",
    width=2200,
    height=500,
    title="Correlation Between Original Variables and Principal Components",
    ylabel="",
    tools=["hover"],
).opts(active_tools=["box_zoom"])
corr_plot.opts(xrotation=90, xlabel="", gridstyle={"grid_line_color": "lightgray"})

The bar plot helps us to understand The principal components in terms of the original variables.
- a high positive value means the original variable on the principal component are strongly positively correlated
- a high negative value means that the original variable are strongly negatively correlated.

We began to see a pattern emerging here with the 'density' features being highly correlated with other 'density' features, likewise for the counts and the 'ratios' features. This is shown more clearly in the circle correlation plot.

In [None]:
correlations = pd.DataFrame(pca.components_, columns=scaled_data_df.columns).T
correlations.columns = [f"PC{i}" for i in range(1, n + 1)]
labels_df = pd.DataFrame(
    {"x": correlations["PC1"], "y": correlations["PC2"], "label": correlations.index}
)
circle_correlation = correlations.hvplot.scatter(
    x="PC1",
    y="PC2",
    width=800,
    height=500,
    title="Correlation Between Principal Components",
    hover_cols=["index"],
    xlim=(-1, 1),
)

(
    circle_correlation
    # plot the dog_subd_density point as a red color
    * labels_df.loc[labels_df["label"] == "dog_subd_density"]
    .hvplot.points()
    .opts(color="red")
    * hv.VLine(0).opts(color="gray", line_dash="dotted")
    * hv.HLine(0).opts(color="gray", line_dash="dotted")
    # * hv.Labels(labels_df, ["x", "y"], "label").opts(
    #     yoffset=-0.05, xoffset=-0.05, text_alpha=0.6
    # )
)

In [None]:
print(labels_df)

##### UMAP (Uniform Manifold Approximation and Projection)
We now do a similar analysis using UMAP. The `UMAP` class has a few more parameters to play with than `PCA` so it can be slightly more confusing but you gain more control over the process.

In [None]:
pn.state.clear_caches()
pn.state.kill_all_servers()

In [None]:
# Get the embeddings dictionary which
embeddings_dict = hf.compute_embeddings(
    scaled_data_df, n_neighbors_values, min_dist_values
)

In [None]:
@pn.cache(max_items=20)
@pn.depends(n_neighbors_slider.param.value, my_clusters_slider.param.value)
def get_umap_plot(neighbor, n_clusters):
    """Returns a HoloViews scatter plot of the UMAP embeddings."""
    plots = []
    embeddings_keys = [(neighbor, min_distance) for min_distance in min_dist_values]

    for embedding_key in embeddings_keys:
        # Retrieve the specific embeddings from the dictionary using the key
        embeddings = embeddings_dict[embedding_key]

        # Compute the cluster labels for the current embeddings
        cluster_labels = hf.compute_kmeans_labels(embeddings, n_clusters)

        # Create a DataFrame that combines the embeddings and their corresponding cluster labels
        embeddings_df = hf.create_clustered_data_df(embeddings, cluster_labels)
        embeddings_df = hf.add_columns(
            embeddings_df, z_subd_merged, ["subdistrict"]
        )

        # Generate a plot for the current embeddings and append it to the list of plots
        plot = hf.create_scatterplot_with_origin_cross(
            embeddings_df,
            title=f"UMAP || {n_clusters=} || {neighbor=} || min_dist={embedding_key[1]}",
        )
        plots.append(plot)

    return hv.Layout(plots).cols(2).opts(shared_axes=False)

In [None]:
clusters_scores_df = calculate_clusters_scores(
    embeddings_dict, cluster_range
).sort_values(by="silhouette_score", ascending=False)

# get the embeddings for the best silhouette score,
# the best calinski_harabasz score and
best_silhouette_embeddings = clusters_scores_df.loc[
    clusters_scores_df["silhouette_score"].idxmax()
]
best_calinski_harabasz_embeddings = clusters_scores_df.loc[
    clusters_scores_df["calinski_harabasz_score"].idxmax()
]
# the best davies_bouldin score. Here we look for the minimum value
best_davies_bouldin_embeddings = clusters_scores_df.loc[
    clusters_scores_df["davies_bouldin_score"].idxmin()
]
print(
    f"""
Best Silhouette Score:{best_silhouette_embeddings["silhouette_score"]:.3f}\n
Best Calinski Harabasz Score:{best_calinski_harabasz_embeddings["calinski_harabasz_score"]:.0f}\n
Best Davies Bouldin Score:{best_davies_bouldin_embeddings["davies_bouldin_score"]:.3f}
"""
)
# display the 3 embeddings
display(best_silhouette_embeddings)
display(best_calinski_harabasz_embeddings)
display(best_davies_bouldin_embeddings)

We calculate all the embeddings beforehand and store them in a dictionary so that our calls using the widget do not have to recompute them each time.

In [None]:
# Create panel for UMAP plot
umap_panel = pn.pane.HoloViews(get_umap_plot)
pn.Column(pn.Row(n_neighbors_slider, my_clusters_slider), umap_panel)

In [None]:
# get the embedding from embedding dict for the best silhouette score
best_silhouette_embeddings_key = best_silhouette_embeddings["embedding_key"]
best_silhouette_n_clusters = best_silhouette_embeddings["num_clusters"]

# get the actual embeddings
embeddings_of_best_silhouette = embeddings_dict[best_silhouette_embeddings_key]
print(f"Best silhouette embedding key: {best_silhouette_embeddings_key}")
print(f"Best silhouette n_clusters: {best_silhouette_n_clusters}")

# get the cluster labels for the best silhouette score
best_silhouette_cluster_labels = hf.compute_kmeans_labels(
    embeddings_of_best_silhouette, best_silhouette_embeddings["num_clusters"]
)

# see which subdistricts are in which cluster
best_silhouette_embeddings_df = hf.create_clustered_data_df(
    embeddings_of_best_silhouette, best_silhouette_cluster_labels
)
# Add in the subdistrict, district, and reporting_year columns
best_silhouette_embeddings_df = hf.add_columns(
    best_silhouette_embeddings_df, z_subd_merged, [
        "subdistrict", "district", "reporting_year"]
)

# group by cluster to see which subdistricts-reporting_year combinations are in which cluster
clusters_subdistricts = (
    best_silhouette_embeddings_df.groupby(
        ["cluster", "reporting_year"])["subdistrict"]
    .apply(list)
    .reset_index(name="subdistricts_cluster")
)
# Ensure that each subdistrict is only in one cluster
# clusters_subdistricts.head(100)

In [None]:
# Group the DataFrame by 'subdistrict' and get the set of 'cluster' for each 'subdistrict'
subdistrict_umap_cluster = (
    best_silhouette_embeddings_df.groupby("subdistrict")["cluster"]
    .apply(set)
    .reset_index(name="umap_cluster_set")
)
# subdistrict_umap_cluster

In [None]:

# Calculate the number of unique clusters for each 'subdistrict' and store it in a new column
subdistrict_umap_cluster["subdistrict_cluster_count"] = subdistrict_umap_cluster[
    "umap_cluster_set"
].apply(len)

# subdistrict_umap_cluster


In [None]:


# Check if there are any 'subdistrict' with more than one unique cluster and sum them
(~subdistrict_umap_cluster["subdistrict_cluster_count"] == 1).sum()
# unravel the set in the umap_cluster_set column
subdistrict_umap_cluster["umap_cluster"] = subdistrict_umap_cluster[
    "umap_cluster_set"
].apply(lambda x: list(x)[0])

# Calculate the number of subdistricts in each cluster and store it in a new column
subdistrict_umap_cluster["cluster_count"] = subdistrict_umap_cluster.groupby(
    ["umap_cluster"]
)["subdistrict"].transform("count")

subdistrict_gdf.merge(
    subdistrict_umap_cluster[["subdistrict", "umap_cluster"]]
).hvplot.polygons(
    aspect="equal",
    geo=True,
    tiles="EsriImagery",
    color="umap_cluster",
    colormap=cc.glasbey_dark[: best_silhouette_n_clusters + 1],
    colorbar=False,
    hover_cols=["all"],
    xaxis="bare",
    yaxis="bare",
    alpha=0.6,
    title=f"UMAP Clusters || {best_silhouette_n_clusters} clusters",
) * gv.Polygons(
    districts_gdf
).opts(
    line_color="white", line_width=2, fill_alpha=0.02, height=600, width=600
)

In [None]:
subdistrict_umap_cluster

With **UMAP**, we can see that our `subdistricts` did not migrate to different clusters as the `reporting_year` year changed. This is a good sign as the algorithm was able to pick up on this similarity without us explicitly mentioning it and without explicitly giving it a feature like `area` which would have made it more obvious. as it does not vary from year to year for a single district.

This was something that the **PCA** was not able to pick up on with out the `area` feature, but the chloropleth maps for both of these Dimension Reduction algorithms looks very similar. This gives some proof of predictability to our data, and we can look into with more details in the exploratory data analysis phase.

In [None]:
subdistrict_umap_cluster.sort_values(by="cluster_count", ascending=False)
# best_silhouette_embeddings_df.cluster.value_counts()

In [None]:
cols_for_pairplot

In [None]:
# Check which features correlate with the New Owner count
corr = pd.DataFrame(scaled_data, columns=cols_for_pairplot).corr()
corr["new_owner_count"].sort_values().hvplot.barh(
    width=800, height=500, title="Correlation with New Owner count"
).opts(xlabel="", active_tools=["box_zoom"])
corr_new_owner_count = (
    corr["new_owner_count"]
    .sort_values()
    .hvplot.barh(width=500, height=500, title="Correlation with New Owner count")
    .opts(active_tools=["box_zoom"])
)


# Get a corr bar plot for the dog count for next to the New Owner count plot
corr_returning_owner_count = (
    corr["returning_owner_count"]
    .sort_values()
    .hvplot.barh(
        width=500, height=500, title="Correlation with Returning Owner Count", xlim=(None, 1)
    )
    .opts(active_tools=["box_zoom"])
)
corr_new_owner_count + corr_returning_owner_count

# Brought over
## EDA
### Load in Data

In [None]:
# from IPython.display import clear_output
# from panel import widgets as pnw  # For widgets and formatting
# import numpy as np  # For number computing
# import pandas as pd  # For data manipulation
# import panel as pn
# from bokeh.models import FixedTicker
# import holoviews as hv
from holoviews import opts
import geoviews as gv
from geoviews import tile_sources as gts
import geopandas as gpd
import hvplot.pandas  # noqa
import spatialpandas as spd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
from sklearn.preprocessing import StandardScaler

from tqdm import tqdm  # Progress bars
from wordcloud import WordCloud  # For generating word cloud visualizations

import helper_functions as hf

# clear_output()

In [None]:
subdistrict_gdf = gpd.read_file("../data/zurich_neighborhoods.geojson")
district_gdf = gpd.read_file("../data/zurich_districts.geojson")
# district_desc = pd.read_csv("../data/zurich_districts.csv")
dog_data_train = pd.read_csv("../data/processed_dog_data.csv")
# dog_data_train = pd.read_csv("../data/processed_dog_data_train.csv")
# Fix data types as they were lost when saving to csv
dog_data_train["owner_id"] = dog_data_train["owner_id"].astype("string").str.zfill(6)
dog_data_train["subdistrict"] = (
    dog_data_train["subdistrict"].astype("string").str.zfill(3)
)

In [None]:
pn.state.kill_all_servers()
pn.state.clear_caches()

In [None]:
poly_opts = dict(
    width=600,
    height=600,
    color_index=None,
    xaxis=None,
    yaxis=None,
    backend_opts={"toolbar.autohide": True},
)
# Neighborhood polygons
neighborhood_poly = gv.Polygons(subdistrict_gdf.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init),
        crs = ccrs.GOOGLE_MERCATOR).opts(
    tools=["hover", "tap"],
    **poly_opts,
    line_color="skyblue",
    line_width=2,
    fill_color="lightgray",
    fill_alpha=0,
    line_alpha=0.5,
)


In [None]:

# district polygons
# district_gdf_desc = district_gdf.merge(district_desc)
district_poly = gv.Polygons(district_gdf.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init),
        crs = ccrs.GOOGLE_MERCATOR).opts(
    **poly_opts,
    line_color="orange",
    fill_alpha=0.02,
    tools=["tap"],
    active_tools=['tap'],
    selection_alpha=0.6,
    selection_color='#008080',
    selection_line_color='#008080',
    line_width=3,
    line_alpha=1,
)

district_poly_pane = pn.pane.HoloViews(district_poly)



In [None]:
# district_nd_overlay


#### Destrict Description pane and Wordcloud

In [None]:
# tap_district = hv.streams.Tap(x=None, y=None, source=district_poly)


# def get_selected_district(x, y):
#     """Returns the selected district based on the x and y coordinates"""
#     return district_gdf_desc[
#         district_gdf_desc.geometry.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init).contains(Point(x, y))
#     ]


# @pn.depends(tap_district.param.x, tap_district.param.y)
# def display_info(x, y):
#     """Displays a brief description of the selected district"""
#     if x is None or y is None:
#         return pn.pane.Markdown("No district selected")
#     else:
#         # Find the selected district based on the x and y coordinates
#         selected_district =  district_gdf_desc[
#             district_gdf_desc.geometry.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init).contains(Point(x, y))
#         ]

#         if selected_district.empty:
#             return pn.pane.Markdown("No district selected")

#         dname = selected_district["district_name"].values[0]
#         dnum = selected_district["district"].values[0]
#         ddesc = selected_district["desc"].values[0]
#         link = selected_district["link"].values[0]
#         return pn.pane.Markdown(
#             f"""
#             <div style="
#             border: 2px solid #4a4a4a;
#             border-radius: 10px;
#             padding: 20px 20px 20px 20px;
#             background-color: #f9f9f9;
#             box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
#             word-wrap: break-word;
#             ">
#             <h2 style='color: #008080;'>{dnum}</h2>
#             <h1 style='color: #000080;'>{dname}</h1>
#             <h3 style='color: #708090;'>{ddesc}</h3>
#             <a href="{link}" >Source</a>
#             </div>

#             """,
#             width=300,
#         )

# display_info_row = pn.Row(display_info, sizing_mode="stretch_width")


In [None]:
# pn.Row(gts.EsriWorldBoundariesAndPlacesAlternate *district_poly, display_info)
# pn.Row(district_poly_pane, display_info_row)

In [None]:
# tap_district2 = hv.streams.Tap(x=None, y=None, rename={'x': 'x2', 'y': 'y2'}, source=district_poly)

# @pn.depends(tap_district.param.x, tap_district.param.y)
# def display_wordcloud(x, y):
#     """Displays a wordcloud of the selected district based on the description
#     of the district in the shape of the district poly"""
#     if x is None or y is None:
#         text = "district select on map"
#         wordcloud = WordCloud(width=800, height=500, background_color="white").generate(
#             text
#         )
#         return hv.RGB(np.array(wordcloud)).opts(
#             width=800, height=500, active_tools=["box_zoom"],
#             title = f"x is {x}, y is {y}"
#         )

#     point = hv.Points([(x, y)]).redim(x='point_x', y='point_y')
#     x2 = point.data.point_x.values[0]
#     y2 = point.data.point_y.values[0]

#     selected_district2 =  district_gdf_desc[
#         district_gdf_desc.geometry.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init).contains(Point(x2, y2))
#         ]
#     if selected_district2.empty:
#         text = f"selection not found {int(x2)} {int(y2)}"
#         wordcloud = WordCloud(width=800, height=500, background_color="white").generate(text)
#         return hv.RGB(np.array(wordcloud)).opts(
#             width=800, height=500, active_tools=["box_zoom"]
#         )

#     dname = selected_district2["district_name"].values[0]
#     dnum = selected_district2["district"].values[0]
#     ddesc = selected_district2["desc"].values[0]
#     text = f"{dnum} {dname} {ddesc}"

#     poly = selected_district2["geometry"].iloc[0]
#     print(poly.bounds)

#     # Get the bounding box of the poly
#     minx, miny, maxx, maxy = poly.bounds

#     # Calculate the width and height of the bounding box
#     margin = 0.1
#     width = (maxx - minx) * (1 + margin)
#     height = (maxy - miny) * (1 + margin)
#     # Calculate the new minimum x and y coordinates
#     minx -= width * margin / 2
#     miny -= height * margin / 2

#     # Create a new image with the same aspect ratio as the bounding box
#     image_width = 800
#     image_height = int(image_width * height / width)
#     test = Image.new("1", (image_width, image_height), 0)

#     # Convert the coordinates to a numpy array
#     coords = np.array(list(poly.exterior.coords))
#     coords -= [minx, miny]
#     coords *= [image_width / width, image_height / height]
#     coords[:, 1] = image_height - coords[:, 1]
#     # Convert the coordinates back to a list of tuples
#     scaled_coords = list(map(tuple, coords))
#     print(scaled_coords)

#     # Draw the scaled poly onto the image
#     ImageDraw.Draw(test).polygon(scaled_coords, outline=1, fill=1)

#     wordcloud = WordCloud(
#         mask=~np.array(test) * 255,
#         # color_func=lambda *args, **kwargs: breed_color,
#         include_numbers=True,
#         margin=20,
#         # contour_color=breed_color,
#         contour_width=5,
#         width=800,
#         height=500,
#         background_color="white",
#     ).generate(text)
#     return hv.RGB(np.array(wordcloud)).opts(
#             width=800,
#             height=500,
#             active_tools=["box_zoom"],
#             backend_opts={"toolbar.autohide": True},
#             title = f"{x2} {y2} {dname}")


# # wordcloud_bound = pn.bind(display_wordcloud, x = tap_district_2.param.x, y=tap_district_2.param.y)
# # wordcloud_dmap = hv.DynamicMap(wordcloud_bound, streams=[tap_district])
# # pn.Row(district_poly , pn.bind(display_wordcloud, x = tap_district.param.x, y=tap_district.param.y))

# wordcloud_pane = pn.panel(display_wordcloud)

In [None]:
# pn.Row(district_poly_pane, display_info_row, wordcloud_pane)


In [None]:

# district_layout = pn.Column(
#     pn.pane.HoloViews(display_wordcloud),
#     pn.Row(gts.EsriWorldBoundariesAndPlacesAlternate *district_poly, display_info, sizing_mode="stretch_width"),
#     )

# district_layout_card = pn.Card(
#     district_layout,
#     title="District Descript",
#     sizing_mode="stretch_width",
# )
# district_layout_card

#### Dog Count & Dog Owner count

In [None]:

# A single row from the dog data
(
    dog_data_train.describe(include="all")
    .T.infer_objects()
    .sort_values(by="unique")
    .fillna("")
)
dog_data_train.sample().T

In [None]:
def update_xaxis(plot, element):
    """Hook to update the x-axis ticker on the plot."""
    plot.state.xaxis.ticker = FixedTicker(ticks=list(range(2015, 2023)))


dogs_total_by_reporting_year = dog_data_train.groupby("reporting_year").size()
print(f"Total number of dogs per year:\n{dogs_total_by_reporting_year}")

total_dogs_line = dogs_total_by_reporting_year.hvplot.bar().opts(
    show_legend=False,
    title="Total Dogs Registered Each Year",
    active_tools=["box_zoom"],
    # height=500,
    # width=400,
)

dog_count_yoy_pct_change = dogs_total_by_reporting_year.pct_change().fillna(0) * 100
total_dogs_yoy_bar = dog_count_yoy_pct_change.hvplot(kind="line").opts(
    hooks=[update_xaxis],
    active_tools=["box_zoom"],
    title="YOY % Change in Dog Count",
    ylabel="%",
)

(total_dogs_line + total_dogs_yoy_bar).cols(1).opts(shared_axes=False)

The Butterfly plot here shows the The edge distribution of male and female dogs for each of the years. We can see here that both genders tend to live around the same length of time we also see that there are slightly more male dogs than female dogs.

In [None]:
yearly_player = pnw.Player(
    name="Yearly Player",
    start=2015,
    end=2020,
    value=2020,
    step=1,
    loop_policy="loop",
    interval=3000,
)


@pn.depends(yearly_player.param.value)
def get_dog_age_butterfly_plot(reporting_year):
    """
    Decorated with @pn.depends, this function generates a butterfly plot of male
    and female dog age distributions for a given roster year.

    Parameters:
    roster (int): The roster year to filter the dog data by.

    Returns:
    hvplot: A butterfly plot of male and female dog age distributions for the given roster year.
    """
    # Define bar plot options
    bar_opts = dict(
        invert=True,
        height=500,
        width=500,
        bar_width=2,
        rot=90,
        xlim=(0, 24),
        xlabel="",
        yaxis="bare",
        ylabel="Count",
        grid=True,
    )
    # Filter the DataFrame for the roster
    filtered_dog_data = dog_data_train.copy()
    # filtered_dog_data = pd.read_csv("../data/processed_dog_data.csv")
    roster_dog_data = filtered_dog_data.query(f"reporting_year=={reporting_year}")
    # Filter for the is_male_dog
    male_roster_dog_data = roster_dog_data.loc[roster_dog_data["is_male_dog"]]
    male_roster_dog_data = (male_roster_dog_data.groupby(
        ["dog_age"]).size().reset_index(name="age_frequency"))
    male_roster_dog_data = male_roster_dog_data.set_index("dog_age")
    total_male = male_roster_dog_data["age_frequency"].sum()
    male_plot = male_roster_dog_data.hvplot.bar(
        **bar_opts,
        ylim=(0, 620),
        title=f"Male Dog Age Distribution || {reporting_year} || {total_male} Canines",
        color="skyblue",
    ).opts(active_tools=["box_zoom"])

    female_roster_dog_data = roster_dog_data[~roster_dog_data["is_male_dog"]]
    female_roster_dog_data = (female_roster_dog_data.groupby(
        ["dog_age"]).size().reset_index(name="age_frequency"))
    female_roster_dog_data = female_roster_dog_data.set_index("dog_age")
    total_female = female_roster_dog_data["age_frequency"].sum()
    female_roster_dog_data["age_frequency"] = (
        -1 * female_roster_dog_data["age_frequency"])
    female_plot = female_roster_dog_data.hvplot.bar(
        **bar_opts,
        ylim=(-620, 0),
        title=
        f"Female Dog Age Distribution || {reporting_year} || {total_female} Canines",
        color="pink",
    ).opts(active_tools=["box_zoom"])
    return (female_plot + male_plot).opts(shared_axes=False, )

In [None]:
dog_age_distribution_pane = pn.panel(get_dog_age_butterfly_plot)
pn.Column(
    yearly_player,
    dog_age_distribution_pane,
    sizing_mode="stretch_width",
)

In [None]:
hf.get_line_plots(
    dog_data_train.groupby("reporting_year", as_index=False)[
        "mixed_type"].value_counts(),
    x="reporting_year",
    group_by="mixed_type",
    highlight_list=["PB", "BB"],
).opts(hooks=[update_xaxis], title="Most Dogs are Pure Breeds", xlabel="", height=500, show_grid=True)

In [None]:
hf.get_line_plots(
    dog_data_train.groupby("reporting_year", as_index=False)[
        "dog_size"].value_counts(),
    x="reporting_year",
    group_by="dog_size",
    highlight_list=["K"],
).opts(hooks=[update_xaxis], title="More Dogs are Small Breeds", xlabel="", height=500, show_grid=True)

Here we see that there is a strong upward tick in the number of dog owners for 30 and 40 year-old. Up until 2019 it was the 50 year olds that had the highest number of dog owners This is a trend that we will explore further in this analysis.

In [None]:
dog_data_train.is_male_owner.value_counts()

The main uptick has been in the dog owners in their 30s and 40s

In [None]:
line_plot_opts = dict(
    height=500,
    width=600,
    hooks=[update_xaxis],
    xlabel="",
    ylabel="",
    show_grid=True,
)
highlighted_age_groups = [10, 30, 40]
dog_owner_grouped_count = dog_data_train.groupby(["reporting_year", "age_group_10"], as_index=False)[
    "owner_id"
].nunique()

female_dog_owner_grouped_count = dog_data_train[~dog_data_train['is_male_owner']==True].groupby(
    ["reporting_year", "age_group_10"], as_index=False)["owner_id"].nunique()
male_dog_owner_grouped_count = dog_data_train[dog_data_train['is_male_owner']==True].groupby(
    ["reporting_year", "age_group_10"], as_index=False)["owner_id"].nunique()

male_dog_owner_pop_plot = hf.get_line_plots(
    data=male_dog_owner_grouped_count,
    x="reporting_year",
    group_by="age_group_10",
    highlight_list = highlighted_age_groups,
).opts(**line_plot_opts, title="Male 30 & 40 year old Dog Owners count is trending upwards")

female_dog_owner_pop_plot = hf.get_line_plots(
    data=female_dog_owner_grouped_count,
    x="reporting_year",
    group_by="age_group_10",
    highlight_list = highlighted_age_groups,
).opts(**line_plot_opts, title="Female 30 & 40 year old Dog Owners count is trending upwards")

(female_dog_owner_pop_plot + male_dog_owner_pop_plot)
# male_dog_owner_pop_plot.opts(
#     title="Young Adults & Middle-Age Dog Owners count is trending upwards", **line_plot_opts
# )

In [None]:
dog_owners_df

In [None]:
dog_owners_df.query('reporting_year > 2015').hvplot(
    x='reporting_year', y='new_owner_count', by='subdistrict', height=800, line_alpha=0.5
).opts(legend_cols=2)

In [None]:
# compare with people population change

pop_data_grouped_count = pop_data.groupby(['reporting_year', 'age_group_10'], as_index=False)['pop_count'].sum()
people_pop_plot = hf.get_line_plots(
    data = pop_data_grouped_count, x='reporting_year', group_by='age_group_10', highlight_list = highlighted_age_groups
)

female_pop_grouped_count = pop_data[~pop_data['is_male']].groupby(['reporting_year', 'age_group_10'], as_index=False)['pop_count'].sum()
male_pop_grouped_count = pop_data[pop_data['is_male']].groupby(['reporting_year', 'age_group_10'], as_index=False)['pop_count'].sum()
female_pop_plot = hf.get_line_plots(
    data = female_pop_grouped_count, x='reporting_year', group_by='age_group_10', highlight_list = highlighted_age_groups
).opts(**line_plot_opts, title="Female Population Change")
male_pop_plot = hf.get_line_plots(
    data = male_pop_grouped_count, x='reporting_year', group_by='age_group_10', highlight_list = highlighted_age_groups
).opts(**line_plot_opts, title="Male Population Change")

(female_pop_plot + male_pop_plot)



**Do older dog owners tend to have older dogs? **

We will investigate this graphically then in more depth.

From the voilinplot below it is clear that younger dog owners tend to have younger dogs and that the older dog owners To a lesser extreme also tend to have older dogs. Despite the slightly upwards positive correlation, each Age Group of the dog owners encompasses a wide range For dog owners in the 50s having the widest range.

In [None]:
dog_data_train['age_range'] = dog_data_train['age_group_10'].astype('str') + 's'
dog_data_train["age_range"] = pd.Categorical(
    dog_data_train["age_range"],
    ordered=True,
    categories=["10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s", "90s"],
)

plt.figure(figsize=(15, 5))

# Distribution of dog owners at each age group
sns.violinplot(
    x="age_range",
    y="dog_age",
    data=dog_data_train,
    order=["10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s", "90s"],
    # palette='dark:#1f77b4',
    cut=0

)
plt.show()


In [None]:
year_toggle = pnw.ToggleGroup(
    button_style='outline',
    name='reporting_year',
    value=2015,
    options=list(range(2015, 2023)),
    behavior='radio'
)

@pn.depends(year_toggle.param.value)
def create_violins(year):
    year_data = dog_data_train[dog_data_train["reporting_year"] == year].sort_values(by="age_range")
    return hv.Violin(year_data, ["age_range", "is_male_dog"], "dog_age").opts(
        width=800, height=400, cut=0, split='is_male_dog',
        # inner='stick',
        ylabel='', xlabel='', ylim=(-1, 24), show_grid=True,
        active_tools=['box_zoom'],
        title=f"Violins | Dog Age Distribution | Owner Age Group | {year}"
    )

pn.Column(
    year_toggle,
    pn.pane.HoloViews(create_violins)
)

In [None]:

# Distribution of male and female dog owners In each district
g = sns.FacetGrid(dog_data_train, col="is_male_dog", aspect=5, height=3, col_wrap=1)
g.map_dataframe(
    sns.violinplot,
    x="age_range",
    y="dog_age",
    hue="is_male_owner",
    split=True,
    palette='dark:#1f77b4',
    inner='stick',
    cut=0

)
g.add_legend()
# add title
plt.subplots_adjust(top=0.9)
g.fig.suptitle("Dog Age | Owner Age | Male Owner bool")
plt.show()


In [None]:

# compare the is_male_owner and the is_male_dog columns
# pd.crosstab(dog_data_train["is_male_owner"], dog_data_train["is_male_dog"]).reset_index(drop=True)
is_male_df = dog_data_train[['reporting_year', 'is_male_dog', 'is_male_owner']].value_counts().reset_index(name="count").sort_values('reporting_year').set_index('reporting_year')
for year in range(2015, 2024):
    display(year,
        is_male_df.query('reporting_year == @year').pivot(
    index='is_male_dog', columns='is_male_owner', values='count'
).style.background_gradient(cmap='Blues')
    )
    print()


Gender HeatMap

In [None]:
dog_data_train.standard.nunique()

In [None]:
# Colors
unique_breeds = dog_data["standard"].unique()

num_unique_breeds = len(unique_breeds)

# Repeat the colormap to cover all unique breeds
repeated_cmap = list(cc.glasbey_dark) * (num_unique_breeds // len(cc.glasbey_dark) + 1)

# Explicit mapping for the color to use for each standard breed
explicit_mapping = {breed: repeated_cmap[i] for i, breed in enumerate(unique_breeds)}


my_colors = hv.Cycle(list(explicit_mapping.values()))
# colormaps for the the gender
boy_cmap = list(sns.color_palette("light:#00008b", n_colors=6).as_hex())
girl_cmap = list(sns.color_palette("light:#8b008b", n_colors=6).as_hex())

Popular Breeds

In [None]:
dog_data_train.sample()

In [None]:
dog_data_all_cols = pd.concat(dog_reporting_year_dict.values())#[dog_data_columns_to_keep]

dog_data_to_2020_all_cols = pd.concat({
    reporting_year: df
    for reporting_year, df in dog_reporting_year_dict.items() if reporting_year <= 2020
}.values())

# dog_data_all_cols[['reporting_year', 'owner_id',

#        'district', 'subdistrict', 'breed_1',
#        'breed_2', 'breed_mixed_cd', 'mixed_type',
#        'dog_size',
#       'dog_age',
#  'is_pure_breed',
#         'breed_1_de',
#        'breed_2_de', 'breed_1_en', 'standard', 'breed_2_en', 'standard_2',
#        'pet_count',]]

In [None]:
# Get the top 5 dog breeds(standards) for each year
dog_data_train.groupby(['reporting_year',"standard",]).size().reset_index(name="count").groupby("reporting_year").apply(lambda x: x.nlargest(10, "count")).reset_index(drop=True)

# Get the 10 most common dog breeds per year in the train dataset.

# dog_data_train['standard_lst'] = dog_data_train.apply(lambda row: list(set([row['standard'], row['standard_2']])) if row['standard_2'] != 'none' else [row['standard']], axis=1)
dog_data_train['standard_lst'] = dog_data_train.apply(lambda row: [row['standard'], row['standard']] if row['standard_2'] == 'none' else [row['standard'], row['standard_2']], axis=1)
# get the top 5 breeds per year per district from the exploded standard_lst dataframe
exploded_df = dog_data_train.explode(['standard_lst'])
(
    exploded_df
    .groupby(['reporting_year', 'district', 'standard_lst']).size().reset_index(name="count_num")
    .groupby(["reporting_year", "district"]).apply(lambda x: x.nlargest(3, "count_num"), include_groups=False)
    .reset_index().drop(columns=['level_2'])
)


In [None]:
n=10
topn_breeds_df = pd.DataFrame(columns=['standard_lst'])

for year in dog_reporting_year_dict.keys():
    tmp_year_df = dog_reporting_year_dict[year].copy()
    tmp_year_df['standard_lst'] = tmp_year_df.apply(lambda row: [row['standard'], row['standard']] if row['standard_2'] == 'none' else [row['standard'], row['standard_2']], axis=1)

    exploded_tmp = tmp_year_df.explode(['standard_lst'])
    tmp_to_merge = exploded_tmp.groupby(['standard_lst']).size().reset_index(name=f"Y{year}")
    topn_breeds_df = topn_breeds_df.merge(tmp_to_merge, on='standard_lst', how='outer')

# Divide by 2 as we exploded the standard_lst column
topn_breeds_df = topn_breeds_df.set_index('standard_lst').fillna(0).astype(int).sort_values(by='Y2023', ascending=False) // 2
majority_breed_dict = {}
for column in topn_breeds_df.columns:
    # sort the column
    year_numbers = topn_breeds_df[column].sort_values(ascending=False)

    breeds_num = (year_numbers.cumsum()/ year_numbers.sum() < 0.95).sum()
    breeds = year_numbers.index.tolist()[:breeds_num]
    majority_breed_dict[column] = breeds

topn_breeds_df


#### Moran's I

In [None]:
breed_list = dog_data[dog_data['reporting_year']==2023].standard.value_counts().head(20).index.tolist()
# breed_list = [breed for breed in breed_list if  breed != 'unknown']
breed_list


In [None]:
get_grouped_counts(dog_reporting_year_dict[2022], threshold=25).div(subdistrict_gdf.set_index('subdistrict')['subd_area_km2'], axis=0)
# subdistrict_gdf
# subdistrict_gdf.set_index('subdistrict')['subd_area_km2']

In [None]:
# w = Queen.from_dataframe(subdistrict_gdf)
# w.transform = 'R'

def get_grouped_counts(dataframe):
    dataframe_c = dataframe.copy()
    dataframe_c['standard_lst'] = dataframe_c.apply(lambda row: [row['standard'], row['standard']] if row['standard_2'] == 'none' else [row['standard'], row['standard_2']], axis=1)
    dataframe_c = dataframe_c.explode(['standard_lst'])
    grouped_counts = dataframe_c.groupby(['subdistrict', 'standard_lst'])['standard_lst'].count().unstack().fillna(0)
    return grouped_counts

def calculate_moran_i(dataframe, standards, weights_matrix):
    moran_dict = {}
    for standard in standards:
        moran = Moran(dataframe[standard], weights_matrix)
        moran_dict[standard] = moran
    return moran_dict

def plot_moran_i(morans, year):

    for standard, mo in morans[year].items():
        plot_moran(mo)

    plt.tight_layout()
    plt.show()

moran_i = {}
morans = {}



for year in dog_reporting_year_dict.keys():
    year_counts = pd.DataFrame()
    year_counts = get_grouped_counts(dog_reporting_year_dict[year])
    year_density = year_counts.div(subdistrict_gdf.set_index('subdistrict')['subd_area_km2'], axis=0)
    # Create weights matrix based on the current year's data
    spatial_df = subdistrict_gdf.merge(year_density, on='subdistrict', how='left')
    w = Queen.from_dataframe(spatial_df)
    w.transform = 'R'
    morans[year] = calculate_moran_i(spatial_df, breed_list, w)
    moran_i[year] = {standard: mo.I for standard, mo in morans[year].items()}


moran_i_df = pd.DataFrame(moran_i).T
moran_i_df.style.highlight_max(axis=1, color='darkgreen')

# get_grouped_counts(dog_reporting_year_dict[2023], threshold=25)
# plot_moran_i(morans, 2023)



In [None]:
for year in range(2015, 2023):
    print(f"Year {year}")
    plot_moran(morans[year]['dachshund'], year)
    plt.show()

In [None]:
dog_data[(dog_data['standard'].isin(['poodle', 'dachshund', 'golden retreiver'])) | (dog_data['standard_2'].isin(['poodle', 'dachshund', 'golden retreiver']))]

In [None]:
moran_test = pd.DataFrame()
moran_test = dog_reporting_year_dict[2015].copy()
moran_test['standard_lst'] = moran_test.apply(lambda row: [row['standard'], row['standard']] if row['standard_2'] == 'none' else [row['standard'], row['standard_2']], axis=1)
moran_test = moran_test.explode(['standard_lst'])
moran_test = moran_test.groupby(['subdistrict', 'standard_lst'])['standard_lst'].count().unstack().drop(columns=['unknown']).fillna(0)
# filter out the low number breeds
mask = moran_test.max() > 25

moran_test = moran_test.T[mask].T

spatial_df = subdistrict_gdf.merge(moran_test, on='subdistrict', how='left')

w = Queen.from_dataframe(spatial_df)
w.transform = 'R'

print(f"As p-value is less than 0.05, there is statistically significant spatial autocorrelation.")
moran_dict = {}
for standard in spatial_df.columns[5:]:
    moran = Moran(spatial_df[standard], w)
    moran_dict[standard] = {'p_value': moran.p_sim, 'moran_I': moran.I}

pd.DataFrame(moran_dict).T.sort_values(by='moran_I', ascending=False)

# moran_test.T[mask].hvplot.heatmap(
#     height=500, width=1000,
#     line_color='gray', line_width=0.5,
#     cmap='greens',
#     ).opts(
#         color_levels=5,
#     # color_levels=[0, 60, 120, 180, 240, 300],
#         active_tools=['box_zoom'])

In [None]:
# the top n small breeds in Zurich
ktopn = (
    dog_data_train.loc[dog_data_train["dog_size"] == "K"]["standard"]
    .value_counts()
    .head(15)
    .index.tolist()
)
ktopn_pure = (
    dog_data_train.loc[
        (dog_data_train["dog_size"] == "K") & (
            dog_data_train["is_pure_breed"])
    ]
    .standard.value_counts()
    .head(15)
    .index.tolist()
)

# The top n big breeds in Zurich
itopn = (
    dog_data_train[dog_data_train["dog_size"] == "I"]["standard"]
    .value_counts()
    .head(15)
    .index.tolist()
)
itopn_pure = (
    dog_data_train.loc[
        (dog_data_train["dog_size"] == "I") & (
            dog_data_train["is_pure_breed"])
    ]
    .standard.value_counts()
    .head(15)
    .index.tolist()
)

topn = ktopn + itopn
topn_pure = ktopn_pure + itopn_pure

In [None]:
dog_data_train.info()
pn.state.kill_all_servers()

In [None]:
pn.state.kill_all_servers()
pn.state.clear_caches()

In [None]:
# Create a Tap stream linked to the HeatMap
breed_tap = hv.streams.Tap(source=None)
reporting_year_slider = pnw.IntSlider(
    value=2015, start=2015, end=2023,
    name="reporting_year",
    width=200,
)
pure_breed_checkbox = pnw.Checkbox(name="Pure Breed", value=True, width=200)
is_male_owner_checkbox = pnw.Checkbox(name="Male Dog Owner", value=True, width=200)
# breed_selector = pnw.Select(name="Breed", options=topn, value="french bulldog")
breed_selector = pnw.ToggleGroup(name="Breed", options=breed_list, value="french bulldog", behavior='radio',
                                 button_style='outline', orientation='vertical', width=200, button_type='success')


top_n_slider = pnw.IntSlider(name="Top N", start=1, end=30, step=1, value=10, width=200)


In [None]:


# @pn.depends(reporting_year_slider.param.value, is_male_owner_checkbox.param.value)
def get_gender_reporting_year_df(reporting_year, gender):
    return dog_data_train.loc[
        (dog_data_train["is_male_owner"] == gender) & (dog_data_train["reporting_year"] == reporting_year)
    ].copy()


@pn.depends(
    reporting_year_slider.param.value,
    is_male_owner_checkbox.param.value,
    top_n_slider.param.value,
)
def get_top_n_gender_breeds(reporting_year, gender, top_n):
    gender_reporting_year_df = get_gender_reporting_year_df(reporting_year=reporting_year, gender=gender)
    return gender_reporting_year_df["standard"].value_counts().head(top_n).index.tolist()


@pn.depends(
    reporting_year_slider.param.value,
    is_male_owner_checkbox.param.value,
    top_n_slider.param.value,
)
def get_gender_heatmap(reporting_year, gender, top_n):
    gender_reporting_year_df = get_gender_reporting_year_df(reporting_year=reporting_year, gender=gender)
    topn_gender_breeds = get_top_n_gender_breeds(reporting_year=reporting_year, gender=gender, top_n=top_n
    )

    filtered_df = (
        gender_reporting_year_df.loc[gender_reporting_year_df["standard"].isin(
            topn_gender_breeds)]
        .groupby(["standard", "district"])
        .size()
        .fillna(0)
        .reset_index(name="count")
    )
    sex = "Male" if gender else "Female"
    top_gender_breeds_heatmap = hv.HeatMap(
        filtered_df, ["district", "standard"], "count"
    ).redim(standard="gender_standard")

    top_gender_breeds_heatmap.opts(
        height=(33 * top_n) + 50,
        width=800,
        cmap=boy_cmap if gender else girl_cmap,
        colorbar=True,
        active_tools=["box_zoom"],
        tools=['tap', 'hover', 'box_select'],
        title=f"Top {top_n} breeds | {reporting_year} | {sex} Owners",
        clim=(0, 100),
    )
    breed_tap.source = top_gender_breeds_heatmap

    return top_gender_breeds_heatmap


dynamic_gender_heatmap_panel = pn.panel(get_gender_heatmap)
# pn.Column(
#     pn.Row(
#     is_male_owner_checkbox,
#     reporting_year_slider,
#     top_n_slider,
#     ),
#     dynamic_gender_heatmap_panel,
# )

In [None]:
# # Define a Tap stream linked to the owner age group
owner_tap = hv.streams.Tap(source=None, x=1, y=30)


@pn.depends(reporting_year_slider.param.value, is_male_owner_checkbox.param.value)
def get_age_heatmap(reporting_year, gender):
    gender_reporting_year_df = get_gender_reporting_year_df(reporting_year=reporting_year, gender=gender)
    gender_grouped = (
        gender_reporting_year_df.groupby(["district", "age_group_10"], as_index=False)["owner_id"]
        .nunique()
        .rename(columns={"owner_id": "count"})
    )
    sex = "Male" if gender else "Female"
    district_age_heatmap = hv.HeatMap(
        gender_grouped, ["district", "age_group_10"], "count"
        ).redim(age_group_10="age_group")

    district_age_heatmap.opts(
        opts.HeatMap(
        cmap=boy_cmap if gender else girl_cmap,
        height=500, width=800,
        ylim=(0, 100),
        xlim=(0, 13),
        colorbar=True,
        line_width=4,
        nonselection_alpha=0.9,
        selection_line_color='red',
        active_tools=["box_zoom"],
        tools=["hover", "tap", "box_select"],
        title=f"{sex} Dog Owners | {reporting_year} | by Age Group vs District",
    ))
    owner_tap.source = district_age_heatmap
    return district_age_heatmap


age_group_panel = pn.panel(get_age_heatmap)


In [None]:
bar_plots_opts = dict(
    height=500,
    width=800,
    invert_axes=True,
    cmap=explicit_mapping,
    show_legend=False,
    xlabel="",
    fontscale=1.2,
)


@pn.depends(
    owner_tap.param.x,
    owner_tap.param.y,
    reporting_year_slider.param.value,
    is_male_owner_checkbox.param.value,
)
def update_barplot(x, y, reporting_year, gender):
    if x is not None and y is not None:
        data = get_gender_reporting_year_df(reporting_year=reporting_year, gender=gender)
        district_x = math.ceil(x - 0.5)
        age_group_y = math.ceil(y / 10) * 10
        # print(f"District: {district_x}, Age Group: {age_group_y}")
        bar_data = (
            data.loc[(data["district"] == district_x) & (
                data["age_group_10"] == age_group_y)]["standard"]
            .value_counts()
            .head(10)
            .reset_index()
        )
        bar_data.columns = ["standard", "count"]
        if len(bar_data) == 0:
            return hv.Bars([], "standard", "count").opts(
                **bar_plots_opts,
                title=f"No Breeds for Age-group:{age_group_y} | Districts:{district_x}",
                active_tools=["box_zoom"],
            )

        return hv.Bars(bar_data, "standard", "count").opts(
            **bar_plots_opts,
            color="standard",
            title=f"Top {min(10,len(bar_data))} Popular Breeds | Age-group:{age_group_y} | Districts:{district_x}",
            active_tools=["box_zoom"],
        )
    if x is None or y is None:
        bar_data = (
            data["standard"]
            .value_counts()
            .head(10)
            .rename("count")
            .reset_index()
            .rename(columns={"index": "standard"})
        )
        return hv.Bars(bar_data, kdims=["standard"], vdims="count").opts(
            **bar_plots_opts,
            color="standard",
            title=f"Top 10 Breeds",
            tools=["hover"],
            active_tools=["box_zoom"],
        )


update_barplot_panel = pn.panel(update_barplot)
pn.Column(
    pn.Row(
    is_male_owner_checkbox, reporting_year_slider,
    ), pn.Row(
    age_group_panel, update_barplot_panel,
    )
)

In [None]:
districts_gdf.info()

In [None]:
poly_opts = dict(
    width=800, height=500,
    line_width=2,
    xaxis=None,
    yaxis=None,
    aspect="equal",
    tools=['hover', 'tap', 'box_select']
)

@pn.depends(
    breed_selector.param.value,
    reporting_year_slider.param.value,
    is_male_owner_checkbox.param.value,
)
def get_breed_chloropleth(breed, reporting_year, gender):

    gender_reporting_year_df = get_gender_reporting_year_df(reporting_year=reporting_year, gender=gender)
    gender_reporting_year_df['standard_lst'] = gender_reporting_year_df.apply(lambda row: [row['standard'], row['standard']] if row['standard_2'] == 'none' else [row['standard'], row['standard_2']], axis=1)
    gender_reporting_year_df = gender_reporting_year_df.explode(['standard_lst'])
    # print(breed)
    standard_data = gender_reporting_year_df.loc[gender_reporting_year_df["standard_lst"] == breed]
    standard_data = standard_data.groupby("subdistrict").size().div(2).astype(int).reset_index(name="count")


    breed_gdf = subdistrict_gdf.merge(
        standard_data, on='subdistrict', how="left"
    )
    # breed_gdf = breed_gdf.drop(columns=["desc", "km2"])
    breed_gdf = breed_gdf.fillna(0)
    breed_color = explicit_mapping[breed]
    breed_cmap = list(
        sns.color_palette("light:" + breed_color, n_colors=6).as_hex()
    )
    sex = "Male" if gender else "Female"
    breed_poly = gv.Polygons(breed_gdf)

    return breed_poly.opts(
            color="count",
            cmap=breed_cmap,
            # clim=(0, 60),
            colorbar=True,
            line_color="darkgray",
            width=800, height=500,
            line_width=2,
            xaxis=None,
            yaxis=None,
            aspect="equal",
            tools=['hover', 'tap', 'box_select'],
            title=f"{breed.title()} | {reporting_year} | {sex} Owners",
        )


breed_chloropleth = pn.panel(get_breed_chloropleth,)


# pn.Row(
# pn.Column(
#     is_male_owner_checkbox,
#     reporting_year_slider,
#     breed_selector,
#     ),
#     breed_chloropleth
# )


In [None]:


# Combine the heatmap and the text display into a layout
# pn.Column(
#     pn.Row(
#     is_male_owner_checkbox, reporting_year_slider, top_n_slider,
#     ),
#     pn.Row(
#     dynamic_gender_heatmap_panel,
#     breed_chloropleth,
#     )
# )

In [None]:
# !wget https://raw.githubusercontent.com/jonnross88/Springboard/main/images/dalle_dog_wordcloud.png -O dalle_dog_wordcloud.png
dalle_dog_path = Path("/content/dalle_dog_wordcloud.png")
dalle_dog_url = "https://raw.githubusercontent.com/jonnross88/Springboard/main/images/dalle_dog_wordcloud.png"

download_file(dalle_dog_path, dalle_dog_url)

In [None]:
mask_path = "/content/dalle_dog_wordcloud.png"
dog_image = hv.RGB.load_image(mask_path)
threshold = 0.7
dog_image_mask = dog_image.data
dog_image_mask[dog_image_mask > threshold] = 1

# Create a count of the various pure breeds
pure_breed_count = dog_data_train.loc[
    (dog_data_train["is_pure_breed"])
].standard.value_counts()

# Create a word cloud for the pure breeds
wc = WordCloud(
    max_font_size=66,
    contour_width=10,
    contour_color="steelblue",
    background_color="white",
    colormap="cet_glasbey_dark",
    mask=(dog_image_mask * 255).astype(np.uint8),
).generate_from_frequencies(pure_breed_count)


breed_wordcoud = hv.RGB(wc.to_array()).opts(
    width=wc.width,
    height=wc.height,
    title="Breed Popularity",
    show_frame=False,
    xaxis=None,
    yaxis=None,
    # bgcolor="gray",
    padding=0.05,
    active_tools=["box_zoom"],
)
pn.pane.HoloViews(breed_wordcoud)

In [None]:
scaler = StandardScaler()
# get the count of dogs by sub-district and reporting_year
dog_count_by_sub_d_reporting_year = (dog_data_train.groupby(
    ["reporting_year", "subdistrict"],
    as_index=False).size().pivot(index="subdistrict",
                                 columns="reporting_year",
                                 values="size"))
# put the sub-district back into the columns
dog_count_df_std = pd.DataFrame(
    scaler.fit_transform(dog_count_by_sub_d_reporting_year),
    columns=dog_count_by_sub_d_reporting_year.columns,
    index=dog_count_by_sub_d_reporting_year.index,
)
# get the percent change of the dog count by sub-district and reporting_year
dog_count_pct_change_std = pd.DataFrame(
    scaler.fit_transform(
        dog_count_by_sub_d_reporting_year.pct_change(axis=1).fillna(0) * 100),
    columns=dog_count_by_sub_d_reporting_year.columns,
    index=dog_count_by_sub_d_reporting_year.index,
)
# plot the standardized dog count and percent change
dog_count_df_std.unstack().reset_index(name="count_std").merge(
    dog_count_pct_change_std.unstack().reset_index(
        name="pct_change_std")).hvplot.scatter(
            by="reporting_year",
            y="count_std",
            x="pct_change_std",
            height=600,
            width=600,
            xlim=(-3, 3),
            ylim=(-3, 3),
        ) * hv.VLine(0).opts(color="lightgray",
                             line_dash="dashed") * hv.HLine(0).opts(
                                 color="lightgray", line_dash="dashed")

In [None]:
# same plot but without the fillna(0) in the pct_change_std
dog_count_pct_change_long = (
    dog_count_by_sub_d_reporting_year.pct_change(axis=1).unstack()
    # .dropna()
    .reset_index(name="pct_change"))
dog_count_long = dog_count_by_sub_d_reporting_year.unstack().reset_index(
    name="count").dropna()

dog_count_std = scaler.fit_transform(
    dog_count_by_sub_d_reporting_year.unstack().values.reshape(-1, 1))

dog_count_long["count_std"] = dog_count_std

(dog_count_long.merge(dog_count_pct_change_long).hvplot.scatter(
    by="reporting_year",
    y="count_std",
    x="pct_change",
    height=600,
    width=600,
) * hv.VLine(0).opts(color="lightgray", line_dash="dashed") *
 hv.HLine(0).opts(color="lightgray", line_dash="dashed"))



In [None]:
# display a sample of the dog data train
dog_data_train.sample(3)
dog_data_train.info()

dog_data_train.describe(include="all").T.round(2).infer_objects().sort_values(by="unique").fillna("")

In [None]:
# Calculate yearly counts
yearly_counts = dog_data_train.groupby(['reporting_year', 'subdistrict', 'standard'
                                        ]).size().reset_index(name='size')

# Calculate yearly change
yearly_counts['yearly_change'] = yearly_counts.groupby(
    ['subdistrict', 'standard'])['size'].diff().fillna(0)

# Calculate change of the change (second derivative)
yearly_counts['change_of_change'] = yearly_counts.groupby(
    ['subdistrict', 'standard'])['yearly_change'].diff().fillna(0)

# Normalize the changes
yearly_counts['change_of_change_normalized'] = yearly_counts.groupby(
    ['subdistrict', 'standard'])['change_of_change'].transform(
        lambda x: (x - x.mean()) / x.std()).fillna(0).round(2)

# Rank the breeds
yearly_counts['rank'] = yearly_counts.groupby(
    ['reporting_year',
     'subdistrict'])['change_of_change_normalized'].rank(ascending=False,
                                                          na_option='bottom')

# Create target variable
yearly_counts['is_emerging'] = (yearly_counts['rank'] == 1).astype(int)

# set all 2015 records to  be non-emerging
yearly_counts.loc[yearly_counts['reporting_year'].isin([2015, 2016]),
                  'is_emerging'] = 0

# yearly_counts.sort_values(by=['subdistrict', 'standard', 'reporting_year']).head(50)
yearly_counts.sort_values(by=['is_emerging', 'reporting_year'],
                          ascending=False).head(50)
yearly_counts.sort_values(by=['size'], ascending=False).head(50)
yearly_counts.query('reporting_year > 2016').sort_values(by=['is_emerging', 'reporting_year'],
                                                 ascending=False).head(50)

In [None]:
# groupby subdistrict and reporting_year and count the number of each breed
breed_count_by_reporting_year_sub_d = dog_data_train.groupby(
    ["reporting_year", "subdistrict", "standard"]).size().unstack(level=2).fillna(0)

# breed_count_by_reporting_year_sub_d.divide(breed_count_by_reporting_year_sub_d.sum(axis=1), axis=0)
breed_count_by_reporting_year_sub_d.sort_index(level='reporting_year').groupby(level='subdistrict').diff().fillna(0)
grouped_breed_count = breed_count_by_reporting_year_sub_d.groupby(level='subdistrict')
grouped_breed_count.pct_change().fillna(0)
total_breed_count = grouped_breed_count.sum()
total_breed_count


# filter the breed count by sub-district and reporting_year to only include breeds in the breeds_more_than_threshold list
# this is a multi-index dataframe with sub-district and breeds
breed_count_by_reporting_year_sub_d

#### Greenspace

In [None]:
greenspace_zip_url = 'https://storage.googleapis.com/mrprime_dataset/zurich/greenspace.zip'
parks_zip_url = "https://storage.googleapis.com/mrprime_dataset/zurich/parks.zip"


In [None]:
greenspace_gdf_dict = hf.get_gdf_from_zip_url(greenspace_zip_url)
parks_gdf_dict = hf.get_gdf_from_zip_url(parks_zip_url)

In [None]:
greenspace_gdf = list(greenspace_gdf_dict.values())[0]
greenspace_gdf.sample()


In [None]:
parks_gdf = list(parks_gdf_dict.values())[0]
parks_gdf.sample()

In [None]:



greenspace_columns = {
    'objektidentifikator': 'object_identifier',
    'pflegeareal': 'maintenance_area',
    'produkt': 'product',
    'erfassungseinheit': 'recording_unit',
    'pflegeeinheit': 'maintenance_unit',
    'pflegestufe': 'maintenance_level'
}
greenspace_gdf = greenspace_gdf.rename(columns=greenspace_columns)
greenspace_gdf = greenspace_gdf.drop(columns='objectid')
greenspace_gdf.sample()


In [None]:
recording_unit_mapping = {
    '611 Parkanlagen': 'park_facility',
    '643 Schulgrün': 'school_greenspace',
    '632 Badeanlagen': 'bathing_facility',
    '641 Strassenbäume': 'street_tree',
    '662 Bachunterhalt ERZ': 'stream_maintenance',
    '661 Wohnliegenschaften LVZ': 'residential',
    '624 Friedhofanlagen': 'cemetery_facility',
    '663 Grünflächenpflege VBZ': 'greenspace_maintenance',
    '623 Grabdienstleistungen': 'grave_service',
    '631 Sportanlagen': 'sports_facility',
    '642 Strassenbegleitgrün': 'roadside_greenspace',
    '651 Sozialbauten IMMO': 'social_housing',
    '650 Schulgrau IMMO': 'school_grey_area',
    '660 Wohnsiedlungen LVZ': 'housing_estate',
    '652 Verwaltungsbauten IMMO': 'administrative_building',
    '654 Werkbauten IMMO': 'industrial_building',
    '653 Kulturbauten IMMO': 'cultural_building',
    '690 Verrechenbare Dienstleistungen': 'billable_service',
    '664 Grünflächenpflege EWZ': 'greenspace_maintenance',
}
greenspace_gdf['recording_unit'] = greenspace_gdf['recording_unit'].map(recording_unit_mapping, na_action='ignore')


In [None]:


# Identify invalid geometries
invalid_geometries = greenspace_gdf[~greenspace_gdf.geometry.is_valid]

# If invalid geometries are found, attempt to fix them using buffer(0)
if not invalid_geometries.empty:
    print(f"Found {len(invalid_geometries)} invalid geometries. Attempting to fix...")
    greenspace_gdf.geometry = greenspace_gdf.geometry.buffer(0)
else:
    print("No invalid geometries found.")

# Perform the dissolve operation
greenspace_gdf_dissolved = greenspace_gdf.dissolve(by='maintenance_area')


In [None]:
park_facility_gdf = greenspace_gdf_dissolved[greenspace_gdf_dissolved['recording_unit'] == 'park_facility'].drop(columns=['recording_unit', 'product', 'object_identifier', 'maintenance_unit']).reset_index()
park_facility_gdf.nunique()

In [None]:
parks_plot = gv.Points(parks_gdf, vdims=['name']).opts(color='red', tools=['hover'], size=5, marker='square')

In [None]:
park_facility_gdf['area_m2'] = park_facility_gdf.to_crs('EPSG:2056').area.astype(int)


In [None]:
def assign_subdistrict(park, subdistricts):
    """Assigns a park to the subdistrict with the largest intersection area."""
    max_area = 0
    assigned_subdistrict = None

    for index, subdistrict in subdistricts.iterrows():
        intersection = park.geometry.intersection(subdistrict.geometry)
        area = intersection.area

        if area > max_area:
            max_area = area
            assigned_subdistrict = subdistrict["subdistrict"]  # Assuming "subdistrict" column in subdistrict_gdf

    return assigned_subdistrict

# Ensure both geodataframes have the same CRS
park_facility_gdf = park_facility_gdf.to_crs(subdistrict_gdf.crs)

# Apply the assign_subdistrict function to each park
park_facility_gdf['subdistrict'] = park_facility_gdf.apply(
    lambda park: assign_subdistrict(park, subdistrict_gdf), axis=1
)
park_facility_gdf


In [None]:
# gts.ESRI * gv.Polygons(park_facility_gdf, vdims=['maintenance_level']).opts(height=800, width=1000, color='maintenance_level', alpha=0.5, tools=['hover', 'tap'], active_tools=['box_zoom'], show_legend=True)

maintenance_poly_dict = {
    level: gv.Polygons(gdf.to_crs(ccrs.GOOGLE_MERCATOR.proj4_init), vdims=['maintenance_level', 'maintenance_area'],
                          crs=ccrs.GOOGLE_MERCATOR).opts(
                              line_width=0, alpha=0.5, height=800, width=1200,
                              show_legend=True, legend_position='right', muted_alpha=0.001,
                              tools=['hover'],
                              xaxis='bare', yaxis='bare', xlabel='', ylabel='')
                          for level, gdf in park_facility_gdf.groupby('maintenance_level')
}

maintenance_plot = gv.NdOverlay(maintenance_poly_dict, kdims=['maintenance_level']).opts(title='Park Facilities by Maintenance Level')
background_map = gts.ESRI.opts(alpha=0.5)

pn.panel(background_map * maintenance_plot * parks_plot)


In [None]:

parks_gdf

In [None]:
column_meanings = {
    'adr_inter': 'Internal address',
    'adresse': 'Address',
    'adrzus_int': 'Internal address addition',
    'behindertenparkplatz': 'Disabled parking',
    'bemerkung': 'Remarks, comments',
    'ccmail': 'CC email address',
    'da': 'Data acquisition (date?)',
    'datum': 'Date',
    'datum_cms': 'Date (CMS related)',
    'dep': 'Department or area',
    'editor': 'Editor (name?)',
    'erforderlichedokumente': 'Required documents',
    'fax': 'Fax number',
    'hausnummer': 'House number',
    'hindernisfreiheit': 'Accessibility',
    'infrastruktur': 'Infrastructure',
    'isbetriebsferien_gebaeude': 'Building closed due to company holidays',
    'isbetriebsferien_schalter': 'Counter closed due to company holidays',
    'kategorie': 'Category',
    'mail': 'Email address',
    'name': 'Name',
    'namenzus': 'Name addition',
    'objectid': 'Object ID',
    'oeffnungszeiten_gebaeude_di': 'Building opening hours (Tuesday)',
    'oeffnungszeiten_schalter_mo': 'Counter opening hours (Monday)',
    'oeffnungszeiten_schalter_sa': 'Counter opening hours (Saturday)',
    'oeffnungszeiten_schalter_so': 'Counter opening hours (Sunday)',
    'ort': 'City or location',
    'plz': 'Postal code',
    'poi_id': 'Point of Interest ID',
    'postadresse': 'Postal address',
    'publish_internet': 'Publish on internet',
    'strasse': 'Street',
    'suchen': 'Search terms',
    'tel': 'Telephone number',
    'tel2': 'Alternative telephone number',
    'www': 'Website',
    'zahlungsmittel_internet': 'Payment methods (internet)',
    'zahlungsmittel_schalter': 'Payment methods (counter)',
    'zahlungsmittel_telefon': 'Payment methods (telephone)',
    'zvv_label': 'ZVV (public transport) label',
    'zvv_link': 'ZVV (public transport) link',

}

In [None]:
( gts.OSM * gv.Points(parks_gdf).opts(size=5)).opts(height=800, width=1000)