
## Dog Puppulation in Zürich: A Geospatial Neighborhood Analysis

### Introduction

#### Problem Statement:
Can we develop a data-driven model by the end of January 2024 that predicts what the dog *puppulation* density will be this year across Zürich’s 34 neighborhoods, with a Mean Absolute Error of less than 10%, using time series cross validation, to provide valuable insights for urban planning, pet-related businesses, and community welfare?


#### Context:
Following the City Council Resolution to override the Law on the Keeping of Dogs, the City of Zürich has embarked on a comprehensive exploration of dog *puppulation* dynamics in its neighborhoods. This initiative, prompted by that regulatory shift, aims to sniff out patterns in dog *puppulation* density that impact urban planning, business opportunities, and the overall welfare of our furry companions and their owners. The study leverages data from **2015** to **2020** to improve urban planning, boost pet-related business ventures, and foster community welfare through a better understanding of dog *puppulation* density patterns. This study is vital in this new era for Zürich, providing practical recommendations for the near future. The aim is to develop a data-driven model that reliably predicts the dog *puppulation* density across Zürich’s 34 neighborhoods in the near future.


#### Criteria for Success:
Our goal is to *dig up* clear patterns of dog *puppulation* density in Zürich’s neighborhoods, laying the groundwork for informed future predictions. We aim to *unleash* the potential of our predictive models, forecasting 2024 dog *puppulation* density patterns in Zürich with a Mean Absolute Error of less than 10%. Achieving this would be a *pawsitive* step towards informed future urban strategies.


#### Constraints within Solution Space:
- **Temporal Scope**: The study is confined to the years with full data availability across all datasets (2015-2020)
- **Spatial Resolution**: The study focuses on dog *puppulation* density at the neighborhood level. This may not capture variations within neighborhoods or between smaller areas.
- **Generalizability**: The findings of this study are specific to Zürich and may not be applicable to other cities or regions with different demographic, economic, and cultural contexts.


#### Stakeholders:
- **City Planners and Local Authorities:** Empower data-driven decision-making to enhance urban living conditions.
- **Business Enterprises:** Guide service offerings and marketing strategies.
- **Dog Owners:** Offer insights into community resources and pet care options.


#### Key Data Sources:
- **Geospatial Boundaries:** [Zürich Statistical Quarters](https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere)
- **Dog Ownership Records:** [Dog Owners Dataset](https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv)
- **Demographic Statistics:** [Population Dataset](https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv)
- **Economic Indicators:** [Income Dataset](https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv)
- **Household Dynamics:** [Household Size Dataset](https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv)

#### Analytical Objectives:
- **Understand the Relationship**: Dig into the relationship between demographic factors and dog *puppulation* density across Zürich’s neighborhoods.
- **Identify Trends and Clusters**: Track and map out the spatial and temporal trends of dog *puppulation* density. Identify spatial clusters of high and low dog *puppulation* density.
- **Predict Future Trends**: Predict the near-future trends of dog *puppulation* density using historical data, aiming for a Mean Absolute Error of less than 10%. This includes forecasting where Zürich’s dog *puppulation* will be booming across its 34 neighborhoods in the immediate future.


### Imports & Configurations

This section includes the necessary imports for libraries, configuration settings for dataframes and visualizations. These components establish the foundational setup for subsequent data analysis and exploration. 


In [1]:
# Standard libraries
from functools import partial

from IPython.display import clear_output

import math

from PIL import ImageDraw, Image  # For image processing

from urllib.request import urlopen


# Related third party imports

from bokeh.models import FixedTicker, NumeralTickFormatter

import cartopy.crs as ccrs  # For cartographic projections and geographic plots

import colorcet as cc  # Additional color palettes

from esda.moran import Moran, Moran_Local  # Spatial autocorrelation statistics

from fiona.io import ZipMemoryFile

import geopandas as gpd

import geoviews as gv

import holoviews as hv

from holoviews import streams

import hvplot.pandas  # noqa

from matplotlib import pyplot as plt

import libpysal as lps  # Spatial analysis library

import numpy as np

import pandas as pd

import panel as pn

import panel.widgets as pnw
from pmdarima import auto_arima  # For determining ARIMA orders

import seaborn as sns

from splot.esda import plot_local_autocorrelation
from sklearn import metrics  # For evaluating model performance

from thefuzz import fuzz  # For string matching
from tqdm.notebook import tqdm  # Progress bars

from wordcloud import WordCloud  # For generating word cloud visualizations


# Local application/library specific imports

import helper_functions as hf  # Custom helper functions for this project

from translate_app import translate_list_to_dict


clear_output()

In [2]:
# Additional configurations for visualization libraries
gv.extension("bokeh")
hv.extension("bokeh")
hvplot.extension("bokeh")
pn.extension(template="fast", nthreads=4, sizing_mode="stretch_width")
clear_output()

In [3]:
# Pandas display options
# Disable warnings for chained assignments
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 50
pd.options.display.max_rows = 100

# Seaborn style setting
sns.set_style("whitegrid")

# Panel configuration for improved interactivity performance
pn.config.throttled = True

# Clear any output created by the extensions and settings
clear_output()

### Data Description

This project utilizes various datasets to reveal the relationship between dog owner geodemographic factors and dog population density in Zurich. 




<table>
    <thead>
        <tr>
            <th>Dataset</th>
            <th>Source URL</th>
            <th>Original Source</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="#Zurich-Statistical-Districts-Geospatial-Data">Zurich Districts Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/geo_statistische_quartiere">Stadt Zürich</a></td>
            <td>Statistical Quarters</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Dogs-Dataset">Zurich Dogs Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001">Stadt Zürich</a></td>
            <td>Dog populations of the City of Zurich since 2015.</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Population-Dataset">Zurich Population Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903">Stadt Zürich</a></td>
            <td>Population by neighbourhood, origin, sex and age, since 1993.</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Income-Data">Zurich Income Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003">Stadt Zürich</a></td>
            <td>Median income of taxable individuals by year, tax rate and urban district, since 1999</td>
        </tr>
        <tr>
            <td><a href="#Zurich-Household-Dataset">Zurich Household Data</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv">Link</a></td>
            <td><a href="https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806">Stadt Zürich</a></td>
            <td>Private households by household size and urban district, since 2013.</td>
        </tr>
    </tbody>
</table>

<p>These datasets collectively enable a comprehensive analysis of dog ownership trends in Zurich.</p>


### Data Loading
First, we load in all of the datasets. 

To enhance readability and ensure consistency across datasets, original column names were translated from German to English and standardized to snake case using our `sanitize_df_column_names` helper function. This transformation facilitates a cleaner, more uniform `pd.DataFrame` structure for analysis.

We then inspect the columns and select the ones we would like to keep for our analysis. We also rename the columns to make them more readable and consistent across datasets. 



#### Zurich Statistical Districts Geospatial Data

This first geodataset comes as a compressed file containing 3 geojson files.

1. `z_gdf_0`: point geometry data at the ideal position for placing a number label on the polygon map.

2. `z_gdf_1`: polygon geometry data specifically for visual representation in cartography i.e.maps.

3. `z_gdf_2`: polygon geometry data recommended for use for accurate geometry calculations, like spatial joins or area calculations.

Together these three files provide excellent geodedic information on the geographical region of Zürich for our analysis.

In [4]:
# # save the url of the website
# # zurich_districts_url = "https://www.zuerich.com/en/visit/about-zurich/zurichs-districts"

# # zurich_desc = hf.get_zurich_description(zurich_districts_url)
# # Create a 'link' column in the description DataFrame with links to each district's details.
# # zurich_desc["link"] = zurich_desc["district"].apply(
# #     lambda x: f"{zurich_districts_url}#s-{x}"
# # )

# # Load in the Zurich districts description DataFrame we created in the previous notebook.
# zurich_desc = pd.read_csv("../data/zurich_districts.csv")
# # Display a sample of the Zurich districts description DataFrame.
# display(zurich_desc.sample(3))

# # Display the Zurich districts description DataFrame.
# zurich_desc.info()

In [5]:
# Define the URL for the Zurich Statistical Quarters geospatial data ZIP file.
zip_gdf_url = "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_statistical_quarters.zip"

# Load the geospatial data into Zurich Geo DataFrames.Would you prefer if we do
zurich_geo_dicts = hf.get_gdf_from_zip_url(zip_gdf_url)

# Rename keys in the Zurich Geo DataFrames with a prefix.
z_gdf = hf.rename_keys(zurich_geo_dicts, prefix="z_gdf_")

# Display the information and a sample of data from each GeoDataFrame in the z_gdf dictionary
for key in z_gdf.keys():
    print(f"\nInformation for {key}:")
    z_gdf[key].info()
    print(f"Sample data from {key}:")
    display(z_gdf[key].sample(3))


Information for z_gdf_0:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   geometry  34 non-null     geometry
 1   objid     34 non-null     object  
 2   name      34 non-null     object  
 3   kuerzel   34 non-null     object  
 4   ori       34 non-null     int64   
 5   hali      34 non-null     object  
 6   vali      34 non-null     object  
dtypes: geometry(1), int64(1), object(5)
memory usage: 2.0+ KB
Sample data from z_gdf_0:


Unnamed: 0,geometry,objid,name,kuerzel,ori,hali,vali
20,POINT (8.53495 47.37139),21,City,14,0,1,2
13,POINT (8.52813 47.37764),14,Langstrasse,42,0,1,2
12,POINT (8.53061 47.38389),13,Gewerbeschule,51,0,1,2



Information for z_gdf_1:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   geometry  34 non-null     geometry
 1   objectid  34 non-null     int64   
 2   objid     34 non-null     object  
 3   qnr       34 non-null     int64   
 4   qname     34 non-null     object  
 5   knr       34 non-null     int64   
 6   kname     34 non-null     object  
dtypes: geometry(1), int64(3), object(3)
memory usage: 2.0+ KB
Sample data from z_gdf_1:


Unnamed: 0,geometry,objectid,objid,qnr,qname,knr,kname
16,"POLYGON ((8.51753 47.38535, 8.51745 47.38537, ...",12,26,42,Langstrasse,4,Kreis 4
4,"POLYGON ((8.53301 47.37394, 8.53310 47.37405, ...",7,16,41,Werd,4,Kreis 4
31,"POLYGON ((8.58807 47.40796, 8.58773 47.40804, ...",31,4,123,Hirzenbach,12,Kreis 12



Information for z_gdf_2:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   geometry  34 non-null     geometry
 1   objid     34 non-null     object  
 2   qnr       34 non-null     int64   
 3   qname     34 non-null     object  
 4   knr       34 non-null     int64   
 5   kname     34 non-null     object  
dtypes: geometry(1), int64(2), object(3)
memory usage: 1.7+ KB
Sample data from z_gdf_2:


Unnamed: 0,geometry,objid,qnr,qname,knr,kname
7,"POLYGON ((8.51615 47.34897, 8.51619 47.34875, ...",16,21,Wollishofen,2,Kreis 2
21,"POLYGON ((8.51403 47.39914, 8.51408 47.39907, ...",29,102,Wipkingen,10,Kreis 10
1,"POLYGON ((8.57525 47.36377, 8.57526 47.36377, ...",10,74,Witikon,7,Kreis 7


#### Zurich Dogs Dataset

In [6]:
zurich_dog_data_link = "https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand_od1001/download/KUL100OD1001.csv"
zurich_dog_data_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_dogs.csv"
)
zurich_dog_data = hf.InfoDataFrame(pd.read_csv(zurich_dog_data_link))
zurich_dog_data.limit_info()

zurich_dog_data = hf.sanitize_df_column_names(zurich_dog_data)
zurich_dog_data.limit_info()
zurich_dog_data.sample(3)


Total number of columns: 32
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 70967 entries, 0 to 70966
Columns: 32 entries, StichtagDatJahr to AnzHunde
dtypes: int64(19), object(13)
memory usage: 17.3+ MB

Only showing info for 8 columns, chosen at random.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70967 entries, 0 to 70966
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   RassentypCd     70967 non-null  object
 1   HundefarbeText  70967 non-null  object
 2   KreisSort       70967 non-null  int64 
 3   RassentypSort   70967 non-null  int64 
 4   QuarLang        70967 non-null  object
 5   AlterVHundSort  70967 non-null  int64 
 6   QuarCd          70967 non-null  int64 
 7   KreisLang       70967 non-null  object
dtypes: int64(4), object(4)
memory usage: 4.3+ MB

Total number of columns: 32
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 70967 entries, 0 to 70966
Columns: 32 entries, deadline_da

Unnamed: 0,deadline_date_year,data_status_cd,holder_id,age_v_10_cd,age_v_10_long,age_v_10_sort,sex_cd,sex_long,sex_sort,circle_cd,circle_lang,circle_sort,quar_cd,quar_lang,quar_sort,race_1_text,race_2_text,breed_mixed__breed_cd,breed_mongrel_long,breed_mixed__breed_sort,breed_type_cd,breed_type_long,breed__type_sort,birth_dog_year,age_v_dog_cd,age_v_dog_long,age_v_dog_sort,sex_dog_cd,sex_dog_long,sex_dog_sort,dog_color_text,number_of_dogs
35053,2019,D,141754,80,80- bis 89-Jährige,9,1,männlich,1,6,Kreis 6,6,61,Unterstrass,61,Dachshund,Keine,1,Rassehund,1,K,Kleinwüchsig,1,2004,14,14-Jährige,14,2,weiblich,2,braun,1
3220,2015,D,98866,40,40- bis 49-Jährige,5,2,weiblich,2,4,Kreis 4,4,42,Langstrasse,42,Yorkshire Terrier,Keine,1,Rassehund,1,K,Kleinwüchsig,1,2005,9,9-Jährige,9,1,männlich,1,black/tan,1
16502,2017,D,96262,30,30- bis 39-Jährige,4,2,weiblich,2,12,Kreis 12,12,122,Schwamendingen-Mitte,122,Bichon frisé,Keine,1,Rassehund,1,K,Kleinwüchsig,1,2008,8,8-Jährige,8,1,männlich,1,weiss,1


#### Zurich Population Dataset



In [7]:
# zurich_pop_link = "https://data.stadt-zuerich.ch/dataset/bev_bestand_jahr_quartier_alter_herkunft_geschlecht_od3903/download/BEV390OD3903.csv"
zurich_pop_link = "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_pop.csv"
zurich_pop_data = hf.InfoDataFrame(pd.read_csv(zurich_pop_link))
zurich_pop_data.limit_info()
zurich_pop_data = hf.sanitize_df_column_names(zurich_pop_data)
zurich_pop_data.limit_info()
print("Showing a full row of the Zurich population DataFrame:")
zurich_pop_data.sample().T


Total number of columns: 23
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 370658 entries, 0 to 370657
Columns: 23 entries, StichtagDatJahr to AnzBestWir
dtypes: int64(15), object(8)
memory usage: 65.0+ MB

Only showing info for 8 columns, chosen at random.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370658 entries, 0 to 370657
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   AlterV10Kurz  370658 non-null  object
 1   AlterV20Kurz  370658 non-null  object
 2   SexCd         370658 non-null  int64 
 3   AlterVCd      370658 non-null  int64 
 4   SexLang       370658 non-null  object
 5   AlterV10Cd    370658 non-null  int64 
 6   AlterV20Cd    370658 non-null  int64 
 7   AnzBestWir    370658 non-null  int64 
dtypes: int64(5), object(3)
memory usage: 22.6+ MB

Total number of columns: 23
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 370658 entries, 0 to 370657
Columns: 23 entries, deadline_date

Unnamed: 0,356131
deadline_date_year,2021
age_v_sort,81
age_v_cd,81
age_v_short,81
age_v_05_sort,17
age_v_05_cd,80
age_v_05_short,80-84
age_v_10_cd,80
age_v_10_short,80-89
age_v_20_cd,80


#### Zurich Income Dataset
These data contain quantile values of the taxable income of natural persons who are primarily taxable in the city of Zurich. Tax income are in thousand francs (integer).

In [8]:
# zurich_income_link = "https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003/download/WIR100OD1003.csv"
zurich_income_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_income.csv"
)
zurich_income_data = hf.InfoDataFrame(pd.read_csv(zurich_income_link))
zurich_income_data.info()

# Clean column names, display info and sample
zurich_income_data = hf.sanitize_df_column_names(zurich_income_data)


zurich_income_data.info()
print("\nShowing a full row of the Zurich income DataFrame:")
zurich_income_data.sample().T

<class 'helper_functions.InfoDataFrame'>
RangeIndex: 2244 entries, 0 to 2243
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   StichtagDatJahr      2244 non-null   int64  
 1   QuarSort             2244 non-null   int64  
 2   QuarCd               2244 non-null   int64  
 3   QuarLang             2244 non-null   object 
 4   SteuerTarifSort      2244 non-null   int64  
 5   SteuerTarifCd        2244 non-null   int64  
 6   SteuerTarifLang      2244 non-null   object 
 7   SteuerEinkommen_p50  2181 non-null   float64
 8   SteuerEinkommen_p25  2181 non-null   float64
 9   SteuerEinkommen_p75  2181 non-null   float64
dtypes: float64(3), int64(5), object(2)
memory usage: 175.4+ KB
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 2244 entries, 0 to 2243
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   deadline_date

Unnamed: 0,1536
deadline_date_year,2014
quar_sort,13
quar_cd,13
quar_lang,Lindenhof
tax_tariff_sort,0
tax_tariff_cd,0
tax_tariff_long,Grundtarif
tax_income_p_50,55.55
tax_income_p_25,24.1
tax_income_p_75,100.4


#### Zurich Household data 

In [9]:
# zurich_household_data_link = "https://data.stadt-zuerich.ch/dataset/bev_hh_haushaltsgroesse_quartier_seit2013_od3806/download/BEV380OD3806.csv"
zurich_household_data_link = (
    "https://storage.googleapis.com/mrprime_dataset/zurich/zurich_household.csv"
)
zurich_household_data = hf.InfoDataFrame(
    pd.read_csv(zurich_household_data_link))
zurich_household_data.limit_info()

zurich_household_data = hf.sanitize_df_column_names(zurich_household_data)
zurich_household_data.limit_info()
print("\nShowing a full row of the Zurich household DataFrame:")

zurich_household_data.sample().T


Total number of columns: 9
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 2040 entries, 0 to 2039
Columns: 9 entries, StichTagDatJahr to AnzBestWir
dtypes: int64(6), object(3)
memory usage: 143.6+ KB

Only showing info for 8 columns, chosen at random.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2040 entries, 0 to 2039
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   KreisLang        2040 non-null   object
 1   AnzBestWir       2040 non-null   int64 
 2   QuarLang         2040 non-null   object
 3   QuarSort         2040 non-null   int64 
 4   hh_groesseSort   2040 non-null   int64 
 5   StichTagDatJahr  2040 non-null   int64 
 6   AnzHH            2040 non-null   int64 
 7   hh_groesseLang   2040 non-null   object
dtypes: int64(5), object(3)
memory usage: 127.6+ KB

Total number of columns: 9
<class 'helper_functions.InfoDataFrame'>
RangeIndex: 2040 entries, 0 to 2039
Columns: 9 entries, key_day_

Unnamed: 0,726
key_day_dat_year,2016
quar_sort,73
quar_lang,Hirslanden
circle_sort,7
circle_lang,Kreis 7
hh_size_sort,1
hh_size_lang,1 Person
number_hh,1727
number_we,1727


### Dataset Wrangling

Before diving into Exploratory Data Analysis (EDA), we need to prepare our datasets. This involves:
- Removing unnecessary columns
- Renaming columns for consistency
- Adding new columns
- Cleaning data (handling missing values, correcting datatypes, and standardizing data)

These steps will ensure our data is clean and well-structured, setting the stage for effective and accurate analysis in the EDA phase. We'll apply these steps to each dataset.

#### Zurich Statistical Districts Geospatial Data

Additional steps for this dataset not yet mentioned:

- area calculations
- spatial join with the geospatial data so that we can consider the districts if we wanted to

In [10]:
zurich_map_gdf = z_gdf["z_gdf_1"]

zurich_map_gdf.rename(
    columns={"qname": "neighborhood",
             "qnr": "sub_district", "knr": "district"},
    inplace=True,
)
# Format the sub_district column to have 3 digits
zurich_map_gdf["sub_district"] = zurich_map_gdf["sub_district"].astype(
    str).str.zfill(3)

# Create the refined geodataframe
neighborhood_gdf = zurich_map_gdf[
    ["neighborhood", "sub_district", "district", "geometry"]
].copy()

# Display geodataframe information and CRS
neighborhood_gdf.info()
display(neighborhood_gdf.crs)

# Display a sample entry from the transformed geodataframe
neighborhood_gdf.sample().T
# Load the geospatial data for calculation
zurich_calc_gdf = z_gdf["z_gdf_2"]

# Calculate area in square meters and add as a new column
zurich_calc_gdf["area_km2"] = zurich_calc_gdf.to_crs(
    ccrs.GOOGLE_MERCATOR).area / 1e6

# Rename the column for consistency with the main geodataframe
zurich_calc_gdf = zurich_calc_gdf.rename(columns={"qname": "neighborhood"})

# Merge calculated features with the main geodataframe (neighborhood_gdf)
area_gdf = neighborhood_gdf.merge(
    zurich_calc_gdf[["neighborhood", "area_km2"]], on="neighborhood"
)

# Display a snapshot of the merged geodataframe
display(area_gdf.sample().T)


districts_gdf = (
    neighborhood_gdf.drop(columns=["neighborhood", "sub_district"])
    .dissolve(by="district")
    .reset_index()
)
districts_gdf = districts_gdf.dissolve(by="district").reset_index()
districts_gdf["area_km2"] = districts_gdf.to_crs(
    ccrs.GOOGLE_MERCATOR).area / 1e6

display(districts_gdf.sample().T)
districts_gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   neighborhood  34 non-null     object  
 1   sub_district  34 non-null     object  
 2   district      34 non-null     int64   
 3   geometry      34 non-null     geometry
dtypes: geometry(1), int64(1), object(2)
memory usage: 1.2+ KB


<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- undefined
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Unnamed: 0,4
neighborhood,Werd
sub_district,041
district,4
geometry,"POLYGON ((8.5330051343 47.373942585, 8.5330964..."
area_km2,0.65893


Unnamed: 0,1
district,2
geometry,"POLYGON ((8.5198218089 47.3240112125, 8.519737..."
area_km2,24.085868


Unnamed: 0,district,geometry,area_km2
0,1,"POLYGON ((8.54195 47.37971, 8.54196 47.37972, ...",3.922455
1,2,"POLYGON ((8.51982 47.32401, 8.51974 47.32401, ...",24.085868
2,3,"POLYGON ((8.51943 47.35125, 8.51889 47.35111, ...",18.841344
3,4,"POLYGON ((8.53301 47.37394, 8.53299 47.37392, ...",6.333365
4,5,"POLYGON ((8.52834 47.38939, 8.52862 47.38919, ...",4.353097
5,6,"POLYGON ((8.54797 47.39915, 8.54801 47.39918, ...",11.124162
6,7,"POLYGON ((8.60185 47.37186, 8.60188 47.37178, ...",32.719294
7,8,"POLYGON ((8.56493 47.34636, 8.56458 47.34619, ...",10.466348
8,9,"POLYGON ((8.50127 47.37961, 8.50121 47.37957, ...",26.281411
9,10,"POLYGON ((8.52545 47.40667, 8.52574 47.40675, ...",19.822487


In [11]:
# Save the geodataframe to disk in the data folder
area_gdf.to_file("../data/zurich_neighborhoods.geojson")
districts_gdf.to_file("../data/zurich_districts.geojson")

#### Zurich Dogs Dataset

The original dataset had 31 columns, many redundant. We've picked 18 for our analysis:

- deadline_date_year
- holder_id
- age_v_10_cd
- sex_cd
- circle_cd
- quar_cd
- quar_lang
- race_1_text
- race_2_text
- breed_mixed__breed_cd
- breed_mongrel_long
- breed_mixed__breed_sort
- breed_type_cd
- birth_dog_year
- age_v_dog_cd
- sex_dog_cd
- dog_color_text
- number_of_dogs

From these columns we create a new dataset, `dog_data` and we and we transform these column in preparation for the EDA phase. These transformations included:
- Converting the columns which only contain two different values two binary columns
- translating some values from German to English
- dealing with missing values
- standardizing some of the values for easier grouping.

In [12]:
new_column_names = {
    "deadline_date_year": "roster",
    "holder_id": "owner_id",
    "age_v_10_cd": "age_group_10",
    "sex_cd": "owner_gender",
    "breed_type_cd": "dog_size",
    "age_v_dog_cd": "dog_age",
    "breed_mongrel_long": "mixed_type",
    "sex_dog_cd": "dog_gender",
    "dog_color_text": "dog_color",
    "race_1_text": "breed_1",
    "race_2_text": "breed_2",
    "circle_cd": "district",
    "quar_cd": "sub_district",
}

zurich_dog_data = zurich_dog_data.rename(columns=new_column_names)

# After renaming, you may still need to adjust the data types for certain columns
zurich_dog_data["owner_id"] = zurich_dog_data["owner_id"].astype(
    "string").str.zfill(6)
zurich_dog_data["dog_age"] = zurich_dog_data["dog_age"].astype(int)
zurich_dog_data["district"] = zurich_dog_data["district"].astype(int)
zurich_dog_data["sub_district"] = (
    zurich_dog_data["sub_district"].astype("string").str.zfill(3)
)
# Repeat each row based on the number of dogs in the row represents
zurich_dog_data = zurich_dog_data.loc[
    zurich_dog_data.index.repeat(zurich_dog_data["number_of_dogs"])
]
# drop the number of dogs column
zurich_dog_data.drop("number_of_dogs", axis=1, inplace=True)
# reset the index
zurich_dog_data.reset_index(drop=True, inplace=True)

print(
    f"Dataset now has {zurich_dog_data.shape[0]} rows and {zurich_dog_data.shape[1]} columns"
)

dog_data = zurich_dog_data[list(new_column_names.values())].copy()
display(
    dog_data.describe(include="all")
    .T.sort_values(by="unique")
    .infer_objects(copy=False)
    .fillna("")
)

Dataset now has 71212 rows and 31 columns


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
dog_size,71212.0,4.0,K,43841.0,,,,,,,
mixed_type,71212.0,4.0,Rassehund,50926.0,,,,,,,
sub_district,71212.0,40.0,092,5556.0,,,,,,,
breed_2,71212.0,176.0,Keine,50926.0,,,,,,,
dog_color,71212.0,214.0,schwarz,7547.0,,,,,,,
breed_1,71212.0,394.0,Unbekannt,9109.0,,,,,,,
owner_id,71212.0,15504.0,105585,109.0,,,,,,,
roster,71212.0,,,,2019.282761,2.599665,2015.0,2017.0,2019.0,2022.0,2023.0
age_group_10,71212.0,,,,47.817545,56.048544,10.0,30.0,40.0,60.0,999.0
owner_gender,71212.0,,,,1.690108,0.462452,1.0,1.0,2.0,2.0,2.0


In [13]:
# Unique values for "mixed_type" column
breed_cat_list_de = zurich_dog_data["mixed_type"].unique().tolist()
print("Breed Categories (German):")
display(breed_cat_list_de)

# Create a dictionary for translation
breed_cat_dict = translate_list_to_dict(breed_cat_list_de)
print("\nBreed Category Dictionary (Translation):")
display(breed_cat_dict)

Breed Categories (German):


['Rassehund',
 'Mischling, beide Rassen bekannt',
 'Mischling, sekundäre Rasse unbekannt',
 'Mischling, beide Rassen unbekannt']


Breed Category Dictionary (Translation):


{'Rassehund': 'Pedigree dog',
 'Mischling, beide Rassen bekannt': 'Mixed breed, both breeds known',
 'Mischling, sekundäre Rasse unbekannt': 'Mixed breed, secondary breed unknown',
 'Mischling, beide Rassen unbekannt': 'Mixed breed, both breeds unknown'}

In [14]:
# Map 'mixed_type' to categories, rename for brevity, and define 'is_pure_breed'
dog_data["mixed_type"] = (
    dog_data["mixed_type"]
    .map(breed_cat_dict)
    .map(
        {
            "Pedigree dog": "PB",
            "Mixed breed, both breeds known": "BB",
            "Mixed breed, secondary breed unknown": "BU",
            "Mixed breed, both breeds unknown": "UU",
        }
    )
)
dog_data["is_pure_breed"] = dog_data["mixed_type"].eq("PB")

In [15]:
# Define owner and dog gender
dog_data["is_male_owner"] = dog_data["owner_gender"] == 1
dog_data["is_male_dog"] = dog_data["dog_gender"] == 1

# Drop the columns we just used to create the new columns
dog_data.drop(columns=["owner_gender", "dog_gender"], inplace=True)

In [16]:
# Unique values for dog colors
dog_colors = dog_data["dog_color"].str.lower().unique()

# Translate dog colors
dog_color_translations = translate_list_to_dict(dog_colors)
dog_data["dog_color_en"] = dog_data["dog_color"].str.lower().map(
    dog_color_translations)

# Unique values for breed_1
breeds_1 = dog_data["breed_1"].str.lower().unique()

# Translate breed_1
breed_1_translations = translate_list_to_dict(breeds_1)
dog_data["breed_1_en"] = dog_data["breed_1"].str.lower().map(
    breed_1_translations)

# Unique values for breed_2
breeds_2 = dog_data["breed_2"].str.lower().unique()

# Translate breed_2
breed_2_translations = translate_list_to_dict(breeds_2)
dog_data["breed_2_en"] = dog_data["breed_2"].str.lower().map(
    breed_2_translations)


##### Breed Standardization
To ensure consistency in the analysis, the breeds in the dataset are standardized. Since the "breed" column is free text, allowing dog owners to input their breed information during registration, variations can exist even for the same breeds. To address this, we will use the dataframe we collected in the last notebook which contains the breeds recognized by the FCI (Fédération Cynologique Internationale). Within this dataframe, each recognized FCI breed has a column listing its name in different languages and alternative, unofficial names. 

This approach helps capture variations in breed names and facilitates grouping similar breeds together.




In [17]:
# Get the FCI dataframe with the recognized breeds
fci_breeds = pd.read_json("../data/fci_breeds.json")
fci_breeds[["alt_names", "breed_en"]]

# Create a DataFrame with translated breed names
breeds_df = pd.DataFrame.from_dict(
    {**breed_1_translations, **breed_2_translations}, orient="index"
).reset_index()
breeds_df.columns = ["breed_de", "breed_en"]

# Initialize a "standard" column for breed standardization
breeds_df["standard"] = None
nan_mask = breeds_df["standard"].isna()

# Match each column for breed standardization
for col in breeds_df.columns:
    matched_value = hf.apply_fuzzy_matching_to_breed_column(
        breeds_df.loc[nan_mask], col, fci_breeds, [fuzz.WRatio]
    )
    breeds_df.loc[nan_mask, "standard"] = matched_value[nan_mask]
    nan_mask = breeds_df["standard"].isna()

# Update the standard column for specific cases
breeds_df.loc[nan_mask, "standard"] = breeds_df.loc[nan_mask, "breed_en"]
breeds_df.loc[breeds_df["breed_de"] == "elo", "standard"] = "elo"
breeds_df.loc[breeds_df["breed_de"] == "keine", "standard"] = "none"
breeds_df.loc[breeds_df["breed_de"] == "mischling", "standard"] = "hybrid"

# Convert breed_1 to lowercase for merging
dog_data["breed_1"] = dog_data["breed_1"].str.lower()
dog_data["breed_2"] = dog_data["breed_2"].str.lower()

# Merge with the breeds_df for standardized breed names
dog_data = dog_data.merge(
    breeds_df.drop(columns=["breed_en"]),
    left_on="breed_1",
    right_on="breed_de",
    suffixes=("", "_1"),
)

dog_data = dog_data.merge(
    breeds_df.drop(columns=["breed_en"]),
    left_on="breed_2",
    right_on="breed_de",
    suffixes=("", "_2"),  # Add suffix to distinguish columns
)


##### Filtering Doodle Dogs

A specific analysis is conducted to filter out dogs with 'doodle' in their breed names, converting them to mixed breeds and updating breed information accordingly. This is a designer breed which is not yet recognized.


In [18]:
# Create mask to filter out the doodle dogs
doodle_mask = dog_data["breed_1"].str.contains(
    r".*doodle", regex=True, na=False, case=False
)
print(f"Number of doodle dogs: {doodle_mask.sum()}")
# convert them to mixed breed if they are pure breeds
dog_data.loc[doodle_mask, "is_pure_breed"] = False
dog_data.loc[doodle_mask, "breed_2"] = "Pudel"
dog_data.loc[doodle_mask, "mixed_type"] = "BB"
dog_data.loc[doodle_mask, "breed_1"] = dog_data.loc[doodle_mask, "breed_1"].apply(
    lambda x: "Golden Retriever" if x.startswith("G") else "Labrador Retriever"
)
# dog_data[doodle_mask]

Number of doodle dogs: 27



The number of dogs for each row is given in the `number_of_dogs` column. These are 'brothers and sisters' that also have the same owner and same characteristics.

E.g.
- `standard` or breed 
- `dog_color_en` or dog color, etc. 


We expand the dataset to have one dog for each row, by repeating the rows by the number in the `number_of_dogs` column. We reset the index after so that we have a unique index for each row.


In [19]:
# Calculate total dogs per owner and roster
dog_data["pet_count"] = dog_data.groupby(["owner_id", "roster"])["breed_1"].transform(
    "count"
)

##### Missing Values
Although initially it looked as if we have no missing values, on close investigation we can see that there are placeholder values for where the missing values are. We replaced these with `Nan` values so that they are not mistaken for real values. As these are only few for the columns `sub_district`, `dog_age`, and `district` we simply drop those rows. Remaining column with missing values `age_group_10`, we:

- fill missing `age_group_10` (dog owners' age groups) with `-1`, tracking these in `age_group_missing`.
- use later years' rosters to fill age group where possible, and make these edits in `age_group_10`.


Finally, we create `age_group_20`, grouping ages into 20-year increments, approximating a generation's length. 


In [20]:
display(
    dog_data.describe(include="all")
    .T.sort_values(by="unique")
    .infer_objects(copy=False)
    .fillna("")
)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
is_male_owner,71212.0,2.0,False,49144.0,,,,,,,
is_male_dog,71212.0,2.0,False,35709.0,,,,,,,
is_pure_breed,71212.0,2.0,True,50902.0,,,,,,,
dog_size,71212.0,4.0,K,43841.0,,,,,,,
mixed_type,71212.0,4.0,PB,50902.0,,,,,,,
sub_district,71212.0,40.0,092,5556.0,,,,,,,
standard_2,71212.0,133.0,none,50926.0,,,,,,,
breed_2_en,71212.0,173.0,no,50926.0,,,,,,,
breed_de_2,71212.0,176.0,keine,50926.0,,,,,,,
breed_2,71212.0,177.0,keine,50902.0,,,,,,,


In [21]:
# Create a list of sub_districts to be used for validation
sub_districts_list = neighborhood_gdf["sub_district"].unique().tolist()

# Define a dictionary of conditions and corresponding columns to be updated
conditions = {
    "dog_size": dog_data["dog_size"] == "UN",
    "age_group_10": dog_data["age_group_10"] > 100,
    "district": dog_data["district"] > 12,
    "dog_age": dog_data["dog_age"] > 30,
    "sub_district": ~dog_data["sub_district"].isin(sub_districts_list),
}

# Identify and print unique breeds with 'UN' dog size
un_breeds = dog_data.loc[conditions["dog_size"], "breed_1"].unique()
print(f"Dogs breeds of those missing dog_size data:\n{un_breeds}")

# Replace 'UN' dog size with 'K' and other invalid values with NaN
for column, condition in conditions.items():
    dog_data.loc[condition, column] = "K" if column == "dog_size" else np.nan

# Display the number of NaN values in each column
print("\nNumber of NaN values in each column:")
print(dog_data.isna().sum().sort_values(ascending=False))

Dogs breeds of those missing dog_size data:
['unbekannt' 'podengo portugues klein' 'mischling']

Number of NaN values in each column:
age_group_10     227
sub_district      18
dog_age            8
district           4
roster             0
is_male_dog        0
standard_2         0
breed_de_2         0
standard           0
breed_de           0
breed_2_en         0
breed_1_en         0
dog_color_en       0
is_pure_breed      0
is_male_owner      0
owner_id           0
breed_2            0
breed_1            0
dog_color          0
mixed_type         0
dog_size           0
pet_count          0
dtype: int64


In [22]:
dog_data.dropna(subset=["dog_age", "district", "sub_district"], inplace=True)

In [26]:
# convert the numerical columns which had NaN values to int
dog_data["dog_age"] = dog_data["dog_age"].astype(int)
dog_data["district"] = dog_data["district"].astype(int)

In [27]:
# Create an indicator variable for missing 'age_group_10' values
dog_data["age_group_missing"] = dog_data["age_group_10"].isna().astype(int)

# Fill in the missing 'age_group_10' values
dog_data["age_group_10"] = dog_data["age_group_10"].fillna(
    dog_data.groupby("owner_id")["age_group_10"].transform(
        lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan
    )
)

dog_data["age_group_10"] = dog_data["age_group_10"].fillna(-1).astype(int)
dog_data["age_group_20"] = dog_data["age_group_10"].apply(
    lambda x: -1 if x == -1 else (x // 20) * 20
)

In [31]:
dog_data.sample(3)

Unnamed: 0,roster,owner_id,age_group_10,dog_size,dog_age,mixed_type,dog_color,breed_1,breed_2,district,sub_district,is_pure_breed,is_male_owner,is_male_dog,dog_color_en,breed_1_en,breed_2_en,breed_de,standard,breed_de_2,standard_2,pet_count,age_group_missing,age_group_20
27668,2018,136635,60,K,6,PB,braun/weiss,shih tzu,keine,11,111,True,False,True,brown/white,shih tzu,no,shih tzu,shih tzu,keine,none,1,0,60
60279,2022,157251,40,K,1,PB,creme,zwergschnauzer,keine,4,44,True,False,True,cream,miniature schnauzer,no,zwergschnauzer,miniature schnauzer,keine,none,1,0,40
25764,2018,123101,30,K,5,PB,schwarz/braun,dachshund,keine,11,119,True,False,False,black/brown,dachshund,no,dachshund,dachshund,keine,none,1,0,20


In [None]:
dog_data_columns_to_keep = [
    "roster",
    "owner_id",
    "dog_size",
    "dog_age",
    "age_group_10",
    "age_group_20",
    "mixed_type",
    "is_pure_breed",
    "is_male_owner",
    "is_male_dog",
    "dog_color_en",
    "standard",
    "standard_2",
    "pet_count",
    "district",
    "sub_district",
    "neighborhood",
    "age_group_missing",
]

# Save the processed dog data to disk
dog_data[dog_data_columns_to_keep].to_csv(
    "../data/processed_dog_data.csv", index=False
)

# filtered_dog_data.sample(5)

In [None]:
# Importing the required libraries for data transformation
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# Normalizing numerical columns
# Here, we are assuming 'numerical_column1' and 'numerical_column2' are numerical columns. Replace these with your column names
numerical_columns = ['numerical_column1', 'numerical_column2']  # Add more column names if needed
scaler = StandardScaler()  # You can use MinMaxScaler or any other scaler based on your data
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Encoding categorical columns
# Here, we are assuming 'categorical_column1' and 'categorical_column2' are categorical columns. Replace these with your column names
categorical_columns = ['categorical_column1', 'categorical_column2']  # Add more column names if needed
encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = encoder.fit_transform(df[column])

# Display the transformed dataframe
df.head()

# Visualize the Cleaned and Transformed Data
Use matplotlib or seaborn to visualize the cleaned and transformed data to identify patterns, trends, and outliers.

In [None]:
# Importing seaborn for advanced data visualization
import seaborn as sns

# Visualizing the distribution of numerical columns
for column in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[column], kde=True)
    plt.title(f'Distribution of {column}')
    plt.show()

# Visualizing the count of categorical columns
for column in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=df[column])
    plt.title(f'Count of {column}')
    plt.show()

# Visualizing the correlation between numerical columns
plt.figure(figsize=(10, 6))
sns.heatmap(df[numerical_columns].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between numerical columns')
plt.show()

# Visualizing the relationship between numerical columns using pairplot
sns.pairplot(df[numerical_columns])
plt.show()

# Visualizing the relationship between categorical and numerical columns using boxplot
for numerical_column in numerical_columns:
    for categorical_column in categorical_columns:
        plt.figure(figsize=(10, 6))
        sns.boxplot(x=df[categorical_column], y=df[numerical_column])
        plt.title(f'Relationship between {categorical_column} and {numerical_column}')
        plt.show()