# Price signals construction activity ? 

# 0. Preprocessing and geocoding the needed dataset

The following notebook presents how we preprocessed the dataset "Construction of residentials" and "Construction of non-residentials". We don't advise the reader to launch this code as it will probably crash. However, you can see the code used to get the below-used csv [here](0.%20preprocessing-geocoding.ipynb).

# 1. Processing of the "Construction of residentials" dataset

## Importing libraries and dataset

In [None]:
import sys 
!{sys.executable} -m pip install xlrd
!{sys.executable} -m pip install h3
!{sys.executable} -m pip install "folium>=0.12" matplotlib mapclassify
!{sys.executable} -m pip install openpyxl

In [None]:
import pandas as pd
import re
from matplotlib import pyplot as plt 
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point
import numpy as np
import h3

In [None]:
df = pd.read_csv("https://minio.lab.sspcloud.fr/mligeret1/constructions_resid_geocoded_cleaned.csv",sep=",",on_bad_lines="warn")


In [None]:
df = df.iloc[1:] #we delete the first line that indicates the code for the variable 
df.head(2) #we check that everything is good 

## Grouping columns together
To do so, we target columns with the same name

In [None]:
for c in df.columns: #we look at every columns in the dataset 
    print(c)

In [None]:
col_metadata = [x for x in df.columns if "DAU" in x] + ["Unnamed: 0"]
col_meta_location = [x for x in df.columns if "lieu des travaux" in x]
col_dates_travaux = [x for x in df.columns if "Date" in x]
col_demandeurs = [x for x in df.columns if "demandeur" in x ]
col_precise_location = [x for x in df.columns if "du terrain" in x and ("Superficie du terrain" not in x)] 
col_cadastres = [x for x in df.columns if "cadastr" in x]
col_construction_details = [x for x in df.columns if "Présence" in x or "Indicateur" in x] #dans le projet 
col_number_created_housings_details = [x for x in df.columns if "logements" in x and "créés" in x and "Nombre total de logements créés" not in x]
col_details_transf = [x for x in df.columns if "Surface" in x]
col_from_geocodage = [i for i in df.columns if "result_" in i]

col_irrelevant = col_metadata + col_dates_travaux + col_precise_location + col_cadastres + col_construction_details + col_number_created_housings_details + col_details_transf + col_from_geocodage
col_relevant =  [c for c in df.columns if c not in col_irrelevant] 
for c in col_relevant: #we check again what's left in the group of colums col_relevant 
    print(c)

In [None]:
df.loc[:,col_relevant].sample()

In [None]:
plt.figure(figsize=(20,10))
#was runed to observe where were the missing values 
#sns.heatmap(df.isna(), cbar=False)

## Arranging columns types 


In [None]:
dico_variables = pd.read_excel("https://data.statistiques.developpement-durable.gouv.fr/dido/api/files/ab799b04-0b03-4f96-949c-eb23c478a8e8")

In [None]:
dico_variables.head(2)

### Retrieve from the dictionary 

In [None]:
def variable_types(ligne):
    if ("Année" in ligne["Description de la variable"]):
        return None
    if "Alphanumérique" in ligne["Format"]:
        return "string"
    if "Numérique" in ligne["Format"]:
        return "float64"
        
dico_variables["Format_python"] = dico_variables.apply(variable_types, axis=1)

In [None]:
dtype_map = dict(zip(dico_variables["Description de la variable"], dico_variables["Format_python"]))
dtype_map = {col : python_type for col, python_type in dtype_map.items() if col in df.columns}
df = df.astype(dtype_map, errors="ignore")

In [None]:
df.dtypes.to_frame().style

### Setting dates 

In [None]:
df["Date (mois) de prise en compte (DPC) du premier évènement reçu dans Sitadel (dépôt de la demande ou autorisation)"].value_counts().to_frame().style #to verify that dates are well specified under the format %Y%m, i.e. year with 4 digits followed by month
df["Date (mois) de prise en compte (DPC) du premier évènement reçu dans Sitadel (dépôt de la demande ou autorisation)"] = pd.to_datetime(df["Date (mois) de prise en compte (DPC) du premier évènement reçu dans Sitadel (dépôt de la demande ou autorisation)"], format="%Y-%m", errors="coerce")


We repeat the operation for the other columns that involve date/year dtypes


In [None]:
for col_date in col_dates_travaux:
    print(df[col_date].value_counts().to_frame()) #to verify the format of the date

In [None]:
col_date_format_ymd = [col_dates_travaux[i] for i in [0,1,2,3]] #we select the columns that have year/month/day
col_date_format_ym = [col_dates_travaux[i] for i in [4,5,6]] #the one that only have year and month 

In [None]:
for col in col_date_format_ymd:
    print(df[col])
    df[col] = pd.to_datetime(df[col], format="%Y-%m-%d")

In [None]:
for col in col_date_format_ym:
    print(df[col])
    df[col] = pd.to_datetime(df[col], format="%Y-%m")


In [None]:
col = "Année de dépôt de la DAU"
print(df[col])
df[col] = df[col].astype("int64") #we assign the type int64 to years columns  

### Completing the last columns

In [None]:
df = df.astype({col : "float64" for col in col_details_transf + ["Superficie du terrain"]})

df = df.astype({col : "string" for col in col_meta_location + col_cadastres + ["Code zone opératoire"]})
df = df.astype({col : "string" for col in ["Adresse_complete"]})


In [None]:
df.dtypes.to_frame().style #expand output if one wants to see the list of columns and assigned types

## Cleaning per column



### Cleaning dates 

In [None]:
df.loc[:,col_dates_travaux].sample(5)

We look at the date columns to see if they have coherent values, which seems to be the case 

In [None]:
#first columns besides Année de dépôt de la DAU 
df[df[col_dates_travaux].gt(pd.Timestamp.today()).any(axis=1)][col_dates_travaux+["Année de dépôt de la DAU"]].head(20)

It is however not the case for the column ```Année de dépôt de la DAU ```

In [None]:
df[df["Année de dépôt de la DAU"]>2025][col_dates_travaux+["Année de dépôt de la DAU"]].sample(1)


In [None]:
ligns_gt_today = df["Année de dépôt de la DAU"]>2025 
#we take those lines and replace the "Année de dépôt de la DAU" with the year from other columns which are valid
df.loc[ligns_gt_today,"Année de dépôt de la DAU"] = df.loc[ligns_gt_today,"Date (mois) de prise en compte (DPC) du premier évènement reçu dans Sitadel (dépôt de la demande ou autorisation)"].dt.year

In [None]:
df[df["Année de dépôt de la DAU"]>2025][col_dates_travaux+["Année de dépôt de la DAU"]].head(20)

We would like to see years for which there are too few datas to build reliable time series

In [None]:
df["Année de dépôt de la DAU"].value_counts().sort_index().plot.bar()

We delete everything that comes before 2012, 2012 included and the current year of 2025 

In [None]:
ligns_few_values = (df["Année de dépôt de la DAU"]<=2012) | (df["Année de dépôt de la DAU"] == 2025)
df = df.drop(df.index[ligns_few_values])

In [None]:
df["Année de dépôt de la DAU"].value_counts().sort_index().plot.bar()

### Cleaning Departments 

In [None]:
df["Code du département du lieu des travaux - Code de la zone"].nunique()

We see that there are 102 departments instead of 101, as for the previous notebook, we imagine that the problem is that one department is coded two times, one time with one digit (2, Aisne) and the other with two (02, Aisne). We harmonize the gap

In [None]:
df.loc[df["Code du département du lieu des travaux - Code de la zone"].str.len()==1,"Code du département du lieu des travaux - Code de la zone"] 
df["Code du département du lieu des travaux - Code de la zone"]=df["Code du département du lieu des travaux - Code de la zone"].str.zfill(2)

We separate dom departments from the rest as they are often not present in the other database we will use 

In [None]:
df_dom = df.loc[df["Code du département du lieu des travaux - Code de la zone"].str.startswith("97")] 
df = df.drop(df_dom.index)
df_dom.groupby(["Code du département du lieu des travaux - Code de la zone", "Année de dépôt de la DAU"]).size().unstack(0).sort_index().plot()


###  Replacing qualitative numeric value by its name 

In [None]:
destinations = [
    "habitation",
    "hébergement hôtelier",
    "bureaux",
    "commerce",
    "artisanat",
    "industrie",
    "agriculture",
    "entrepôt",
    "service public ou d'intérêt collectif"
]

dict_destination_principale = {key: value for key, value in zip(range(1, 10), destinations)}


In [None]:
df["Destination principale"] = pd.to_numeric(df["Destination principale"],errors="coerce")
df["Destination principale"] = df["Destination principale"].apply(lambda x : dict_destination_principale[x])


In [None]:
df["Catégorie du demandeur (maître d’ouvrage) selon Sitadel"].value_counts()


It is more practical to keep this column as it is as the number allow us to identify the category 

In [None]:
previous_use = [
    "logements",
    "hébergement hôtelier",
    "bureaux",
    "commerce",
    "artisanat",
    "industrie",
    "agriculture",
    "entrepôt",
    "service public ou d'intérêt collectif"
]

dict_previous_use = {key: value for key, value in zip(range(1, 10), previous_use)}


In [None]:
df["Type principal des locaux d’origine transformés"] = df["Type principal des locaux d’origine transformés"].map(dict_previous_use)


### Some boolean indicators 

In [None]:
bool_map = {
    "True": True,
    "False": False,
}

for col in col_construction_details:
    if df[col].dtypes != "bool":
        df[col] = df[col].map(bool_map)
        df[col] = df[col].astype(bool)

for col in col_construction_details:
    print(df[col].dtypes) #to check if everything is set as bool

In [None]:
df[col_construction_details].mean().sort_values().plot.barh(figsize=(6, 6))

plt.xlabel("Share of True")
plt.tight_layout()
plt.show()

def compare_indicators() : 
    df[col_construction_details].mean().sort_values().plot.barh(figsize=(6, 6))


In [None]:
def compare_indicators(depcode): #it's a function to compare a department with the national level, to identify the specificity of the given dep
    compare = pd.DataFrame({
        f"Dept {depcode}": df.loc[df["Code du département du lieu des travaux - Code de la zone"] == depcode, col_construction_details].mean(),
        "National level": df.loc[df["Code du département du lieu des travaux - Code de la zone"] != depcode, col_construction_details].mean(),
    })
    ax = (
    compare
    .sort_values(f"Dept {depcode}")
    .plot.barh(figsize=(8, 6))
    )

    ax.set_xlabel("Share of projects that")
    ax.set_title("Comparison of construction characteristics")
    plt.tight_layout()
    plt.show()


## Descriptive statistics 

### What was destroyed to create these new housing units ? 

In [None]:
df["Type principal des locaux d’origine transformés"].value_counts()

### Creating the Construction activity Time Series

In [None]:
df["Année de dépôt de la DAU"].value_counts().sort_index().plot()


In [None]:
df["Date"] = df["Date (mois) de prise en compte (DPC) du premier évènement reçu dans Sitadel (dépôt de la demande ou autorisation)"]

In [None]:
df[df["Date"].dt.year<=2012][col_dates_travaux+["Année de dépôt de la DAU"]] 
#we check why there are still values from 2012 even though they were supposed to be deleted 
#we see that it's due to the lag between requesting the right to construct and the time it was registered in the database

### What are the trends ?

#### ... some preliminary visualization

In [None]:
activity_per_department = df.groupby(["Code du département du lieu des travaux - Code de la zone", "Année de dépôt de la DAU"]).size().unstack(0).sort_index().plot(legend=False)

It's very difficult to read ...

In [None]:
df.groupby(["Code du département du lieu des travaux - Code de la zone", "Année de dépôt de la DAU"]).size().unstack(0).sort_index().plot(
    subplots=True,
    layout=(10,10),
    figsize=(15,15),
    legend=True 
)


In [None]:
deps = gpd.read_file("https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements.geojson")

In [None]:
deps = gpd.read_file("https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements.geojson")
def add_department_name(df):
    df = df.merge(
        deps[["code", "nom"]],
        left_on="Code du département du lieu des travaux - Code de la zone",
        right_on="code",
        how="left"
    )

    df["Code du département du lieu des travaux - Code de la zone"] = (
        df["code"].astype(str) + " - " + df["nom"]
    )

    df = (
        df
        .drop(columns=["code", "nom"])
        .set_index("Code du département du lieu des travaux - Code de la zone")
    )

    return df

In [None]:
activity_per_department = df.groupby(["Code du département du lieu des travaux - Code de la zone", "Année de dépôt de la DAU"]).size().unstack(1)
add_department_name(activity_per_department).sample(4)

It's still not very convenient to read and analyze, let's try making it clearer, however we will this time try to answer it in a more direct fashion i.e. clustering groups of datas

#### Clustering groups of departments

In [None]:
X = activity_per_department.fillna(0)
X = (X - X.mean(axis=1).values[:, None]) / X.std(axis=1).values[:, None]

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)
labels = kmeans.fit_predict(X.values)
activity_per_department["cluster"] = labels


In [None]:
activity_per_department

In [None]:
for k in sorted(activity_per_department["cluster"].unique()):
    plt.figure(figsize=(6,4))
    for _, row in activity_per_department[activity_per_department["cluster"] == k].drop(columns="cluster").iterrows():
        plt.plot(row.values, alpha=0.4)
    plt.title(f"cluster {k}")
    plt.xlabel("Year")
    plt.ylabel("Construction activity")
    plt.show()

In [None]:
plt.figure(figsize=(8,5))
for _, row in X.iterrows():
    plt.plot(row.values, alpha=0.2)
plt.title("Curves after normalisation")
plt.xlabel("year")
plt.ylabel("Normalized value")
plt.show()

#### Quantile analysis 

In [None]:
activity_per_department.head()


In [None]:
q = 10
activity_per_department["quantile_2013"] = pd.qcut(
    activity_per_department[2013],
    q=q,
    labels=[f"Q{i}" for i in range(1,q+1)]
)


In [None]:
fig, (axQ2,axQ10) = plt.subplots(1, 2, figsize=(14, 7))
activity_per_department.loc[activity_per_department["quantile_2013"] == "Q2"].drop(columns=["quantile_2013"]).T.plot(legend=True, ax=axQ2)
activity_per_department.loc[activity_per_department["quantile_2013"] == "Q10"].drop(columns=["quantile_2013"]).T.plot(legend=True, ax=axQ10)

axQ2.legend(
    title="Departments of the 2th decile",
    bbox_to_anchor=(1.05, 1),
    loc="upper left"
)
axQ10.legend(
    title="Departments of the 1Oth decile",
    bbox_to_anchor=(1.05, 1),
    loc="upper left"
)


In [None]:
(
    activity_per_department
    .drop(columns=["quantile_2013"])
    .groupby(activity_per_department["quantile_2013"])
    .mean()
    .T
    .plot()
)


### In which departments did we build the most ? 

Let's first determine for a given year, departments in which we did build the most 
#### Q1 : For the year 2024, in which department did we build the most ? 

In [None]:
construction_activity2024 = activity_per_department[2024].sort_values().reset_index()

In [None]:
construction_activity2024

#### -> Q1 : Let's represent departments on a map 

In [None]:
deps = gpd.read_file("https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements.geojson")


In [None]:
deps.index = deps.index.astype(str).str.zfill(2)
deps.head()


In [None]:
construction_activity2024 = construction_activity2024.merge(
    deps,
    left_on="Code du département du lieu des travaux - Code de la zone",
    right_on="code",
    how="left"
)

construction_activity2024.head(5)


In [None]:
construction_activity2024 = gpd.GeoDataFrame(
    construction_activity2024,
    geometry="geometry",
    crs=deps.crs
)

In [None]:
construction_activity2024.plot(
    column=2024,
    cmap="OrRd",
    legend=True,
    figsize=(10, 10),
    edgecolor="black"
) 

#### -> Q1 What's happening in Bordeaux's region ? 

In [None]:
df.loc[df["Code du département du lieu des travaux - Code de la zone"] == "44",]["Type principal des locaux d’origine transformés"].value_counts()


it seems that people are transforming old houses and land to create new housings (however that's the same as everywhere else in the country). Maybe looking at some indicators might be useful

In [None]:

compare_indicators("44")

#### Q2 : In which department was growth the strongest ? 

In [None]:
activity_per_department = df.groupby(["Code du département du lieu des travaux - Code de la zone", "Année de dépôt de la DAU"]).size().unstack(1)

In [None]:
growth_per_department = activity_per_department.pct_change(axis=1)
growth_per_department = growth_per_department.mean(axis=1).reset_index()
growth_per_department = growth_per_department.merge(
    deps,
    left_on="Code du département du lieu des travaux - Code de la zone",
    right_on="code",
    how="left"
)
growth_per_department = growth_per_department.rename(columns={0:"Avg_growth_rate"})
growth_per_department = gpd.GeoDataFrame(
    growth_per_department,
    geometry="geometry",
    crs=deps.crs
)
growth_per_department.plot(
    column="Avg_growth_rate",
    cmap="OrRd",
    legend=True,
    figsize=(10, 10),
    edgecolor="black"
) 

In [None]:
growth_per_department.sort_values(by="Avg_growth_rate", ascending=False).head(10)

In [None]:
add_department_name(activity_per_department.iloc[85:90])

#### Total number of housings created
Strictly speaking, the above results identify the departments with the highest number of housing construction projects, rather than the number of housing units actually created. Consequently, a high level of activity may reflect a large number of relatively small projects rather than fewer, large-scale developments. To fully assess construction intensity, it is therefore necessary to examine the distribution of project sizes.

In [None]:
thereshold_big_project = 4
print(f"Number of construction {df.loc[df["Nombre total de logements créés"]>thereshold_big_project, col_relevant].shape[0]}")
print(f"Proportion of housing creation projects that are 'big construction projects' {df.loc[df["Nombre total de logements créés"]>thereshold_big_project, col_relevant].shape[0]/df.shape[0]*100}%")

Instead of testing for every value, we could try to vizualise the threeshold 

In [None]:
percentages = [(df.loc[df["Nombre total de logements créés"] > i, col_relevant].shape[0] / df.shape[0] * 100) for i in range(1, 20)]
pd.DataFrame({"threshold": range(1, 20), "percentage": percentages}).plot(x="threshold", y="percentage")

Depending on the definition that is used, we estimate that 5% of housings creation project are created by real estate developer (we can check if it matches the information provided by the column "Catégorie du demandeur (maître d’ouvrage) selon Sitadel" i.e. category of the builder)


In [None]:
print(f"Proportion of projects that were launched by private individuals : {df.loc[df["Catégorie du demandeur (maître d’ouvrage) selon Sitadel"].astype(str).str.startswith("1")].shape[0]/df.shape[0]*100}%")

In [None]:
developer_built_units = int(df.loc[df["Nombre total de logements créés"]>thereshold_big_project,"Nombre total de logements créés"].sum())
private_built_units = int(df.loc[df["Nombre total de logements créés"]<=thereshold_big_project,"Nombre total de logements créés"].sum())
print(f"Proportion of housings that were built as part of development projects {int(developer_built_units/(developer_built_units+private_built_units)*100)}%")

We verify that each housing creation project contains at least one created house

In [None]:
int(df["Nombre total de logements créés"].min())

We can now wonder if the departments where developers chose to build are the same where the general population (i.e. the remaining 95%) chose to build

In [None]:
developer_built_per_departments = df.loc[df["Nombre total de logements créés"]>thereshold_big_project].groupby("Code du département du lieu des travaux - Code de la zone")["Nombre total de logements créés"].sum().to_frame()


In [None]:
add_department_name(developer_built_per_departments)["Nombre total de logements créés"].sort_values(ascending=False).head(20).plot.barh()

In [None]:
private_built_per_departments = df.loc[df["Nombre total de logements créés"]<thereshold_big_project].groupby("Code du département du lieu des travaux - Code de la zone")["Nombre total de logements créés"].sum().to_frame()


In [None]:
add_department_name(private_built_per_departments)["Nombre total de logements créés"].sort_values(ascending=False).head(20).plot.barh()

#### Neutralizing population 
What are the results if we neutralize the population variable i.e. if we look at number of new housings/person 

In [None]:
def deps_pop_init():
    deps_pop = pd.read_excel("https://www.insee.fr/fr/statistiques/fichier/2012713/TCRD_004.xlsx")
    deps_pop.columns = deps_pop.iloc[2]
    deps_pop = deps_pop.iloc[3:]
    deps_pop = deps_pop.reset_index()
    deps_pop.columns = ["to_del", "code", "deps_name", "2025", "share_pop","2022", "2016","2011","1999"]
    deps_pop.set_index("code")
    deps_pop = deps_pop.loc[0:101,["code","2011","2025"]]
    return deps_pop

deps_pop = deps_pop_init()

In [None]:
def add_department_pop(df):
    df = df.reset_index()
    df = df.merge(
        deps_pop[["code", "2011", "2025"]],
        left_on="Code du département du lieu des travaux - Code de la zone",
        right_on="code",
        how="left"
    )
    df = df.drop(columns=["code"]).set_index("Code du département du lieu des travaux - Code de la zone")
    return df

In [None]:
developer_built_per_departments_pop = add_department_pop(developer_built_per_departments)
developer_built_per_departments_pop["Nb housings created per new person"] = developer_built_per_departments_pop["Nombre total de logements créés"]/(developer_built_per_departments_pop["2025"]-developer_built_per_departments_pop["2011"])

In [None]:
private_built_per_departments_pop = add_department_pop(private_built_per_departments)
private_built_per_departments_pop["Nb housings created per new person"] = private_built_per_departments_pop["Nombre total de logements créés"]/(private_built_per_departments_pop["2025"]-private_built_per_departments_pop["2011"])

In [None]:
add_department_name(developer_built_per_departments_pop["Nb housings created per new person"].sort_values(ascending=False).head(20).to_frame().reset_index())

In [None]:
add_department_name(developer_built_per_departments_pop["Nb housings created per new person"].sort_values(ascending=True).head(10).to_frame().reset_index())

In [None]:
plt.scatter(
    developer_built_per_departments_pop["2025"] 
    - developer_built_per_departments_pop["2011"],
    developer_built_per_departments_pop["Nombre total de logements créés"]
)



In [None]:
developer_built_per_departments_pop.loc[developer_built_per_departments_pop["2025"] - developer_built_per_departments_pop["2011"]<-100000]

In [None]:
add_department_name(developer_built_per_departments_pop.loc[developer_built_per_departments_pop["2025"] - developer_built_per_departments_pop["2011"]>200000])

In [None]:
plt.scatter(
    private_built_per_departments_pop["2025"] 
    - private_built_per_departments_pop["2011"],
    private_built_per_departments_pop["Nombre total de logements créés"]
)



In [None]:
private_built_per_departments_pop["pop_growth"] = private_built_per_departments_pop["2025"].astype("float")- private_built_per_departments_pop["2011"].astype("float")

In [None]:
private_built_per_departments_pop["pop_growth"] = np.log1p(private_built_per_departments_pop["pop_growth"] )


In [None]:
private_built_per_departments_pop["Nombre total de logements créés"] = np.log1p(private_built_per_departments_pop["Nombre total de logements créés"])


In [None]:
import seaborn as sns

sns.regplot(
    x="pop_growth",
    y="Nombre total de logements créés",
    data=private_built_per_departments_pop,
    ci=95
)


### In which places did we build the most ? (more granular analysis)

#### Where the most housing construction project were launched 

In [None]:
df['Geometry'] = df.apply(
    lambda x: Point(x["longitude"], x["latitude"]) 
    if x["longitude"] and x["latitude"] 
    else None,
    axis = 1
)

gdf = gpd.GeoDataFrame(
    df,              # les données
    geometry="Geometry",     # La colonne de géométrie
    crs='EPSG:4326'       # Système de coordonnées 
)

In [None]:
gdf = gdf.to_crs(epsg=2154)  # transformation du système de coordonnée pour plot hexbin


In [None]:
gdf_clean = gdf[
    (gdf.geometry.x.between(200_000, 1_200_000)) &
    (gdf.geometry.y.between(6_000_000, 7_200_000))
]


In [None]:
fig, ax = plt.subplots(figsize=(7, 7))

graph_heatmap_resid = ax.hexbin(
    gdf_clean.geometry.x,
    gdf_clean.geometry.y,
    gridsize=120,
    cmap="inferno",
    mincnt=5,
    bins="log" 
)

ax.set_aspect("equal")
ax.set_axis_off()

plt.colorbar(graph_heatmap_resid, label="Points per hexagon")
plt.title("Number of construction from 2013 to 2024")
plt.show()


We note that the previous chart uses a logarithmic scale, which means that a change from dark purple to pink represents 100 times more points, and from purple to yellow represents 1,000 times more points.

#### Where the highest number of housing units were built

In [None]:
col_relevant =  [c for c in df.columns if c not in col_irrelevant] #on actualise pour ajouter la colonne géométrie nouvellement créée 
df.loc[:,col_relevant].sample()

We then select a resolution for the hexagon, resolution 7 seem to be the optimal one as each hexagon will cover a surface of approximately 15m². Knowing there are approximately 1 million points in the database, each hexagon will have 

In [None]:
gdf = gpd.GeoDataFrame(
    df,              # les données
    geometry="Geometry",     # La colonne de géométrie
    crs='EPSG:4326'       # Système de coordonnées 
)

In [None]:
gdf["hexagon"] = gdf.geometry.apply(lambda coord : h3.latlng_to_cell(coord.x,coord.y,6))

In [None]:
construction_per_hexagon_total = (gdf.groupby("hexagon")["Nombre total de logements créés"].sum().reset_index(name="Number_construction_2013-2024"))
construction_per_hexagon_total.sample(5)

In [None]:
import h3
import geopandas as gpd
from shapely.geometry import Polygon

def h3_get_polygon(hexagon):
    return Polygon(h3.cell_to_boundary(hexagon))

construction_per_hexagon_total["geometry"] = construction_per_hexagon_total["hexagon"].apply(h3_get_polygon)
construction_per_hexagon_total = gpd.GeoDataFrame(construction_per_hexagon_total, geometry="geometry", crs="EPSG:4326")

In [None]:
construction_per_hexagon_total["log_Number_construction_2013-2024"] = np.log1p(construction_per_hexagon_total["Number_construction_2013-2024"])
construction_per_hexagon_total.explore(
    column="log_Number_construction_2013-2024", 
    cmap="viridis",     
    location=[46.6, 2.5],  # France
    zoom_start=6,)

In [None]:
construction_per_hexagon = gdf.groupby(["hexagon", "Année de dépôt de la DAU"])["Nombre total de logements créés"].sum().unstack(1).sort_index()
construction_per_hexagon = construction_per_hexagon.fillna(0)
construction_per_hexagon

In [None]:
import numpy as np
log_growth_per_hexagon = np.log1p(construction_per_hexagon).diff(axis=1)
log_growth_per_hexagon  = log_growth_per_hexagon.mean(axis=1).sort_values(ascending=False).to_frame().reset_index()
log_growth_per_hexagon = log_growth_per_hexagon.rename(columns={0:"Avg_growth_rate"})
log_growth_per_hexagon

In [None]:
log_growth_per_hexagon["geometry"] = log_growth_per_hexagon["hexagon"].apply(h3_get_polygon)
log_growth_per_hexagon = gpd.GeoDataFrame(log_growth_per_hexagon, geometry="geometry", crs="EPSG:4326")
log_growth_per_hexagon

In [None]:
log_growth_per_hexagon.explore(
    column="Avg_growth_rate",
    cmap="RdBu_r",
    vmin=-0.05,
    vmax=0.05,
    tiles="CartoDB positron",
    legend=True,
    legend_kwds={"caption": "Log-variation (échelle resserrée)"},
    location=[46.6, 2.5],  # France
    zoom_start=6,
)


# 2. Processing of the "Real estate transaction price"

In this section, we refere to the document 2. 
It was put in another notebook due to size issue. 
At the end of this notebook, please come back here and start section 3 

# 3. Modeling price-house building relationship

Are where real estate prices high where people create the most housings
Accordingly, the only reason why prices are different from one place to another is the difference in demand. 
We hereby take prices as an indicator of positive aggregate demand

## Granular level 


In [None]:
pm2_growth_per_hex = pd.read_csv("pm2_growth_per_hex.csv")


In [None]:
main_reg = (
    log_growth_per_hexagon
    .reset_index(drop=True)
    .assign(hexagon=lambda d: d["hexagon"].astype(str).str.strip())
    .merge(
        pm2_growth_per_hex
        .reset_index(drop=True)
        .assign(h3_index=lambda d: d["h3_index"].astype(str).str.strip()),
        left_on="hexagon",
        right_on="h3_index",
        how="inner"
    )
)


In [None]:
log_growth_per_hexagon["hexagon"]

In [None]:
log_growth_per_hexagon["hexagon"].str.startswith("877").nunique()

In [None]:
pm2_growth_per_hex["h3_index"].str.startswith("877").nunique()

In [None]:
main_reg = log_growth_per_hexagon.merge(pm2_growth_per_hex, left_on="hexagon", right_on="h3_index", how="left")

In [None]:
main_reg

In [None]:
set(log_growth_per_hexagon["hexagon"]).intersection(
    set(pm2_growth_per_hex["h3_index"])
)


In [None]:
log_growth_per_hexagon["hexagon"] = log_growth_per_hexagon["hexagon"].astype(str)
pm2_growth_per_hex["h3_index"] = pm2_growth_per_hex["h3_index"].astype(str)


In [None]:
pm2_growth_per_hex["h3_index"].str.len().describe()

In [None]:
log_growth_per_hexagon["hexagon"].str.len().describe()


## Department level : Analyzing prices and real estate transactions 

In [None]:
nb_transac_per_dep = pd.read_csv("nb_transac_per_dep.csv")

# 4. Processing of the "Creation of non residential buildings" 

there should have been a last section, looking at the correlation between the creation of non residential buildings with residential buildings, but due to a lack of time it was not done

however part of the work was started :
- the database was cleaned in a similar fashion as the "construction of residentials" database, for this reason, it may not be interesting to look at it
- however, descriptive statistics differ and may be interesting for the reader