# Calculate distance between all commune of France

In previous tutorial, we used the **INSEE COG** to define french commune. In this tutorial, we will use the code postal to define the French commune. There are a `one-to-one` mapping between `INSEE COG and code postal` in most of the case. But there are some exceptions.


## Data source

The data source which we use in this tutorial is from: https://datanova.laposte.fr/datasets/laposte-hexasmal

From this page, you can download three files:

- base-officielle-codes-postaux.csv (click on `piece joint`): commune name, insee codes officiels géographiques (COG), and code postal and centroid of the commune.
- 019HexaSmal-full.csv (click on `Telechargement des donees`): commune name, insee codes officiels géographiques (COG), and code postal and polygon of the commune.
- 019HexaSmal.csv (click on `Telechargement des donees`): only has commune name, insee codes officiels géographiques (COG), and code postal.

> This data is updated twice a year. It's impossible to find the old release. So if you have old data with COG, the new COG or code postal may not be a 100% match.

## 0. Build sedona context

In this tutorial, we will use `sedona-1.7.2` for `spark 3.5.2` with `scala 2.12`. You can find the required jars in `jars/sedona-35-212-172`.

The geotools version is `28.5` for sedona-1.7.2.

In [19]:
from sedona.spark import *
from pyspark.sql import SparkSession, DataFrame
from pathlib import Path
from pyspark.sql.functions import trim, split, expr

In [2]:
# build a sedona session offline
project_root_dir = Path.cwd().parent.parent

In [4]:
jar_folder = Path(f"{project_root_dir}/jars/sedona-35-212-172")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.7.2) offline
spark = SparkSession.builder \
    .appName("SedonaParquetExample") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [5]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [6]:
# get the spark context
sc = sedona.sparkContext

# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

## 1. Explore the base-officielle-codes-postaux

In [7]:
data_dir = project_root_dir / "data"
commune_file_path = data_dir / "parquet" / "fr_commune_code_postal" / "base-officielle-codes-postaux.parquet"

In [9]:
commune_df = spark.read.parquet(commune_file_path.as_posix())

In [10]:
commune_df.show(5)

+------------------+--------------------+-----------+----------------------+-------+--------------------+
|code_commune_insee|   nom_de_la_commune|code_postal|libelle_d_acheminement|ligne_5|           _geopoint|
+------------------+--------------------+-----------+----------------------+-------+--------------------+
|             01001|L ABERGEMENT CLEM...|      01400|  L ABERGEMENT CLEM...|   NULL|46.15170180297285...|
|             01002|L ABERGEMENT DE V...|      01640|  L ABERGEMENT DE V...|   NULL|46.00713099777772...|
|             01004|   AMBERIEU EN BUGEY|      01500|     AMBERIEU EN BUGEY|   NULL|45.95747066471399...|
|             01005| AMBERIEUX EN DOMBES|      01330|   AMBERIEUX EN DOMBES|   NULL|45.99922938293103...|
|             01006|             AMBLEON|      01300|               AMBLEON|   NULL|45.74831432147182...|
+------------------+--------------------+-----------+----------------------+-------+--------------------+
only showing top 5 rows



In [11]:
commune_df.printSchema()

root
 |-- code_commune_insee: string (nullable = true)
 |-- nom_de_la_commune: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- libelle_d_acheminement: string (nullable = true)
 |-- ligne_5: string (nullable = true)
 |-- _geopoint: string (nullable = true)



### 1.1 Remove all spaces in the string col

In [13]:
def trim_all_string_columns(df: DataFrame) -> DataFrame:
    for col_name, dtype in df.dtypes:
        if dtype == 'string':
            df = df.withColumn(col_name, trim(col_name))
    return df

In [14]:
clean_commune_df = trim_all_string_columns(commune_df)

In [15]:
clean_commune_df.show(5, truncate=False)

+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+
|code_commune_insee|nom_de_la_commune      |code_postal|libelle_d_acheminement |ligne_5|_geopoint                           |
+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+
|01001             |L ABERGEMENT CLEMENCIAT|01400      |L ABERGEMENT CLEMENCIAT|NULL   |46.15170180297285,4.930600521664882 |
|01002             |L ABERGEMENT DE VAREY  |01640      |L ABERGEMENT DE VAREY  |NULL   |46.00713099777772,5.42467488805381  |
|01004             |AMBERIEU EN BUGEY      |01500      |AMBERIEU EN BUGEY      |NULL   |45.957470664713995,5.370568254510258|
|01005             |AMBERIEUX EN DOMBES    |01330      |AMBERIEUX EN DOMBES    |NULL   |45.99922938293103,4.911871787269484 |
|01006             |AMBLEON                |01300      |AMBLEON                |NULL   |45.74831432147182,5.5927847144

### 1.2 Convert string to geometry

In [17]:
geo_col_name = "_geopoint"
tmp_df = clean_commune_df.withColumn("lat", split(geo_col_name, ",").getItem(0).cast("double")) \
    .withColumn("lon", split(geo_col_name, ",").getItem(1).cast("double"))


In [23]:
rename_map = {
    "code_commune_insee": "code_insee",
    "nom_de_la_commune": "commune_name"
}
geo_df = tmp_df.withColumn("centroid", expr("ST_Point(lon,lat)")).select("code_commune_insee", "nom_de_la_commune",
                                                                         "code_postal", "centroid").withColumnsRenamed(
    rename_map)



In [24]:
geo_df.show(5, truncate=False)

+----------+-----------------------+-----------+--------------------------------------------+
|code_insee|commune_name           |code_postal|centroid                                    |
+----------+-----------------------+-----------+--------------------------------------------+
|01001     |L ABERGEMENT CLEMENCIAT|01400      |POINT (4.930600521664882 46.15170180297285) |
|01002     |L ABERGEMENT DE VAREY  |01640      |POINT (5.42467488805381 46.00713099777772)  |
|01004     |AMBERIEU EN BUGEY      |01500      |POINT (5.370568254510258 45.957470664713995)|
|01005     |AMBERIEUX EN DOMBES    |01330      |POINT (4.911871787269484 45.99922938293103) |
|01006     |AMBLEON                |01300      |POINT (5.592784714407381 45.74831432147182) |
+----------+-----------------------+-----------+--------------------------------------------+
only showing top 5 rows



In [25]:
geo_df.printSchema()

root
 |-- code_insee: string (nullable = true)
 |-- commune_name: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- centroid: geometry (nullable = true)



In [28]:
output_path = data_dir / "tmp" / "fr_commune_code_postal_centroid_geopoint"
geo_df.coalesce(1).write \
    .mode("overwrite") \
    .format("geoparquet") \
    .option("geometry", "centroid") \
     .option("crs", "EPSG:4326") \
    .save(output_path.as_posix())

### 1.3 Prepare for osrm distance

As the osrm api takes only string coordinates, so we need to prepare the matrix with string latitude, and longitude

In [29]:
tmp_df.show(5, truncate=False)

+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+------------------+-----------------+
|code_commune_insee|nom_de_la_commune      |code_postal|libelle_d_acheminement |ligne_5|_geopoint                           |lat               |lon              |
+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+------------------+-----------------+
|01001             |L ABERGEMENT CLEMENCIAT|01400      |L ABERGEMENT CLEMENCIAT|NULL   |46.15170180297285,4.930600521664882 |46.15170180297285 |4.930600521664882|
|01002             |L ABERGEMENT DE VAREY  |01640      |L ABERGEMENT DE VAREY  |NULL   |46.00713099777772,5.42467488805381  |46.00713099777772 |5.42467488805381 |
|01004             |AMBERIEU EN BUGEY      |01500      |AMBERIEU EN BUGEY      |NULL   |45.957470664713995,5.370568254510258|45.957470664713995|5.370568254510258|
|01005             |AM

In [30]:
commune_df= tmp_df.select("code_commune_insee","nom_de_la_commune","code_postal","lon","lat").withColumnsRenamed(rename_map)
commune_df.show(5, truncate=False)

+----------+-----------------------+-----------+-----------------+------------------+
|code_insee|commune_name           |code_postal|lon              |lat               |
+----------+-----------------------+-----------+-----------------+------------------+
|01001     |L ABERGEMENT CLEMENCIAT|01400      |4.930600521664882|46.15170180297285 |
|01002     |L ABERGEMENT DE VAREY  |01640      |5.42467488805381 |46.00713099777772 |
|01004     |AMBERIEU EN BUGEY      |01500      |5.370568254510258|45.957470664713995|
|01005     |AMBERIEUX EN DOMBES    |01330      |4.911871787269484|45.99922938293103 |
|01006     |AMBLEON                |01300      |5.592784714407381|45.74831432147182 |
+----------+-----------------------+-----------+-----------------+------------------+
only showing top 5 rows



In [31]:
commune_df.printSchema()

root
 |-- code_insee: string (nullable = true)
 |-- commune_name: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- lon: double (nullable = true)
 |-- lat: double (nullable = true)



In [32]:
distinct_code_postal_count = commune_df.select("code_postal").distinct().count()

In [33]:
total_count = commune_df.count()

In [34]:
distinct_code_insee_count = commune_df.select("code_insee").distinct().count()

In [35]:
print(f"insee code count: {distinct_code_insee_count}")
print(f"total code count: {total_count}")
print(f"postal code count: {distinct_code_postal_count}")

insee code count: 35007
total code count: 39192
postal code count: 6328


In [36]:
output_path = data_dir / "tmp" / "fr_commune_code_postal_centroid_double"
commune_df.coalesce(1).write.mode("overwrite").parquet(output_path.as_posix()) \
