# Calculate distance between all commune of France

In previous tutorial, we used the **INSEE COG** to define french commune. In this tutorial, we will use the code postal to define the French commune. There are a `one-to-one` mapping between `INSEE COG and code postal` in most of the case. But there are some exceptions.


## Data source

The data source which we use in this tutorial is from: https://datanova.laposte.fr/datasets/laposte-hexasmal

From this page, you can download three files:

- base-officielle-codes-postaux.csv (click on `piece joint`): commune name, insee codes officiels géographiques (COG), and code postal and centroid of the commune.
- 019HexaSmal-full.csv (click on `Telechargement des donees`): commune name, insee codes officiels géographiques (COG), and code postal and polygon of the commune.
- 019HexaSmal.csv (click on `Telechargement des donees`): only has commune name, insee codes officiels géographiques (COG), and code postal.

> This data is updated twice a year. It's impossible to find the old release. So if you have old data with COG, the new COG or code postal may not be a 100% match.

## 0. Build sedona context

In this tutorial, we will use `sedona-1.7.2` for `spark 3.5.2` with `scala 2.12`. You can find the required jars in `jars/sedona-35-212-172`.

The geotools version is `28.5` for sedona-1.7.2.

In [1]:
from sedona.spark import *
from pyspark.sql import SparkSession, DataFrame
from pathlib import Path
from pyspark.sql.functions import trim, split, expr, col

In [2]:
# build a sedona session offline
project_root_dir = Path.cwd().parent.parent

In [3]:
jar_folder = Path(f"{project_root_dir}/jars/sedona-35-212-172")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.7.2) offline
spark = SparkSession.builder \
    .appName("build_extra_routes_with_spec_commune") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [4]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [5]:
# get the spark context
sc = sedona.sparkContext

# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

## 1. Explore the base-officielle-codes-postaux

In [7]:
data_dir = project_root_dir / "data"
commune_file_path = data_dir / "parquet" / "fr_commune_code_postal" / "base-officielle-codes-postaux.parquet"

In [9]:
commune_df = spark.read.parquet(commune_file_path.as_posix())

In [10]:
commune_df.show(5)

+------------------+--------------------+-----------+----------------------+-------+--------------------+
|code_commune_insee|   nom_de_la_commune|code_postal|libelle_d_acheminement|ligne_5|           _geopoint|
+------------------+--------------------+-----------+----------------------+-------+--------------------+
|             01001|L ABERGEMENT CLEM...|      01400|  L ABERGEMENT CLEM...|   NULL|46.15170180297285...|
|             01002|L ABERGEMENT DE V...|      01640|  L ABERGEMENT DE V...|   NULL|46.00713099777772...|
|             01004|   AMBERIEU EN BUGEY|      01500|     AMBERIEU EN BUGEY|   NULL|45.95747066471399...|
|             01005| AMBERIEUX EN DOMBES|      01330|   AMBERIEUX EN DOMBES|   NULL|45.99922938293103...|
|             01006|             AMBLEON|      01300|               AMBLEON|   NULL|45.74831432147182...|
+------------------+--------------------+-----------+----------------------+-------+--------------------+
only showing top 5 rows



In [11]:
commune_df.printSchema()

root
 |-- code_commune_insee: string (nullable = true)
 |-- nom_de_la_commune: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- libelle_d_acheminement: string (nullable = true)
 |-- ligne_5: string (nullable = true)
 |-- _geopoint: string (nullable = true)



### 1.1 Remove all spaces in the string col

In [13]:
def trim_all_string_columns(df: DataFrame) -> DataFrame:
    for col_name, dtype in df.dtypes:
        if dtype == 'string':
            df = df.withColumn(col_name, trim(col_name))
    return df

In [14]:
clean_commune_df = trim_all_string_columns(commune_df)

In [15]:
clean_commune_df.show(5, truncate=False)

+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+
|code_commune_insee|nom_de_la_commune      |code_postal|libelle_d_acheminement |ligne_5|_geopoint                           |
+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+
|01001             |L ABERGEMENT CLEMENCIAT|01400      |L ABERGEMENT CLEMENCIAT|NULL   |46.15170180297285,4.930600521664882 |
|01002             |L ABERGEMENT DE VAREY  |01640      |L ABERGEMENT DE VAREY  |NULL   |46.00713099777772,5.42467488805381  |
|01004             |AMBERIEU EN BUGEY      |01500      |AMBERIEU EN BUGEY      |NULL   |45.957470664713995,5.370568254510258|
|01005             |AMBERIEUX EN DOMBES    |01330      |AMBERIEUX EN DOMBES    |NULL   |45.99922938293103,4.911871787269484 |
|01006             |AMBLEON                |01300      |AMBLEON                |NULL   |45.74831432147182,5.5927847144

### 1.2 Convert string to geometry

In [17]:
geo_col_name = "_geopoint"
tmp_df = clean_commune_df.withColumn("lat", split(geo_col_name, ",").getItem(0).cast("double")) \
    .withColumn("lon", split(geo_col_name, ",").getItem(1).cast("double"))


In [23]:
rename_map = {
    "code_commune_insee": "code_insee",
    "nom_de_la_commune": "commune_name"
}
geo_df = tmp_df.withColumn("centroid", expr("ST_Point(lon,lat)")).select("code_commune_insee", "nom_de_la_commune",
                                                                         "code_postal", "centroid").withColumnsRenamed(
    rename_map)



In [24]:
geo_df.show(5, truncate=False)

+----------+-----------------------+-----------+--------------------------------------------+
|code_insee|commune_name           |code_postal|centroid                                    |
+----------+-----------------------+-----------+--------------------------------------------+
|01001     |L ABERGEMENT CLEMENCIAT|01400      |POINT (4.930600521664882 46.15170180297285) |
|01002     |L ABERGEMENT DE VAREY  |01640      |POINT (5.42467488805381 46.00713099777772)  |
|01004     |AMBERIEU EN BUGEY      |01500      |POINT (5.370568254510258 45.957470664713995)|
|01005     |AMBERIEUX EN DOMBES    |01330      |POINT (4.911871787269484 45.99922938293103) |
|01006     |AMBLEON                |01300      |POINT (5.592784714407381 45.74831432147182) |
+----------+-----------------------+-----------+--------------------------------------------+
only showing top 5 rows



In [25]:
geo_df.printSchema()

root
 |-- code_insee: string (nullable = true)
 |-- commune_name: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- centroid: geometry (nullable = true)



In [28]:
output_path = data_dir / "tmp" / "fr_commune_code_postal_centroid_geopoint"
geo_df.coalesce(1).write \
    .mode("overwrite") \
    .format("geoparquet") \
    .option("geometry", "centroid") \
    .option("crs", "EPSG:4326") \
    .save(output_path.as_posix())

### 1.3 Prepare for osrm distance

As the osrm api takes only string coordinates, so we need to prepare the matrix with string latitude, and longitude

In [29]:
tmp_df.show(5, truncate=False)

+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+------------------+-----------------+
|code_commune_insee|nom_de_la_commune      |code_postal|libelle_d_acheminement |ligne_5|_geopoint                           |lat               |lon              |
+------------------+-----------------------+-----------+-----------------------+-------+------------------------------------+------------------+-----------------+
|01001             |L ABERGEMENT CLEMENCIAT|01400      |L ABERGEMENT CLEMENCIAT|NULL   |46.15170180297285,4.930600521664882 |46.15170180297285 |4.930600521664882|
|01002             |L ABERGEMENT DE VAREY  |01640      |L ABERGEMENT DE VAREY  |NULL   |46.00713099777772,5.42467488805381  |46.00713099777772 |5.42467488805381 |
|01004             |AMBERIEU EN BUGEY      |01500      |AMBERIEU EN BUGEY      |NULL   |45.957470664713995,5.370568254510258|45.957470664713995|5.370568254510258|
|01005             |AM

In [30]:
commune_df = tmp_df.select("code_commune_insee", "nom_de_la_commune", "code_postal", "lon", "lat").withColumnsRenamed(
    rename_map)
commune_df.show(5, truncate=False)

+----------+-----------------------+-----------+-----------------+------------------+
|code_insee|commune_name           |code_postal|lon              |lat               |
+----------+-----------------------+-----------+-----------------+------------------+
|01001     |L ABERGEMENT CLEMENCIAT|01400      |4.930600521664882|46.15170180297285 |
|01002     |L ABERGEMENT DE VAREY  |01640      |5.42467488805381 |46.00713099777772 |
|01004     |AMBERIEU EN BUGEY      |01500      |5.370568254510258|45.957470664713995|
|01005     |AMBERIEUX EN DOMBES    |01330      |4.911871787269484|45.99922938293103 |
|01006     |AMBLEON                |01300      |5.592784714407381|45.74831432147182 |
+----------+-----------------------+-----------+-----------------+------------------+
only showing top 5 rows



In [31]:
commune_df.printSchema()

root
 |-- code_insee: string (nullable = true)
 |-- commune_name: string (nullable = true)
 |-- code_postal: string (nullable = true)
 |-- lon: double (nullable = true)
 |-- lat: double (nullable = true)



In [32]:
distinct_code_postal_count = commune_df.select("code_postal").distinct().count()

In [33]:
total_count = commune_df.count()

In [34]:
distinct_code_insee_count = commune_df.select("code_insee").distinct().count()

In [35]:
print(f"insee code count: {distinct_code_insee_count}")
print(f"total code count: {total_count}")
print(f"postal code count: {distinct_code_postal_count}")

insee code count: 35007
total code count: 39192
postal code count: 6328


In [36]:
output_path = data_dir / "tmp" / "fr_commune_code_postal_centroid_double"
commune_df.coalesce(1).write.mode("overwrite").parquet(output_path.as_posix())


## Check the diff between

In [17]:
# read the data from fr_commune_code_postal_centroid_double.parquet
data_dir = project_root_dir / "data"
commune_file_path = data_dir / "parquet" / "fr_commune_code_postal" / "fr_commune_code_postal_centroid_double.parquet"
commune_df = spark.read.parquet(commune_file_path.as_posix())
commune_df.show(5, truncate=False)

+----------+-----------------------+-----------+-----------------+------------------+
|code_insee|commune_name           |code_postal|lon              |lat               |
+----------+-----------------------+-----------+-----------------+------------------+
|01001     |L ABERGEMENT CLEMENCIAT|01400      |4.930600521664882|46.15170180297285 |
|01002     |L ABERGEMENT DE VAREY  |01640      |5.42467488805381 |46.00713099777772 |
|01004     |AMBERIEU EN BUGEY      |01500      |5.370568254510258|45.957470664713995|
|01005     |AMBERIEUX EN DOMBES    |01330      |4.911871787269484|45.99922938293103 |
|01006     |AMBLEON                |01300      |5.592784714407381|45.74831432147182 |
+----------+-----------------------+-----------+-----------------+------------------+
only showing top 5 rows



In [20]:
# the old dataset
converted_centroid_path = data_dir / "tmp" / "converted_centroid_of_french_commune"
old_commune_df = spark.read.parquet(converted_centroid_path.as_posix())
old_commune_df= old_commune_df.withColumnRenamed("insee", "code_insee")
old_commune_df.show(5, truncate=False)

+------------+----------+-----------------+------------------+
|nom         |code_insee|longitude        |latitude          |
+------------+----------+-----------------+------------------+
|Pie-d'Orezza|2B222     |9.338150861836196|42.374292014354154|
|Lano        |2B137     |9.235357777014519|42.37887024991088 |
|Cambia      |2B051     |9.302107656444328|42.36875223806091 |
|Érone       |2B106     |9.26661425039706 |42.375563316535825|
|Oletta      |2B185     |9.33384508224219 |42.641774511917404|
+------------+----------+-----------------+------------------+
only showing top 5 rows



In [34]:
diff_commune_df = commune_df.join(old_commune_df, on="code_insee", how="left_anti")
distinct_diff_commune_df = diff_commune_df.select("code_insee","commune_name","code_postal").dropDuplicates(['code_insee'])

In [35]:
distinct_diff_commune_df.count()

147

In [36]:
distinct_diff_commune_df.orderBy("code_insee").show(100)

+----------+--------------------+-----------+
|code_insee|        commune_name|code_postal|
+----------+--------------------+-----------+
|     12218| CONQUES EN ROUERGUE|      12320|
|     13201|        MARSEILLE 01|      13001|
|     13202|        MARSEILLE 02|      13002|
|     13203|        MARSEILLE 03|      13003|
|     13204|        MARSEILLE 04|      13004|
|     13205|        MARSEILLE 05|      13005|
|     13206|        MARSEILLE 06|      13006|
|     13207|        MARSEILLE 07|      13007|
|     13208|        MARSEILLE 08|      13008|
|     13209|        MARSEILLE 09|      13009|
|     13210|        MARSEILLE 10|      13010|
|     13211|        MARSEILLE 11|      13011|
|     13212|        MARSEILLE 12|      13012|
|     13213|        MARSEILLE 13|      13013|
|     13214|        MARSEILLE 14|      13014|
|     13215|        MARSEILLE 15|      13015|
|     13216|        MARSEILLE 16|      13016|
|     14581|          AURSEULLES|      14240|
|     15031|              CELLES| 

In [42]:
diff_commune_df_bis = old_commune_df.join(commune_df, on="code_insee", how="left_anti")

distinct_diff_commune_df_bis = diff_commune_df_bis.select("code_insee","nom").dropDuplicates(['code_insee'])

In [43]:
distinct_diff_commune_df_bis.count()

95

In [44]:
distinct_diff_commune_df_bis.show(100)

+----------+--------------------+
|code_insee|                 nom|
+----------+--------------------+
|     01039|                Béon|
|     01330|             Ruffieu|
|     02077|        Berzy-le-Sec|
|     02311|              Filain|
|     08294|         La Moncelle|
|     08300|        Le Mont-Dieu|
|     09255|         Saint-Amans|
|     09287|            Senconac|
|     12076| Conques-en-Rouergue|
|     13055|           Marseille|
|     14011|          Aurseulles|
|     14300|             Gerrots|
|     14623|Saint-Martin-de-F...|
|     16140|        Fontclaireau|
|     16147| Gardes-le-Pontaroux|
|     16226|  Montignac-Charente|
|     16238|          Moutonneau|
|     16355|Saint-Sulpice-de-...|
|     17334|Saint-Georges-de-...|
|     18131|   Lugny-Bourbonnais|
|     19223| Saint-Martin-Sepert|
|     19230|Saint-Pardoux-Cor...|
|     22027|          Le Cambout|
|     22043|           Coëtlogon|
|     22200|              Pléven|
|     22309|       Saint-Launeuc|
|     23023|  

## Why I don't want to patch the old

1. Even I add the special Insee code for lyon, paris, etc as the departure code, and all old insee code as destination, the reverse is not calculated.
2. The code insee has changed a lot between 2022 and 2025.
          - In 2025, I have 147 code that do not exist in 2022 (147-20-16-9=142, if I remove the special code of paris, lyon)
          - In 2022, I have 95 code that do not exist in 2025

## Clean the dataset for the calculation

Remove all duplicated rows(code_insee, long, lat)

In [46]:
dedup_old_commune_df = old_commune_df.dropDuplicates(['code_insee',"longitude","latitude"])

In [47]:
dedup_old_commune_df.count()

34955

In [48]:
old_commune_df.count()

34955

In [49]:
old_commune_df.show(5)

+------------+----------+-----------------+------------------+
|         nom|code_insee|        longitude|          latitude|
+------------+----------+-----------------+------------------+
|Pie-d'Orezza|     2B222|9.338150861836196|42.374292014354154|
|        Lano|     2B137|9.235357777014519| 42.37887024991088|
|      Cambia|     2B051|9.302107656444328| 42.36875223806091|
|       Érone|     2B106| 9.26661425039706|42.375563316535825|
|      Oletta|     2B185| 9.33384508224219|42.641774511917404|
+------------+----------+-----------------+------------------+
only showing top 5 rows



In [55]:
paris_code_list = ["75101","75102","75103","75104","75105","75106","75107","75108","75109","75110","75111","75112","75113","75114","75115","75116","75117","75118","75119","75120"]
marseil_code_list = ["13201","13202","13203","13204","13205","13206","13207","13208","13209","13210","13211","13212","13213","13214","13215","13216"]
lyon_code_list = ["69381","69382","69383","69384","69385","69386","69387","69388","69389"]
spec_code_list = paris_code_list + marseil_code_list + lyon_code_list
special_code_df = commune_df.select("code_insee","commune_name","lon","lat").filter(col("code_insee").isin(spec_code_list)).dropDuplicates(["code_insee"])

In [58]:
special_code_df.cache()
special_code_df.count()

45

In [57]:
special_code_df.show(25)

+----------+------------+------------------+------------------+
|code_insee|commune_name|               lon|               lat|
+----------+------------+------------------+------------------+
|     13201|MARSEILLE 01| 5.382762418572516|   43.300180147405|
|     13202|MARSEILLE 02| 5.349708114001739|  43.3223695755212|
|     13203|MARSEILLE 03|5.3806244639103245| 43.31130519027072|
|     13204|MARSEILLE 04| 5.400172919457896|43.306252388667076|
|     13205|MARSEILLE 05| 5.400579300317757| 43.29252007050407|
|     13206|MARSEILLE 06| 5.381383694109374| 43.28763998786084|
|     13207|MARSEILLE 07| 5.327364965322443| 43.27961254906469|
|     13208|MARSEILLE 08| 5.325514248331542|43.214958098052385|
|     13209|MARSEILLE 09|  5.45229488318938| 43.23664648410533|
|     13210|MARSEILLE 10|5.4242901733836675|43.274316841834796|
|     13211|MARSEILLE 11| 5.478708665309835| 43.28765586882729|
|     13212|MARSEILLE 12| 5.441757373520363| 43.30714673377935|
|     13213|MARSEILLE 13| 5.430088281481

## Build the matrix

To be able to calculate the distance between all `communes` in France, we need to build a matrix with `source commune` and `destination commune`.

## Relation between Code Officiel Géographique (COG) de l’INSEE and code postal

### Paris

| Arrondissement	 | Code INSEE | Code Postal |
|-----------------|------------|-------------|
| Paris 1er	      | 75101      | 75001       |
| Paris 2e	       | 75102      | 75002       |
| Paris 3e	       | 75103      | 75003       |
| Paris 4e	       | 75104      | 75004       |
| Paris 5e	       | 75105      | 75005       |
| Paris 6e	       | 75106      | 75006       |
| Paris 7e	       | 75107      | 75007       |
| Paris 8e	       | 75108      | 75008       |
| Paris 9e	       | 75109      | 75009       |
| Paris 10e	      | 75110      | 75010       |
| Paris 11e	      | 75111      | 75011       |
| Paris 12e	      | 75112      | 75012       |
| Paris 13e	      | 75113      | 75013       |
| Paris 14e	      | 75114      | 75014       |
| Paris 15e	      | 75115      | 75015       |
| Paris 16e	      | 75116      | 75016       |
| Paris 17e	      | 75117      | 75017       |
| Paris 18e	      | 75118      | 75018       |
| Paris 19e	      | 75119      | 75019       |
| Paris 20e	      | 75120      | 75020       |
| Paris ALL       | 75056      | None        |

### Lyon

| Arrondissement	 | Code INSEE | Code Postal |
|-----------------|------------|-------------|
| Lyon 1er	       | 69381      | 69001       |
| Lyon 2e	        | 69382      | 69002       |
| Lyon 3e	        | 69383      | 69003       |
| Lyon 4e	        | 69384      | 69004       |
| Lyon 5e	        | 69385      | 69005       |
| Lyon 6e	        | 69386      | 69006       |
| Lyon 7e	        | 69387      | 69007       |
| Lyon 8e	        | 69388      | 69008       |
| Lyon 9e	        | 69389      | 69009       |
| Lyon ALL	       | 69123      | None        |

### Marseille

| Arrondissement	 | Code INSEE | Code Postal |
|-----------------|------------|-------------|
| Marseille 1er	  | 13201      | 75001       |
| Marseille 2e	   | 13202      | 75002       |
| Marseille 3e	   | 13203      | 75003       |
| Marseille 4e	   | 13204      | 75004       |
| Marseille 5e	   | 13205      | 75005       |
| Marseille 6e	   | 13206      | 75006       |
| Marseille 7e	   | 13207      | 75007       |
| Marseille 8e	   | 13208      | 75008       |
| Marseille 9e	   | 13209      | 75009       |
| Marseille 10e	  | 13210      | 75010       |
| Marseille 11e	  | 13211      | 75011       |
| Marseille 12e	  | 13212      | 75012       |
| Marseille 13e	  | 13213      | 75013       |
| Marseille 14e	  | 13214      | 75014       |
| Marseille 15e	  | 13215      | 75015       |
| Marseille 16e	  | 13216      | 75016       |
| Marseille ALL   | 13055      | None        |

In [60]:
# 1st step: get all possible routes for special code commune as departure
df_dep1=special_code_df.selectExpr("code_insee as source_insee","commune_name as source_nom","lon as source_long","lat as source_lat")

df_arr1 = old_commune_df.selectExpr("code_insee as dest_insee","nom as dest_nom","longitude as dest_long","latitude as dest_lat")

routes1 = df_dep1.crossJoin(df_arr1)

In [62]:
routes1.count()

1572975

In [64]:
routes1.show(10)

+------------+------------+------------------+------------------+----------+------------+-----------------+------------------+
|source_insee|  source_nom|       source_long|        source_lat|dest_insee|    dest_nom|        dest_long|          dest_lat|
+------------+------------+------------------+------------------+----------+------------+-----------------+------------------+
|       75111|    PARIS 11|2.3815596509625725| 48.86007317753911|     2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|
|       13204|MARSEILLE 04| 5.400172919457896|43.306252388667076|     2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|
|       13215|MARSEILLE 15| 5.365308697775667| 43.35309076105492|     2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|
|       13208|MARSEILLE 08| 5.325514248331542|43.214958098052385|     2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|
|       75115|    PARIS 15|2.2937166457955085| 48.84165432062318|     2B222|Pie-d'Orezza|9.338150861836196|42.3

In [61]:
# 2nd step: get all possible routes for special code commune as arrival
df_dep2 = old_commune_df.selectExpr("code_insee as source_insee","nom as source_nom","longitude as dest_long","latitude as dest_lat")

df_arr2 = special_code_df.selectExpr("code_insee as dest_insee","commune_name as dest_nom","lon as dest_long","lat as dest_lat")

routes2 = df_dep2.crossJoin(df_arr2)

In [63]:
routes2.count()

1572975

In [65]:
routes2.show(10)

+------------+------------+-----------------+------------------+----------+------------+------------------+------------------+
|source_insee|  source_nom|        dest_long|          dest_lat|dest_insee|    dest_nom|         dest_long|          dest_lat|
+------------+------------+-----------------+------------------+----------+------------+------------------+------------------+
|       2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|     75111|    PARIS 11|2.3815596509625725| 48.86007317753911|
|       2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|     13204|MARSEILLE 04| 5.400172919457896|43.306252388667076|
|       2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|     13215|MARSEILLE 15| 5.365308697775667| 43.35309076105492|
|       2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|     13208|MARSEILLE 08| 5.325514248331542|43.214958098052385|
|       2B222|Pie-d'Orezza|9.338150861836196|42.374292014354154|     75115|    PARIS 15|2.2937166457955085| 48.

In [66]:
# step3: concat routes1 and routes2
all_routes = routes1.union(routes2)

In [67]:
# now all routes contains all possible routes between special code commune and old commune
all_routes.count()

3145950

In [69]:
all_routes.rdd.getNumPartitions()

8

In [71]:
output_path = data_dir / "tmp" / "extra_routes_matrix"
all_routes.write.mode("overwrite").parquet(output_path.as_posix())

In [72]:
tmp_routes = spark.read.parquet(output_path.as_posix())

tmp_routes.count()

3145950

In [73]:
tmp_routes.show(5, truncate=False)

+------------+------------+-----------------+------------------+----------+------------+------------------+------------------+
|source_insee|source_nom  |source_long      |source_lat        |dest_insee|dest_nom    |dest_long         |dest_lat          |
+------------+------------+-----------------+------------------+----------+------------+------------------+------------------+
|2B222       |Pie-d'Orezza|9.338150861836196|42.374292014354154|75111     |PARIS 11    |2.3815596509625725|48.86007317753911 |
|2B222       |Pie-d'Orezza|9.338150861836196|42.374292014354154|13204     |MARSEILLE 04|5.400172919457896 |43.306252388667076|
|2B222       |Pie-d'Orezza|9.338150861836196|42.374292014354154|13215     |MARSEILLE 15|5.365308697775667 |43.35309076105492 |
|2B222       |Pie-d'Orezza|9.338150861836196|42.374292014354154|13208     |MARSEILLE 08|5.325514248331542 |43.214958098052385|
|2B222       |Pie-d'Orezza|9.338150861836196|42.374292014354154|75115     |PARIS 15    |2.2937166457955085|48.8