# Exercise: Museums of France

This workflow is an example of a possible solution to the exercise question, using **PySpark**.

## Spark Session

In [None]:
import findspark
findspark.init()
import pyspark

In [None]:
spark = pyspark.sql.SparkSession \
                    .builder \
                    .appName("Spark SQL First Example") \
                    .getOrCreate()

## Reading the Data

Use pandas to read the CSV file:

In [None]:
departements = spark.read.csv("../.assets/data/museums/departements.csv", 
                              sep=";",
                              header=True)
departements.show()

Ignore irrelevant columns:

In [None]:
departements = departements[["Nom du département", "Population totale"]]
departements.show()

Use pandas to read Excel file - Spark cannot do this yet:

In [None]:
import pandas
museums_pd = pandas.read_excel("../.assets/data/museums/Liste_musees_de_France.xls")
museums_pd.head()

In [None]:
museums = spark.createDataFrame(museums_pd.fillna("")) # Spark does not like NaN values in string colums, so fill them with empty strings

group museums by name of department

In [None]:
museum_count = museums.groupby("NOMDEP").count()
museum_count.show()

convert names of departments to match 

In [None]:
from pyspark.sql.functions import upper, col, udf
from pyspark.sql.types import StringType, FloatType

departements = departements.withColumn("Nom du département", upper(col("Nom du département")))

replace_hyphens = udf(lambda s: s.replace("-", " "), StringType())
departements = departements.withColumn("Nom du département", replace_hyphens("Nom du département"))

departements.show()

join data frames by name of departments

In [None]:
departements.show()

In [None]:
museum_count.show()

In [None]:
joined = departements.join(museum_count.withColumnRenamed("NOMDEP", "Nom du département"), on="Nom du département")
joined.show()

convert population to numeric and calculate correlation coefficients

In [None]:
convert_population = udf(lambda s: float(s.replace(".", "")), FloatType())

joined = joined.withColumn("Population totale", convert_population("Population totale"))

In [None]:
joined.show()

In [None]:
joined.corr("Population totale", "count")

done.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_