# French Departmental/Regional Data

In [1]:
import pandas as pd

----
&nbsp;
## Creating new columns `department` & `region`

Compared to the UK France is considerably more ordered in terms of how she defines regions.

1. We extract the first two digits of the code postal from `address`. ie, 06, 78 etc
2. These digits represent a particular department
3. We arrogate these into regions

We search for the departments and regions of france on [Wikipedia](https://en.wikipedia.org/wiki/Departments_of_France) and scrape the data using pandas

In [2]:
url = 'https://en.wikipedia.org/wiki/Departments_of_France'
tables = pd.read_html(url)

In [3]:
# The table of interest is found by inspection
departments = tables[4]
departments.head()

Unnamed: 0,INSEE code,Arms 1,Date of establishment,Department,Capital,Region,Named after
0,1,,26 February 1790,Ain,Bourg-en-Bresse,Auvergne-Rhône-Alpes,Ain (river)
1,2,,26 February 1790,Aisne,Laon,Hauts-de-France,Aisne (river)
2,3,,26 February 1790,Allier,Moulins,Auvergne-Rhône-Alpes,Allier (river)
3,4,,26 February 1790,Alpes-de-Haute-Provence 2,Digne-les-Bains,Provence-Alpes-Côte d'Azur,Alps mountains and Provence region
4,5,,26 February 1790,Hautes-Alpes,Gap,Provence-Alpes-Côte d'Azur,Alps mountains


In [4]:
print(departments.columns.tolist())

['INSEE code', 'Arms 1', 'Date of establishment', 'Department', 'Capital', 'Region', 'Named after']


In [5]:
# Drop specified columns from the DataFrame
departments = departments.drop(columns=['Arms 1', 'Date of establishment', 'Capital', 'Named after'])

# Rename the column 'INSEE code' to 'department_num'
departments = departments.rename(columns={'INSEE code': 'department_num'})

# Convert all column names in the DataFrame to lowercase
departments.columns = departments.columns.str.lower()

# Replace the trailing numbers in the 'department' column with an empty string (if it exists)
departments['department'] = departments['department'].str.replace(r'\d+$', '', regex=True)

In [6]:
# Remove trailing whitespace
departments['department'] = departments['department'].str.strip()

In [7]:
# Display the last 10 rows of the DataFrame
departments.tail(10)

Unnamed: 0,department_num,department,region
92,91,Essonne,Île-de-France
93,92,Hauts-de-Seine,Île-de-France
94,93,Seine-Saint-Denis,Île-de-France
95,94,Val-de-Marne,Île-de-France
96,95,Val-d'Oise,Île-de-France
97,971,Guadeloupe,Guadeloupe
98,972,Martinique,Martinique
99,973,Guyane,French Guiana
100,974,La Réunion,Réunion
101,976,Mayotte,Mayotte


The last five entries are overseas departments

In [8]:
departments = departments.copy()

# Remove the last 5 entries from the DataFrame
departments = departments.iloc[:-5]

In `departments` there is some ambiguity with the naming of department 69. We search `france_data` for department 69

In [9]:
rhone = departments[departments['department_num'].str.startswith('69')]
rhone

Unnamed: 0,department_num,department,region
69,69D,Rhône,Auvergne-Rhône-Alpes
70,69M,Lyon Metropolis,Auvergne-Rhône-Alpes


We will change department 69_ to 69 Rhône

In [10]:
# Identify rows with department number '69D' or '69M'
rows_to_update = departments['department_num'].isin(['69D', '69M'])

# Update department name and department number for these rows
departments.loc[rows_to_update, 'department'] = 'Rhône'
departments.loc[rows_to_update, 'department_num'] = '69'

# Drop the specified index
departments = departments.drop(departments.index[69])

We export `departments` as a .csv file to merge with `france_data`

In [11]:
# Export the data to a csv file
departments.to_csv('../../data/France/departments.csv', index=False)

In [12]:
departments.tail(3)

Unnamed: 0,department_num,department,region
94,93,Seine-Saint-Denis,Île-de-France
95,94,Val-de-Marne,Île-de-France
96,95,Val-d'Oise,Île-de-France


----
&nbsp;
## Total population data by department. [INSEE](https://www.insee.fr/fr/statistiques/6013867) 2019 Census

In [13]:
url = 'https://www.insee.fr/fr/statistiques/6013867'
tables = pd.read_html(url)

In [14]:
pop_total = tables[0]
pop_total.head()

Unnamed: 0,Code département,Nom du département,Population municipale
0,1,Ain,652 432
1,2,Aisne,531 345
2,3,Allier,335 975
3,4,Alpes-de-Haute-Provence,164 308
4,5,Hautes-Alpes,141 220


In [15]:
pop_total = pop_total.rename(columns={'Code département': 'department_num',
                                      'Nom du département': 'department',
                                      'Population municipale': 'population'})

In [16]:
# Remove whitespace from 'population' column and format as type integer
pop_total['population'] = pop_total['population'].replace(' ', '', regex=True).astype(int)

In [17]:
pop_total.tail(10)

Unnamed: 0,department_num,department,population
90,90,Territoire de Belfort,141318
91,91,Essonne,1301659
92,92,Hauts-de-Seine,1624357
93,93,Seine-Saint-Denis,1644903
94,94,Val-de-Marne,1407124
95,95,Val-d'Oise,1249674
96,971,Guadeloupe,384239
97,972,Martinique,364508
98,973,Guyane,281678
99,974,La Réunion,861210


We remove the overseas departments

In [18]:
# Remove the last 4 entries from the DataFrame
pop_total = pop_total.iloc[:-4]

----
&nbsp;
## Population density by department. [INSEE](https://statistiques-locales.insee.fr/#c=indicator&i=pop_depuis_1876.dens&s=2020&t=A01&view=map2) 2020

In [19]:
pop_density = pd.read_csv("../../data/France/france_pop_den.csv", sep=';')
pop_density.head()

Unnamed: 0,Code,Libellé,Densité de population 2020
0,1,Ain,114.2
1,2,Aisne,71.9
2,3,Allier,45.7
3,4,Alpes-de-Haute-Provence,23.9
4,5,Hautes-Alpes,25.3


In [20]:
pop_density.columns.tolist()

['Code', 'Libellé', 'Densité de population 2020']

In [21]:
pop_density.tail(10)

Unnamed: 0,Code,Libellé,Densité de population 2020
91,91,Essonne,723.9
92,92,Hauts-de-Seine,9260.4
93,93,Seine-Saint-Denis,7008.6
94,94,Val-de-Marne,5746.1
95,95,Val-d'Oise,1004.7
96,971,Guadeloupe,235.5
97,972,Martinique,320.2
98,973,Guyane,3.4
99,974,La Réunion,344.7
100,976,Mayotte,N/A - résultat non disponible


In [22]:
pop_density = pop_density.rename(columns={'Code': 'department_num',
                                      'Libellé': 'department',
                                      'Densité de population 2020': 'population_density(pop/area)'})

# Removing the overseas departments
pop_density = pop_density.iloc[:-5]

# Convert the density column as type 'float'
pop_density['population_density(pop/area)'] = pop_density['population_density(pop/area)'].astype(float)
pop_density.head()

Unnamed: 0,department_num,department,population_density(pop/area)
0,1,Ain,114.2
1,2,Aisne,71.9
2,3,Allier,45.7
3,4,Alpes-de-Haute-Provence,23.9
4,5,Hautes-Alpes,25.3


In [23]:
pop_density.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   department_num                96 non-null     object 
 1   department                    96 non-null     object 
 2   population_density(pop/area)  96 non-null     float64
dtypes: float64(1), object(2)
memory usage: 2.4+ KB


----
&nbsp;
### Merging `departments`, `pop_total` & `pop_density`

In [24]:
# We sort all dfs in the same manner

departments = departments.sort_values('department_num')
pop_total = pop_total.sort_values('department_num')
pop_density = pop_density.sort_values('department_num')

We check if the name `department` is equal in the three dataframes

In [25]:
set1 = set(departments['department'].unique())
set2 = set(pop_total['department'].unique())
set3 = set(pop_density['department'].unique())

print(set1 == set2 == set3)  # This should print True if all sets are equal

True


In [26]:
from functools import reduce

# List of dataframes to merge
dfs = [departments, pop_total, pop_density]

# Use reduce and merge to merge all dataframes
demographics = reduce(lambda left,right: pd.merge(left,right,on=['department', 'department_num']), dfs)

In [27]:
demographics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   department_num                96 non-null     object 
 1   department                    96 non-null     object 
 2   region                        96 non-null     object 
 3   population                    96 non-null     int64  
 4   population_density(pop/area)  96 non-null     float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.5+ KB


By inspection, the population density of [Paris](https://en.wikipedia.org/wiki/Paris) is incorrect

In [28]:
demographics.loc[demographics['department'] == 'Paris', 'population_density(pop/area)'] = 20545.0

We will calculate the area of each department from `population` and `population_density`

In [29]:
demographics['area(sq_km)'] = round(demographics['population'] / demographics['population_density(pop/area)'], 2)
demographics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   department_num                96 non-null     object 
 1   department                    96 non-null     object 
 2   region                        96 non-null     object 
 3   population                    96 non-null     int64  
 4   population_density(pop/area)  96 non-null     float64
 5   area(sq_km)                   96 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 5.2+ KB


The `demographics` data can now be exported

In [30]:
# Export the data to a csv file
demographics.to_csv('../../data/France/demographics.csv', index=False)