# French Departmental/Regional Data from [INSEE](https://statistiques-locales.insee.fr/#c=indicator&view=map2)

In [1]:
import pandas as pd

----
&nbsp;
## Creating new columns `department` & `region`

Compared to the UK France is considerably more ordered in terms of how she defines regions.

1. We extract the first two digits of the code postal from `address`. ie, 06, 78 etc
2. These digits represent a particular department
3. We arrogate these into regions

We search for the departments and regions of france on [Wikipedia](https://en.wikipedia.org/wiki/Departments_of_France) and scrape the data using pandas

In [2]:
url = 'https://en.wikipedia.org/wiki/Departments_of_France'
tables = pd.read_html(url)

In [3]:
# The table of interest is found by inspection
departments = tables[4]
departments.head()

Unnamed: 0,INSEE code,Arms 1,Date of establishment,Department,Capital,Region,Named after
0,1,,26 February 1790,Ain,Bourg-en-Bresse,Auvergne-Rhône-Alpes,Ain (river)
1,2,,26 February 1790,Aisne,Laon,Hauts-de-France,Aisne (river)
2,3,,26 February 1790,Allier,Moulins,Auvergne-Rhône-Alpes,Allier (river)
3,4,,26 February 1790,Alpes-de-Haute-Provence 2,Digne-les-Bains,Provence-Alpes-Côte d'Azur,Alps mountains and Provence region
4,5,,26 February 1790,Hautes-Alpes,Gap,Provence-Alpes-Côte d'Azur,Alps mountains


In [4]:
print(departments.columns.tolist())

['INSEE code', 'Arms 1', 'Date of establishment', 'Department', 'Capital', 'Region', 'Named after']


In [5]:
# Drop specified columns from the DataFrame
departments = departments.drop(columns=['Arms 1', 'Date of establishment', 'Capital', 'Named after'])

# Rename the column 'INSEE code' to 'department_num'
departments = departments.rename(columns={'INSEE code': 'department_num'})

# Convert all column names in the DataFrame to lowercase
departments.columns = departments.columns.str.lower()

# Replace the trailing numbers in the 'department' column with an empty string (if it exists)
departments['department'] = departments['department'].str.replace(r'\d+$', '', regex=True)

In [6]:
# Remove trailing whitespace
departments['department'] = departments['department'].str.strip()

In [7]:
# Display the last 10 rows of the DataFrame
departments.tail(10)

Unnamed: 0,department_num,department,region
92,91,Essonne,Île-de-France
93,92,Hauts-de-Seine,Île-de-France
94,93,Seine-Saint-Denis,Île-de-France
95,94,Val-de-Marne,Île-de-France
96,95,Val-d'Oise,Île-de-France
97,971,Guadeloupe,Guadeloupe
98,972,Martinique,Martinique
99,973,Guyane,French Guiana
100,974,La Réunion,Réunion
101,976,Mayotte,Mayotte


The last five entries are overseas departments

In [8]:
departments = departments.copy()

# Remove the last 5 entries from the DataFrame
departments = departments.iloc[:-5]

In `departments` there is some ambiguity with the naming of department 69. We search `france_data` for department 69

In [9]:
rhone = departments[departments['department_num'].str.startswith('69')]
rhone

Unnamed: 0,department_num,department,region
69,69D,Rhône,Auvergne-Rhône-Alpes
70,69M,Lyon Metropolis,Auvergne-Rhône-Alpes


We will change department 69_ to 69 Rhône

In [10]:
# Identify rows with department number '69D' or '69M'
rows_to_update = departments['department_num'].isin(['69D', '69M'])

# Update department name and department number for these rows
departments.loc[rows_to_update, 'department'] = 'Rhône'
departments.loc[rows_to_update, 'department_num'] = '69'

# Drop the specified index
departments = departments.drop(departments.index[69])

We export `departments` as a .csv file to merge with `france_data`

In [11]:
# Export the data to a csv file
departments.to_csv('../../data/France/Demographics/departments.csv', index=False)

In [12]:
departments.tail(3)

Unnamed: 0,department_num,department,region
94,93,Seine-Saint-Denis,Île-de-France
95,94,Val-de-Marne,Île-de-France
96,95,Val-d'Oise,Île-de-France


----
&nbsp;
## *Statistiques locales* by department. [INSEE](https://www.insee.fr/fr/statistiques/6013867) 2020

In [13]:
stats_locale = pd.read_csv("../../data/France/Demographics/stats_locales.csv", sep=';')
stats_locale.head()

Unnamed: 0,Code,Libellé,Taux de pauvreté 2020,Taux de chômage annuel moyen 2022,Salaire net horaire moyen 2021,Population municipale 2020,Densité de population (historique depuis 1876) 2020,Nb de pers. non scolarisées de 15 ans ou + 2020
0,1,Ain,10.5,5.5,15.34,657856,114.2,480283
1,2,Aisne,18.0,10.5,13.92,529374,71.9,394221
2,3,Allier,15.3,7.7,13.63,335628,45.7,263472
3,4,Alpes-de-Haute-Provence,16.6,8.2,14.15,165451,23.9,129106
4,5,Hautes-Alpes,13.9,6.9,13.54,140605,25.3,109699


In [14]:
stats_locale.columns.tolist()

['Code',
 'Libellé',
 'Taux de pauvreté 2020',
 'Taux de chômage annuel moyen 2022',
 'Salaire net horaire moyen 2021',
 'Population municipale 2020',
 'Densité de population (historique depuis 1876) 2020',
 'Nb de pers. non scolarisées de 15 ans ou + 2020']

In [15]:
stats_locale = stats_locale.rename(columns={
    'Code': 'department_num',
    'Libellé': 'department',
    'Taux de pauvreté 2020': 'poverty_rate(%)',
    'Taux de chômage annuel moyen 2022': 'average_annual_unemployment_rate(%)',
    'Salaire net horaire moyen 2021': 'average_net_hourly_wage(€)',
    'Population municipale 2020': 'municipal_population',
    'Densité de population (historique depuis 1876) 2020': 'population_density(inhabitants/sq_km)',
    'Nb de pers. non scolarisées de 15 ans ou + 2020': 'non_schooled_persons_15_and_over'
})

In [16]:
stats_locale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 8 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   department_num                         101 non-null    object 
 1   department                             101 non-null    object 
 2   poverty_rate(%)                        101 non-null    float64
 3   average_annual_unemployment_rate(%)    101 non-null    object 
 4   average_net_hourly_wage(€)             101 non-null    object 
 5   municipal_population                   101 non-null    object 
 6   population_density(inhabitants/sq_km)  101 non-null    object 
 7   non_schooled_persons_15_and_over       101 non-null    object 
dtypes: float64(1), object(7)
memory usage: 6.4+ KB


In [17]:
# Remove leading and trailing whitespace for all string columns
stats_locale = stats_locale.apply(lambda col: col.str.strip() if col.dtype == "object" else col)

In [18]:
# Define columns to be converted to numeric type
numeric_cols = ['average_annual_unemployment_rate(%)',
                'average_net_hourly_wage(€)',
                'municipal_population',
                'population_density(inhabitants/sq_km)',
                'non_schooled_persons_15_and_over']

# Convert the columns to numeric, setting any errors to NaN
for col in numeric_cols:
    stats_locale[col] = pd.to_numeric(stats_locale[col], errors='coerce')

In [19]:
# Check if there are any NaN values in the DataFrame
nan_values = stats_locale.isnull().sum()
print(nan_values[nan_values > 0])

average_annual_unemployment_rate(%)      1
average_net_hourly_wage(€)               1
municipal_population                     1
population_density(inhabitants/sq_km)    1
non_schooled_persons_15_and_over         1
dtype: int64


In [20]:
# Return that row of data
nan_rows = stats_locale[stats_locale.isna().any(axis=1)]
nan_rows

Unnamed: 0,department_num,department,poverty_rate(%),average_annual_unemployment_rate(%),average_net_hourly_wage(€),municipal_population,population_density(inhabitants/sq_km),non_schooled_persons_15_and_over
100,976,Mayotte,77.3,,,,,


*`Mayotte`* is an overseas department which will be removed from the dataset as we are focusing analysis on mainland France

In [21]:
stats_locale.tail(10)

Unnamed: 0,department_num,department,poverty_rate(%),average_annual_unemployment_rate(%),average_net_hourly_wage(€),municipal_population,population_density(inhabitants/sq_km),non_schooled_persons_15_and_over
91,91,Essonne,13.2,6.4,17.85,1306118.0,723.9,906036.0
92,92,Hauts-de-Seine,11.9,5.9,26.02,1626213.0,9260.4,1141945.0
93,93,Seine-Saint-Denis,27.6,10.2,14.98,1655422.0,7008.6,1112210.0
94,94,Val-de-Marne,16.6,7.1,18.86,1407972.0,5746.1,978399.0
95,95,Val-d'Oise,17.0,8.0,17.03,1251804.0,1004.7,852868.0
96,971,Guadeloupe,34.5,18.6,14.93,383559.0,235.5,283923.0
97,972,Martinique,26.7,12.5,14.69,361225.0,320.2,275923.0
98,973,Guyane,52.9,13.1,15.1,285133.0,3.4,164254.0
99,974,La Réunion,35.6,18.1,13.79,863083.0,344.7,594394.0
100,976,Mayotte,77.3,,,,,


We remove the overseas departments

In [22]:
# Remove the last 5 entries from the DataFrame (Overseas territories)
stats_locale = stats_locale.iloc[:-5]
print(f"Shape of stats_locale: {stats_locale.shape}")

Shape of stats_locale: (96, 8)


There's a discrepancy in the population density data sourced from INSEE. The population density of [Paris](https://en.wikipedia.org/wiki/List_of_French_departments_by_population) is 20,454 inhabitants/sq_km

In [23]:
paris = stats_locale[stats_locale['department'] == 'Paris']
paris

Unnamed: 0,department_num,department,poverty_rate(%),average_annual_unemployment_rate(%),average_net_hourly_wage(€),municipal_population,population_density(inhabitants/sq_km),non_schooled_persons_15_and_over
75,75,Paris,15.4,5.7,27.14,2145906.0,10179.8,1563175.0


In [24]:
stats_locale.loc[stats_locale['department'] == 'Paris', 'population_density(inhabitants/sq_km)'] = 20454

We need to extrapolate `area(sq_km)` from population density and population

In [25]:
stats_locale['area(sq_km)'] = round(stats_locale['municipal_population'] /
                                                           stats_locale['population_density(inhabitants/sq_km)'], 2)

In [26]:
stats_locale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   department_num                         96 non-null     object 
 1   department                             96 non-null     object 
 2   poverty_rate(%)                        96 non-null     float64
 3   average_annual_unemployment_rate(%)    96 non-null     float64
 4   average_net_hourly_wage(€)             96 non-null     float64
 5   municipal_population                   96 non-null     float64
 6   population_density(inhabitants/sq_km)  96 non-null     float64
 7   non_schooled_persons_15_and_over       96 non-null     float64
 8   area(sq_km)                            96 non-null     float64
dtypes: float64(7), object(2)
memory usage: 6.9+ KB


----
&nbsp;
### Merging `departments` & `stats_locale`

In [27]:
# We sort all dfs in the same manner

departments = departments.sort_values('department_num')
stats_locale = stats_locale.sort_values('department_num')

We check if the name `department` is equal in the two dataframes

In [28]:
set1 = set(departments['department'].unique())
set2 = set(stats_locale['department'].unique())

print(set1 == set2)  # This should print True if all sets are equal

True


In [29]:
from functools import reduce

# List of dataframes to merge
dfs = [departments, stats_locale]

# Use reduce and merge to merge all dataframes
demographics = reduce(lambda left,right: pd.merge(left,right,on=['department', 'department_num']), dfs)

Will drop `non_schooled_persons_15_and_over` as I feel it's the weakest statistic

In [30]:
demographics = demographics.drop(columns="non_schooled_persons_15_and_over")
demographics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   department_num                         96 non-null     object 
 1   department                             96 non-null     object 
 2   region                                 96 non-null     object 
 3   poverty_rate(%)                        96 non-null     float64
 4   average_annual_unemployment_rate(%)    96 non-null     float64
 5   average_net_hourly_wage(€)             96 non-null     float64
 6   municipal_population                   96 non-null     float64
 7   population_density(inhabitants/sq_km)  96 non-null     float64
 8   area(sq_km)                            96 non-null     float64
dtypes: float64(6), object(3)
memory usage: 7.5+ KB


The `demographics` data can now be exported

In [31]:
# Export the data to a csv file
demographics.to_csv('../../data/France/Demographics/demographics.csv', index=False)