# France Paris Weather Data Preparation & Data quality check

The format of the daily weather data for France is a set of two CSV files by department and for a significant period. For example, I have 2 CSV files for Paris department (75) for the period of 1950-2023, one file for the essential parameters (Temperature, Rainfall, etc.) and the other one for the complements (Humidity is included in this file). I have another 2 CSV files for the period of last two years (2023-2024).

Since the three variables that I want to analyze to compare with HK weather information are Temperature, Humidity and Rainfall, I have 4 files representing the weather data of Paris.

In [2]:
# Necessary modules
import pandas as pd  # type: ignore
import time
import datetime
import os
import glob
import numpy as np # type: ignore

In [6]:
list_files = glob.glob("../0_Data/PAR/DonneesMeteo/*.csv")
list_files

['../0_Data/PAR/DonneesMeteo/Q_75_previous-1950-2023_RR-T-Vent.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_previous-1950-2023_autres-parametres.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_latest-2024-2025_RR-T-Vent.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_latest-2024-2025_autres-parametres.csv']

Let's check the first file in the list before preceeding a recursive data importation.

In [None]:
dataset = pd.read_csv(list_files[0], delimiter = ";")
print(dataset.head())

   NUM_POSTE  NOM_USUEL        LAT       LON  ALTI  AAAAMMJJ   RR  QRR  TN  \
0   75101001  INNOCENTS  48.860667  2.348333    37  19500101  0.0  1.0 NaN   
1   75101001  INNOCENTS  48.860667  2.348333    37  19500102  1.8  1.0 NaN   
2   75101001  INNOCENTS  48.860667  2.348333    37  19500103  2.0  1.0 NaN   
3   75101001  INNOCENTS  48.860667  2.348333    37  19500104  0.2  1.0 NaN   
4   75101001  INNOCENTS  48.860667  2.348333    37  19500105  1.0  1.0 NaN   

   QTN  ...  HXI2  QHXI2  FXI3S  QFXI3S  DXI3S  QDXI3S  HXI3S  QHXI3S  DRR  \
0  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
1  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
2  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
3  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
4  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   

   QDRR  
0   NaN  
1   NaN  
2   NaN  
3   NaN  
4   NaN  

[

In [16]:
print(dataset['NUM_POSTE'].drop_duplicates().shape)
print(dataset['NOM_USUEL'].drop_duplicates())

(38,)
0                       INNOCENTS
4872              TOUR ST-JACQUES
24931                     PLANTES
33270                  LUXEMBOURG
57194                     LAENNEC
75371               CHAMP DE MARS
76712                 TOUR EIFFEL
86366                   LOUIS XVI
95132                LARIBOISIERE
120870                   ST-LOUIS
135528               ILE DE BERCY
160894             LA FAISANDERIE
182276               LEO LAGRANGE
200913                 ST-ANTOINE
226991               PORTE D'IVRY
230572                SALPETRIERE
247253           PARIS-MONTSOURIS
274281               OBSERVATOIRE
291457              OBS. TERRASSE
296143    PARIS-MONTSOURIS-DOUBLE
298828                  VAUGIRARD
318955                G. POMPIDOU
323109                    AUTEUIL
335710                  BAGATELLE
361607                      PASSY
393194                  LONGCHAMP
399313                BATIGNOLLES
414470                 MONTMARTRE
428267            BUTTES CHAUMONT
451224  

In fact, those parameters are measured from 38 different observatories (Same number of observatories as in Hong Kong!) just for Paris city. It is quite many given that the surface of Hong Kong is about ten times more than Paris surface (1108 square km vs. 105 square km). But population in Hong Kong is much more than population in Paris (7M residents vs. 2M residents). In terms of the population density, it goes up to 6300 people by square km in Hong Kong, versus 19000 people by square km in Paris. - It is quite unexpected that the population density in Paris is the triple than in Hong Kong, since to my feelings as a resident in two cities, Hong Kong seemed to be much more crowded. It has to be put in perspective of the habitable surface though, since the habitable surface is much smaller in Hong Kong because of the mountains - so relatively few people may be densely assembled in a smaller area.

Anyways, to have only one value representing the global weather parameter for Paris, I choose to average these values over observatories.  

I will extract only three variables from these CSV files, which seem to be the most corresponding to HK Data according to their descriptions in the documentation. Those are the followings:
- RR (From the first file) : Rainfall quantity in last 24 hours (measured on the period from 6AM the day to 6AM the day after). The final value picked at 6AM on D+1 will be affected to the data of the day D. (in mm)
- TX (From the first file): maximal temperature in a covered area (in Celcius)
- UM (From the second file): daily average of the hourly maximal humidity (in %)

For more informations about other French weather variables, one can refer to the documentation of this data in French Data Gouv webpage (https://www.data.gouv.fr/fr/datasets/donnees-climatologiques-de-base-quotidiennes/).

In [35]:
period = [x.split("/")[-1].split("_")[-2] for x in list_files]
param = [x.split("/")[-1].split("_")[-1] for x in list_files]
param

['RR-T-Vent.csv',
 'autres-parametres.csv',
 'RR-T-Vent.csv',
 'autres-parametres.csv']

In [133]:
final_dataset = pd.DataFrame()
data_summary = pd.DataFrame()

period = [x.split("/")[-1].split("_")[-2] for x in list_files]
param = [x.split("/")[-1].split("_")[-1] for x in list_files]

for sub_period in set(period):
    sub_list = [x for x in list_files if sub_period in x]
    for sub_file in sub_list:
        if param[0] in sub_file:
            dataset_T1 = pd.read_csv(sub_file, delimiter = ";")
            dataset_T1 = dataset_T1[['NOM_USUEL','AAAAMMJJ', 'RR', 'TX']]
        else:
            dataset_T2 = pd.read_csv(list_files[1], delimiter = ";")
            dataset_T2 = dataset_T2[['NOM_USUEL','AAAAMMJJ', 'UM']]
    dataset_for_period = dataset_T1.set_index(['NOM_USUEL','AAAAMMJJ']).join(dataset_T2.set_index(['NOM_USUEL','AAAAMMJJ']), on = ['NOM_USUEL','AAAAMMJJ'])
    dataset_for_period['Period'] = sub_period
    final_dataset = pd.concat([dataset_for_period, final_dataset])

In [139]:
final_dataset = final_dataset.reset_index().set_index('AAAAMMJJ')

In [140]:
final_dataset['year'] = [int(str(x)[:4]) for x in final_dataset.index]
final_dataset['month'] = [int(str(x)[4:6]) for x in final_dataset.index]
final_dataset['day'] = [int(str(x)[6:8]) for x in final_dataset.index]
final_dataset['Date'] = pd.to_datetime(final_dataset[['year', 'month', 'day']])
final_dataset.reset_index(inplace = True, drop = True)

In [141]:
final_dataset

Unnamed: 0,NOM_USUEL,RR,TX,UM,Period,year,month,day,Date
0,LUXEMBOURG,9.6,11.5,,latest-2024-2025,2024,1,1,2024-01-01
1,LUXEMBOURG,7.1,12.7,,latest-2024-2025,2024,1,2,2024-01-02
2,LUXEMBOURG,3.2,13.0,,latest-2024-2025,2024,1,3,2024-01-03
3,LUXEMBOURG,0.2,11.6,,latest-2024-2025,2024,1,4,2024-01-04
4,LUXEMBOURG,0.2,10.1,,latest-2024-2025,2024,1,5,2024-01-05
...,...,...,...,...,...,...,...,...,...
496913,BELLEVILLE PARC,,18.8,76.0,previous-1950-2023,2015,6,22,2015-06-22
496914,BELLEVILLE PARC,,20.9,64.0,previous-1950-2023,2015,6,23,2015-06-23
496915,BELLEVILLE PARC,,25.4,53.0,previous-1950-2023,2015,6,24,2015-06-24
496916,BELLEVILLE PARC,,29.4,42.0,previous-1950-2023,2015,6,25,2015-06-25


- Daily average value for three variables for Paris:

In [149]:
final_dataset_avg = final_dataset[['Date','RR', 'TX', 'UM']].groupby('Date').mean()

In [None]:
final_dataset[['Date', 'RR', 'TX','UM']]

Unnamed: 0,Date,RR,TX,UM
0,2024-01-01,9.6,11.5,
1,2024-01-02,7.1,12.7,
2,2024-01-03,3.2,13.0,
3,2024-01-04,0.2,11.6,
4,2024-01-05,0.2,10.1,
...,...,...,...,...
496913,2015-06-22,,18.8,76.0
496914,2015-06-23,,20.9,64.0
496915,2015-06-24,,25.4,53.0
496916,2015-06-25,,29.4,42.0


In [162]:

NA_table = final_dataset[['RR', 'TX','UM']].groupby(final_dataset.NOM_USUEL).sum().reset_index()
count_table = final_dataset[['NOM_USUEL', 'RR', 'TX', 'UM']].groupby('NOM_USUEL').count()
totalObs_RR = count_table['RR'].sum()
totalObs_TX = count_table['TX'].sum()
totalObs_UM = count_table['UM'].sum()
count_table['PctTo_TotalOBS_RR']=(count_table['RR']/totalObs_RR)
count_table['PctTo_TotalOBS_TX']=(count_table['TX']/totalObs_TX)
count_table['PctTo_TotalOBS_UM']=(count_table['UM']/totalObs_UM)
count_table

Unnamed: 0_level_0,RR,TX,UM,PctTo_TotalOBS_RR,PctTo_TotalOBS_TX,PctTo_TotalOBS_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AUTEUIL,24067,5572,0,0.049887,0.04109,0.0
BAGATELLE,25869,9608,0,0.053623,0.070852,0.0
BATIGNOLLES,15005,8858,0,0.031103,0.065322,0.0
BELLEVILLE,1369,0,0,0.002838,0.0,0.0
BELLEVILLE PARC,6281,7049,6864,0.01302,0.051981,0.153636
BUTTES CHAUMONT,22929,6637,0,0.047528,0.048943,0.0
BUTTES RESERV.,365,0,0,0.000757,0.0,0.0
CHAMP DE MARS,1341,0,0,0.00278,0.0,0.0
CHARONNE,8002,0,0,0.016587,0.0,0.0
G. POMPIDOU,4154,0,0,0.008611,0.0,0.0


In [205]:
# Compute the observation period by Observatory, for each Variable :
Period_Obs = pd.DataFrame({
    'NOM_USUEL':final_dataset.NOM_USUEL.drop_duplicates()
})
Period_Obs.set_index('NOM_USUEL', inplace = True)
for variable in ['RR', 'TX', 'UM']:
    Start_Date = final_dataset.loc[final_dataset[variable].isna()==False][['NOM_USUEL','Date']].groupby('NOM_USUEL').min()
    End_Date = final_dataset.loc[final_dataset[variable].isna()==False][['NOM_USUEL','Date']].groupby('NOM_USUEL').max()
    Var_Obs_Period = Start_Date.join(End_Date, lsuffix = "_S", rsuffix = "_E")
    temp = pd.DataFrame({
        'NOM_USUEL':Var_Obs_Period.index,
        'period':[np.ceil(td/np.timedelta64(1, 'D')).astype(int) for td in (Var_Obs_Period['Date_E']-Var_Obs_Period['Date_S'])/(30*12)]
    })
    temp.columns = ['NOM_USUEL', 'Period_'+variable]
    temp.set_index('NOM_USUEL', inplace = True)
    Period_Obs = Period_Obs.join(temp, on = 'NOM_USUEL')

In [206]:
Period_Obs

Unnamed: 0_level_0,Period_RR,Period_TX,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LUXEMBOURG,69.0,48.0,
TOUR EIFFEL,,29.0,12.0
LARIBOISIERE,77.0,6.0,
PARIS-MONTSOURIS,77.0,77.0,67.0
PARIS-MONTSOURIS-DOUBLE,,9.0,
LONGCHAMP,19.0,19.0,18.0
INNOCENTS,14.0,,
TOUR ST-JACQUES,57.0,31.0,14.0
PLANTES,24.0,,
LAENNEC,52.0,,


In [209]:
final_dataset.loc[(final_dataset['NOM_USUEL']=='LUXEMBOURG')&(final_dataset['UM'].isna()==False)]

Unnamed: 0,NOM_USUEL,RR,TX,UM,Period,year,month,day,Date


In [None]:
dataset_T1 = dataset[['AAAAMMJJ', 'RR', 'TX']]
# Average over all 38 observatories by a date
dataset_T1 = dataset_T1.groupby('AAAAMMJJ').mean(['RR', 'TX'])

dataset_T2 = pd.read_csv(list_files[1], delimiter = ";")
dataset_T2 = dataset_T2[['AAAAMMJJ', 'UM']]
# Average over all 38 observatories by a date
dataset_T2 = dataset_T2.groupby('AAAAMMJJ').mean(['UM'])
dataset = dataset_T1.join(dataset_T2, on = 'AAAAMMJJ')


Unnamed: 0_level_0,RR,TX,UM
AAAAMMJJ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19500101,0.000000,2.400000,
19500102,2.282609,9.600000,
19500103,2.791304,10.600000,
19500104,0.521739,10.100000,
19500105,1.026087,9.800000,
...,...,...,...
20231227,0.100000,11.150000,72.5
20231228,0.000000,11.700000,83.5
20231229,0.400000,11.716667,84.0
20231230,0.900000,11.250000,87.0
