# France Paris Weather Data Preparation & Data quality check

The format of the daily weather data for France is a set of two CSV files by department and for a significant period. For example, I have 2 CSV files for Paris department (75) for the period of 1950-2023, one file for the essential parameters (Temperature, Rainfall, etc.) and the other one for the complements (Humidity is included in this file). I have another 2 CSV files for the period of last two years (2023-2024).

Since the three variables that I want to analyze to compare with HK weather information are Temperature, Humidity and Rainfall, I have 4 files representing the weather data of Paris.

In [121]:
# Necessary modules
import pandas as pd  # type: ignore
import time
import datetime
import os
import glob
import numpy as np # type: ignore

In [122]:
list_files = glob.glob("../0_Data/PAR/DonneesMeteo/*.csv")
list_files

['../0_Data/PAR/DonneesMeteo/Q_75_previous-1950-2023_RR-T-Vent.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_previous-1950-2023_autres-parametres.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_latest-2024-2025_RR-T-Vent.csv',
 '../0_Data/PAR/DonneesMeteo/Q_75_latest-2024-2025_autres-parametres.csv']

Let's check the first file in the list before preceeding a recursive data importation.

In [123]:
dataset = pd.read_csv(list_files[0], delimiter = ";")
print(dataset.head())

   NUM_POSTE  NOM_USUEL        LAT       LON  ALTI  AAAAMMJJ   RR  QRR  TN  \
0   75101001  INNOCENTS  48.860667  2.348333    37  19500101  0.0  1.0 NaN   
1   75101001  INNOCENTS  48.860667  2.348333    37  19500102  1.8  1.0 NaN   
2   75101001  INNOCENTS  48.860667  2.348333    37  19500103  2.0  1.0 NaN   
3   75101001  INNOCENTS  48.860667  2.348333    37  19500104  0.2  1.0 NaN   
4   75101001  INNOCENTS  48.860667  2.348333    37  19500105  1.0  1.0 NaN   

   QTN  ...  HXI2  QHXI2  FXI3S  QFXI3S  DXI3S  QDXI3S  HXI3S  QHXI3S  DRR  \
0  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
1  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
2  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
3  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   
4  NaN  ...   NaN    NaN    NaN     NaN    NaN     NaN    NaN     NaN  NaN   

   QDRR  
0   NaN  
1   NaN  
2   NaN  
3   NaN  
4   NaN  

[

In [124]:
print(dataset['NUM_POSTE'].drop_duplicates().shape)
print(dataset['NOM_USUEL'].drop_duplicates())

(38,)
0                       INNOCENTS
4872              TOUR ST-JACQUES
24931                     PLANTES
33270                  LUXEMBOURG
57194                     LAENNEC
75371               CHAMP DE MARS
76712                 TOUR EIFFEL
86366                   LOUIS XVI
95132                LARIBOISIERE
120870                   ST-LOUIS
135528               ILE DE BERCY
160894             LA FAISANDERIE
182276               LEO LAGRANGE
200913                 ST-ANTOINE
226991               PORTE D'IVRY
230572                SALPETRIERE
247253           PARIS-MONTSOURIS
274281               OBSERVATOIRE
291457              OBS. TERRASSE
296143    PARIS-MONTSOURIS-DOUBLE
298828                  VAUGIRARD
318955                G. POMPIDOU
323109                    AUTEUIL
335710                  BAGATELLE
361607                      PASSY
393194                  LONGCHAMP
399313                BATIGNOLLES
414470                 MONTMARTRE
428267            BUTTES CHAUMONT
451224  

In fact, those parameters are measured from 38 different observatories (Same number of observatories as in Hong Kong!) just for Paris city. It is quite many given that the surface of Hong Kong is about ten times more than Paris surface (1108 square km vs. 105 square km). But population in Hong Kong is much more than population in Paris (7M residents vs. 2M residents). In terms of the population density, it goes up to 6300 people by square km in Hong Kong, versus 19000 people by square km in Paris. - It is quite unexpected that the population density in Paris is the triple than in Hong Kong, since to my feelings as a resident in two cities, Hong Kong seemed to be much more crowded. It has to be put in perspective of the habitable surface though, since the habitable surface is much smaller in Hong Kong because of the mountains - so relatively few people may be densely assembled in a smaller area.

Anyways, to have only one value representing the global weather parameter for Paris, I choose to average these values over observatories.  

I will extract only three variables from these CSV files, which seem to be the most corresponding to HK Data according to their descriptions in the documentation. Those are the followings:
- RR (From the first file) : Rainfall quantity in last 24 hours (measured on the period from 6AM the day to 6AM the day after). The final value picked at 6AM on D+1 will be affected to the data of the day D. (in mm)
- TX (From the first file): maximal temperature in a covered area (in Celcius)
- UM (From the second file): daily average of the hourly maximal humidity (in %)

For more informations about other French weather variables, one can refer to the documentation of this data in French Data Gouv webpage (https://www.data.gouv.fr/fr/datasets/donnees-climatologiques-de-base-quotidiennes/).

In [125]:
# Aggregate the data of three variables for the same period:

final_dataset = pd.DataFrame()
data_summary = pd.DataFrame()

period = [x.split("/")[-1].split("_")[-2] for x in list_files]
param = [x.split("/")[-1].split("_")[-1] for x in list_files]

for sub_period in set(period):
    sub_list = [x for x in list_files if sub_period in x]
    for sub_file in sub_list:
        if param[0] in sub_file:
            dataset_T1 = pd.read_csv(sub_file, delimiter = ";")
            dataset_T1 = dataset_T1[['NOM_USUEL','AAAAMMJJ', 'RR', 'TX']]
        else:
            dataset_T2 = pd.read_csv(sub_file, delimiter = ";")
            dataset_T2 = dataset_T2[['NOM_USUEL','AAAAMMJJ', 'UM']]
    dataset_for_period = dataset_T1.set_index(['NOM_USUEL','AAAAMMJJ']).join(dataset_T2.set_index(['NOM_USUEL','AAAAMMJJ']), on = ['NOM_USUEL','AAAAMMJJ'])
    dataset_for_period['Period'] = sub_period
    final_dataset = pd.concat([dataset_for_period, final_dataset])

In [126]:
# RE format the Date variable
final_dataset = final_dataset.reset_index().set_index('AAAAMMJJ')
final_dataset['year'] = [int(str(x)[:4]) for x in final_dataset.index]
final_dataset['month'] = [int(str(x)[4:6]) for x in final_dataset.index]
final_dataset['day'] = [int(str(x)[6:8]) for x in final_dataset.index]
final_dataset['Date'] = pd.to_datetime(final_dataset[['year', 'month', 'day']])
final_dataset.reset_index(inplace = True, drop = True)

- Daily average value for three variables for Paris:

In [127]:
final_dataset_avg = final_dataset[['Date','RR', 'TX', 'UM']].groupby('Date').mean()

In [128]:
# Count the number of observations by Observatory, and the contribution of each Observatory to the total dataset

NA_table = final_dataset[['RR', 'TX','UM']].groupby(final_dataset.NOM_USUEL).sum().reset_index()
count_table = final_dataset[['NOM_USUEL', 'RR', 'TX', 'UM']].groupby('NOM_USUEL').count()
count_table.columns= "count_"+count_table.columns
totalObs_RR = count_table['count_RR'].sum()
totalObs_TX = count_table['count_TX'].sum()
totalObs_UM = count_table['count_UM'].sum()
count_table['PctTo_TotalOBS_RR']=round((100*count_table['count_RR']/totalObs_RR),1)
count_table['PctTo_TotalOBS_TX']=round((100*count_table['count_TX']/totalObs_TX))
count_table['PctTo_TotalOBS_UM']=round((100*count_table['count_UM']/totalObs_UM))

In [129]:
# Compute the observation period by Observatory, for each Variable :
Period_Obs = pd.DataFrame({
    'NOM_USUEL':final_dataset.NOM_USUEL.drop_duplicates()
})
Period_Obs.set_index('NOM_USUEL', inplace = True)
for variable in ['RR', 'TX', 'UM']:
    Start_Date = final_dataset.loc[final_dataset[variable].isna()==False][['NOM_USUEL','Date']].groupby('NOM_USUEL').min()
    End_Date = final_dataset.loc[final_dataset[variable].isna()==False][['NOM_USUEL','Date']].groupby('NOM_USUEL').max()
    Var_Obs_Period = Start_Date.join(End_Date, lsuffix = "_S", rsuffix = "_E")
    temp = pd.DataFrame({
        'NOM_USUEL':Var_Obs_Period.index,
        'period':[np.ceil(td/np.timedelta64(1, 'D')).astype(int) for td in (Var_Obs_Period['Date_E']-Var_Obs_Period['Date_S'])/(30*12)]
    })
    temp.columns = ['NOM_USUEL', 'Period_'+variable]
    temp.set_index('NOM_USUEL', inplace = True)
    Period_Obs = Period_Obs.join(temp, on = 'NOM_USUEL')

In [130]:
count_table = count_table.join(Period_Obs, on = "NOM_USUEL")


In [131]:
count_table.columns

Index(['count_RR', 'count_TX', 'count_UM', 'PctTo_TotalOBS_RR',
       'PctTo_TotalOBS_TX', 'PctTo_TotalOBS_UM', 'Period_RR', 'Period_TX',
       'Period_UM'],
      dtype='object')

In [132]:
# Re order the table "count_table" with Columns by Variable :
count_table_temp = count_table.melt()
col_names = count_table_temp['variable'].str.split("_", expand = True)
count_table_temp['colnames_to_row'] = col_names[2].combine_first(col_names[1])\
                                 .fillna('.')
count_table_temp['colnames_to_row_bis'] = col_names[0].str.lower()
count_table_temp.sort_values(["colnames_to_row","colnames_to_row_bis"], inplace = True)
New_column_order = count_table_temp['variable'].drop_duplicates().values
count_table = count_table[New_column_order]

- Which Observatory has the most large contribution to the variable Rainfall (RR)? 

In [133]:
count_table.sort_values(by="PctTo_TotalOBS_RR", ascending = False).head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
PARIS-MONTSOURIS,27475,5.7,77.0,27475,20.0,77.0,24349,53.0,69.0
ST-ANTOINE,26078,5.4,75.0,0,0.0,,0,0.0,
BAGATELLE,25869,5.4,74.0,9608,7.0,28.0,0,0.0,
LARIBOISIERE,26185,5.4,77.0,2075,2.0,6.0,0,0.0,
ILE DE BERCY,25366,5.3,74.0,0,0.0,,0,0.0,


And we see that some of observatories don't contribute at all for Rainfall records:

In [134]:
count_table.sort_values(by="PctTo_TotalOBS_RR").head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
TOUR EIFFEL,0,0.0,,9658,7.0,29.0,3374,7.0,12.0
PARIS-MONTSOURIS-DOUBLE,0,0.0,,3132,2.0,9.0,0,0.0,
BUTTES RESERV.,365,0.1,2.0,0,0.0,,0,0.0,
HEROLD,243,0.1,1.0,0,0.0,,0,0.0,
BELLEVILLE,1369,0.3,5.0,0,0.0,,0,0.0,


- For Temperature, the most contribution Observatory is also PARIS-MONTSOURIS with a more significant difference with others (>20%). 

In [135]:
count_table.sort_values(by="PctTo_TotalOBS_TX", ascending = False).head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
PARIS-MONTSOURIS,27475,5.7,77.0,27475,20.0,77.0,24349,53.0,69.0
LUXEMBOURG,24297,5.0,69.0,17095,13.0,48.0,0,0.0,
TOUR ST-JACQUES,20029,4.2,57.0,10371,8.0,31.0,4468,10.0,14.0
BAGATELLE,25869,5.4,74.0,9608,7.0,28.0,0,0.0,
TOUR EIFFEL,0,0.0,,9658,7.0,29.0,3374,7.0,12.0


And same as for Rainfall, some Observatories don't contribute at all:

In [136]:
count_table.sort_values(by="PctTo_TotalOBS_TX").head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LOUIS XVI,8766,1.8,25.0,0,0.0,,0,0.0,
TENON,7779,1.6,25.0,0,0.0,,0,0.0,
ST-LOUIS,14658,3.0,44.0,0,0.0,,0,0.0,
ST-ANTOINE,26078,5.4,75.0,0,0.0,,0,0.0,
PORTE D'IVRY,3581,0.7,15.0,0,0.0,,0,0.0,


- How about Humidity? For Humidity as well, it is PARIS-MONSOURIS the biggest contributer, and for 69% of data for this variable. 

In [137]:
count_table.sort_values(by="PctTo_TotalOBS_UM", ascending = False).head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
PARIS-MONTSOURIS,27475,5.7,77.0,27475,20.0,77.0,24349,53.0,69.0
BELLEVILLE PARC,6281,1.3,20.0,7049,5.0,22.0,6864,15.0,22.0
LONGCHAMP,6566,1.4,19.0,6543,5.0,19.0,6516,14.0,19.0
TOUR ST-JACQUES,20029,4.2,57.0,10371,8.0,31.0,4468,10.0,14.0
TOUR EIFFEL,0,0.0,,9658,7.0,29.0,3374,7.0,12.0


In [140]:
count_table.sort_values(by="PctTo_TotalOBS_UM", ).head()

Unnamed: 0_level_0,count_RR,PctTo_TotalOBS_RR,Period_RR,count_TX,PctTo_TotalOBS_TX,Period_TX,count_UM,PctTo_TotalOBS_UM,Period_UM
NOM_USUEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AUTEUIL,24067,5.0,56.0,5572,4.0,16.0,0,0.0,
TENON,7779,1.6,25.0,0,0.0,,0,0.0,
ST-LOUIS,14658,3.0,44.0,0,0.0,,0,0.0,
ST-ANTOINE,26078,5.4,75.0,0,0.0,,0,0.0,
SALPETRIERE,16652,3.5,47.0,9556,7.0,28.0,0,0.0,


So far for French weather data, there are some missing values, but no particular irrelevant string values are noticed. 

Let's save the wrangled data.

In [None]:
treated_data_rep = r'../0_Data/wrangled/' 
if not os.path.exists(treated_data_rep):
    os.makedirs(treated_data_rep)
final_dataset_avg.to_pickle(treated_data_rep+"PARIS_AGG.pkl")