# Step-by-step guide for DISDRODB reader preparation 

This notebook aims to guide you through creating the reader for the raw files logged by a disdrometer device. 

In first place, this notebook will provide you with functions that will display and enable to investigate the content of your raw data files.

Successively, you will define a series of parameters defining the reader behaviour. These pieces of code will be consolidated in the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file to generate a DISDRODB L0 reader.


In this notebook, we uses a lightweight dataset for illustratory purposes. You may use it and readapt it for exploring your own dataset, when preparing a new reader. 

Following the documentation chapter [`Add a new reader`](https://disdrodb.readthedocs.io/en/latest/readers.html#adding-a-new-reader), we will follow 3 steps: 

* Step 1 : We set up the data within the correct directory structure
* Step 2 : We start digging into the data to set up the transformation parameters.
* Step 3 : We create the new reader

## Step 1: Set up the data within the correct directory structure

For this example, you will find the sample data in the folder [`data`](https://github.com/ltelab/disdrodb/tree/main/data/DISDRODB) of the [disdrodb](https://github.com/ltelab/disdrodb/) repository. 
It corresponds to some measurements taken at two stations (`station_name_1` and `station_name_2`) during two days of a field campaign led by the EPFL LTE laboratory.

```
  📁 DISDRODB
  ├── 📁 Raw
      ├── 📁 DATA_SOURCE
          ├── 📁 CAMPAIGN_NAME
              ├── 📁 data
                  ├── 📁 station_name_1
                  ├── 📜 file60_20180817.dat.gz
                  ├── 📜 file60_20180818.dat.gz
                  ├── 📁 station_name_2
                  ├── 📜 file61_20180817.dat.gz
                  ├── 📜 file61_20180818.dat.gz
              ├── 📁 info
              ├── 📁 issue
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
              ├── 📁 metadata
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
```

This structure fulfills the requirements described in the documentation to [Add a new reader](https://disdrodb.readthedocs.io/en/latest/readers.html#adding-a-new-reader).


## Step 2: Read and analyse the data

Once the dataset and metadata are set up in the correct directory structure, we can now start analysing our data. 

The objectives of Step 2 is to define the specifications to read the raw data into a dataframe and ensure that the dataframe columns match the DISDRODB standards.

At the end, you should be able to generate Apache Parquet files from your input raw data. 


--------------------------------------------------------------------
Here we load the modules and packages required. *Nothing must be changed here*. 

In [2]:
# Define project root directory
import os

root_path = os.path.dirname(
    os.getcwd()
)  # something like /home/ghiggi/Projects/disdrodb
print(root_path)

/home/ghiggi/Projects/disdrodb


In [3]:
# If you didn't installed disdrodb, but you are running this tutorial within the cloned repository:
import sys

sys.path.insert(0, root_path)

In [4]:
import logging
import pandas as pd

# Directory
from disdrodb.l0.io import (
    get_campaign_name,
    create_initial_directory_structure,
    get_raw_file_list,
)


# Tools to develop the reader
from disdrodb.l0.template_tools import (
    check_column_names,
    infer_df_str_column_names,
    print_df_first_n_rows,
    print_df_random_n_rows,
    print_df_column_names,
    print_valid_L0_column_names,
    get_df_columns_unique_values_dict,
    print_df_columns_unique_values,
    print_df_summary_stats,
)

# L0A processing
from disdrodb.l0.l0a_processing import (
    read_raw_data,
    read_raw_file_list,
    cast_column_dtypes,
    write_l0a,
)

# L0B processing
from disdrodb.l0.l0b_processing import (
    retrieve_l0b_arrays,
    create_l0b_from_l0a,
    set_encodings,
)

# Metadata
from disdrodb.l0.metadata import read_metadata

# Standards
from disdrodb.l0.check_standards import check_sensor_name, check_l0a_column_names

**1. Define paths and running parameters**

In the following section, define the raw and processed directory paths. *This may be changed if you are using another folder*.

NB:
- In the real use case, the `DATA_SOURCE` and `CAMPAIGN_NAME`should be replaced by meaningul names ! 
- The `raw_dir` and `processed_dir` must end with the same `CAMPAIGN_NAME` (in upper case format)

In [13]:
disdrodb_dir = os.path.join(root_path, "data", "DISDRODB")
raw_dir = os.path.join(disdrodb_dir, "Raw", "DATA_SOURCE", "CAMPAIGN_NAME")
processed_dir = os.path.join(disdrodb_dir, "Processed", "DATA_SOURCE", "CAMPAIGN_NAME")
assert os.path.exists(raw_dir), "Raw directory does not exist"
print(f"raw_dir: {raw_dir}")
print(f"processed_dir: {processed_dir}")

raw_dir: /home/ghiggi/Projects/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME
processed_dir: /home/ghiggi/Projects/disdrodb/data/DISDRODB/Processed/DATA_SOURCE/CAMPAIGN_NAME


Then we define the reader execution parameters. When the new reader will be created, these parameters will be become the reader function arguments. Please have a look [at the documentation](https://disdrodb.readthedocs.io/en/latest/readers.html#runing-a-reader) to get a full description. 

In [15]:
force = True
parallel = False
verbose = True
debugging_mode = True
sensor_name = "OTT_Parsivel"

**3. Selection of the station**

In this example, we choose  to implement and run the reader for station `station_name_1`. However, feel free to change the station name :)

In [16]:
station_name = "station_name_1"

**2. Initialization**

We initiate some checks, and get some variable. *Nothing must be changed here.*

In [17]:
# Create directory structure
create_initial_directory_structure(
    raw_dir=raw_dir,
    processed_dir=processed_dir,
    station_name=station_name,
    force=force,
    verbose=False,
)

Please, be sure to run the cell above only one time. If it is run many times, the log file blocks the folder creation.  

**4. Get the list of file to process**

We now list all files that are in selected station.
Here we need to specify the [glob pattern](https://en.wikipedia.org/wiki/Glob_(programming)) that enables to select all the relevant data files. 
Since the files in this case study are named like `file<XXX>_<TIME>.dat.gz`, we define the glob pattern `"*.dat*"`. Note that also `"*.dat.gz"` or `"file*.dat.gz"` would have worked.


In [18]:
glob_pattern = "*.dat*"

file_list = get_raw_file_list(
    raw_dir=raw_dir,
    station_name=station_name,
    glob_patterns=glob_pattern,
    verbose=verbose,
    debugging_mode=debugging_mode,
)

print(file_list)

 -  - 2 files to process in /home/ghiggi/Projects/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1
['/home/ghiggi/Projects/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180817.dat.gz', '/home/ghiggi/Projects/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180818.dat.gz']


🚨 The `glob_pattern` variable definition will be transferred into the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook.

Remember that the `glob_pattern` variable depends on the file extensions of your dataset !!!

**5. Retrieve metadata from YAML files**

We now load the metadata file of the station.

If the name of the station is not correctly defined, an error message is raised.

In [19]:
# Retrieve metadata
attrs = read_metadata(campaign_dir=raw_dir, station_name=station_name)

# Retrieve sensor name
sensor_name = attrs["sensor_name"]
check_sensor_name(sensor_name)

**5. Load the one file into a dataframe**

In the  `reader_kwargs` dictionary, you may set [any arguments](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) that need to be passed to read the raw text file into a `pandas.DataFrame`.

In [20]:
reader_kwargs = {}

# - Define delimiter
reader_kwargs["delimiter"] = ","

# - Avoid first column to become df index !!!
reader_kwargs["index_col"] = False

# Since column names are expected to be passed explicitly, header is set to None
reader_kwargs["header"] = None

# - Number of rows to be skipped at the beginning of the file
reader_kwargs["skiprows"] = None

# - Define behaviour when encountering bad lines
reader_kwargs["on_bad_lines"] = "skip"

# - Define reader engine
#   - C engine is faster
#   - Python engine is more feature-complete
reader_kwargs["engine"] = "python"

# - Define on-the-fly decompression of on-disk data
#   - Available: gzip, bz2, zip
reader_kwargs["compression"] = "infer"

# - Strings to recognize as NA/NaN and replace with standard NA flags
#   - Already included: ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’,
#                       ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’,
#                       ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
reader_kwargs["na_values"] = ["na", "", "error"]


# -----------------------------------------------------------
# Select first file
filepath = file_list[0]

# Try to read the raw file
df_raw = read_raw_data(filepath, column_names=None, reader_kwargs=reader_kwargs)
# Print the dataframe
print(f"Dataframe for the file {os.path.basename(filepath)} :")
display(df_raw)

Dataframe for the file file60_20180817.dat.gz :


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,362511,4612.0301,00847.4977,01-08-2018 12:44:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
1,362512,4612.0301,00847.4978,01-08-2018 12:45:01,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
2,362513,4612.0301,00847.4985,01-08-2018 12:45:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
3,362514,4612.0305,00847.4990,01-08-2018 12:46:01,,OK,0000.000,0056.49,00,00,...,035,0.05,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4,362515,4612.0303,00847.4992,01-08-2018 12:46:31,,OK,0000.000,0056.49,00,00,...,034,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4736,367249,4612.0313,00847.4956,03-08-2018 04:13:25,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4737,367250,4612.0313,00847.4955,03-08-2018 04:13:56,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4738,367251,4612.0313,00847.4955,03-08-2018 04:14:26,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4739,367252,4612.0313,00847.4954,03-08-2018 04:14:55,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0


In [22]:
print("Column names:", df_raw.columns)
print("Row Index:", df_raw.index)

Column names: Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23],
           dtype='int64')
Row Index: RangeIndex(start=0, stop=4741, step=1)


Here we expect the `df_raw` to have: 
- numeric column names (i.e.  `Int64Index`) 
- numeric row index (i.e. `RangeIndex`)  

If the structure of the dataframe looks fine (no header and no row index), we are on the good track ! 


Depending on the schema of your data, this `reader_kwargs` dictionary may be fairly different from the one above. 

> 🚨 The `reader_kwargs` dictionary will be transferred to the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

**6. Data exploration**

The settings for the loading of the data is now ready, we can now load one file and analyse its content to see if there is any errors or inconsistencies.

Here are some instructions : 

* Do not assign column names to the dataframe columns yet
* Do not assign a dtype to the dataframe columns yet
* Possibly look at multiple files ;)


We print the content first 3 rows :
 (*Feel free to change the value of n to see more/less rows*)

In [23]:
print_df_first_n_rows(df_raw, n=2, column_names=False)

 - Column 0 :
      ['362511' '362512' '362513']
 - Column 1 :
      ['4612.0301' '4612.0301' '4612.0301']
 - Column 2 :
      ['00847.4977' '00847.4978' '00847.4985']
 - Column 3 :
      ['01-08-2018 12:44:30' '01-08-2018 12:45:01' '01-08-2018 12:45:30']
 - Column 4 :
      [nan nan nan]
 - Column 5 :
      ['OK' 'OK' 'OK']
 - Column 6 :
      ['0000.000' '0000.000' '0000.000']
 - Column 7 :
      ['0056.49' '0056.49' '0056.49']
 - Column 8 :
      ['00' '00' '00']
 - Column 9 :
      ['00' '00' '00']
 - Column 10 :
      ['-9.999' '-9.999' '-9.999']
 - Column 11 :
      ['9999' '9999' '9999']
 - Column 12 :
      ['12611' '12617' '12600']
 - Column 13 :
      ['00000' '00000' '00000']
 - Column 14 :
      ['035' '035' '035']
 - Column 15 :
      ['0.06' '0.06' '0.06']
 - Column 16 :
      ['24.9' '24.9' '24.9']
 - Column 17 :
      ['0' '0' '0']
 - Column 18 :
      ['005.649' '005.649' '005.649']
 - Column 19 :
      ['000' '000' '000']
 - Column 20 :
      ['-9.999,-9.999,-9.999,-9

In [24]:
df_raw.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,362511,4612.0301,847.4977,01-08-2018 12:44:30,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
1,362512,4612.0301,847.4978,01-08-2018 12:45:01,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
2,362513,4612.0301,847.4985,01-08-2018 12:45:30,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0


We print the content of n rows picked randomly : 

In [25]:
print_df_random_n_rows(df_raw, n=6, with_column_names=False)

- Column 0 : ['365205' '363869' '366700' '366371' '366659' '363330']
- Column 1 : ['4612.0319' '4612.0293' '4612.0293' '4612.0312' '4612.0305' '4612.0328']
- Column 2 : ['00847.4989' '00847.4946' '00847.4936' '00847.4958' '00847.4923'
 '00847.4942']
- Column 3 : ['02-08-2018 11:11:31' '02-08-2018 00:03:31' '02-08-2018 23:39:01'
 '02-08-2018 20:54:30' '02-08-2018 23:18:31' '01-08-2018 19:34:01']
- Column 4 : [nan nan nan nan nan nan]
- Column 5 : ['OK' 'OK' 'OK' 'OK' 'OK' 'OK']
- Column 6 : ['0000.000' '0000.000' '0000.000' '0000.000' '0000.000' '0000.000']
- Column 7 : ['0056.67' '0056.67' '0056.71' '0056.71' '0056.71' '0056.67']
- Column 8 : ['00' '00' '00' '00' '00' '00']
- Column 9 : ['00' '00' '00' '00' '00' '00']
- Column 10 : ['-9.999' '-9.999' '-9.999' '-9.999' '-9.999' '-9.999']
- Column 11 : ['9999' '9999' '9999' '9999' '9999' '9999']
- Column 12 : ['12628' '12562' '11699' '12305' '11694' '12501']
- Column 13 : ['00000' '00000' '00000' '00000' '00000' '00000']
- Column 14 : ['

Get the number of column :

In [26]:
len(df_raw.columns)

24

Look at unique values for a single column :

In [27]:
print_df_columns_unique_values(df_raw, column_indices=11, column_names=False)

 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Look at unique values for a few columns :

Note: Use `column_indices=None` to get the unique values for all columns

In [28]:
print_df_columns_unique_values(df_raw, column_indices=slice(10, 12), column_names=False)

 - Column 10 :
      ['-9.999', '02.669', '04.241', '04.745', '04.826', '04.879', '05.430', '06.095', '06.220', '07.415', '08.436', '08.489', '08.506', '08.724', '08.956', '09.079', '09.894', '10.057', '10.567', '11.705', '12.097', '12.390', '12.923', '13.114', '13.407', '13.684', '14.324', '15.060', '16.530', '16.636', '16.668', '17.194', '17.382', '17.829', '17.918', '18.334', '18.655', '19.526', '20.329', '21.134', '21.426', '23.098', '23.664', '23.760', '24.472', '25.473', '25.957', '29.270', '31.271', '32.255', '33.844', '36.196']
 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Get the unique values as dictionary

In [29]:
get_df_columns_unique_values_dict(
    df_raw, column_indices=slice(10, 12), column_names=False
)

{'Column 10': ['-9.999',
  '02.669',
  '04.241',
  '04.745',
  '04.826',
  '04.879',
  '05.430',
  '06.095',
  '06.220',
  '07.415',
  '08.436',
  '08.489',
  '08.506',
  '08.724',
  '08.956',
  '09.079',
  '09.894',
  '10.057',
  '10.567',
  '11.705',
  '12.097',
  '12.390',
  '12.923',
  '13.114',
  '13.407',
  '13.684',
  '14.324',
  '15.060',
  '16.530',
  '16.636',
  '16.668',
  '17.194',
  '17.382',
  '17.829',
  '17.918',
  '18.334',
  '18.655',
  '19.526',
  '20.329',
  '21.134',
  '21.426',
  '23.098',
  '23.664',
  '23.760',
  '24.472',
  '25.473',
  '25.957',
  '29.270',
  '31.271',
  '32.255',
  '33.844',
  '36.196'],
 'Column 11': ['0824',
  '0906',
  '1363',
  '1397',
  '2921',
  '3203',
  '3326',
  '3816',
  '4465',
  '9999']}

**7. Columns name**

Now we have validated the content of our data. It's time to care about its structure (column names). 

The function `infer_df_str_column_names()` tries to guess the column name based on string patterns according to `L0A_encodings.yml` and the type of sensor.

In [30]:
infer_df_str_column_names(df_raw, sensor_name=sensor_name)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: ['rainfall_rate_32bit'],
 7: ['rainfall_accumulated_32bit', 'rainfall_accumulated_16bit'],
 8: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 9: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 10: ['reflectivity_32bit', 'rainfall_rate_16bit'],
 11: ['mor_visibility'],
 12: ['number_particles', 'sample_interval', 'laser_amplitude'],
 13: ['number_particles', 'sample_interval', 'laser_amplitude'],
 14: ['error_code', 'sensor_temperature'],
 15: ['sensor_heating_current'],
 16: ['sensor_battery_voltage'],
 17: ['sensor_status'],
 18: ['rainfall_amount_absolute_32bit'],
 19: ['error_code', 'sensor_temperature'],
 20: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 21: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 22: ['raw_drop_number'],
 23: ['sensor_status']}

This can help us to define later the `column_names` list.

As reference, here is the list of valid columns name (taken from `L0A_encodings.yml`):

In [31]:
print_valid_L0_column_names(sensor_name)

['rainfall_rate_32bit', 'rainfall_accumulated_32bit', 'weather_code_synop_4680', 'weather_code_synop_4677', 'weather_code_metar_4678', 'weather_code_nws', 'reflectivity_32bit', 'mor_visibility', 'sample_interval', 'laser_amplitude', 'number_particles', 'sensor_temperature', 'sensor_serial_number', 'firmware_iop', 'firmware_dsp', 'sensor_heating_current', 'sensor_battery_voltage', 'sensor_status', 'start_time', 'sensor_time', 'sensor_date', 'station_name', 'station_number', 'rainfall_amount_absolute_32bit', 'error_code', 'rainfall_rate_16bit', 'rainfall_rate_12bit', 'rainfall_accumulated_16bit', 'reflectivity_16bit', 'raw_drop_concentration', 'raw_drop_average_velocity', 'raw_drop_number']


It's time now to define our current column names : 

Hint to define the names :
* get information from the disdrometer user guide and the data logger employed. 
* use `infer_df_str_column_names()` to help you
* analyse the content column after column with `print_df_columns_unique_values()`  

In [32]:
column_names = [
    "unknown1",
    "unknown2",
    "unknown3",
    "timestep",
    "unknown4",
    "unknown5",
    "rainfall_rate_32bit",
    "rainfall_accumulated_32bit",
    "weather_code_synop_4680",
    "weather_code_synop_4677",
    "reflectivity_32bit",
    "mor_visibility",
    "laser_amplitude",
    "number_particles",
    "sensor_temperature",
    "sensor_heating_current",
    "sensor_battery_voltage",
    "sensor_status",
    "rainfall_amount_absolute_32bit",
    "error_code",
    "raw_drop_concentration",
    "raw_drop_average_velocity",
    "raw_drop_number",
    "unknown6",
]

> 🚨 The `column_names` list will be transferred  to the [reader_template.py](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

Check the validity of your definition 

In [33]:
check_column_names(column_names, sensor_name)

The following columns do no met the DISDRODB standards: ['unknown2', 'timestep', 'unknown4', 'unknown1', 'unknown6', 'unknown3', 'unknown5'].
Please remove such columns within the df_sanitizer_fun
Please be sure to create the 'time' column within the df_sanitizer_fun.
The 'time' column must be datetime with resolution in seconds (dtype='M8[s]').


Ok, fair enough.
There are columns that need to be removed, and we need to also define a column "time" with dtype `datetime` to meet the DISDRODB standards.

These points will be addressed in Section 9 of this notebook ! 

**8. Read the dataframe with correct columns name**

We can now create a new dataframe with the columns name :

In [34]:
df = read_raw_data(
    filepath=filepath, column_names=column_names, reader_kwargs=reader_kwargs
)

And print the dataframe column names : 

In [35]:
print_df_column_names(df)

 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : timestep
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6


**9. Perform further tests and analysis to check the correctness of `column_names`**

You can for example check some statistics for a specific column.

In [36]:
column_name = "rainfall_rate_32bit"
array_of_values = df.loc[:, [column_name]].astype("float")
print_df_summary_stats(array_of_values)

 - Column 0 ( rainfall_rate_32bit ):
                    
mean  0.005426
min   0.000000
25%   0.000000
50%   0.000000
75%   0.000000
max   2.881000


**10. Final columns formatting**

In [37]:
check_l0a_column_names(df, sensor_name=sensor_name)

The following columns do no met the DISDRODB standards: ['unknown2', 'timestep', 'unknown4', 'unknown1', 'unknown6', 'unknown3', 'unknown5']


ValueError: The following columns do no met the DISDRODB standards: ['unknown2', 'timestep', 'unknown4', 'unknown1', 'unknown6', 'unknown3', 'unknown5']

In [38]:
check_column_names(column_names, sensor_name)

The following columns do no met the DISDRODB standards: ['unknown2', 'timestep', 'unknown4', 'unknown1', 'unknown6', 'unknown3', 'unknown5'].
Please remove such columns within the df_sanitizer_fun
Please be sure to create the 'time' column within the df_sanitizer_fun.
The 'time' column must be datetime with resolution in seconds (dtype='M8[s]').


Now, it's time to remove all the columns that does not match the DISDRODB standard.

In [39]:
df = df.drop(
    columns=["unknown1", "unknown2", "unknown3", "unknown4", "unknown5", "unknown6"]
)

It's also time to define the column `time` which is requested by the DISDRODB standard

In [40]:
df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
df = df.drop(columns=["timestep"])

Check column names met DISDRODB standards after custom processing :

In [41]:
check_l0a_column_names(df, sensor_name=sensor_name)

Check the dataframe looks as desired :

In [42]:
print_df_column_names(df)

 - Column 0 : rainfall_rate_32bit
 - Column 1 : rainfall_accumulated_32bit
 - Column 2 : weather_code_synop_4680
 - Column 3 : weather_code_synop_4677
 - Column 4 : reflectivity_32bit
 - Column 5 : mor_visibility
 - Column 6 : laser_amplitude
 - Column 7 : number_particles
 - Column 8 : sensor_temperature
 - Column 9 : sensor_heating_current
 - Column 10 : sensor_battery_voltage
 - Column 11 : sensor_status
 - Column 12 : rainfall_amount_absolute_32bit
 - Column 13 : error_code
 - Column 14 : raw_drop_concentration
 - Column 15 : raw_drop_average_velocity
 - Column 16 : raw_drop_number
 - Column 17 : time


In [43]:
print_df_random_n_rows(df, n=5)

- Column 0 (rainfall_rate_32bit) : ['0000.000' '0000.000' '0000.000' '0000.000' '0000.114']
- Column 1 (rainfall_accumulated_32bit) : ['0056.67' '0056.52' '0056.67' '0056.71' '0056.67']
- Column 2 (weather_code_synop_4680) : ['00' '00' '00' '00' '57']
- Column 3 (weather_code_synop_4677) : ['00' '00' '00' '00' '58']
- Column 4 (reflectivity_32bit) : ['-9.999' '-9.999' '-9.999' '-9.999' '10.567']
- Column 5 (mor_visibility) : ['9999' '9999' '9999' '9999' '9999']
- Column 6 (laser_amplitude) : ['12631' '12655' '12606' '11551' '12411']
- Column 7 (number_particles) : ['00000' '00000' '00003' '00000' '00022']
- Column 8 (sensor_temperature) : ['035' '027' '036' '015' '017']
- Column 9 (sensor_heating_current) : ['0.06' '0.06' '0.06' '0.06' '0.06']
- Column 10 (sensor_battery_voltage) : ['24.9' '24.9' '24.9' '24.9' '24.9']
- Column 11 (sensor_status) : ['0' '0' '0' '0' '0']
- Column 12 (rainfall_amount_absolute_32bit) : ['005.667' '005.652' '005.667' '005.671' '005.667']
- Column 13 (error_

In [44]:
print_df_columns_unique_values(df, column_indices=2, column_names=True)

 - Column 2 ( weather_code_synop_4680 ):
      ['00', '57', '61', '62', '71', '72', '88']


**11. Define the dataframe sanitizer function**

The `df_sanitizer_fun` encapsulate the code specific to each reader/dataset that is required to obtain a dataframe compliants with the DISDRODB standards.

With the data used in this notebook, we need to drop some columns and define the `time` column ! 

From the code defined in Section 10, we define the following function: 

In [45]:
def df_sanitizer_fun(df):
    # Import pandas
    import pandas as pd

    # - Drop unvalid columns
    columns_to_drop = [
        "unknown1",
        "unknown2",
        "unknown3",
        "unknown4",
        "unknown5",
        "unknown6",
    ]

    df = df.drop(columns=columns_to_drop)

    # - Convert timestep column to datetime format
    df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
    df = df.drop(columns=["timestep"])

    # - Return the dataframe
    return df

> 🚨 The `df_sanitizer_fun()` function will be transfered to the [reader_template.py](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

**12. Now let's try calling the reader function as it will be called in the DISDRODB L0 reader**

* You may try with increasing number of files (update `file_list`)


Here we combine all raw files in a single dataframe. 

The function `read_raw_file_list` takes as argument :
* `file_list` : the list of files present in the specified station directory
* `column_names` : the list of column (defined previously)
* `reader_kwargs` : dictionary to data loading  into the dataframe (defined previously)
* `sensor_name` : taken from the `sensor_name` key in the metadata YAML file of the station
* `df_sanitizer_fun`: the function to sanitize the data frame (defined previously)

All these arguments are defined either in the data directory structure, or earlier in the code.

In [46]:
subset_file_list = file_list[:1]

df = read_raw_file_list(
    file_list=subset_file_list,
    column_names=column_names,
    reader_kwargs=reader_kwargs,
    sensor_name=sensor_name,
    verbose=verbose,
    df_sanitizer_fun=df_sanitizer_fun,
)
display(df)

 - 1 / 1 processed successfully. File name: /home/ghiggi/Projects/disdrodb/data/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/data/station_name_1/file60_20180817.dat.gz
 -  - 0 of 1 have been skipped.


Unnamed: 0,rainfall_rate_32bit,rainfall_accumulated_32bit,weather_code_synop_4680,weather_code_synop_4677,reflectivity_32bit,mor_visibility,laser_amplitude,number_particles,sensor_temperature,sensor_heating_current,sensor_battery_voltage,sensor_status,rainfall_amount_absolute_32bit,error_code,raw_drop_concentration,raw_drop_average_velocity,raw_drop_number,time
0,0.0,56.490002,0.0,0.0,-9.999,9999.0,12611.0,0.0,35.0,0.06,24.9,0.0,5.649,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-01-08 12:44:30
1,0.0,56.490002,0.0,0.0,-9.999,9999.0,12617.0,0.0,35.0,0.06,24.9,0.0,5.649,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-01-08 12:45:01
2,0.0,56.490002,0.0,0.0,-9.999,9999.0,12600.0,0.0,35.0,0.06,24.9,0.0,5.649,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-01-08 12:45:30
3,0.0,56.490002,0.0,0.0,-9.999,9999.0,12603.0,0.0,35.0,0.05,24.9,0.0,5.649,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-01-08 12:46:01
4,0.0,56.490002,0.0,0.0,-9.999,9999.0,12606.0,0.0,34.0,0.06,24.9,0.0,5.649,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-01-08 12:46:31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4736,0.0,56.709999,0.0,0.0,-9.999,9999.0,11059.0,0.0,15.0,0.06,24.9,0.0,5.671,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-03-08 04:13:25
4737,0.0,56.709999,0.0,0.0,-9.999,9999.0,11175.0,0.0,15.0,0.06,24.9,0.0,5.671,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-03-08 04:13:56
4738,0.0,56.709999,0.0,0.0,-9.999,9999.0,11275.0,0.0,15.0,0.06,24.9,0.0,5.671,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-03-08 04:14:26
4739,0.0,56.709999,0.0,0.0,-9.999,9999.0,11361.0,0.0,15.0,0.06,24.9,0.0,5.671,0.0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",2018-03-08 04:14:55


Here we derive the corresponding xr.Dataset object 

In [47]:
ds = create_l0b_from_l0a(df, attrs, verbose=False)
print(ds)

<xarray.Dataset>
Dimensions:                         (time: 4741, diameter_bin_center: 32,
                                     velocity_bin_center: 32)
Coordinates: (12/13)
  * diameter_bin_center             (diameter_bin_center) float64 0.062 ... 24.5
    diameter_bin_lower              (diameter_bin_center) float64 0.0 ... 23.0
    diameter_bin_upper              (diameter_bin_center) float64 0.1245 ... ...
    diameter_bin_width              (diameter_bin_center) float64 0.125 ... 3.0
  * velocity_bin_center             (velocity_bin_center) float64 0.05 ... 20.8
    velocity_bin_lower              (velocity_bin_center) float64 0.0 ... 19.2
    ...                              ...
    velocity_bin_width              (velocity_bin_center) float64 0.1 ... 3.2
  * time                            (time) datetime64[ns] 2018-01-08T12:44:30...
    crs                             <U5 'WGS84'
    latitude                        float64 46.2
    longitude                       float64 8.792

which can be saved as DISDRODB L0B netCDF by running the following code:

In [48]:
# ds = set_encodings(ds, sensor_name)
# ds.to_netcdf("/path/where/to/save/the/file.nc")

## Step 3 : Create the reader

We have now all the elements to start creating the new reader. 
All the modifications that we did in this notebook must be now transcribed into a DISDRODB L0 reader file.

1. Copy and paste the [`disdrodb\L0\readers\reader_template.py`](https://github.com/ltelab/disdrodb/tree/main/disdrodb/L0/readers) into the folder `disdrodb\L0\readers\DATA_SOURCE`

2. Rename the copied file `<CAMPAIGN_NAME>.py` (or i.e. `<CAMPAIGN_NAME>_<sensor_acronym>.py` if within a single campaign multiple type of sensors have been deployed). This will be the `reader` name that you need to add to the metadata YAML file of the stations that require such reader.  

3. Within the file, update the portion of code described in the next points 4., 5. and 6.

4. Add the `reader` name to the metadata YAML files of the stations.

---------------------------------------------------------------------------------------


4. **Update the `columns_names` list**

   Before :

    ``` python
        column_names = []
    ```
    
    After : 

    ``` python
        column_names = [
            "unknown1",
            "unknown2",
            "unknown3",
            "timestep",
            "unknown4",
            "unknown5",
            "rainfall_rate_32bit",
            "rainfall_accumulated_32bit",
            "weather_code_synop_4680",
            "weather_code_synop_4677",
            "reflectivity_32bit",
            "mor_visibility",
            "laser_amplitude",
            "number_particles",
            "sensor_temperature",
            "sensor_heating_current",
            "sensor_battery_voltage",
            "sensor_status",
            "rainfall_amount_absolute_32bit",
            "error_code",
            "raw_drop_concentration",
            "raw_drop_average_velocity",
            "raw_drop_number",
            "unknown6",
        ]
    ```

---------------------------------------------------------------------------------------
5. **Update the `reader_kwargs` dictionary**

    Before :

    ``` python
        reader_kwargs = {}

    ```
    
    After : 

    ``` python
        reader_kwargs = {}

        # - Define delimiter
        reader_kwargs["delimiter"] = ","

        # - Avoid first column to become df index !!!
        reader_kwargs["index_col"] = False

        # Since column names are expected to be passed explicitly, header is set to None
        reader_kwargs['header'] = None

        # - Number of rows to be skipped at the beginning of the file 
        reader_kwargs['skiprows']= None

        # - Define behaviour when encountering bad lines
        reader_kwargs["on_bad_lines"] = "skip"

        # - Define reader engine
        #   - C engine is faster
        #   - Python engine is more feature-complete
        reader_kwargs["engine"] = "python"

        # - Define on-the-fly decompression of on-disk data
        #   - Available: gzip, bz2, zip
        reader_kwargs["compression"] = "infer"

        # - Strings to recognize as NA/NaN and replace with standard NA flags
        #   - Already included: ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’,
        #                       ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’,
        #                       ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
        reader_kwargs["na_values"] = ["na", "", "error"]

    ```

---------------------------------------------------------------------------------------
6. **Update the `df_sanitizer_fun()` function**

   Before:

    ``` python

        def df_sanitizer_fun(df):
            # - Import dask or pandas
            import pandas as pd
                    
            # - Add here below the reader required custom code
            pass
        
            # - Return the dataframe 
            return df
        
    ```
    
    After : 

    ``` python
        def df_sanitizer_fun(df):
            # Import pandas
            import pandas as pd

            # - Drop unvalid columns
            columns_to_drop = ["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6']
            df = df.drop(columns=columns_to_drop)

            # - Convert timestep column to datetime format
            df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
            df = df.drop(columns=["timestep"])
            
            # - Return the dataframe 
            return df
        
    ```
 
 ---------------------------------------------------------------------------------------   

7. **Run the script**

To run the scripts, you need to define the local <disdrodb_dir> directory where all the data and metadata are stored.
On Windows, it will have a path ending by \DISDRODB 
On Mac/Linux, it will have a path ending by /DISDRODB 

To run the processing of a single station, just run:    
     
```
run_disdrodb_l0_station <disdrodb_dir> <DATA_SOURCE> <CAMPAIGN_NAME> <STATION_NAME> -l0b True -f True -v True -d False
```

To run the processing on all stations of a given campaign, just run:    
     
```
run_disdrodb_l0 <disdrodb_dir> --data_sources <DATA_SOURCE> --campaign_names <CAMPAIGN_NAME> -f True -v True -d False
```

Have a look  [here](https://disdrodb.readthedocs.io/en/latest/readers.html#running-a-reader) for a full documentation on how to run specific DISDRODB L0 processing. 

ATTENTION: For this to command to run, you need to have added the `reader` name to the station metadata YAML file !

 ---------------------------------------------------------------------------------------

8. **Check if the script has correctly executed**

    The output folder should be as follow :
    
    ```
    📁 DISDRODB
    ├── 📁 Processed
       ├── 📁 DATA_SOURCE
          ├── 📁 CAMPAIGN_NAME
              ├── 📁 info
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
              ├── 📁 L0A
                  ├── 📁 station_name_1
                     ├── 📜 *.parquet
                  ├── 📁 station_name_1
                     ├── 📜 *.parquet
              ├── 📁 L0B
                  ├── 📁 station_name_1
                     ├── 📜 *.nc
                  ├── 📁 station_name_2
                     ├── 📜 *.nc
              ├── 📁 logs
                     ├── 📁 L0A
                          ├── 📁 station_name_1
                              ├── 📜 logs_<raw_file_name>.log
                          ├── 📁 station_name_2
                              ├── 📜 logs_<raw_file_name>.log
                     ├── 📁 L0B
                          ├── 📁 station_name_1
                              ├── 📜 logs_<L0B_file_name>.log
                          ├── 📁 station_name_2
                              ├── 📜 logs_<L0B_file_name>.log
              ├── 📁 metadata
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml

    ```

---------------------------------------------------------------------------------------

Well done 👋👋👋 

You should now be able to create a new reader for your own data.
Please consider to share the reader for your data with the community by uploading it on the DISDRODB repository.

Have a look at the [contributors guidelines](https://disdrodb.readthedocs.io/en/latest/contributors_guidelines.html) for more information and do not hesitate to open a [GitHub Issue](https://github.com/ltelab/disdrodb/issues) if you need any clarification. 

The DISDRODB team hope you enjoyed this tutorial 