# Step-by-step guide for DISDRODB reader preparation 

This notebook aims to guide you through creating the reader for the raw files logged by a disdrometer device. 

In first place, this notebook will provide you with functions that will display and enable to investigate the content of your raw data files.

Successively, you will define a series of parameters defining the reader behaviour. These pieces of code will be consolidated in the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file to generate a DISDRODB L0 reader.


In this notebook, we uses a lightweight dataset for illustratory purposes. You may use it and readapt it for exploring your own dataset, when preparing a new reader. 

Following the documentation chapter [`Add a new reader`](https://disdrodb.readthedocs.io/en/latest/readers.html#adding-a-new-reader), we will follow 3 steps: 

* Step 1 : We set up the data within the correct directory structure
* Step 2 : We start digging into the data to set up the transformation parameters.
* Step 3 : We create the new reader

## Step 1: Set up the data within the correct directory structure

For this example, you will find the sample data in the folder `data` of the [GitHub disdrodb repository](https://github.com/ltelab/disdrodb/). 
It corresponds to some measurements taken at two stations (`station_name_1` and `station_name_2`) during two days of a field campaign led by the EPFL LTE laboratory.

```
  📁 DISDRODB
  ├── 📁 Raw
      ├── 📁 DATA_SOURCE
          ├── 📁 CAMPAIGN_NAME
              ├── 📁 data
                  ├── 📁 station_name_1
                  ├── 📜 file60_20180817.dat.gz
                  ├── 📜 file60_20180818.dat.gz
                  ├── 📁 station_name_2
                  ├── 📜 file61_20180817.dat.gz
                  ├── 📜 file61_20180818.dat.gz
              ├── 📁 info
              ├── 📁 issue
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
              ├── 📁 metadata
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
```

This structure fulfills the requirements described in the documentation to [Add a new reader](https://disdrodb.readthedocs.io/en/latest/readers.html#adding-a-new-reader).


## Step 2: Read and analyse the data

Once the dataset and metadata are set up in the correct directory structure, we can now start analysing our data. 

The objectives of Step 2 is to define the specifications to read the raw data into a dataframe and ensure that the dataframe columns match the DISDRODB standards.

At the end, you should be able to generate Apache Parquet files from your input raw data. 


--------------------------------------------------------------------
Here we load the modules and packages required. *Nothing must be changed here*. 

In [1]:
import os
import sys
import logging
import pandas as pd

# Add project root folder into sys path
root_path = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
sys.path.insert(0, root_path)


# Directory
from disdrodb.L0.io import (
    check_directories,
    get_campaign_name,
    create_directory_structure,
)


# Tools to develop the reader
from disdrodb.L0.template_tools import (
    check_column_names,
    infer_df_str_column_names,
    print_df_first_n_rows,
    print_df_random_n_rows,
    print_df_column_names,
    print_valid_L0_column_names,
    get_df_columns_unique_values_dict,
    print_df_columns_unique_values,
    print_df_summary_stats,
)

# L0A processing
from disdrodb.L0.L0A_processing import (
    read_raw_data,
    get_file_list,
    read_raw_file_list,
    cast_column_dtypes,
    write_df_to_parquet,
)

# L0B processing
from disdrodb.L0.L0B_processing import (
    retrieve_L0B_arrays,
    create_L0B_from_L0A,
    set_encodings,
)

# Metadata
from disdrodb.L0.metadata import read_metadata

# Standards
from disdrodb.L0.check_standards import check_sensor_name, check_L0A_column_names

# Logger
from disdrodb.utils.logger import create_logger

**1. Define paths and running parameters**

In the following section, define the raw and processed directory paths. *This may be changed if you are using another folder*.

NB: In the real use case, the `DATA_SOURCE` and `CAMPAIGN_NAME`should be replaced by meaningul names ! 

In [2]:
raw_dir = os.path.join(
    root_path, "data", "DISDRODB", "Raw", "DATA_SOURCE", "CAMPAIGN_NAME"
)  # Must end with campaign_name upper case
processed_dir = os.path.join(
    root_path, "data", "DISDRODB", "Processed", "DATA_SOURCE", "CAMPAIGN_NAME"
)  # Must end with campaign_name upper case

assert os.path.exists(raw_dir), "Raw directory does not exist"

Then we define the reader execution parameters. When the new reader will be created, these parameters will be become the reader function arguments. Please have a look [at the documentation](https://disdrodb.readthedocs.io/en/latest/readers.html#runing-a-reader) to get a full description. 

In [3]:
force = True
parallel = False
verbose = True
debugging_mode = True
sensor_name = "OTT_Parsivel"

**2. Initialization**

We initiate some checks, and get some variable. *Nothing must be changed here.*

In [4]:
# Initial directory checks
raw_dir, processed_dir = check_directories(raw_dir, processed_dir, force=force)


# Retrieve campaign name
campaign_name = get_campaign_name(raw_dir)

# -------------------------------------------------------------------------.
# Define logging settings
create_logger(processed_dir, "reader_" + campaign_name)

# Retrieve logger
logger = logging.getLogger(campaign_name)
logger.info("### Script start ###")

# -------------------------------------------------------------------------.
# Create directory structure
create_directory_structure(raw_dir, processed_dir)

# -------------------------------------------------------------------------.
# List stations
list_station_name = os.listdir(os.path.join(raw_dir, "data"))
print(list_station_name)

['station_name_1', 'station_name_2']


Please, be sure to run the cell above only one time. If it is run many times, the log file blocks the folder creation.  

**3. Selection of the station**

In this example, here we choose  to read only one station. However, feel free to change the station name :) 
In the current example we have only two stations. Therefore you can choose between 0 and 1.

In [5]:
station_name = list_station_name[0]
print(station_name)

station_name_1


**4. Get the list of file to process**

We now list all files that are in selected station.
Here we need to specify the [glob pattern](https://en.wikipedia.org/wiki/Glob_(programming)) that enables to select all the relevant data files. 
Since the files in this case study are named like `file<XXX>_<TIME>.dat.gz`, we define the glob pattern `"*.dat*"`. Note that also `"*.dat.gz"` or `"file*.dat.gz"` would have worked.


In [6]:
glob_pattern = os.path.join("data", station_name, "*.dat*")

file_list = get_file_list(
    raw_dir=raw_dir,
    glob_pattern=glob_pattern,
    verbose=verbose,
    debugging_mode=debugging_mode,
)

print(file_list)

 - 2 files to process in C:\projects\disdrodb-fork\data\DISDRODB\Raw\DATA_SOURCE\CAMPAIGN_NAME
['C:\\projects\\disdrodb-fork\\data\\DISDRODB\\Raw\\DATA_SOURCE\\CAMPAIGN_NAME\\data\\station_name_1\\file60_20180817.dat.gz', 'C:\\projects\\disdrodb-fork\\data\\DISDRODB\\Raw\\DATA_SOURCE\\CAMPAIGN_NAME\\data\\station_name_1\\file60_20180818.dat.gz']


🚨 The `glob_pattern` variable definition will be transferred into the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook.

Remember that the `glob_pattern` variable depends on the file extensions of your dataset !!!

**5. Retrieve metadata from YAML files**

We now load the metadata file of the station.

If the name of the station is not correctly defined, an error message is raised.

In [7]:
# Retrieve metadata
attrs = read_metadata(campaign_dir=raw_dir, station_name=station_name)

# Retrieve sensor name
sensor_name = attrs["sensor_name"]
check_sensor_name(sensor_name)

**5. Load the one file into a dataframe**

In the  `reader_kwargs` dictionary, you may set [any arguments](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) that need to be passed to read the raw text file into a `pandas.DataFrame`.

In [8]:
reader_kwargs = {}

# - Define delimiter
reader_kwargs["delimiter"] = ","

# - Avoid first column to become df index !!!
reader_kwargs["index_col"] = False

# Since column names are expected to be passed explicitly, header is set to None
reader_kwargs["header"] = None

# - Number of rows to be skipped at the beginning of the file
reader_kwargs["skiprows"] = None

# - Define behaviour when encountering bad lines
reader_kwargs["on_bad_lines"] = "skip"

# - Define reader engine
#   - C engine is faster
#   - Python engine is more feature-complete
reader_kwargs["engine"] = "python"

# - Define on-the-fly decompression of on-disk data
#   - Available: gzip, bz2, zip
reader_kwargs["compression"] = "infer"

# - Strings to recognize as NA/NaN and replace with standard NA flags
#   - Already included: ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’,
#                       ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’,
#                       ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
reader_kwargs["na_values"] = ["na", "", "error"]

# -----------------------------------------------------------
# Select first file
filepath = file_list[0]

# Try to read the raw file
df_raw = read_raw_data(
    filepath,
    column_names=None,
    reader_kwargs=reader_kwargs,
)
# Print the dataframe
print(f"Dataframe for the file {os.path.basename(filepath)} :")
display(df_raw)

Dataframe for the file file60_20180817.dat.gz :


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,362511,4612.0301,00847.4977,01-08-2018 12:44:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
1,362512,4612.0301,00847.4978,01-08-2018 12:45:01,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
2,362513,4612.0301,00847.4985,01-08-2018 12:45:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
3,362514,4612.0305,00847.4990,01-08-2018 12:46:01,,OK,0000.000,0056.49,00,00,...,035,0.05,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4,362515,4612.0303,00847.4992,01-08-2018 12:46:31,,OK,0000.000,0056.49,00,00,...,034,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4736,367249,4612.0313,00847.4956,03-08-2018 04:13:25,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4737,367250,4612.0313,00847.4955,03-08-2018 04:13:56,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4738,367251,4612.0313,00847.4955,03-08-2018 04:14:26,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4739,367252,4612.0313,00847.4954,03-08-2018 04:14:55,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0


In [9]:
print("Column names:", df_raw.columns)
print("Row Index:", df_raw.index)

Column names: Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23],
           dtype='int64')
Row Index: RangeIndex(start=0, stop=4741, step=1)


Here we expect the `df_raw` to have: 
- numeric column names (i.e.  `Int64Index`) 
- numeric row index (i.e. `RangeIndex`)  

If the structure of the dataframe looks fine (no header and no row index), we are on the good track ! 


Depending on the schema of your data, this `reader_kwargs` dictionary may be fairly different from the one above. 

> 🚨 The `reader_kwargs` dictionary will be transferred to the [`reader_template.py`](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

**6. Data exploration**

The settings for the loading of the data is now ready, we can now load one file and analyse its content to see if there is any errors or inconsistencies.

Here are some instructions : 

* Do not assign column names to the dataframe columns yet
* Do not assign a dtype to the dataframe columns yet
* Possibly look at multiple files ;)


We print the content first 3 rows :
 (*Feel free to change the value of n to see more/less rows*)

In [10]:
print_df_first_n_rows(df_raw, n=2, column_names=False)

 - Column 0 :
      ['362511' '362512' '362513']
 - Column 1 :
      ['4612.0301' '4612.0301' '4612.0301']
 - Column 2 :
      ['00847.4977' '00847.4978' '00847.4985']
 - Column 3 :
      ['01-08-2018 12:44:30' '01-08-2018 12:45:01' '01-08-2018 12:45:30']
 - Column 4 :
      [nan nan nan]
 - Column 5 :
      ['OK' 'OK' 'OK']
 - Column 6 :
      ['0000.000' '0000.000' '0000.000']
 - Column 7 :
      ['0056.49' '0056.49' '0056.49']
 - Column 8 :
      ['00' '00' '00']
 - Column 9 :
      ['00' '00' '00']
 - Column 10 :
      ['-9.999' '-9.999' '-9.999']
 - Column 11 :
      ['9999' '9999' '9999']
 - Column 12 :
      ['12611' '12617' '12600']
 - Column 13 :
      ['00000' '00000' '00000']
 - Column 14 :
      ['035' '035' '035']
 - Column 15 :
      ['0.06' '0.06' '0.06']
 - Column 16 :
      ['24.9' '24.9' '24.9']
 - Column 17 :
      ['0' '0' '0']
 - Column 18 :
      ['005.649' '005.649' '005.649']
 - Column 19 :
      ['000' '000' '000']
 - Column 20 :
      ['-9.999,-9.999,-9.999,-9

In [11]:
df_raw.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,362511,4612.0301,847.4977,01-08-2018 12:44:30,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
1,362512,4612.0301,847.4978,01-08-2018 12:45:01,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
2,362513,4612.0301,847.4985,01-08-2018 12:45:30,,OK,0.0,56.49,0,0,...,35,0.06,24.9,0,5.649,0,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0


We print the content of n rows picked randomly : 

In [12]:
print_df_random_n_rows(df_raw, n=6, with_column_names=False)

- Column 0 : ['366418' '363107' '365120' '363956' '362576' '363091']
- Column 1 : ['4612.0318' '4612.0307' '4612.0321' '4612.0278' '4612.0315' '4612.0301']
- Column 2 : ['00847.4949' '00847.4955' '00847.4983' '00847.4952' '00847.4976'
 '00847.4953']
- Column 3 : ['02-08-2018 21:18:01' '01-08-2018 17:42:31' '02-08-2018 10:29:01'
 '02-08-2018 00:47:00' '01-08-2018 13:17:01' '01-08-2018 17:34:31']
- Column 4 : [nan nan nan nan nan nan]
- Column 5 : ['OK' 'OK' 'OK' 'OK' 'OK' 'OK']
- Column 6 : ['0000.000' '0000.056' '0000.000' '0000.000' '0000.000' '0000.545']
- Column 7 : ['0056.71' '0056.65' '0056.67' '0056.67' '0056.49' '0056.63']
- Column 8 : ['00' '57' '00' '00' '00' '62']
- Column 9 : ['00' '58' '00' '00' '00' '63']
- Column 10 : ['-9.999' '04.826' '-9.999' '-9.999' '-9.999' '17.382']
- Column 11 : ['9999' '9999' '9999' '9999' '9999' '9999']
- Column 12 : ['12246' '12414' '12577' '12538' '12646' '12411']
- Column 13 : ['00000' '00019' '00000' '00000' '00000' '00071']
- Column 14 : ['

Get the number of column :

In [13]:
len(df_raw.columns)

24

Look at unique values for a single column :

In [14]:
print_df_columns_unique_values(df_raw, column_indices=11, column_names=False)

 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Look at unique values for a few columns :

Note: Use `column_indices=None` to get the unique values for all columns

In [15]:
print_df_columns_unique_values(df_raw, column_indices=slice(10, 12), column_names=False)

 - Column 10 :
      ['-9.999', '02.669', '04.241', '04.745', '04.826', '04.879', '05.430', '06.095', '06.220', '07.415', '08.436', '08.489', '08.506', '08.724', '08.956', '09.079', '09.894', '10.057', '10.567', '11.705', '12.097', '12.390', '12.923', '13.114', '13.407', '13.684', '14.324', '15.060', '16.530', '16.636', '16.668', '17.194', '17.382', '17.829', '17.918', '18.334', '18.655', '19.526', '20.329', '21.134', '21.426', '23.098', '23.664', '23.760', '24.472', '25.473', '25.957', '29.270', '31.271', '32.255', '33.844', '36.196']
 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Get the unique values as dictionary

In [16]:
get_df_columns_unique_values_dict(
    df_raw, column_indices=slice(10, 12), column_names=False
)

{'Column 10': ['-9.999',
  '02.669',
  '04.241',
  '04.745',
  '04.826',
  '04.879',
  '05.430',
  '06.095',
  '06.220',
  '07.415',
  '08.436',
  '08.489',
  '08.506',
  '08.724',
  '08.956',
  '09.079',
  '09.894',
  '10.057',
  '10.567',
  '11.705',
  '12.097',
  '12.390',
  '12.923',
  '13.114',
  '13.407',
  '13.684',
  '14.324',
  '15.060',
  '16.530',
  '16.636',
  '16.668',
  '17.194',
  '17.382',
  '17.829',
  '17.918',
  '18.334',
  '18.655',
  '19.526',
  '20.329',
  '21.134',
  '21.426',
  '23.098',
  '23.664',
  '23.760',
  '24.472',
  '25.473',
  '25.957',
  '29.270',
  '31.271',
  '32.255',
  '33.844',
  '36.196'],
 'Column 11': ['0824',
  '0906',
  '1363',
  '1397',
  '2921',
  '3203',
  '3326',
  '3816',
  '4465',
  '9999']}

**7. Columns name**

Now we have validated the content of our data. It's time to care about its structure (column names). 

The function `infer_df_str_column_names()` tries to guess the column name based on string patterns according to `L0A_encodings.yml` and the type of sensor.

In [17]:
infer_df_str_column_names(df_raw, sensor_name=sensor_name)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: ['rainfall_rate_32bit'],
 7: ['rainfall_accumulated_32bit'],
 8: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 9: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 10: ['reflectivity_32bit'],
 11: ['mor_visibility'],
 12: ['number_particles', 'laser_amplitude'],
 13: ['number_particles', 'laser_amplitude'],
 14: ['sensor_temperature', 'error_code'],
 15: ['sensor_heating_current'],
 16: ['sensor_battery_voltage'],
 17: ['sensor_status'],
 18: ['rainfall_amount_absolute_32bit'],
 19: ['sensor_temperature', 'error_code'],
 20: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 21: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 22: ['raw_drop_number'],
 23: ['sensor_status']}

This can help us to define later the `column_names` list.

As reference, here is the list of valid columns name (taken from `L0A_encodings.yml`):

In [18]:
print_valid_L0_column_names(sensor_name)

['rainfall_rate_32bit', 'rainfall_accumulated_32bit', 'weather_code_synop_4680', 'weather_code_synop_4677', 'weather_code_metar_4678', 'weather_code_nws', 'reflectivity_32bit', 'mor_visibility', 'sample_interval', 'laser_amplitude', 'number_particles', 'sensor_temperature', 'sensor_serial_number', 'firmware_iop', 'firmware_dsp', 'sensor_heating_current', 'sensor_battery_voltage', 'sensor_status', 'start_time', 'sensor_time', 'sensor_date', 'station_name', 'station_number', 'rainfall_amount_absolute_32bit', 'error_code', 'rainfall_rate_16bit', 'rainfall_rate_12bit', 'rainfall_accumulated_16bit', 'reflectivity_16bit', 'raw_drop_concentration', 'raw_drop_average_velocity', 'raw_drop_number']


It's time now to define our current column names : 

Hint to define the names :
* get information from the disdrometer user guide and the data logger employed. 
* use `infer_df_str_column_names()` to help you
* analyse the content column after column with `print_df_columns_unique_values()`  

In [19]:
column_names = [
    "unknown1",
    "unknown2",
    "unknown3",
    "timestep",
    "unknown4",
    "unknown5",
    "rainfall_rate_32bit",
    "rainfall_accumulated_32bit",
    "weather_code_synop_4680",
    "weather_code_synop_4677",
    "reflectivity_32bit",
    "mor_visibility",
    "laser_amplitude",
    "number_particles",
    "sensor_temperature",
    "sensor_heating_current",
    "sensor_battery_voltage",
    "sensor_status",
    "rainfall_amount_absolute_32bit",
    "error_code",
    "raw_drop_concentration",
    "raw_drop_average_velocity",
    "raw_drop_number",
    "unknown6",
]

> 🚨 The `column_names` list will be transferred  to the [reader_template.py](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

Check the validity of your definition 

In [20]:
check_column_names(column_names, sensor_name)

The following columns do no met the DISDRODB standards: ['unknown5', 'unknown4', 'unknown2', 'unknown1', 'unknown6', 'timestep', 'unknown3'].
Please remove such columns within the df_sanitizer_fun
Please be sure to create the 'time' column within the df_sanitizer_fun.
The 'time' column must be datetime with resolution in seconds (dtype='M8[s]').


Ok, fair enough.
There are columns that need to be removed, and we need to also define a column "time" with dtype `datetime` to meet the DISDRODB standards.

These points will be addressed in Section 9 of this notebook ! 

**8. Read the dataframe with correct columns name**

We can now create a new dataframe with the columns name :

In [21]:
df = read_raw_data(
    filepath=filepath, column_names=column_names, reader_kwargs=reader_kwargs
)

And print the dataframe column names : 

In [22]:
print_df_column_names(df)

 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : timestep
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6


Check if the lazily loading (dask) is correct :

In [23]:
df_dask = read_raw_data(
    filepath=filepath, column_names=column_names, reader_kwargs=reader_kwargs
)
df_dask = df_dask.compute()

And print the dataframe column names : 

In [24]:
print_df_column_names(df_dask)

 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : timestep
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6


We can now verify that the pandas and the dask reading are similar 

In [25]:
assert df.equals(df_dask)

**9. Perform further tests and analysis to check the correctness of `column_names`**

You can for example check some statistics for a specific column.

In [26]:
column_name = "rainfall_rate_32bit"
array_of_values = df.loc[:, [column_name]].astype("float")
print_df_summary_stats(array_of_values)

 - Column 0 ( rainfall_rate_32bit ):
                    
mean  0.005426
min   0.000000
25%   0.000000
50%   0.000000
75%   0.000000
max   2.881000


**10. Final columns formatting**

In [27]:
check_L0A_column_names(df, sensor_name=sensor_name)

ValueError: The following columns do no met the DISDRODB standards: ['unknown5', 'unknown4', 'unknown2', 'unknown1', 'unknown6', 'timestep', 'unknown3']

In [None]:
check_column_names(column_names, sensor_name)

Now, it's time to remove all the columns that does not match the DISDRODB standard.

In [None]:
df = df.drop(
    columns=["unknown1", "unknown2", "unknown3", "unknown4", "unknown5", "unknown6"]
)

It's also time to define the column `time` which is requested by the DISDRODB standard

In [None]:
df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
df = df.drop(columns=["timestep"])

Check column names met DISDRODB standards after custom processing :

In [None]:
check_L0A_column_names(df, sensor_name=sensor_name)

Check the dataframe looks as desired :

In [None]:
print_df_column_names(df)

In [None]:
print_df_random_n_rows(df, n=5)

In [None]:
print_df_columns_unique_values(df, column_indices=2, column_names=True)

**11. Define the dataframe sanitizer function**

The `df_sanitizer_fun` encapsulate the code specific to each reader/dataset that is required to obtain a dataframe compliants with the DISDRODB standards.

With the data used in this notebook, we need to drop some columns and define the `time` column ! 

From the code defined in Section 10, we define the following function: 

In [None]:
def df_sanitizer_fun(df):
    # Import pandas
    import pandas as dd

    # - Drop unvalid columns
    columns_to_drop = [
        "unknown1",
        "unknown2",
        "unknown3",
        "unknown4",
        "unknown5",
        "unknown6",
    ]

    df = df.drop(columns=columns_to_drop)

    # - Convert timestep column to datetime format
    df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
    df = df.drop(columns=["timestep"])

    # - Return the dataframe
    return df

> 🚨 The `df_sanitizer_fun()` function will be transfered to the [reader_template.py](https://github.com/ltelab/disdrodb/blob/main/disdrodb/L0/readers/reader_template.py) file at the end of this notebook. 

**12. Now let's try calling the reader function as it will be called in the DISDRODB L0 reader**

* You may try with increasing number of files (update `file_list`)

Here we combine all raw files in a single dataframe. 

The function `read_raw_file_list` takes as argument :
* `file_list` : the list of files present in the specified station directory
* `column_names` : the list of column (defined previously)
* `reader_kwargs` : dictionary to data loading  into the dataframe (defined previously)
* `sensor_name` : taken from the `sensor_name` key in the metadata YAML file of the station
* `df_sanitizer_fun`: the function to sanitize the data frame (defined previously)

All these arguments are defined either in the data directory structure, or earlier in the code.

In [None]:
# TODO: THIS HAS BEEN MODIFIED
subset_file_list = file_list[:1]

df = read_raw_file_list(
    file_list=subset_file_list,
    column_names=column_names,
    reader_kwargs=reader_kwargs,
    sensor_name=sensor_name,
    verbose=verbose,
    df_sanitizer_fun=df_sanitizer_fun,
)
display(df)

Here we derive the corresponding xr.Dataset object 

In [None]:
ds = create_L0B_from_L0A(df, attrs, verbose=False)
print(ds)

which can be saved as DISDRODB L0B netCDF by running the following code:

In [None]:
# ds = set_encodings(ds, sensor_name)
# ds.to_netcdf("/path/where/to/save/the/file.nc")

## Step 3 : Create the reader

We have now all the elements to start creating the new reader. 
All the modifications that we did in this notebook must be now transcribed into a DISDRODB L0 reader file.

1. Copy and paste the [`disdrodb\L0\readers\reader_template.py`](https://github.com/ltelab/disdrodb/tree/main/disdrodb/L0/readers) into the folder `disdrodb\L0\readers\DATA_SOURCE`

2. Rename the copied file `CAMPAIGN_NAME.py`

3. Within the file, update the portion of code described in the next points 4., 5. and 6.
---------------------------------------------------------------------------------------


4. **Update the `columns_names` list**

   Before :

    ``` python
        column_names = []
    ```
    
    After : 

    ``` python
        column_names = [
            "unknown1",
            "unknown2",
            "unknown3",
            "timestep",
            "unknown4",
            "unknown5",
            "rainfall_rate_32bit",
            "rainfall_accumulated_32bit",
    parallel=parallel,weather_code_synop_4680",
            "weather_code_synop_4677",
            "reflectivity_32bit",
            "mor_visibility",
            "laser_amplitude",
            "number_particles",
            "sensor_temperature",
            "sensor_heating_current",
            "sensor_battery_voltage",
            "sensor_status",
            "rainfall_amount_absolute_32bit",
            "error_code",
            "raw_drop_concentration",
            "raw_drop_average_velocity",
            "raw_drop_number",
            "unknown6",
        ]
    ```

---------------------------------------------------------------------------------------
5. **Update the `reader_kwargs` dictparallel=parallel,

    Before :

    ``` python
        reader_kwargs = {}

    ```
    
    After : 

    ``` python
        reader_kwargs = {}

        # - Define delimiter
        reader_kwargs["delimiter"] = ","

        # - Avoid first column to become df index !!!
        reader_kwargs["index_col"] = False

        # Since column names are expected to be passed explicitly, header is set to None
        reader_kwargs['header'] = None

        # - Number of rows to be skipped at the beginning of the file 
        reader_kwargs['skiprows']= None

        # - Define behaviour when encountering bad lines
        reader_kwargs["on_bad_lines"] = "skip"

        # - Define reader engine
        #   - C engine is faster
        #   - Python engine is more feature-complete
        reader_kwargs["engine"] = "python"

        # - Define on-the-fly decompression of on-disk data
        #   - Available: gzip, bz2, zip
        reader_kwargs["compression"] = "infer"

        # - Strings to recognize as NA/NaN and replace with standard NA flags
        #   - Already included: ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’,
        #                       ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’,
        #                       ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
        reader_kwargs["na_values"] = ["na", "", "error"]

    ```

---------------------------------------------------------------------------------------
6. **Update the `df_sanitizer_fun()` function**

   Before:

    ``` python

        def df_sanitizer_fun(df):
            # - Import pandas
            import pandas as pd
                    
            # - Add here below the reader required custom code
            pass
        
            # - Return the dataframe 
            return df
        
    ```
    
    After : 

    ``` python
        def df_sanitizer_fun(df):
            # - Import pandas
            import pandas as pd

            # - Drop unvalid columns
            columns_to_drop = ["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6']
            df = df.drop(columns=columns_to_drop)

            # - Convert timestep column to datetime format
            df["time"] = pd.to_datetime(df["timestep"], format="%m-%d-%Y %H:%M:%S")
            df = df.drop(columns=["timestep"])
            
            # - Return the dataframe 
            return df
        
    ```
 
 ---------------------------------------------------------------------------------------   

 
 7. **Run the script**
 
   
      
Just run :

     
     
* Windows:

```
run_disdrodb_l0_reader <data_source> <campaign_name>  <data_folder>\DISDRODB\Raw\DATA_SOURCE\CAMPAIGN_NAME\<data_folder>\DISDRODB\Processed\DATA_SOURCE\CAMPAIGN_NAME\ -l0b True -f True -v True -d False
```

     
* Mac/Linux :

``` batch
run_disdrodb_l0_reader <data_source> <campaign_name> <data_folder>/DISDRODB/Raw/DATA_SOURCE/CAMPAIGN_NAME/ <data_folder>/DISDRODB/Processed/DATA_SOURCE/CAMPAIGN_NAME/ -l0b True -f True -v True -d False
```
 
You need to adapt the <data_folder> parameter to your local data folder.



Have a look  [here](https://disdrodb.readthedocs.io/en/latest/readers.html#running-a-reader) for a full documnentation on how to run a reader. 

 ---------------------------------------------------------------------------------------
 
8. **Check if the script has correctly executed**

    The output folder should be as follow :
    
    ```
    📁 DISDRODB
    ├── 📁 Processed
       ├── 📁 DATA_SOURCE
          ├── 📁 CAMPAIGN_NAME
              ├── 📁 info
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml
              ├── 📁 L0A
                  ├── 📁 station_name_1
                     ├── 📜 *.parquet
                  ├── 📁 station_name_1
                     ├── 📜 *.parquet
              ├── 📁 L0B
                  ├── 📁 station_name_1
                     ├── 📜 *.nc
                  ├── 📁 station_name_2
                     ├── 📜 *.nc
              ├── 📁 logs
                  ├── 📜 <date>_LO_reader.log
              ├── 📁 metadata
                  ├── 📜 station_name_1.yml
                  ├── 📜 station_name_2.yml

    ```

---------------------------------------------------------------------------------------

Well done 👋👋👋 

You should now be able to create a new reader for your own data.
Please consider to share the reader for your data with the community by uploading it on the DISDRODB repository.

Have a look at the [contributors guidelines](https://disdrodb.readthedocs.io/en/latest/contributors_guidelines.html) for more information and do not hesitate to open a [GitHub Issue](https://github.com/ltelab/disdrodb/issues) if you need any clarification. 

The DISDRODB team hope you enjoyed this tutorial 