# Checksum Processing and Verification ( adler32 )

For the DIDs that were present in both, rucio dumps and seal dumps, we will validate whether the checksums reported in both dumps are the same or not.

## Time Range
We need to pick a time range used to filter the DIDs from the dumps. The end time is limited by the fact that we do not have all the checksums for the DIDs in the SEAL dumps after a certain time. We will use the following time range for the analysis:

In [40]:
from datetime import datetime
date_start_str = '20220101' # YYYYMMDD
date_end_str = '20220801' # YYYYMMDD

date_start = datetime.strptime(date_start_str, '%Y%m%d')
date_end = datetime.strptime(date_end_str, '%Y%m%d')

# Checksum processing

In this section, we will process the checksums from the Rucio and SEAL dumps to make them comparable.

## Checksums calculated by SEAL

The `data/seal/checksums/entries_{start_date}_{end_date}.csv` file contains the checksums (adler32, md5, sha256) calculated by SEAL for the DIDs available on their storage. 


In [63]:
import pandas as pd
from core.utils import bytesToTB
dir = 'data/seal/checksums'
entries_file = f'{dir}/entries_{date_start_str}_{date_end_str}.csv'

entries = pd.read_csv(entries_file)
entries['path'] = entries['path'].str.replace('rucio/', '')

total_size_entries = entries['size_bytes'].sum()

print(f'Num Entries: {len(entries)}')
print(f'Total Size: {bytesToTB(total_size_entries)} TB')

Num Entries: 516978
Total Size: 343.429 TB


The `data/seal/checksums/errors_{start_date}_{end_date}.csv` file contains entries that for which the checksum could not be calculated by SEAL and the specific errors that occurred.

- Most of the errors indicate 404 Not Found, which means SEAL has data for the file, but the last few chunks did not complete transfer
- Entries with a 500 Internal Server Error are essentially the same as the 404, but the in addition the to the chunk(s) missing, the metadata record could not be found

In [78]:
errors_file = f'data/seal/checksums/errors_{date_start_str}_{date_end_str}.csv'
errors = pd.read_csv(errors_file, index_col=False)
errors['path'] = errors['path'].str.replace('rucio/', '')
print(f'Num Errors: {len(errors)}')

0    data15_13TeV/DAOD_PHYSLITE.22956626._000109.po...
1    data15_13TeV/DAOD_PHYSLITE.22956716._000120.po...
2    data15_13TeV/DAOD_PHYSLITE.22956771._000467.po...
3    data16_13TeV/DAOD_PHYSLITE.22956954._000091.po...
4    data16_13TeV/DAOD_PHYSLITE.22956983._000021.po...
Name: path, dtype: object
Num Errors: 4387


  errors = pd.read_csv(errors_file, index_col=False)


## Checksums stored by Rucio

We will load the consistent did's ( subset of rucio dumps ) and filter for rows for the selected time range. The consistent did's are the ones that are present in both the rucio dumps and the seal dumps. This file was prepared in the `manual_auditor` notebook.

In [73]:
consistent_dids_file = f'data/outputs/consistent_dids_20220101-20230410.csv'
consistent_dids = pd.read_csv(consistent_dids_file)

consistent_dids['creation_date'] = pd.to_datetime(consistent_dids['creation_date'])
consistent_dids['update_date'] = pd.to_datetime(consistent_dids['update_date'])
consistent_dids['size'] = pd.to_numeric(consistent_dids['size'])

print(f'Num Consistent DIDs: {len(consistent_dids)}')

Num Consistent DIDs: 2654663


Now, we filter the consistent did's for the selected time range.

In [61]:
consistent_dids_in_time_range = consistent_dids[(consistent_dids['creation_date'] <= date_end)]

print(f'Num Consistent DIDs in Date Range: {len(consistent_dids_in_time_range)}')

Num Consistent DIDs in Date Range: 793013


## Missing Checksums from SEAL

The `consistent_dids_in_time_range` DataFrame contains the DIDs that are present in both the rucio dumps and the seal dumps in the selected time range. Therefore, the DIDs provided by SEAL in their `entries` and `errors` DataFrames should also be present in the `consistent_dids_in_time_range` DataFrame ( except for the `dark_dids`). We will check if this is the case.

The entries that are present in the `consistent_dids_in_time_range` file but are not present in the `entries` or `errors` files are missing checksums from SEAL. We will check if there are any such entries.

In [82]:
dids_missing_checksums = consistent_dids_in_time_range[~consistent_dids_in_time_range['path'].isin(entries['path'])]

# check if the dids missing checksums are not in the errors

dids_missing_checksums = dids_missing_checksums[~dids_missing_checksums['path'].isin(errors['path'])]

print(f'Num Missing Checksums from SEAL: {len(dids_missing_checksums)}')



Num Missing Checksums from SEAL: 271658
For the time range 2022-01-01 00:00:00 to 2022-08-01 00:00:00: 


## Summary

In [88]:
print(f"For the time range {date_start} to {date_end}: {len(consistent_dids_in_time_range)} DIDs registered in Rucio were available at SEAL.")
print(f"Of these, checksums were provided by SEAL for {len(entries)} DIDs. SEAL could not generate checksums for {len(errors)} DIDs and these files should be marked as lost.")
print(f"{len(dids_missing_checksums)} DIDs were missing checksums from SEAL.")

missing_checksums_filename = f'data/outputs/dids_missing_checksums_{date_start_str}_{date_end_str}.csv'
dids_missing_checksums.to_csv(missing_checksums_filename, index=False)
print(f"Missing checksums DIDs written to {missing_checksums_filename}")

For the time range 2022-01-01 00:00:00 to 2022-08-01 00:00:00: 793013 DIDs registered in Rucio were available at SEAL.
Of these, checksums were provided by SEAL for 516978 DIDs. SEAL could not generate checksums for 4387 DIDs and these files should be marked as lost.
271658 DIDs were missing checksums from SEAL.


# Checksum Verification