# Checksum Verification ( adler32 )

For the DIDs that were present in both, rucio dumps and seal dumps, we will validate whether the checksums reported in both dumps are the same or not.

## Time Range
We need to pick a time range used to filter the DIDs from the dumps. The end time is limited by the fact that we do not have all the checksums for the DIDs in the SEAL dumps after a certain time. We will use the following time range for the analysis:

In [2]:
time_start = '20220101' # YYYYMMDD
time_end = '20220801' # YYYYMMDD

# Checksums calculated by SEAL

The `data/seal/checksums/entries_{start_date}_{end_date}.csv` file contains the checksums (adler32, md5, sha256) calculated by SEAL for the DIDs available on their storage. 


In [11]:
import pandas as pd
from core.utils import bytesToTB
dir = 'data/seal/checksums'
entries_file = f'{dir}/entries_{time_start}_{time_end}.csv'
entries = pd.read_csv(entries_file)

total_size_entries = entries['size_bytes'].sum()

print(f'Num Entries: {len(entries)}')
print(f'Total Size: {bytesToTB(total_size_entries)} TB')

Num Entries: 516978
Total Size: 343.429 TB


The `data/seal/checksums/errors_{start_date}_{end_date}.csv` file contains entries that for which the checksum could not be calculated by SEAL and the specific errors that occurred.

- Most of the errors indicate 404 Not Found, which means SEAL has data for the file, but the last few chunks did not complete transfer
- Entries with a 500 Internal Server Error are essentially the same as the 404, but the in addition the to the chunk(s) missing, the metadata record could not be found

In [14]:
errors_file = f'data/seal/checksums/errors_{time_start}_{time_end}.csv'
errors = pd.read_csv(errors_file)

print(f'Num Errors: {len(errors)}')

Num Errors: 4387
