hdf5 being rebuilt too often #1078

carlogrisetti · 2021-01-17T08:37:05Z

ludwig from master (and since 0.3.2 at least), on different systems

when changing the config.yaml file, even just changing the dropout values (which nothing has to do with the hdf5 preprocessed file), Ludwig says that the checksum has changed, hence it has to rebuild the hdf5 file.
This happens also switching the batch size parameter, for example. I am suspecting it happened to me also without changing any parameter whatsoever.

I will look into that, but wanted to keep track of this, since I don't know how much time I will have to do this in the next few days, and maybe that's already a known issue (i have found no issues regarding this, anyway).

carlogrisetti · 2021-01-17T16:43:47Z

Just happened with the same exact data, config, etc.
Might have something to do with the fact that I'm working on image data in input?

I will find the time to debug this

w4nderlust · 2021-01-18T00:40:03Z

The logic is the following: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1798-L1811
which means the data is preprocessed when the following things change:

ludwig version
dataset modification date
any parameter in the preprocessing section of the config
any name or type of any feature
any preprocessing section of any feature

If a feature parameter like dropout or a training parameter like batch size are changed, the preprocessing should not be triggered again (they are not used to compute the checksum, so checksum is identical and so the preprocessing should not be triggered, the relevant piece of code is:

ludwig/ludwig/data/preprocessing.py

Lines 1313 to 1334 in 576b74b

    
           checksum = calculate_checksum(dataset, config) 
        
           cache_checksum = cache_training_set_metadata.get(CHECKSUM, 
        
                                                            None) 
        
           if checksum == cache_checksum: 
        
               logger.info( 
        
                   'Found hdf5 and meta.json with the same filename ' 
        
                   'of the dataset, using them instead' 
        
               ) 
        
               dataset = dataset_hdf5_fp 
        
               training_set_metadata = cache_training_set_metadata 
        
               config['data_hdf5_fp'] = dataset 
        
               data_format = 'hdf5' 
        
           else: 
        
               logger.info( 
        
                   "Found hdf5 and meta.json with the same filename " 
        
                   "of the dataset, but checksum don't match, " 
        
                   "if saving of processed input is not skipped " 
        
                   "they will be overridden" 
        
               ) 
        
               os.remove(dataset_hdf5_fp) 
        
               os.remove(training_set_metadata_fp)

)

Let me know if you can figure out what triggers the spurious recreation of the cache.

carlogrisetti · 2021-01-18T02:12:39Z

Here it is...
when checking the config, the 'preprocessing': {'in_memory': False, 'num_processes': 6} i specified gets thrown out, and that line 1313 only looks at a config that has 'preprocessing': {} in it, hence being different from the actual one.
When saving the meta.json file, in row 1591, the full config (including 'preprocessing': {'in_memory': False, 'num_processes': 6}) gets passed and hashed.

You should be able to repro with a model definition like this one
config.txt

carlogrisetti · 2021-01-18T08:31:38Z

Ok... needed some sleep :)
The part that changes in the info dataset you're building before calculating the hash is the last bit:

'feature_preprocessing': [{}, {}, {}, {}]} (when comparing to decide whether to rebuild)

ludwig/ludwig/data/preprocessing.py

Line 1313 in 576b74b

checksum = calculate_checksum(dataset, config)

vs

'feature_preprocessing': [{}, {'in_memory': False, 'num_processes': 6}, {}, {}]} (when saving meta.json)

ludwig/ludwig/data/preprocessing.py

Line 1591 in 576b74b

training_set_metadata[CHECKSUM] = calculate_checksum(

Since that 'in_memory': False, 'num_processes': 6 is specified at the global level, I expect not to find those at the feature level (or to be copied in both the check and the json saving)

The "merging" happens here:

ludwig/ludwig/data/preprocessing.py

Lines 1132 to 1161 in 576b74b

    
           def build_data(input_df, features, training_set_metadata, backend): 
        
               proc_df = backend.df_engine.empty_df_like(input_df) 
        
               for feature in features: 
        
                   if PROC_COLUMN not in feature: 
        
                       feature[PROC_COLUMN] = compute_feature_hash(feature) 
        
                   if feature[PROC_COLUMN] not in proc_df: 
        
                       preprocessing_parameters = \ 
        
                           training_set_metadata[feature[NAME]][ 
        
                               PREPROCESSING] 
        
                       input_df = handle_missing_values( 
        
                           input_df, 
        
                           feature, 
        
                           preprocessing_parameters 
        
                       ) 
        
                       add_feature_data = get_from_registry( 
        
                           feature[TYPE], 
        
                           base_type_registry 
        
                       ).add_feature_data 
        
                       proc_df = add_feature_data( 
        
                           feature, 
        
                           input_df, 
        
                           proc_df, 
        
                           training_set_metadata, 
        
                           preprocessing_parameters, 
        
                           backend 
        
                       ) 
        
               return proc_df

carlogrisetti · 2021-01-19T10:24:23Z

@w4nderlust I don't know how to manage this without breaking that last code... I'd say I'm passing this to you guys hehe :)
The issue should be easily reproducible. If not, just ask me for more details.

w4nderlust · 2021-01-20T20:47:06Z

Looks like a bug then, will look into it!

w4nderlust · 2021-03-08T01:06:25Z

Working on it #1114

w4nderlust · 2021-03-10T20:58:12Z

Merged the PR, @carlogrisetti could you confirm this solves the issue in your specific use case?

carlogrisetti · 2021-05-21T06:25:11Z

Never saw this message, sorry @w4nderlust . It does indeed fix the issue. I just had it resurface in a non-master updated install... and as soon as I updated to master it worked flawessly

w4nderlust · 2021-05-22T01:24:39Z

great to hear!

w4nderlust added the bug Something isn't working label Jan 20, 2021

w4nderlust closed this as completed in 1ad6f00 Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdf5 being rebuilt too often #1078

hdf5 being rebuilt too often #1078

carlogrisetti commented Jan 17, 2021

carlogrisetti commented Jan 17, 2021

w4nderlust commented Jan 18, 2021

carlogrisetti commented Jan 18, 2021

carlogrisetti commented Jan 18, 2021

carlogrisetti commented Jan 19, 2021

w4nderlust commented Jan 20, 2021 •

edited

Loading

w4nderlust commented Mar 8, 2021

w4nderlust commented Mar 10, 2021

carlogrisetti commented May 21, 2021

w4nderlust commented May 22, 2021

hdf5 being rebuilt too often #1078

hdf5 being rebuilt too often #1078

Comments

carlogrisetti commented Jan 17, 2021

carlogrisetti commented Jan 17, 2021

w4nderlust commented Jan 18, 2021

carlogrisetti commented Jan 18, 2021

carlogrisetti commented Jan 18, 2021

carlogrisetti commented Jan 19, 2021

w4nderlust commented Jan 20, 2021 • edited Loading

w4nderlust commented Mar 8, 2021

w4nderlust commented Mar 10, 2021

carlogrisetti commented May 21, 2021

w4nderlust commented May 22, 2021

w4nderlust commented Jan 20, 2021 •

edited

Loading