Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdf5 being rebuilt too often #1078

Closed
carlogrisetti opened this issue Jan 17, 2021 · 10 comments
Closed

hdf5 being rebuilt too often #1078

carlogrisetti opened this issue Jan 17, 2021 · 10 comments
Labels
bug Something isn't working

Comments

@carlogrisetti
Copy link
Contributor

ludwig from master (and since 0.3.2 at least), on different systems

when changing the config.yaml file, even just changing the dropout values (which nothing has to do with the hdf5 preprocessed file), Ludwig says that the checksum has changed, hence it has to rebuild the hdf5 file.
This happens also switching the batch size parameter, for example. I am suspecting it happened to me also without changing any parameter whatsoever.

I will look into that, but wanted to keep track of this, since I don't know how much time I will have to do this in the next few days, and maybe that's already a known issue (i have found no issues regarding this, anyway).

@carlogrisetti
Copy link
Contributor Author

Just happened with the same exact data, config, etc.
Might have something to do with the fact that I'm working on image data in input?

I will find the time to debug this

@w4nderlust
Copy link
Collaborator

The logic is the following: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1798-L1811
which means the data is preprocessed when the following things change:

  • ludwig version
  • dataset modification date
  • any parameter in the preprocessing section of the config
  • any name or type of any feature
  • any preprocessing section of any feature

If a feature parameter like dropout or a training parameter like batch size are changed, the preprocessing should not be triggered again (they are not used to compute the checksum, so checksum is identical and so the preprocessing should not be triggered, the relevant piece of code is:

checksum = calculate_checksum(dataset, config)
cache_checksum = cache_training_set_metadata.get(CHECKSUM,
None)
if checksum == cache_checksum:
logger.info(
'Found hdf5 and meta.json with the same filename '
'of the dataset, using them instead'
)
dataset = dataset_hdf5_fp
training_set_metadata = cache_training_set_metadata
config['data_hdf5_fp'] = dataset
data_format = 'hdf5'
else:
logger.info(
"Found hdf5 and meta.json with the same filename "
"of the dataset, but checksum don't match, "
"if saving of processed input is not skipped "
"they will be overridden"
)
os.remove(dataset_hdf5_fp)
os.remove(training_set_metadata_fp)
)

Let me know if you can figure out what triggers the spurious recreation of the cache.

@carlogrisetti
Copy link
Contributor Author

Here it is...
when checking the config, the 'preprocessing': {'in_memory': False, 'num_processes': 6} i specified gets thrown out, and that line 1313 only looks at a config that has 'preprocessing': {} in it, hence being different from the actual one.
When saving the meta.json file, in row 1591, the full config (including 'preprocessing': {'in_memory': False, 'num_processes': 6}) gets passed and hashed.

You should be able to repro with a model definition like this one
config.txt

@carlogrisetti
Copy link
Contributor Author

Ok... needed some sleep :)
The part that changes in the info dataset you're building before calculating the hash is the last bit:

'feature_preprocessing': [{}, {}, {}, {}]} (when comparing to decide whether to rebuild)

checksum = calculate_checksum(dataset, config)

vs

'feature_preprocessing': [{}, {'in_memory': False, 'num_processes': 6}, {}, {}]} (when saving meta.json)

training_set_metadata[CHECKSUM] = calculate_checksum(

Since that 'in_memory': False, 'num_processes': 6 is specified at the global level, I expect not to find those at the feature level (or to be copied in both the check and the json saving)

The "merging" happens here:

def build_data(input_df, features, training_set_metadata, backend):
proc_df = backend.df_engine.empty_df_like(input_df)
for feature in features:
if PROC_COLUMN not in feature:
feature[PROC_COLUMN] = compute_feature_hash(feature)
if feature[PROC_COLUMN] not in proc_df:
preprocessing_parameters = \
training_set_metadata[feature[NAME]][
PREPROCESSING]
input_df = handle_missing_values(
input_df,
feature,
preprocessing_parameters
)
add_feature_data = get_from_registry(
feature[TYPE],
base_type_registry
).add_feature_data
proc_df = add_feature_data(
feature,
input_df,
proc_df,
training_set_metadata,
preprocessing_parameters,
backend
)
return proc_df

@carlogrisetti
Copy link
Contributor Author

@w4nderlust I don't know how to manage this without breaking that last code... I'd say I'm passing this to you guys hehe :)
The issue should be easily reproducible. If not, just ask me for more details.

@w4nderlust
Copy link
Collaborator

w4nderlust commented Jan 20, 2021

Looks like a bug then, will look into it!

@w4nderlust w4nderlust added the bug Something isn't working label Jan 20, 2021
@w4nderlust
Copy link
Collaborator

Working on it #1114

@w4nderlust
Copy link
Collaborator

Merged the PR, @carlogrisetti could you confirm this solves the issue in your specific use case?

@carlogrisetti
Copy link
Contributor Author

Never saw this message, sorry @w4nderlust . It does indeed fix the issue. I just had it resurface in a non-master updated install... and as soon as I updated to master it worked flawessly

@w4nderlust
Copy link
Collaborator

great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants