-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Can the DSIR calculate the data metric method mentioned in the paper—KL reduction?
And what are the necessary data preprocessing methods when resampling a custom dataset? My scenario involves importance resampling of data in the Alpaca style, and my current processing code is as follows:
from data_selection import HashedNgramDSIR
raw_datasets = ["/dsir/original_data/train_30k.jsonl"]
target_datasets = ["/dsir/original_data/target.jsonl"]
dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/dsir/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='resampled', num_to_sample=10000, cache_dir='/dsir/resampled_cache')
Metadata
Metadata
Assignees
Labels
No labels