How to Calculate KL Reduction ？

 Can the DSIR calculate the data metric method mentioned in the paper—KL reduction? 
And what are the necessary data preprocessing methods when resampling a custom dataset? My scenario involves importance resampling of data in the Alpaca style, and my current processing code is as follows: 
```
from data_selection import HashedNgramDSIR

raw_datasets = ["/dsir/original_data/train_30k.jsonl"]
target_datasets = ["/dsir/original_data/target.jsonl"]

dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/dsir/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='resampled', num_to_sample=10000, cache_dir='/dsir/resampled_cache')
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to Calculate KL Reduction ？ #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to Calculate KL Reduction ？ #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions