-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add oversampling strategies to interleave datasets #4831
Add oversampling strategies to interleave datasets #4831
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you ! This is amazing :)
Could you add a test case in test_arrow_dataset.py
to test this stopping strategy please ?
I think we can also mention it in the documentation in process.mdx
(rendered at https://hf.co/docs/datasets/process): what about about a section on Interleave
right after the Concatenate
section ?
Currently there's just a "Tip" that redirects to the streaming/interleave docs
src/datasets/combine.py
Outdated
if (iterable and stopping_strategy != "first_exhausted") or ( | ||
stopping_strategy not in ["first_exhausted", "all_exhausted"] | ||
): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be a ValueError if stopping_strategy not in ["first_exhausted", "all_exhausted"]
and a NotImplementedError if iterable and stopping_strategy != "first_exhausted"
…nterleave_map_style_datasets and add comments
…ntation of process.mdx
Hi @lhoestq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks all good to me ! Thanks again, and for the tests and docs as well :)
Merging !
|
||
# Reasoning behind the following operation: for each dataset indices (i.e column) repeat the indices to have max_length indices per dataset | ||
# For example, if the max_length is 5 and the i-th dataset has 3 samples, the i-th column will be [0,1,2,0,1] | ||
indices = np.mod(np.arange(max(lengths)).reshape(-1, 1), np.array(lengths).reshape(1, -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice trick !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that comment and for the review!
@ylacombe Thanks for your effort!
May I ask why is that, and how to solve it? In some scenarios, such as domain adaptation with limited resources, it is normal to have a big generic dataset and a small in-domain dataset. Here is an example with data sizes 8:2 and oversampling ratios 0.2:0.8 from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({"a": [1, 2, 3, 4, 5, 6, 7, 8]})
d2 = Dataset.from_dict({"a": [9, 10]})
new_d = interleave_datasets([d1, d2], probabilities=[0.2, 0.8], seed=42, stopping_strategy="all_exhausted")
print(len(new_d))
print(new_d["a"])
The ratios sampled from the two original datasets to the output dataset are correct. However, the length of the output dataset is 37, which is too big. I think it should be only large enough to make the smaller dataset similar in size to the bigger dataset. Any solution for this? Many thanks! |
Hi @ymoslem, it's a great question and yes, it's normal to have two different-sized datasets to interleave! My recommendation here would be to either use probabilities more biased towards the large model (e.g Let me know if I need to be clearer! |
@ylacombe Many thanks for your prompt response! As we needed to implement certain oversampling experiments, we ended up using Pandas. Considering each dataset a class with a distinct "label": import pandas as pd
def oversample(df):
classes = df.label.value_counts().to_dict()
most = max(classes.values())
classes_list = []
for key in classes:
classes_list.append(df[df['label'] == key])
classes_sample = []
for i in range(1,len(classes_list)):
classes_sample.append(classes_list[i].sample(most, replace=True))
df_maybe = pd.concat(classes_sample)
final_df = pd.concat([df_maybe,classes_list[0]], axis=0)
final_df = final_df.reset_index(drop=True)
return final_df |
Hello everyone,
Here is a proposal to improve
interleave_datasets
function.Following Issue #3064, and @lhoestq comment, I propose here a code that performs oversampling when interleaving a
Dataset
list.I have myself encountered this problem while trying to implement training on a multilingual dataset following a training strategy similar to that of XLSUM paper, a multilingual abstract summary dataset where the multilingual training dataset is created by sampling from the languages following a smoothing strategy. The main idea is to sample languages that have a low number of samples more frequently than other languages.
As in Issue #3064, the current default strategy is a undersampling strategy, which stops as soon as a dataset runs out of samples. The new
all_exhausted
strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.How does it work in practice:
probabilities
isNone
and the strategy isall_exhausted
, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will beprobabilities
is notNone
and the strategy isall_exhausted
, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.More on the last sentence:
The previous example of
interleave_datasets
was:With my implementation,
dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
gives:>>> dataset["a"]
[10, 0, 11, 1, 2]
because
d1
is already out of samples just after2
is added.Example of the results of applying the different strategies:
Final note: I've been using that code for a research project involving a large-scale multilingual dataset. One should be careful when using oversampling to avoid to avoid exploding the size of the dataset. For example, if a very large data set has a low probability of being sampled, the final dataset may be several times the size of that large data set.