You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let me start by thanking you for the open-source release of GReaT. I found an implementation detail about the generation of samples, especially on larger datasets.
Problem Description
Looking at the GPU utilization I found that the CPU workload (everything outside of sampling the model) takes increasingly longer. (using nvtop, GPU utilization becomes worse with more/higher sampling iterations).
Proposed Solution
Digging in the code I found that the accumulator (df_gen) and generated (pd.DataFrame(td)) data frames are concatenated in each iteration.
This incurs O(N^2) overhead (each time memory is allocated for a new DataFrame that can contain all rows). This can be resolved by creating a list of data frames and concatenating them at the end of the generation process. For example:
for GReaT.sample this would require a minor change, similar to the following:
# Create an accumulation list for generated datadfs= []
...
whilen>already_generated:
...
df_gen=_convert_text_to_tabular_data(text_data, df_gen)
...
dfs.append(df_gen)
already_generated+=len(dfs[-1])
pbar.update(len(dfs[-1]))
df_gen=pd.concat(dfs)
df_gen=df_gen.reset_index(drop=True)
...
The _convert_text_to_tabular_data can be improved similarly by making it return a DataFrame that is constructed from a list of dictionaries.
This way for a dataset containing 20K+ samples, generation time went from 40+ minutes to about 3 minutes. Smaller datasets also seem to benefit, but this is less pronounced as the overhead grows linearly with the sampling iteration.
Example implementation
Looking at related work, it seems like the RealTabFormers implementation provides an example of this.
Dear authors,
Let me start by thanking you for the open-source release of GReaT. I found an implementation detail about the generation of samples, especially on larger datasets.
Problem Description
Looking at the GPU utilization I found that the CPU workload (everything outside of sampling the model) takes increasingly longer. (using
nvtop
, GPU utilization becomes worse with more/higher sampling iterations).Proposed Solution
Digging in the code I found that the accumulator (
df_gen
) and generated (pd.DataFrame(td)
) data frames are concatenated in each iteration.https://github.com/kathrinse/be_great/blob/c568617763ba954fb39fc6b6e222e3abaef0886a/be_great/great.py#LL147C21-L147C21
be_great/be_great/great_utils.py
Line 97 in c568617
This incurs
O(N^2)
overhead (each time memory is allocated for a newDataFrame
that can contain all rows). This can be resolved by creating a list of data frames andconcat
enating them at the end of the generation process. For example:for
GReaT.sample
this would require a minor change, similar to the following:The
_convert_text_to_tabular_data
can be improved similarly by making it return a DataFrame that is constructed from a list of dictionaries.This way for a dataset containing 20K+ samples, generation time went from 40+ minutes to about 3 minutes. Smaller datasets also seem to benefit, but this is less pronounced as the overhead grows linearly with the sampling iteration.
Example implementation
Looking at related work, it seems like the RealTabFormers implementation provides an example of this.
https://github.com/worldbank/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/rtf_sampler.py#L610-L674
Side note
Likely this could also (slightly) improve GReaT's performance in Appendix B.5 of your paper for inference/generation.
The text was updated successfully, but these errors were encountered: