关于预训练的数据集 #41

wyx502 · 2022-11-16T02:15:16Z

请问一下，文中提到预训练数据集共30G，15G来自公开数据集，15来自CSTNET。我在unb的网站上找到了ISCXVPN2016、ISCXTor2016等很多个数据集，可以问一下作者15G公开数据集具体指哪些，或者做了哪些处理吗。另外，15G的CSTNET非公开数据集有办法获取吗，谢谢。

wyx502 · 2022-11-17T07:30:14Z

作者您好，我找到了您发布的上图，想确认下是15G的公开数据集吗？另外，还想问一下4.3G的encrypted_traffic_burst.txt文件是30G的数据生成的吗？谢谢。

linwhitehat · 2022-11-18T15:40:14Z

1，预训练的数据集中选取是没有什么加入约束的，因此可以使用尽可能丰富的协议流量进行替代。
2，encrypted_traffic_burst.txt是基于预训练数据生成的

wyx502 · 2022-11-23T09:16:25Z

抱歉再次打扰，我想问一下，您在主页的readme提到用vocab_process/main.py生成corpora，在data_process的readme中又提到pre-training stage用data_generation生成burst，哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢，谢谢。

linwhitehat · 2022-12-17T13:16:05Z

抱歉再次打扰，我想问一下，您在主页的readme提到用vocab_process/main.py生成corpora，在data_process的readme中又提到pre-training stage用data_generation生成burst，哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢，谢谢。

你好，data_process中是生成用于预训练corpora所需的流量burst数据，然后由vocab_process生成相应的corpora，可以把这两部分理解为流量数据预处理和预训练数据生成的过程。

linwhitehat added help wanted Extra attention is needed question Further information is requested labels Nov 18, 2022

linwhitehat closed this as completed Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于预训练的数据集 #41

关于预训练的数据集 #41

wyx502 commented Nov 16, 2022

wyx502 commented Nov 17, 2022

linwhitehat commented Nov 18, 2022

wyx502 commented Nov 23, 2022

linwhitehat commented Dec 17, 2022

关于预训练的数据集 #41

关于预训练的数据集 #41

Comments

wyx502 commented Nov 16, 2022

wyx502 commented Nov 17, 2022

linwhitehat commented Nov 18, 2022

wyx502 commented Nov 23, 2022

linwhitehat commented Dec 17, 2022