Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于预训练的数据集 #41

Closed
wyx502 opened this issue Nov 16, 2022 · 4 comments
Closed

关于预训练的数据集 #41

wyx502 opened this issue Nov 16, 2022 · 4 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@wyx502
Copy link

wyx502 commented Nov 16, 2022

请问一下,文中提到预训练数据集共30G,15G来自公开数据集,15来自CSTNET。我在unb的网站上找到了ISCXVPN2016、ISCXTor2016等很多个数据集,可以问一下作者15G公开数据集具体指哪些,或者做了哪些处理吗。另外,15G的CSTNET非公开数据集有办法获取吗,谢谢。

@wyx502
Copy link
Author

wyx502 commented Nov 17, 2022

image
作者您好,我找到了您发布的上图,想确认下是15G的公开数据集吗?另外,还想问一下4.3G的encrypted_traffic_burst.txt文件是30G的数据生成的吗?谢谢。

@linwhitehat linwhitehat added help wanted Extra attention is needed question Further information is requested labels Nov 18, 2022
@linwhitehat
Copy link
Owner

1,预训练的数据集中选取是没有什么加入约束的,因此可以使用尽可能丰富的协议流量进行替代。
2,encrypted_traffic_burst.txt是基于预训练数据生成的

@wyx502
Copy link
Author

wyx502 commented Nov 23, 2022

image
80df7a92e9cc6d0148e694c1cb792bd
抱歉再次打扰,我想问一下,您在主页的readme提到用vocab_process/main.py生成corpora,在data_process的readme中又提到pre-training stage用data_generation生成burst,哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢,谢谢。

@linwhitehat
Copy link
Owner

image 80df7a92e9cc6d0148e694c1cb792bd 抱歉再次打扰,我想问一下,您在主页的readme提到用vocab_process/main.py生成corpora,在data_process的readme中又提到pre-training stage用data_generation生成burst,哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢,谢谢。

你好,data_process中是生成用于预训练corpora所需的流量burst数据,然后由vocab_process生成相应的corpora,可以把这两部分理解为流量数据预处理和预训练数据生成的过程。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants