We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问一下,文中提到预训练数据集共30G,15G来自公开数据集,15来自CSTNET。我在unb的网站上找到了ISCXVPN2016、ISCXTor2016等很多个数据集,可以问一下作者15G公开数据集具体指哪些,或者做了哪些处理吗。另外,15G的CSTNET非公开数据集有办法获取吗,谢谢。
The text was updated successfully, but these errors were encountered:
作者您好,我找到了您发布的上图,想确认下是15G的公开数据集吗?另外,还想问一下4.3G的encrypted_traffic_burst.txt文件是30G的数据生成的吗?谢谢。
Sorry, something went wrong.
1,预训练的数据集中选取是没有什么加入约束的,因此可以使用尽可能丰富的协议流量进行替代。 2,encrypted_traffic_burst.txt是基于预训练数据生成的
抱歉再次打扰,我想问一下,您在主页的readme提到用vocab_process/main.py生成corpora,在data_process的readme中又提到pre-training stage用data_generation生成burst,哪个才是能够生成encryted_traffic_burst.txt的方法呢。因为我发现这两个都能生成txt文件。您能否再详细说明一下呢,谢谢。
你好,data_process中是生成用于预训练corpora所需的流量burst数据,然后由vocab_process生成相应的corpora,可以把这两部分理解为流量数据预处理和预训练数据生成的过程。
No branches or pull requests
请问一下,文中提到预训练数据集共30G,15G来自公开数据集,15来自CSTNET。我在unb的网站上找到了ISCXVPN2016、ISCXTor2016等很多个数据集,可以问一下作者15G公开数据集具体指哪些,或者做了哪些处理吗。另外,15G的CSTNET非公开数据集有办法获取吗,谢谢。
The text was updated successfully, but these errors were encountered: