Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we upload our own dataset? #4

Closed
sandro272 opened this issue Jul 25, 2019 · 5 comments
Closed

Can we upload our own dataset? #4

sandro272 opened this issue Jul 25, 2019 · 5 comments

Comments

@sandro272
Copy link

Do you have scripts available/any easy way to convert raw data to your processed dataset files . So that i can test your on my own dataset .

@svjan5
Copy link
Member

svjan5 commented Jul 25, 2019

Hi @sandro272,
By dataset you mean training dataset (wikipedia corpus) or evaluation data?

@sandro272
Copy link
Author

@svjan5 Uh... I mean that because I want to use our own dataset, so can you provide a script or method that converts raw data into your processed data (eg voc2id.txt, etc.). Thank you!

@svjan5
Copy link
Member

svjan5 commented Jul 26, 2019

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

@sandro272
Copy link
Author

@svjan5 OK,thank you!

@loginaway

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants