Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing Dataset #4

Closed
camelot2002 opened this issue Aug 1, 2021 · 7 comments
Closed

Changing Dataset #4

camelot2002 opened this issue Aug 1, 2021 · 7 comments

Comments

@camelot2002
Copy link

I wanted to change the data set but am unable to understand how you have mapped document_ids to the documents. A little clarification of that in readme.md would be really helpful.
Thank you.

@maifeng
Copy link
Collaborator

maifeng commented Aug 1, 2021

The document ids are either unique IDs provided by the data vendor or they can be incremental IDs. If you have a CSV file with no other unique identifiers, you can save the row numbers as the document IDs.

@maifeng maifeng closed this as completed Aug 1, 2021
@camelot2002
Copy link
Author

i dont have a csv file all i have is the data

@camelot2002
Copy link
Author

camelot2002 commented Aug 1, 2021

i have a ticker to differentiate different companies. But in your csv files one document has multiple document ids and i dont understand how a document has been broken down.

@maifeng
Copy link
Collaborator

maifeng commented Aug 1, 2021

One input document corresponds to one unique id. The number of rows in document file is the same as the document-id file.

@camelot2002
Copy link
Author

camelot2002 commented Aug 1, 2021

the document.txt in the input folder contains several documents right? and each line has a unique id okay. And also each document has a unique id. How does it differentiate between different documents in that plethora of text.

@maifeng
Copy link
Collaborator

maifeng commented Aug 1, 2021

Each line in document.txt is a unique document with line breaks removed.

@camelot2002
Copy link
Author

okay thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants