Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple tokens per feature data row #99

Open
de-code opened this issue Apr 1, 2020 · 3 comments
Open

Allow multiple tokens per feature data row #99

de-code opened this issue Apr 1, 2020 · 3 comments

Comments

@de-code
Copy link
Contributor

de-code commented Apr 1, 2020

This is carried over from #90 (comment)

Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.

Potential solution:

  • an option to specify the columns with the tokens (similar to the features)
  • concatenate the word embeddings and other token related vectors

Probably need to change a few places that expect a single token as an input.

/cc @kermitt2 @lfoppiano

@de-code
Copy link
Contributor Author

de-code commented Apr 7, 2020

I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185

I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different.

@lfoppiano
Copy link
Collaborator

We can think about it once the features channel is merged.

@de-code
Copy link
Contributor Author

de-code commented Aug 4, 2020

Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example.

Related PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants