You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to your experience, performance does not improve whe adding more than a certain number of sequences (around 1500-2000 if I remember correctly). Also, I have experienced segmentation faults from Wapiti when I have tried it nonetheless with more, although that might have been related to structural issues of the data (such as leading or trailing <note> fields).
This means that just throwing more and more training data at the algorithm is not the smart way. Instead, the task would be to select those sequences that have the highest information entropy and tossing out those which merely repeat what has already been learned.
I am not a CS person so I am not in a good position to figure out how to do this, but maybe someone has an idea how this could be implemented with AnyStyle.
The text was updated successfully, but these errors were encountered:
According to your experience, performance does not improve whe adding more than a certain number of sequences (around 1500-2000 if I remember correctly). Also, I have experienced segmentation faults from Wapiti when I have tried it nonetheless with more, although that might have been related to structural issues of the data (such as leading or trailing
<note>
fields).This means that just throwing more and more training data at the algorithm is not the smart way. Instead, the task would be to select those sequences that have the highest information entropy and tossing out those which merely repeat what has already been learned.
I am not a CS person so I am not in a good position to figure out how to do this, but maybe someone has an idea how this could be implemented with AnyStyle.
The text was updated successfully, but these errors were encountered: