You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed with Robert, we should remove all references to context_type in the module to dramatically simplify code. After using the package, it's become clear that the workflow often involves creating a new corpus object per workflow, not having multiply realizable Corpus objects for book, page, and sentence level analyses.
The text was updated successfully, but these errors were encountered:
Upon further reflection and usage, context_type is incredibly useful. However, the concept was never formalized. One way I've started thinking about it is a 4 level tokenization heirarchy:
folder
file
paragraph
sentence
One interesting idea for an experiment is to take vsm.extensions.corpusbuilders.collection_corpus and extend it to allow for multiple indicies over the same range to allow for greater heirarchy depth. For example, one could model a title and its individual volumes (e.g., title/vol0, title/vol1, ...) and compare the individual volumes to the aggregate whole. At the file level, each document would only be modeled once.
There are complications, in that the viewer doesn't allow one to take the same tokens and look at them in finer granularity due to the bag-of-words assumption underlying LDA. Furthermore, this would mean that each multi-volume document would be added to a corpus twice, once for the aggregate and once for each individual volume, returning to the multiple copies modeling issue that violates further textual independence assumptions.
As discussed with Robert, we should remove all references to
context_type
in the module to dramatically simplify code. After using the package, it's become clear that the workflow often involves creating a new corpus object per workflow, not having multiply realizable Corpus objects for book, page, and sentence level analyses.The text was updated successfully, but these errors were encountered: