Refactor `context_type` #95

JaimieMurdock · 2015-01-30T19:08:40Z

As discussed with Robert, we should remove all references to context_type in the module to dramatically simplify code. After using the package, it's become clear that the workflow often involves creating a new corpus object per workflow, not having multiply realizable Corpus objects for book, page, and sentence level analyses.

The text was updated successfully, but these errors were encountered:

JaimieMurdock · 2015-05-30T16:21:40Z

Upon further reflection and usage, context_type is incredibly useful. However, the concept was never formalized. One way I've started thinking about it is a 4 level tokenization heirarchy:

folder
file
paragraph
sentence

One interesting idea for an experiment is to take vsm.extensions.corpusbuilders.collection_corpus and extend it to allow for multiple indicies over the same range to allow for greater heirarchy depth. For example, one could model a title and its individual volumes (e.g., title/vol0, title/vol1, ...) and compare the individual volumes to the aggregate whole. At the file level, each document would only be modeled once.

There are complications, in that the viewer doesn't allow one to take the same tokens and look at them in finer granularity due to the bag-of-words assumption underlying LDA. Furthermore, this would mean that each multi-volume document would be added to a corpus twice, once for the aggregate and once for each individual volume, returning to the multiple copies modeling issue that violates further textual independence assumptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `context_type` #95

Refactor `context_type` #95

JaimieMurdock commented Jan 30, 2015

JaimieMurdock commented May 30, 2015

Refactor context_type #95

Refactor context_type #95

Comments

JaimieMurdock commented Jan 30, 2015

JaimieMurdock commented May 30, 2015

Refactor `context_type` #95

Refactor `context_type` #95