Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor context_type #95

Open
JaimieMurdock opened this issue Jan 30, 2015 · 1 comment
Open

Refactor context_type #95

JaimieMurdock opened this issue Jan 30, 2015 · 1 comment

Comments

@JaimieMurdock
Copy link
Member

As discussed with Robert, we should remove all references to context_type in the module to dramatically simplify code. After using the package, it's become clear that the workflow often involves creating a new corpus object per workflow, not having multiply realizable Corpus objects for book, page, and sentence level analyses.

@JaimieMurdock
Copy link
Member Author

Upon further reflection and usage, context_type is incredibly useful. However, the concept was never formalized. One way I've started thinking about it is a 4 level tokenization heirarchy:

  1. folder
  2. file
  3. paragraph
  4. sentence

One interesting idea for an experiment is to take vsm.extensions.corpusbuilders.collection_corpus and extend it to allow for multiple indicies over the same range to allow for greater heirarchy depth. For example, one could model a title and its individual volumes (e.g., title/vol0, title/vol1, ...) and compare the individual volumes to the aggregate whole. At the file level, each document would only be modeled once.

There are complications, in that the viewer doesn't allow one to take the same tokens and look at them in finer granularity due to the bag-of-words assumption underlying LDA. Furthermore, this would mean that each multi-volume document would be added to a corpus twice, once for the aggregate and once for each individual volume, returning to the multiple copies modeling issue that violates further textual independence assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant