New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context-dependent word level estimates #734
Conversation
Codecov Report
@@ Coverage Diff @@
## master #734 +/- ##
=======================================
Coverage 84.67% 84.67%
=======================================
Files 67 67
Lines 3106 3106
=======================================
Hits 2630 2630
Misses 476 476
Continue to review full report at Codecov.
|
…nto enh/complex_text
…nto enh/complex_text
…ey can be processed by the FeatureSchema
@tyarkoni (cc @rbroc) I notice BERT returns a list, in a single field, rather than Currently, we store all values as strings in the db, so we have to be careful about how we handle this. Would splat-ing this array onto individual columns by pybids' job, or custom neuroscout logic prior to ingestion to ingest as separate features? We could also allow a Otherwise, I've tested the BERT extractor, and the windowing logic is implemented, so once we figure that out we should be ready to go. |
I see this as NeuroScout's job to handle. @rbroc and I talked it over, and from a pliers standpoint, I don't like representing each dimension as its own feature, because the user will almost always want to use that vector as a single variable. So I think probably you'll need to break this up during ingestion. (I agree that passing it to PyBIDS to deal with doesn't make sense.) |
I suppose we could consider adding an initialization argument to the pliers extractor that controls whether the embedding is returned in one field or one per dimension. I'd be fine with that if it turns out to be a PITA to implement the custom logic in NeuroScout (though it doesn't seem unreasonable to me to think this is something we might eventually want anyway, for other features). If it's not too much work, I think I'd prefer sticking with doing it in NeuroScout. I wonder if we could just adopt a universal heuristic that all iterable values found in pliers features are to be broken up and serially numbered during ingestion into NeuroScout. Are there cases where that policy would fail us? |
I think your last suggestion probably works.If value is list, create The main downside is if we ever want to include these features on the frontend, it will be unwieldy. In that case, we'd need a way to group features. |
Also, it just seems that you probably want to use less than all the embeddings, so having them as individual features is less useful. That said, that's what they literally are so its consistent with the way we do things. |
I think we'll probably want to include all of the embedding dimensions in most models where any are included, so I agree that it would be good to have some front-end short-cut for grabbing/grouping all of them (while also allowing the user to select individual ones). But I guess this isn't really any different from how we handle confounds, right (except for there being more dimensions)? Do we currently require users to check each confound box individually, or can they just "select all" to do it once? |
23989ba
to
e7d7010
Compare
157439a
to
804a49c
Compare
804a49c
to
79288c2
Compare
After removing the A single extraction of a This is about 1461 events per run, which less than one per second on average. This actually seems fairly reasonable. Obviously, the events will be higher frequency when speech is occuring, with stretches of no speech. For now, will try: rounding values to 5th to see how this affects the size. Then run models with both and compare. |
Looks like rounded and unrounded values result in identical f-test images. So much so that I think we can round even more if necessary. |
Okay, looks like I'm using the right column type for However, the problem is that each row has about ~72 bytes of overhead, including the onset, duration, object_id, foreign keys and necessary indexes (primary_key, foreign_keys). That means that assuming 90 bytes per row * 768 embedding dimensions * 1400 words = 96768000 bytes = 96.77 MB Assume we do this for all runs with speech in Neuroscout (~25), that's about 2.5 GB. Not totally crazy but not very efficient. For reference, if we were to store all dimensions in one row as a string (which we then parse), it would only be about 164 MB for all datasets. |
Another solution is to store values for predictors we know to be large and repetition as a If stored as a list of list, we could use 25% of the size it would be to store as proper rows, and minimize the custom code to deal with this (only upon dumping to |
Looks like everything went well with running the BERT analysis, so ready to merge. |
Closes #713
My preferred solution is one that is slightly hacky but fits in with the current way of doing things as much as possible.
When ingesting transcripts, each word is represented in an abstract sense, with no timing information. Here, I'm adding a representation of a whole transcript, with
content
including the whole text for a run. These are differentiated by a mimetype oftext/csv
(to indicate multiple rows) instead oftext/plain
On runtime, these will get recreated into
ComplexTextStimuli
with word level onsets (by querying the db).Then on extraction, extractors that require the whole transcript will have to be denoted in a different way. I'm thinking this may be worth having its own field in the
JSON
object.In terms of tokenization, we also need to specific ways to then split this
ComplexTextObject
for approaches such as sliding window, or even for those that require the whole text.Still need to:
ComplexTextStimuli
from db for existing datasets/transcripts