Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context-dependent word level estimates #734

Merged
merged 49 commits into from Feb 18, 2020
Merged

Context-dependent word level estimates #734

merged 49 commits into from Feb 18, 2020

Conversation

adelavega
Copy link
Collaborator

@adelavega adelavega commented Jan 31, 2020

Closes #713

My preferred solution is one that is slightly hacky but fits in with the current way of doing things as much as possible.

When ingesting transcripts, each word is represented in an abstract sense, with no timing information. Here, I'm adding a representation of a whole transcript, with content including the whole text for a run. These are differentiated by a mimetype of text/csv (to indicate multiple rows) instead of text/plain

On runtime, these will get recreated into ComplexTextStimuli with word level onsets (by querying the db).

Then on extraction, extractors that require the whole transcript will have to be denoted in a different way. I'm thinking this may be worth having its own field in the JSON object.

In terms of tokenization, we also need to specific ways to then split this ComplexTextObject for approaches such as sliding window, or even for those that require the whole text.

Still need to:

  • Recreate ComplexTextStimuli from db for existing datasets/transcripts
  • Add custom extraction logic for these objects (recreate CTS object from db)
  • Implement tokenization/moving window logic.

@codecov-io
Copy link

codecov-io commented Feb 3, 2020

Codecov Report

Merging #734 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #734   +/-   ##
=======================================
  Coverage   84.67%   84.67%           
=======================================
  Files          67       67           
  Lines        3106     3106           
=======================================
  Hits         2630     2630           
  Misses        476      476
Impacted Files Coverage Δ
neuroscout/populate/extract.py 67.66% <ø> (ø) ⬆️
neuroscout/tasks/upload.py 54.92% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b52961...8b5b481. Read the comment docs.

@adelavega
Copy link
Collaborator Author

@tyarkoni (cc @rbroc) I notice BERT returns a list, in a single field, rather than n encoding columns.

Currently, we store all values as strings in the db, so we have to be careful about how we handle this. Would splat-ing this array onto individual columns by pybids' job, or custom neuroscout logic prior to ingestion to ingest as separate features? We could also allow a JSONB field to allow arrays of values, which would get split into individual columns at report/compile time. However, this would make it incompatible with the frontend for the most part, as the column names would be unknown ahead of time.

Otherwise, I've tested the BERT extractor, and the windowing logic is implemented, so once we figure that out we should be ready to go.

@tyarkoni
Copy link

tyarkoni commented Feb 6, 2020

I see this as NeuroScout's job to handle. @rbroc and I talked it over, and from a pliers standpoint, I don't like representing each dimension as its own feature, because the user will almost always want to use that vector as a single variable. So I think probably you'll need to break this up during ingestion. (I agree that passing it to PyBIDS to deal with doesn't make sense.)

@tyarkoni
Copy link

tyarkoni commented Feb 6, 2020

I suppose we could consider adding an initialization argument to the pliers extractor that controls whether the embedding is returned in one field or one per dimension. I'd be fine with that if it turns out to be a PITA to implement the custom logic in NeuroScout (though it doesn't seem unreasonable to me to think this is something we might eventually want anyway, for other features). If it's not too much work, I think I'd prefer sticking with doing it in NeuroScout.

I wonder if we could just adopt a universal heuristic that all iterable values found in pliers features are to be broken up and serially numbered during ingestion into NeuroScout. Are there cases where that policy would fail us?

@adelavega
Copy link
Collaborator Author

I think your last suggestion probably works.If value is list, create n number of features that end with _{n}.

The main downside is if we ever want to include these features on the frontend, it will be unwieldy. In that case, we'd need a way to group features.

@adelavega
Copy link
Collaborator Author

Also, it just seems that you probably want to use less than all the embeddings, so having them as individual features is less useful. That said, that's what they literally are so its consistent with the way we do things.

@tyarkoni
Copy link

tyarkoni commented Feb 6, 2020

I think we'll probably want to include all of the embedding dimensions in most models where any are included, so I agree that it would be good to have some front-end short-cut for grabbing/grouping all of them (while also allowing the user to select individual ones). But I guess this isn't really any different from how we handle confounds, right (except for there being more dimensions)? Do we currently require users to check each confound box individually, or can they just "select all" to do it once?

@adelavega
Copy link
Collaborator Author

After removing the history column, the total size of the extracted_event table is 76M rows at 636MB.

A single extraction of a PretrainedBertExtractor creates around 1.1M rows which is about 93MB.

This is about 1461 events per run, which less than one per second on average. This actually seems fairly reasonable. Obviously, the events will be higher frequency when speech is occuring, with stretches of no speech.

For now, will try: rounding values to 5th to see how this affects the size.

Then run models with both and compare.

@adelavega
Copy link
Collaborator Author

adelavega commented Feb 11, 2020

Looks like rounded and unrounded values result in identical f-test images. So much so that I think we can round even more if necessary.

@adelavega
Copy link
Collaborator Author

Okay, looks like I'm using the right column type for value and rounding did shorten the column's length/size on disk from about 20 to 8-9 bytes.

However, the problem is that each row has about ~72 bytes of overhead, including the onset, duration, object_id, foreign keys and necessary indexes (primary_key, foreign_keys).

That means that assuming 90 bytes per row * 768 embedding dimensions * 1400 words = 96768000 bytes = 96.77 MB

Assume we do this for all runs with speech in Neuroscout (~25), that's about 2.5 GB. Not totally crazy but not very efficient.

For reference, if we were to store all dimensions in one row as a string (which we then parse), it would only be about 164 MB for all datasets.

@adelavega
Copy link
Collaborator Author

Another solution is to store values for predictors we know to be large and repetition as a JSON column in Predictors (still allowing us to split predictors for selection).

If stored as a list of list, we could use 25% of the size it would be to store as proper rows, and minimize the custom code to deal with this (only upon dumping to PredictorEvents).

@adelavega
Copy link
Collaborator Author

Looks like everything went well with running the BERT analysis, so ready to merge.

@adelavega adelavega merged commit 32f2bc0 into master Feb 18, 2020
@adelavega adelavega deleted the enh/complex_text branch February 18, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ComplexText Stimulus Representation
3 participants