Context-dependent word level estimates #734

adelavega · 2020-01-31T23:49:35Z

Closes #713

My preferred solution is one that is slightly hacky but fits in with the current way of doing things as much as possible.

When ingesting transcripts, each word is represented in an abstract sense, with no timing information. Here, I'm adding a representation of a whole transcript, with content including the whole text for a run. These are differentiated by a mimetype of text/csv (to indicate multiple rows) instead of text/plain

On runtime, these will get recreated into ComplexTextStimuli with word level onsets (by querying the db).

Then on extraction, extractors that require the whole transcript will have to be denoted in a different way. I'm thinking this may be worth having its own field in the JSON object.

In terms of tokenization, we also need to specific ways to then split this ComplexTextObject for approaches such as sliding window, or even for those that require the whole text.

Still need to:

Recreate ComplexTextStimuli from db for existing datasets/transcripts
Add custom extraction logic for these objects (recreate CTS object from db)
Implement tokenization/moving window logic.

codecov-io · 2020-02-03T18:25:38Z

Codecov Report

Merging #734 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #734   +/-   ##
=======================================
  Coverage   84.67%   84.67%           
=======================================
  Files          67       67           
  Lines        3106     3106           
=======================================
  Hits         2630     2630           
  Misses        476      476

Impacted Files	Coverage Δ
neuroscout/populate/extract.py	`67.66% <ø> (ø)`	⬆️
neuroscout/tasks/upload.py	`54.92% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b52961...8b5b481. Read the comment docs.

…nto enh/complex_text

…ey can be processed by the FeatureSchema

adelavega · 2020-02-06T01:19:23Z

@tyarkoni (cc @rbroc) I notice BERT returns a list, in a single field, rather than n encoding columns.

Currently, we store all values as strings in the db, so we have to be careful about how we handle this. Would splat-ing this array onto individual columns by pybids' job, or custom neuroscout logic prior to ingestion to ingest as separate features? We could also allow a JSONB field to allow arrays of values, which would get split into individual columns at report/compile time. However, this would make it incompatible with the frontend for the most part, as the column names would be unknown ahead of time.

Otherwise, I've tested the BERT extractor, and the windowing logic is implemented, so once we figure that out we should be ready to go.

tyarkoni · 2020-02-06T17:00:00Z

I see this as NeuroScout's job to handle. @rbroc and I talked it over, and from a pliers standpoint, I don't like representing each dimension as its own feature, because the user will almost always want to use that vector as a single variable. So I think probably you'll need to break this up during ingestion. (I agree that passing it to PyBIDS to deal with doesn't make sense.)

tyarkoni · 2020-02-06T17:03:23Z

I suppose we could consider adding an initialization argument to the pliers extractor that controls whether the embedding is returned in one field or one per dimension. I'd be fine with that if it turns out to be a PITA to implement the custom logic in NeuroScout (though it doesn't seem unreasonable to me to think this is something we might eventually want anyway, for other features). If it's not too much work, I think I'd prefer sticking with doing it in NeuroScout.

I wonder if we could just adopt a universal heuristic that all iterable values found in pliers features are to be broken up and serially numbered during ingestion into NeuroScout. Are there cases where that policy would fail us?

adelavega · 2020-02-06T17:07:00Z

I think your last suggestion probably works.If value is list, create n number of features that end with _{n}.

The main downside is if we ever want to include these features on the frontend, it will be unwieldy. In that case, we'd need a way to group features.

adelavega · 2020-02-06T17:09:02Z

Also, it just seems that you probably want to use less than all the embeddings, so having them as individual features is less useful. That said, that's what they literally are so its consistent with the way we do things.

tyarkoni · 2020-02-06T17:16:33Z

I think we'll probably want to include all of the embedding dimensions in most models where any are included, so I agree that it would be good to have some front-end short-cut for grabbing/grouping all of them (while also allowing the user to select individual ones). But I guess this isn't really any different from how we handle confounds, right (except for there being more dimensions)? Do we currently require users to check each confound box individually, or can they just "select all" to do it once?

adelavega · 2020-02-10T22:43:39Z

After removing the history column, the total size of the extracted_event table is 76M rows at 636MB.

A single extraction of a PretrainedBertExtractor creates around 1.1M rows which is about 93MB.

This is about 1461 events per run, which less than one per second on average. This actually seems fairly reasonable. Obviously, the events will be higher frequency when speech is occuring, with stretches of no speech.

For now, will try: rounding values to 5th to see how this affects the size.

Then run models with both and compare.

adelavega · 2020-02-11T22:57:10Z

Looks like rounded and unrounded values result in identical f-test images. So much so that I think we can round even more if necessary.

…nto enh/complex_text

adelavega · 2020-02-12T19:07:40Z

Okay, looks like I'm using the right column type for value and rounding did shorten the column's length/size on disk from about 20 to 8-9 bytes.

However, the problem is that each row has about ~72 bytes of overhead, including the onset, duration, object_id, foreign keys and necessary indexes (primary_key, foreign_keys).

That means that assuming 90 bytes per row * 768 embedding dimensions * 1400 words = 96768000 bytes = 96.77 MB

Assume we do this for all runs with speech in Neuroscout (~25), that's about 2.5 GB. Not totally crazy but not very efficient.

For reference, if we were to store all dimensions in one row as a string (which we then parse), it would only be about 164 MB for all datasets.

adelavega · 2020-02-12T22:51:49Z

Another solution is to store values for predictors we know to be large and repetition as a JSON column in Predictors (still allowing us to split predictors for selection).

If stored as a list of list, we could use 25% of the size it would be to store as proper rows, and minimize the custom code to deal with this (only upon dumping to PredictorEvents).

adelavega · 2020-02-18T20:08:09Z

Looks like everything went well with running the BERT analysis, so ready to merge.

adelavega added 2 commits January 31, 2020 17:32

Initial attemp at ComplexTextStim

fa13c02

Update test for ComplexStimCreation

da709f4

adelavega added 19 commits February 3, 2020 12:42

Add example extract_tokenized extractors logic

431b475

Add option to only ingest complte transcripts

43f301b

Add CTS reconstruction logic

b68a1b5

Implement basic extraction logic for CTS

fe2d3b1

Fix feature_schema

1fe9122

Add word counter extractor to feature schema, flatten result list

426df6e

Update migrations

8446ec0

Merge branch 'enh/complex_text' of github.com:neuroscout/neuroscout i…

4660541

…nto enh/complex_text

Abstract load CTS models

31288ea

Merge branch 'enh/complex_text' of github.com:neuroscout/neuroscout i…

7cd7452

…nto enh/complex_text

Add windowing logic

16c23cc

Add docstrings for all functions in populate.extract

ea869d8

Extract word count for all elligible datasets

2a1a095

Add progress bar to stimuli

24d98cb

Matching stimulus is oldest matching stimulus

cf7bf89

Match including up to extension

f5bc60a

Fix feature schema

5f74cec

Add window_method and window_n as parameters to extractor (so that th…

f0ad757

…ey can be processed by the FeatureSchema

Minor linting

8fc0b5e

adelavega added 2 commits February 5, 2020 19:19

Add requirements for BERT

7273309

Correct pytorch version

39b4f8c

adelavega added 4 commits February 6, 2020 18:22

Add to merlin

c090719

Don't index ExtractorResult object

15caab5

Add comma to JSON'

5e7fb08

JSON schema extra comma

e7d7010

adelavega force-pushed the enh/complex_text branch from 23989ba to e7d7010 Compare February 7, 2020 20:27

adelavega added 5 commits February 7, 2020 15:22

Don't add all for CTS extraction

0436782

Update schema, and add assert

4c3b0de

listify only value

4a151dc

Check length of value not object

5bb88ff

listify only values

66b1d2a

adelavega force-pushed the enh/complex_text branch from 157439a to 804a49c Compare February 7, 2020 22:40

Value is v not val

79288c2

adelavega force-pushed the enh/complex_text branch from 804a49c to 79288c2 Compare February 7, 2020 22:40

adelavega added 4 commits February 7, 2020 17:27

rename before hash

5ff2cc6

Remove double line

b76ca9a

Remove history column from WW

058c10d

Add rounding to FeatureSerializer

2b52961

adelavega added 5 commits February 11, 2020 17:03

Merge branch 'master' into enh/complex_text

b213b4a

Round to 3

ccc755b

Downgrade Werkzeug in celery

fc901d3

Merge branch 'enh/complex_text' of github.com:neuroscout/neuroscout i…

f49428b

…nto enh/complex_text

Resolve celery requirements.txt conflict

8b5b481

Use stretch for celery worker

08130f8

adelavega merged commit 32f2bc0 into master Feb 18, 2020

adelavega deleted the enh/complex_text branch February 18, 2020 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context-dependent word level estimates #734

Context-dependent word level estimates #734

adelavega commented Jan 31, 2020 •

edited

codecov-io commented Feb 3, 2020 •

edited

adelavega commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

adelavega commented Feb 6, 2020

adelavega commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

adelavega commented Feb 10, 2020

adelavega commented Feb 11, 2020 •

edited

adelavega commented Feb 12, 2020

adelavega commented Feb 12, 2020

adelavega commented Feb 18, 2020

Context-dependent word level estimates #734

Context-dependent word level estimates #734

Conversation

adelavega commented Jan 31, 2020 • edited

codecov-io commented Feb 3, 2020 • edited

Codecov Report

adelavega commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

adelavega commented Feb 6, 2020

adelavega commented Feb 6, 2020

tyarkoni commented Feb 6, 2020

adelavega commented Feb 10, 2020

adelavega commented Feb 11, 2020 • edited

adelavega commented Feb 12, 2020

adelavega commented Feb 12, 2020

adelavega commented Feb 18, 2020

adelavega commented Jan 31, 2020 •

edited

codecov-io commented Feb 3, 2020 •

edited

adelavega commented Feb 11, 2020 •

edited