Add graphs to windowed extraction #756

adelavega · 2020-05-07T21:16:23Z

To support BERT Entropy

codecov-io · 2020-05-07T22:07:37Z

Codecov Report

Merging #756 into master will decrease coverage by 0.54%.
The diff coverage is 7.69%.

@@            Coverage Diff             @@
##           master     #756      +/-   ##
==========================================
- Coverage   83.38%   82.84%   -0.55%     
==========================================
  Files          63       63              
  Lines        2823     2874      +51     
==========================================
+ Hits         2354     2381      +27     
- Misses        469      493      +24

Impacted Files	Coverage Δ
neuroscout/populate/annotate.py	`86.27% <ø> (ø)`
neuroscout/populate/extract.py	`66.91% <7.69%> (-0.76%)`	⬇️
neuroscout/resources/analysis/reports.py	`44.68% <0.00%> (-3.97%)`	⬇️
neuroscout/models/analysis.py	`84.09% <0.00%> (-0.80%)`	⬇️
neuroscout/schemas/analysis.py	`97.97% <0.00%> (+0.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdc9002...c9bae10. Read the comment docs.

adelavega · 2020-05-08T01:13:45Z

Okay @rbroc I've added the ability to extract Graphs using "tokenized_extractors". Tokenized extractors are those that operate on text which is combined in various ways, most commonly either by run or by a moving window of variable size.

I extracted BERT LM Entropy with a window of 25 for Merlin Movie. Feel free to take a look, although I'm not sure what else we can do to validate. Next up, I'll extract with n=10 with the following Graph:

[
    [{"transformer": "BertLMExtractor", "parameters": {"mask": 10, "top_n": 100}, "children": [{"transformer": "MetricExtractor", "parameters": {"functions": "scipy.stats.entropy"}}]}],
    {"window": "pre", "n": 10}
]

If that looks allright I'll go ahead.

Only other thing to check is that this latest chagne didn't mess up with the extraction for normal BERT as it now also has to be graph. I'll look into that tomorrow.

…/graph

rbroc · 2020-05-14T12:33:18Z

okay @adelavega, I think we'll go for a window size of 25 words, with: top_n=100, mask=24, return_softmax=True and, I guess, framework='tf'. MetricExtractor takes functions='scipy.stats.entropy'. Note that the mask argument is equal window_size - 1, as what we're doing here is masking the last word (indexing starts at 0) in a sequence and try to predict it based on the previous words.

One small issue. I have been trying to check the extracted values for Merlin with window size 25, top_n=100, mask=24 and return_softmax left to default, but I can't manage to reproduce them. Probably worth having a chat to find out why.

One more thing. Can't remember if, for the Bert encodings, we fed 25 words text chunks to the extractor and only picked the encoding for the last word, or if we averaged across all word-level encodings and assigned the resulting encoding to the next word in the transcript.
In the case of the entropy extractor, the entropy value we get should be assigned to the last word in the 25-word text chunk.

adelavega · 2020-05-14T18:13:28Z

About not being able to reproduce those values, maybe we can put that on hold and try again with the parameters you gave me. I think the index was set to the same number as the window size.

This actually reveals a bigger issue, which is that the result object has the MetricExtractor as its extractor, so I lose the parameters of the BERT extract in my meta-data...

I may be able to address that within Neuroscout only, but we probably need to at some point (not super urgent though)

adelavega · 2020-05-14T18:14:55Z

I believe that you feed 25 word chunks to the extractor, and only pick the encoding the last word.

adelavega · 2020-05-20T03:20:42Z

@rbroc I have re-exracted the BERT entropy extractor. The name of the extractor is: BERTLM_pre_25_entropy.

It is available for Sherlock task in SherlockMerlin.

rbroc · 2020-05-20T13:13:34Z

ok, still can't reproduce the values. Maybe worth having a chat about it before extracting the rest.

codecov-commenter · 2020-05-21T20:29:46Z

Codecov Report

Merging #756 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #756   +/-   ##
=======================================
  Coverage   83.11%   83.11%           
=======================================
  Files          63       63           
  Lines        2837     2837           
=======================================
  Hits         2358     2358           
  Misses        479      479

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1bd12d...b1bd12d. Read the comment docs.

adelavega · 2020-05-21T20:54:15Z

For SherlockMerlin this is the first window:

and the first window is:
"how 's your going yeah good very good you have n't written a word have you you just wrote still has trust issues and you read my"

adelavega · 2020-05-21T20:58:22Z

and the result for that first window is:
0.00452 with onset of 137.232 and duration of 0.29.

That should correspond to the word "my"

Although actually this looks a bit funny, onset for "my" in the original complex text object is 139.22

That could be the source of the mismatch.

adelavega · 2020-05-21T21:29:53Z

I tried this with BERTLM outside of neuroscout, and it's giving back the 137 onset. This seems like a problem with the extractor.

Otherwise, the entropy value manually calculator also seems correct.

The onset is relative to the onset of the movie in therun, which is 25.5.

So the first onset for the BERTLM extractor entropy value should be 25.5+139.922 = 165.422

In actually, since the returned onset is 137.232, then it is 162.732

The value in the db for that onset is 0.005` which is correct.

One issue is I'm rounding these values, so maybe I shouldn't be given their range.

The one thing I don't understand is there are values before that onset, with really high entropy values and I'm not sure where they are coming from.

adelavega · 2020-05-21T21:32:17Z

I'm going to round to one more decimal place
We need to fix the onset coming from pliers, or I can force fix it in neuroscout by taking the onset from the last element and forcing it onto the result
Figure out where the first few values are coming from
I can also add custom logic to fix the apostrophe issue. Should we remove it and spaces or just spaces?

adelavega · 2020-05-21T22:15:42Z

Okay, looks like for this window:
"good very good you have n't written a word have you you just wrote still has trust issues and you read my writing upside down you" which should have an onset of 143.602, the graph returns an onset of 134.863, with a value of 2.811.

This onset is wrong, so this is probably why we're having weird results.

I could fix this in NS, but I think it's a Pliers issue.

adelavega · 2020-05-21T22:17:11Z

in /media/neuroscout-data/neuroscout/file_data I saved two .pkls which are the ComplexTextStims that correspond to those two window slices, so you can play with it @rbroc.

rbroc · 2020-05-22T11:20:20Z

okay, looked into this.

Onset issue
The onset issues was indeed a pliers issue with BertLMExtractor, sorry. I should have fixed it in this PR: PsychoinformaticsLab/pliers#397.
The problem was that instead of picking the onset of the 24th word, it picked the onset of the 24th token, as indexing was based on BERT-tokenized text instead of words (as it should have been).

There was also an additional issue little, namely that if the last word in the sequence (the one we want to mask) also occurred earlier in the sequence, the extractor would by default mask the first one. This should also be fixed now.
Unfortunately, this means that we'll have to re-extract also for SherlockMerlin, as there may be cases where the last word occurs twice in the text.

Could you try pulling pliers from my PR, and checking if the onset issue is fixed?

Mismatch in values
I still get slightly different values. We may be using different parameters for BertLMExtractor? I'm setting return_softmax=True, top_n=100, framework='tf', mask=24.

Input to BertLM
I don't think the apostrophe is an issue. But if we could strip spaces from within the text field of each element of the ComplexTextStim that'd be great.

…/graph

adelavega · 2020-05-31T23:08:59Z

Okay, now that BERT surprisal and BERT entropy are extracted I'm going to merge.,

Add graphs to windowed extraction

3b7ad9a

adelavega added 7 commits May 7, 2020 17:35

Add BERT Entropy to Sherlock

37f503d

Fix SM typo

2b74880

Fix transformers typo

4f0757d

Loop over graph results

84f85ec

Add entropy to feature schema

f1cc80e

Set modality to none for MetricExtractor

b742768

Extract BERTLMEntropy in all

7bcaf95

adelavega added 2 commits May 8, 2020 16:55

Fix sm json

75d9b08

Merge branch 'enh/graph' of github.com:neuroscout/neuroscout into enh…

1db069e

…/graph

adelavega added 5 commits May 16, 2020 14:07

Add 25 window extractin (should fail)

977f525

Remove assert

cee0b86

Index n-1

1598270

Debug object_id

fc0eb61

undo breakpoint

c9bae10

Add rbroc to authoerized users

3e4cc20

Add debug point

12cbff2

Move breakpoint

d8191c8

Remove breakpoint

f76768a

adelavega added 16 commits May 22, 2020 13:09

Add breakpoint

f02904f

Merge branch 'master' into enh/graph

9c9d67d

Merge branch 'enh/graph' of github.com:neuroscout/neuroscout into enh…

d9aaa8a

…/graph

Put breakpoint earlier

4e34814

Merge branch 'enh/graph' of github.com:neuroscout/neuroscout into enh…

769afbc

…/graph

move breakpoint

272b4d3

Remove breakpoint and exract for SM dataset

112d2b3

Extract all

e2a2e3a

Fix forrest config

512cceb

Add BERT LM Surprisal for SherlockMerlin

0fac87d

Fix feature schema

d04fc7d

Convert NaNs to n/a (to not skip them in FeatureSerialier

d559141

Add n/a to sm config

63538e5

Rename surprisal

963f692

fix json

8caa7d4

Extract BERT Suprisal for all but SM

b1bd12d

adelavega merged commit 1a68a80 into master May 31, 2020

adelavega deleted the enh/graph branch January 29, 2021 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add graphs to windowed extraction #756

Add graphs to windowed extraction #756

adelavega commented May 7, 2020

codecov-io commented May 7, 2020 •

edited

adelavega commented May 8, 2020

rbroc commented May 14, 2020

adelavega commented May 14, 2020 •

edited

adelavega commented May 14, 2020

adelavega commented May 20, 2020

rbroc commented May 20, 2020

codecov-commenter commented May 21, 2020 •

edited

adelavega commented May 21, 2020 •

edited

adelavega commented May 21, 2020

adelavega commented May 21, 2020 •

edited

adelavega commented May 21, 2020 •

edited

adelavega commented May 21, 2020

adelavega commented May 21, 2020

rbroc commented May 22, 2020

adelavega commented May 31, 2020

Add graphs to windowed extraction #756

Add graphs to windowed extraction #756

Conversation

adelavega commented May 7, 2020

codecov-io commented May 7, 2020 • edited

Codecov Report

adelavega commented May 8, 2020

rbroc commented May 14, 2020

adelavega commented May 14, 2020 • edited

adelavega commented May 14, 2020

adelavega commented May 20, 2020

rbroc commented May 20, 2020

codecov-commenter commented May 21, 2020 • edited

Codecov Report

adelavega commented May 21, 2020 • edited

adelavega commented May 21, 2020

adelavega commented May 21, 2020 • edited

adelavega commented May 21, 2020 • edited

adelavega commented May 21, 2020

adelavega commented May 21, 2020

rbroc commented May 22, 2020

adelavega commented May 31, 2020

codecov-io commented May 7, 2020 •

edited

adelavega commented May 14, 2020 •

edited

codecov-commenter commented May 21, 2020 •

edited

adelavega commented May 21, 2020 •

edited

adelavega commented May 21, 2020 •

edited

adelavega commented May 21, 2020 •

edited