Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graphs to windowed extraction #756

Merged
merged 35 commits into from May 31, 2020
Merged

Add graphs to windowed extraction #756

merged 35 commits into from May 31, 2020

Conversation

adelavega
Copy link
Collaborator

To support BERT Entropy

@codecov-io
Copy link

codecov-io commented May 7, 2020

Codecov Report

Merging #756 into master will decrease coverage by 0.54%.
The diff coverage is 7.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #756      +/-   ##
==========================================
- Coverage   83.38%   82.84%   -0.55%     
==========================================
  Files          63       63              
  Lines        2823     2874      +51     
==========================================
+ Hits         2354     2381      +27     
- Misses        469      493      +24     
Impacted Files Coverage Δ
neuroscout/populate/annotate.py 86.27% <ø> (ø)
neuroscout/populate/extract.py 66.91% <7.69%> (-0.76%) ⬇️
neuroscout/resources/analysis/reports.py 44.68% <0.00%> (-3.97%) ⬇️
neuroscout/models/analysis.py 84.09% <0.00%> (-0.80%) ⬇️
neuroscout/schemas/analysis.py 97.97% <0.00%> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdc9002...c9bae10. Read the comment docs.

@adelavega
Copy link
Collaborator Author

Okay @rbroc I've added the ability to extract Graphs using "tokenized_extractors". Tokenized extractors are those that operate on text which is combined in various ways, most commonly either by run or by a moving window of variable size.

I extracted BERT LM Entropy with a window of 25 for Merlin Movie. Feel free to take a look, although I'm not sure what else we can do to validate. Next up, I'll extract with n=10 with the following Graph:

[
    [{"transformer": "BertLMExtractor", "parameters": {"mask": 10, "top_n": 100}, "children": [{"transformer": "MetricExtractor", "parameters": {"functions": "scipy.stats.entropy"}}]}],
    {"window": "pre", "n": 10}
]

If that looks allright I'll go ahead.

Only other thing to check is that this latest chagne didn't mess up with the extraction for normal BERT as it now also has to be graph. I'll look into that tomorrow.

@rbroc
Copy link
Collaborator

rbroc commented May 14, 2020

okay @adelavega, I think we'll go for a window size of 25 words, with: top_n=100, mask=24, return_softmax=True and, I guess, framework='tf'. MetricExtractor takes functions='scipy.stats.entropy'. Note that the mask argument is equal window_size - 1, as what we're doing here is masking the last word (indexing starts at 0) in a sequence and try to predict it based on the previous words.

One small issue. I have been trying to check the extracted values for Merlin with window size 25, top_n=100, mask=24 and return_softmax left to default, but I can't manage to reproduce them. Probably worth having a chat to find out why.

One more thing. Can't remember if, for the Bert encodings, we fed 25 words text chunks to the extractor and only picked the encoding for the last word, or if we averaged across all word-level encodings and assigned the resulting encoding to the next word in the transcript.
In the case of the entropy extractor, the entropy value we get should be assigned to the last word in the 25-word text chunk.

@adelavega
Copy link
Collaborator Author

adelavega commented May 14, 2020

About not being able to reproduce those values, maybe we can put that on hold and try again with the parameters you gave me. I think the index was set to the same number as the window size.

This actually reveals a bigger issue, which is that the result object has the MetricExtractor as its extractor, so I lose the parameters of the BERT extract in my meta-data...

I may be able to address that within Neuroscout only, but we probably need to at some point (not super urgent though)

@adelavega
Copy link
Collaborator Author

I believe that you feed 25 word chunks to the extractor, and only pick the encoding the last word.

@adelavega
Copy link
Collaborator Author

@rbroc I have re-exracted the BERT entropy extractor. The name of the extractor is: BERTLM_pre_25_entropy.

It is available for Sherlock task in SherlockMerlin.

@rbroc
Copy link
Collaborator

rbroc commented May 20, 2020

ok, still can't reproduce the values. Maybe worth having a chat about it before extracting the rest.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2020

Codecov Report

Merging #756 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #756   +/-   ##
=======================================
  Coverage   83.11%   83.11%           
=======================================
  Files          63       63           
  Lines        2837     2837           
=======================================
  Hits         2358     2358           
  Misses        479      479           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1bd12d...b1bd12d. Read the comment docs.

@adelavega
Copy link
Collaborator Author

adelavega commented May 21, 2020

For SherlockMerlin this is the first window:

and the first window is:
"how 's your going yeah good very good you have n't written a word have you you just wrote still has trust issues and you read my"

@adelavega
Copy link
Collaborator Author

and the result for that first window is:
0.00452 with onset of 137.232 and duration of 0.29.

That should correspond to the word "my"

Although actually this looks a bit funny, onset for "my" in the original complex text object is 139.22

That could be the source of the mismatch.

@adelavega
Copy link
Collaborator Author

adelavega commented May 21, 2020

I tried this with BERTLM outside of neuroscout, and it's giving back the 137 onset. This seems like a problem with the extractor.

Otherwise, the entropy value manually calculator also seems correct.

The onset is relative to the onset of the movie in therun, which is 25.5.

So the first onset for the BERTLM extractor entropy value should be 25.5+139.922 = 165.422

In actually, since the returned onset is 137.232, then it is 162.732

The value in the db for that onset is 0.005` which is correct.

One issue is I'm rounding these values, so maybe I shouldn't be given their range.

The one thing I don't understand is there are values before that onset, with really high entropy values and I'm not sure where they are coming from.

@adelavega
Copy link
Collaborator Author

adelavega commented May 21, 2020

  1. I'm going to round to one more decimal place
  2. We need to fix the onset coming from pliers, or I can force fix it in neuroscout by taking the onset from the last element and forcing it onto the result
  3. Figure out where the first few values are coming from
  4. I can also add custom logic to fix the apostrophe issue. Should we remove it and spaces or just spaces?

@adelavega
Copy link
Collaborator Author

Okay, looks like for this window:
"good very good you have n't written a word have you you just wrote still has trust issues and you read my writing upside down you" which should have an onset of 143.602, the graph returns an onset of 134.863, with a value of 2.811.

This onset is wrong, so this is probably why we're having weird results.

I could fix this in NS, but I think it's a Pliers issue.

@adelavega
Copy link
Collaborator Author

in /media/neuroscout-data/neuroscout/file_data I saved two .pkls which are the ComplexTextStims that correspond to those two window slices, so you can play with it @rbroc.

@rbroc
Copy link
Collaborator

rbroc commented May 22, 2020

okay, looked into this.

Onset issue
The onset issues was indeed a pliers issue with BertLMExtractor, sorry. I should have fixed it in this PR: PsychoinformaticsLab/pliers#397.
The problem was that instead of picking the onset of the 24th word, it picked the onset of the 24th token, as indexing was based on BERT-tokenized text instead of words (as it should have been).

There was also an additional issue little, namely that if the last word in the sequence (the one we want to mask) also occurred earlier in the sequence, the extractor would by default mask the first one. This should also be fixed now.
Unfortunately, this means that we'll have to re-extract also for SherlockMerlin, as there may be cases where the last word occurs twice in the text.

Could you try pulling pliers from my PR, and checking if the onset issue is fixed?

Mismatch in values
I still get slightly different values. We may be using different parameters for BertLMExtractor? I'm setting return_softmax=True, top_n=100, framework='tf', mask=24.

Input to BertLM
I don't think the apostrophe is an issue. But if we could strip spaces from within the text field of each element of the ComplexTextStim that'd be great.

@adelavega
Copy link
Collaborator Author

Okay, now that BERT surprisal and BERT entropy are extracted I'm going to merge.,

@adelavega adelavega merged commit 1a68a80 into master May 31, 2020
@adelavega adelavega deleted the enh/graph branch January 29, 2021 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants