New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add graphs to windowed extraction #756
Conversation
Codecov Report
@@ Coverage Diff @@
## master #756 +/- ##
==========================================
- Coverage 83.38% 82.84% -0.55%
==========================================
Files 63 63
Lines 2823 2874 +51
==========================================
+ Hits 2354 2381 +27
- Misses 469 493 +24
Continue to review full report at Codecov.
|
Okay @rbroc I've added the ability to extract Graphs using "tokenized_extractors". Tokenized extractors are those that operate on text which is combined in various ways, most commonly either by run or by a moving window of variable size. I extracted BERT LM Entropy with a window of 25 for Merlin Movie. Feel free to take a look, although I'm not sure what else we can do to validate. Next up, I'll extract with n=10 with the following Graph:
If that looks allright I'll go ahead. Only other thing to check is that this latest chagne didn't mess up with the extraction for normal BERT as it now also has to be graph. I'll look into that tomorrow. |
okay @adelavega, I think we'll go for a window size of 25 words, with: One small issue. I have been trying to check the extracted values for Merlin with window size 25, One more thing. Can't remember if, for the Bert encodings, we fed 25 words text chunks to the extractor and only picked the encoding for the last word, or if we averaged across all word-level encodings and assigned the resulting encoding to the next word in the transcript. |
About not being able to reproduce those values, maybe we can put that on hold and try again with the parameters you gave me. I think the index was set to the same number as the window size. This actually reveals a bigger issue, which is that the result object has the I may be able to address that within Neuroscout only, but we probably need to at some point (not super urgent though) |
I believe that you feed 25 word chunks to the extractor, and only pick the encoding the last word. |
@rbroc I have re-exracted the BERT entropy extractor. The name of the extractor is: It is available for Sherlock task in SherlockMerlin. |
ok, still can't reproduce the values. Maybe worth having a chat about it before extracting the rest. |
Codecov Report
@@ Coverage Diff @@
## master #756 +/- ##
=======================================
Coverage 83.11% 83.11%
=======================================
Files 63 63
Lines 2837 2837
=======================================
Hits 2358 2358
Misses 479 479 Continue to review full report at Codecov.
|
For SherlockMerlin this is the first window: and the first window is: |
and the result for that first window is: That should correspond to the word "my" Although actually this looks a bit funny, onset for "my" in the original complex text object is 139.22 That could be the source of the mismatch. |
I tried this with BERTLM outside of neuroscout, and it's giving back the 137 onset. This seems like a problem with the extractor. Otherwise, the entropy value manually calculator also seems correct. The onset is relative to the onset of the movie in therun, which is 25.5. So the first onset for the BERTLM extractor entropy value should be 25.5+139.922 = 165.422 In actually, since the returned onset is 137.232, then it is 162.732 The value in the One issue is I'm rounding these values, so maybe I shouldn't be given their range. The one thing I don't understand is there are values before that onset, with really high entropy values and I'm not sure where they are coming from. |
|
Okay, looks like for this window: This onset is wrong, so this is probably why we're having weird results. I could fix this in NS, but I think it's a Pliers issue. |
in |
okay, looked into this. Onset issue There was also an additional issue little, namely that if the last word in the sequence (the one we want to mask) also occurred earlier in the sequence, the extractor would by default mask the first one. This should also be fixed now. Could you try pulling pliers from my PR, and checking if the onset issue is fixed? Mismatch in values Input to BertLM |
Okay, now that BERT surprisal and BERT entropy are extracted I'm going to merge., |
To support BERT Entropy