-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove tale variants from atu_seq when a motif is repeated in sequence #45
Comments
Which tale was this? Can you pls add the ATU number? (3) suggest to me that
maybe we are dealing with story variants, seriously influencing plot
structure, apart from minor variants. In that case we should retain
wgatever we can.
This is btw the typical open ended question where we can ask for expert
advice, involving those in the know. Why not speak out and spare future
criticism from them. They should decide how they want to contribute to our
design, and then fun could be doubled while frustration could be halved. We
could identify such neuralgic points for them to join the crew.
…On Wed, 24 Jan 2024 at 12:21, Joshh ***@***.***> wrote:
This is based on an issue identified by @salmonix
<https://github.com/salmonix>. In the example identified, the sequences
of the tale variants are as follows for tale 1341A:
1. "J2356","J2136","J2136"
2. "J2356","J2136","J581"
3. "J2356","J581","J2136"
4. "J2356","J581","J581"
The text runs as following: "...The thieves kill him, too [J581, J2136].
(3) Two foolish slaves are recaptured because of their talkativeness [J581,
J2136]..." The motifs identified are: J581 (*Wisdom and Folly,
Foolishness Of Noise-Making When Enemies Overhear*) and J2136.1 (*Wisdom
and Folly*).
@salmonix <https://github.com/salmonix> suggests that, in our cleared
data *perhaps* we should only retain variants 2 and 3 above, and remove 1
and 4, where the same motif is repeated in a row.
Adding this as an issue for discussion: @sdaranyi
<https://github.com/sdaranyi> and @salmonix <https://github.com/salmonix>,
are we certain that we want to remove all tale variants where a motif is
repeated 2 or more times in a row?
—
Reply to this email directly, view it on GitHub
<#45>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZDKNTTVNPW67Q6NGDL4ATYQDVDPAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TQMBUHA4TQMY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco. |
This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference. |
Exactly. Decision point no 1. But then we should also exclude these types
from the string set as well — with 68 K at hand we can afford delegating
such problems to the wise, thereby making them co-own the effort.
…On Wed, 24 Jan 2024 at 12:34, Joshh ***@***.***> wrote:
This is tale 1341A. I would be comfortable tagging this and potentially
other questions with a 'question' tag and suggesting to experts that this
is one way they could contribute to the dataset. I'd want to remove it from
the milestone of things we want to resolve prior to initial publishing of
the dataset, before the conference.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZDKNTJXGPPJOTCFEYQAZTYQDWUFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TAMRWHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate. |
As long as we know the ones we are excluding for this reason (which means
that at some point they will be welcome back), I don't see a problem.
I am not certain that variants are the source of this problem, but
understand Uther like that. Aarne and Thompson were more generous with
variants, but Uther merged them and generalized his shorthand to a next
interpretation level, suggesting a more abstract common content
denominator, only he knows why. It may have worked for the profession until
now, but clearly variant strings must be separated, not merged.
…On Wed, 24 Jan 2024 at 12:38, Joshh ***@***.***> wrote:
Yes, exactly the problem I had in mind. Until we devise a plan on how to
deal with variants -- based on the AaTh btw --, we cannot resolve this. Now
off to lunch before my next telco.
@sdaranyi <https://github.com/sdaranyi> , if you are certain that this is
a problem, it would not be difficult to remove such occurrences from the
dataset. I just want to ensure that it truly is a problem. I.e. that we can
reasonably expect that motifs do not occur twice in a row with enough
frequency to retain such instances in the permutations of sequences we
generate.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I put my daughter on to revise these cases. She will do it in this month
and I will guide her.
For now I would leave the repetitions out for 2 reasons:
1. if they are recursive elements that, imo, would not add much to the key
points of the tale structures. Like: performing 3 tasks instead of one.
2. if they are due to parsing error - as the data is recorded with
ambiguity - it should be out.
we lose important information only if we eliminate sequences as A,B and
B,A where A and B both can be terminal. (resolution: punish the evil
stepmother and marry the girl OR marry the girl and punish stepmom) But I
have not much seen that.
I would leave it out now till we have a bit manually revised data.
Also regarding contracting the numbers to 2 digits (as K1076.2.3 -> K1076.2
, for instance) : It seems that _most_ of the time it is similar like
referring to motives by a superclass tag instead of the particular motif.
Like saying: 'bringing out an object' instead of 'bringing out a mirror'
and 'bringing out a mortar'.
However, as I see that in some cases this categorization is wrong and it
may lead to errors. This case what I can imagine is:
-> let's take the full token with all the digits. K1076.2.3
-> make graph 1
-> reduce the digits K1076.2
-> make graph 2
Compare the two graphs if they have the same main base characteristics.
If yes, we know that K1076.2.3 can be substituted with K1076.2. That would
be the theory.
…On Wed, Jan 24, 2024 at 1:08 PM sdaranyi ***@***.***> wrote:
As long as we know the ones we are excluding for this reason (which means
that at some point they will be welcome back), I don't see a problem.
I am not certain that variants are the source of this problem, but
understand Uther like that. Aarne and Thompson were more generous with
variants, but Uther merged them and generalized his shorthand to a next
interpretation level, suggesting a more abstract common content
denominator, only he knows why. It may have worked for the profession
until
now, but clearly variant strings must be separated, not merged.
On Wed, 24 Jan 2024 at 12:38, Joshh ***@***.***> wrote:
> Yes, exactly the problem I had in mind. Until we devise a plan on how to
> deal with variants -- based on the AaTh btw --, we cannot resolve this.
Now
> off to lunch before my next telco.
>
> @sdaranyi <https://github.com/sdaranyi> , if you are certain that this
is
> a problem, it would not be difficult to remove such occurrences from the
> dataset. I just want to ensure that it truly is a problem. I.e. that we
can
> reasonably expect that motifs do not occur twice in a row with enough
> frequency to retain such instances in the permutations of sequences we
> generate.
>
> —
> Reply to this email directly, view it on GitHub
> <#45 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKXK5SOH5GSIMSVFYXJ5DYQD2UJAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGAYDCMRUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @salmonix and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo. |
Emma will annotate the text manually, so we can re-parse it searching for a
given pattern. Gonna be an additional line right below the tale with a tag
and the motives extracted.
So far the line will look like this (for the tale 1692 as example):
## ANN: 1692, J2136, J2461.1.7, J2461.1.7.1, [T:
J2136.5.6,J2136.5.7,J2136.5.5]
T: means tail variants.
We also thought of an other tag, like [R: motif 1, motif 2] marking that
the motives can be reversible.
This cleanup will strictly focus on understanding the human text of ATU.
I also thought of adding subjective tag, maybe a separate line. Eg. in many
tales the motives are interchangeable. In the example above I _can_ imagine
that J2461.1.7, J2461.1.7.1 are two motives (the mortar and the mirror) and
their order actually does not really matter. It is put into the catalogue
as is, but as human readers knowing intuitively how stories run know, that
here it does not matter. So, maybe we can add one more line, like ## ANN*
with this version. * would stand for 'reconstructed' version, like in
linguistics.
Any further ideas welcome. We would have a manual check and let's use it
for the best result. She jumps into it from next week on.
Yeah, she is on my cost partly anyway, so a good reason to pay her from my
company. :D
…On Thu, Jan 25, 2024 at 2:32 AM Joshh ***@***.***> wrote:
Thanks @salmonix <https://github.com/salmonix> and thanks to your
daughter. Will she be using R or Python? Let me know if you have thoughts
about how best to integrate the changed code into the existing codebase.
Regarding your comment on reducing the digits, I've made an issue over in
our other repo <j-hagedorn/folktale_dna#5>.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKXKYWVE2MQOPHISJKL3DYQGY2HAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGE4TQOJTGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's great news, @salmonix . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this: ...which is the structure of the current script at this point. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other. Notes on method
Impact on current issuesAs I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 (this one), #44 (since the main need for the AT was it's more logical structure), and #46 (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 as well. That's great! |
She will add the annotation to the source text file.
- *If in sequence, no need for terminal tag*. Hmm...ok.
- *Switched orders are variants?* For the example that you gave, if they
both occur and are distinct motifs, then I'm not sure what you mean by
saying that "their order actually does not really matter". If both
orderings occur, then wouldn't these be discrete variants of the tale?
- In this case I would not see them as variants. More like accidentals
of the expression. The story teller chose that order and not an other. That
evil stepmother was punished and the girl married == girl married and evil
stepmother punished are equivalents and not variants. Changing the order
wil l not change the story.
- *Resolving unforeseen challenges.* I imagine that she will encounter a
number of questions that we can't foresee. Will she be noting these and
e-mailing us about them?
Of course.
Note, that the tagging will only be about tales where the text is
ambiguous. Where it is straightforward, we just parse as now.
…On Thu, Jan 25, 2024 at 12:49 PM Joshh ***@***.***> wrote:
That's great news, @salmonix <https://github.com/salmonix> . Just to be
clear, she will be going through the original .txt file and manually
producing a .csv file? From the example you give, it sounds as though that
file's structure will look a lot like this:
image.png (view on web)
<https://github.com/j-hagedorn/trilogy/assets/7065685/a1a73b2c-473d-4db9-a5e0-39db339a59dc>
...which is the structure of the current script at this point
<https://github.com/j-hagedorn/trilogy/blob/master/fetch/fetch_taletypes.R#L101>.
That's nice because we can pretty easily apply the remainder of the logic
to produce a variant of atu_seq based on the more accurate manual
annotations. Actually, I'm thinking that her version will become the
primary, and we'll just archive the other.
Notes on method
- *If in sequence, no need for terminal tag*. If she is removing or
re-ordering the motifs in such a way that the final motif is the sequence
is always the one which occurs last in the story narrative, then we don't
need her to note that it is the terminal motif, since that will be clear
from the structure.
- *Switched orders are variants?* For the example that you gave, if
they both occur and are distinct motifs, then I'm not sure what you mean by
saying that "their order actually does not really matter". If both
orderings occur, then wouldn't these be discrete variants of the tale?
- *Resolving unforeseen challenges.* I imagine that she will encounter
a number of questions that we can't foresee. Will she be noting these and
e-mailing us about them?
Impact on current issues
As I see it, the creation and subsequent flattening of the
manually-annotated file will allow us to close #45
<#45> (this one), #44
<#44> (since the main need
for the AT was it's more logical structure), and #46
<#46> (since Emma will
manually applying consistent structure to denote variants). With the tale
sequence laid out clearly and accurately, it should be easy to close #40
<#40> as well. That's great!
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKXKZTOSVBXOBXT36K5U3YQJBFFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGAZDAOJSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems like High value, but also High effort so I'm not doing work on it and assuming that its completion will be contingent upon the manual annotation, @salmonix . |
This is based on an issue identified by @salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:
The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).
@salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.
Adding this as an issue for discussion: @sdaranyi and @salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?
The text was updated successfully, but these errors were encountered: