Remove tale variants from atu_seq when a motif is repeated in sequence #45

j-hagedorn · 2024-01-24T11:21:15Z

This is based on an issue identified by @salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:

"J2356","J2136","J2136"
"J2356","J2136","J581"
"J2356","J581","J2136"
"J2356","J581","J581"

The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).

@salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.

Adding this as an issue for discussion: @sdaranyi and @salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?

sdaranyi · 2024-01-24T11:28:25Z

Which tale was this? Can you pls add the ATU number? (3) suggest to me that maybe we are dealing with story variants, seriously influencing plot structure, apart from minor variants. In that case we should retain wgatever we can. This is btw the typical open ended question where we can ask for expert advice, involving those in the know. Why not speak out and spare future criticism from them. They should decide how they want to contribute to our design, and then fun could be doubled while frustration could be halved. We could identify such neuralgic points for them to join the crew.

…

On Wed, 24 Jan 2024 at 12:21, Joshh ***@***.***> wrote: This is based on an issue identified by @salmonix <https://github.com/salmonix>. In the example identified, the sequences of the tale variants are as follows for tale 1341A: 1. "J2356","J2136","J2136" 2. "J2356","J2136","J581" 3. "J2356","J581","J2136" 4. "J2356","J581","J581" The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (*Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear*) and J2136.1 (*Wisdom and Folly*). @salmonix <https://github.com/salmonix> suggests that, in our cleared data *perhaps* we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row. Adding this as an issue for discussion: @sdaranyi <https://github.com/sdaranyi> and @salmonix <https://github.com/salmonix>, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row? — Reply to this email directly, view it on GitHub <#45>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZDKNTTVNPW67Q6NGDL4ATYQDVDPAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TQMBUHA4TQMY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sdaranyi · 2024-01-24T11:33:52Z

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

j-hagedorn · 2024-01-24T11:34:15Z

This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference.

sdaranyi · 2024-01-24T11:37:52Z

Exactly. Decision point no 1. But then we should also exclude these types from the string set as well — with 68 K at hand we can afford delegating such problems to the wise, thereby making them co-own the effort.

…

On Wed, 24 Jan 2024 at 12:34, Joshh ***@***.***> wrote: This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZDKNTJXGPPJOTCFEYQAZTYQDWUFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TAMRWHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

j-hagedorn · 2024-01-24T11:37:52Z

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate.

sdaranyi · 2024-01-24T12:08:25Z

As long as we know the ones we are excluding for this reason (which means that at some point they will be welcome back), I don't see a problem. I am not certain that variants are the source of this problem, but understand Uther like that. Aarne and Thompson were more generous with variants, but Uther merged them and generalized his shorthand to a next interpretation level, suggesting a more abstract common content denominator, only he knows why. It may have worked for the profession until now, but clearly variant strings must be separated, not merged.

…

On Wed, 24 Jan 2024 at 12:38, Joshh ***@***.***> wrote: Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco. @sdaranyi <https://github.com/sdaranyi> , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

salmonix · 2024-01-24T18:56:15Z

I put my daughter on to revise these cases. She will do it in this month and I will guide her. For now I would leave the repetitions out for 2 reasons: 1. if they are recursive elements that, imo, would not add much to the key points of the tale structures. Like: performing 3 tasks instead of one. 2. if they are due to parsing error - as the data is recorded with ambiguity - it should be out. we lose important information only if we eliminate sequences as A,B and B,A where A and B both can be terminal. (resolution: punish the evil stepmother and marry the girl OR marry the girl and punish stepmom) But I have not much seen that. I would leave it out now till we have a bit manually revised data. Also regarding contracting the numbers to 2 digits (as K1076.2.3 -> K1076.2 , for instance) : It seems that _most_ of the time it is similar like referring to motives by a superclass tag instead of the particular motif. Like saying: 'bringing out an object' instead of 'bringing out a mirror' and 'bringing out a mortar'. However, as I see that in some cases this categorization is wrong and it may lead to errors. This case what I can imagine is: -> let's take the full token with all the digits. K1076.2.3 -> make graph 1 -> reduce the digits K1076.2 -> make graph 2 Compare the two graphs if they have the same main base characteristics. If yes, we know that K1076.2.3 can be substituted with K1076.2. That would be the theory.

…

On Wed, Jan 24, 2024 at 1:08 PM sdaranyi ***@***.***> wrote: As long as we know the ones we are excluding for this reason (which means that at some point they will be welcome back), I don't see a problem. I am not certain that variants are the source of this problem, but understand Uther like that. Aarne and Thompson were more generous with variants, but Uther merged them and generalized his shorthand to a next interpretation level, suggesting a more abstract common content denominator, only he knows why. It may have worked for the profession until now, but clearly variant strings must be separated, not merged. On Wed, 24 Jan 2024 at 12:38, Joshh ***@***.***> wrote: > Yes, exactly the problem I had in mind. Until we devise a plan on how to > deal with variants -- based on the AaTh btw --, we cannot resolve this. Now > off to lunch before my next telco. > > @sdaranyi <https://github.com/sdaranyi> , if you are certain that this is > a problem, it would not be difficult to remove such occurrences from the > dataset. I just want to ensure that it truly is a problem. I.e. that we can > reasonably expect that motifs do not occur twice in a row with enough > frequency to retain such instances in the permutations of sequences we > generate. > > — > Reply to this email directly, view it on GitHub > <#45 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKXK5SOH5GSIMSVFYXJ5DYQD2UJAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGAYDCMRUGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

j-hagedorn · 2024-01-25T01:32:09Z

Thanks @salmonix and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo.

salmonix · 2024-01-25T07:38:16Z

Emma will annotate the text manually, so we can re-parse it searching for a given pattern. Gonna be an additional line right below the tale with a tag and the motives extracted. So far the line will look like this (for the tale 1692 as example): ## ANN: 1692, J2136, J2461.1.7, J2461.1.7.1, [T: J2136.5.6,J2136.5.7,J2136.5.5] T: means tail variants. We also thought of an other tag, like [R: motif 1, motif 2] marking that the motives can be reversible. This cleanup will strictly focus on understanding the human text of ATU. I also thought of adding subjective tag, maybe a separate line. Eg. in many tales the motives are interchangeable. In the example above I _can_ imagine that J2461.1.7, J2461.1.7.1 are two motives (the mortar and the mirror) and their order actually does not really matter. It is put into the catalogue as is, but as human readers knowing intuitively how stories run know, that here it does not matter. So, maybe we can add one more line, like ## ANN* with this version. * would stand for 'reconstructed' version, like in linguistics. Any further ideas welcome. We would have a manual check and let's use it for the best result. She jumps into it from next week on. Yeah, she is on my cost partly anyway, so a good reason to pay her from my company. :D

…

On Thu, Jan 25, 2024 at 2:32 AM Joshh ***@***.***> wrote: Thanks @salmonix <https://github.com/salmonix> and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo <j-hagedorn/folktale_dna#5>. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKXKYWVE2MQOPHISJKL3DYQGY2HAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGE4TQOJTGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

j-hagedorn · 2024-01-25T11:49:27Z

That's great news, @salmonix . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this:

...which is the structure of the current script at this point. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other.

Notes on method

If in sequence, no need for terminal tag. If she is removing or re-ordering the motifs in such a way that the final motif is the sequence is always the one which occurs last in the story narrative, then we don't need her to note that it is the terminal motif, since that will be clear from the structure.
Switched orders are variants? For the example that you gave, if they both occur and are distinct motifs, then I'm not sure what you mean by saying that "their order actually does not really matter". If both orderings occur, then wouldn't these be discrete variants of the tale?
Resolving unforeseen challenges. I imagine that she will encounter a number of questions that we can't foresee. Will she be noting these and e-mailing us about them?

Impact on current issues

As I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 (this one), #44 (since the main need for the AT was it's more logical structure), and #46 (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 as well. That's great!

salmonix · 2024-01-25T21:35:57Z

She will add the annotation to the source text file. - *If in sequence, no need for terminal tag*. Hmm...ok. - *Switched orders are variants?* For the example that you gave, if they both occur and are distinct motifs, then I'm not sure what you mean by saying that "their order actually does not really matter". If both orderings occur, then wouldn't these be discrete variants of the tale? - In this case I would not see them as variants. More like accidentals of the expression. The story teller chose that order and not an other. That evil stepmother was punished and the girl married == girl married and evil stepmother punished are equivalents and not variants. Changing the order wil l not change the story. - *Resolving unforeseen challenges.* I imagine that she will encounter a number of questions that we can't foresee. Will she be noting these and e-mailing us about them? Of course. Note, that the tagging will only be about tales where the text is ambiguous. Where it is straightforward, we just parse as now.

…

On Thu, Jan 25, 2024 at 12:49 PM Joshh ***@***.***> wrote: That's great news, @salmonix <https://github.com/salmonix> . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this: image.png (view on web) <https://github.com/j-hagedorn/trilogy/assets/7065685/a1a73b2c-473d-4db9-a5e0-39db339a59dc> ...which is the structure of the current script at this point <https://github.com/j-hagedorn/trilogy/blob/master/fetch/fetch_taletypes.R#L101>. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other. Notes on method - *If in sequence, no need for terminal tag*. If she is removing or re-ordering the motifs in such a way that the final motif is the sequence is always the one which occurs last in the story narrative, then we don't need her to note that it is the terminal motif, since that will be clear from the structure. - *Switched orders are variants?* For the example that you gave, if they both occur and are distinct motifs, then I'm not sure what you mean by saying that "their order actually does not really matter". If both orderings occur, then wouldn't these be discrete variants of the tale? - *Resolving unforeseen challenges.* I imagine that she will encounter a number of questions that we can't foresee. Will she be noting these and e-mailing us about them? Impact on current issues As I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 <#45> (this one), #44 <#44> (since the main need for the AT was it's more logical structure), and #46 <#46> (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 <#40> as well. That's great! — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKXKZTOSVBXOBXT36K5U3YQJBFFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGAZDAOJSGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

j-hagedorn · 2024-03-05T22:05:15Z

@salmonix , how is the manually-annotated file going? Does Emma have any questions that @sdaranyi or I can help with?

j-hagedorn · 2024-03-05T23:16:57Z

As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems like High value, but also High effort so I'm not doing work on it and assuming that its completion will be contingent upon the manual annotation, @salmonix .

j-hagedorn added the question Further information is requested label Jan 24, 2024

j-hagedorn added this to the Resolve issues with Trilogy data structure prior to 'publication' milestone Jan 24, 2024

j-hagedorn assigned j-hagedorn, salmonix and sdaranyi and unassigned j-hagedorn Jan 24, 2024

j-hagedorn removed the question Further information is requested label Jan 25, 2024

j-hagedorn unassigned sdaranyi Mar 5, 2024

j-hagedorn added high effort Requires substantial effort to address high value labels Mar 5, 2024

j-hagedorn removed this from the Resolve issues with Trilogy data structure prior to 'publication' milestone Mar 7, 2024

j-hagedorn mentioned this issue Mar 8, 2024

Tag root and terminal (leaf) node in ATU datasets #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tale variants from atu_seq when a motif is repeated in sequence #45

Remove tale variants from atu_seq when a motif is repeated in sequence #45

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

sdaranyi commented Jan 24, 2024

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

salmonix commented Jan 24, 2024 via email

j-hagedorn commented Jan 25, 2024

salmonix commented Jan 25, 2024 via email

j-hagedorn commented Jan 25, 2024

salmonix commented Jan 25, 2024 via email

j-hagedorn commented Mar 5, 2024

j-hagedorn commented Mar 5, 2024 •

edited

Loading

Remove tale variants from atu_seq when a motif is repeated in sequence #45

Remove tale variants from atu_seq when a motif is repeated in sequence #45

Comments

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

sdaranyi commented Jan 24, 2024

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

j-hagedorn commented Jan 24, 2024

sdaranyi commented Jan 24, 2024 via email

salmonix commented Jan 24, 2024 via email

j-hagedorn commented Jan 25, 2024

salmonix commented Jan 25, 2024 via email

j-hagedorn commented Jan 25, 2024

Notes on method

Impact on current issues

salmonix commented Jan 25, 2024 via email

j-hagedorn commented Mar 5, 2024

j-hagedorn commented Mar 5, 2024 • edited Loading

j-hagedorn commented Mar 5, 2024 •

edited

Loading