Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tale variants from atu_seq when a motif is repeated in sequence #45

Open
j-hagedorn opened this issue Jan 24, 2024 · 13 comments
Open
Assignees
Labels
high effort Requires substantial effort to address high value

Comments

@j-hagedorn
Copy link
Owner

This is based on an issue identified by @salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:

  1. "J2356","J2136","J2136"
  2. "J2356","J2136","J581"
  3. "J2356","J581","J2136"
  4. "J2356","J581","J581"

The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).

@salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.

Adding this as an issue for discussion: @sdaranyi and @salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?

@sdaranyi
Copy link
Collaborator

sdaranyi commented Jan 24, 2024 via email

@sdaranyi
Copy link
Collaborator

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@j-hagedorn
Copy link
Owner Author

This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference.

@sdaranyi
Copy link
Collaborator

sdaranyi commented Jan 24, 2024 via email

@j-hagedorn
Copy link
Owner Author

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate.

@sdaranyi
Copy link
Collaborator

sdaranyi commented Jan 24, 2024 via email

@salmonix
Copy link
Collaborator

salmonix commented Jan 24, 2024 via email

@j-hagedorn j-hagedorn removed the question Further information is requested label Jan 25, 2024
@j-hagedorn
Copy link
Owner Author

Thanks @salmonix and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo.

@salmonix
Copy link
Collaborator

salmonix commented Jan 25, 2024 via email

@j-hagedorn
Copy link
Owner Author

That's great news, @salmonix . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this:

image

...which is the structure of the current script at this point. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other.

Notes on method

  • If in sequence, no need for terminal tag. If she is removing or re-ordering the motifs in such a way that the final motif is the sequence is always the one which occurs last in the story narrative, then we don't need her to note that it is the terminal motif, since that will be clear from the structure.
  • Switched orders are variants? For the example that you gave, if they both occur and are distinct motifs, then I'm not sure what you mean by saying that "their order actually does not really matter". If both orderings occur, then wouldn't these be discrete variants of the tale?
  • Resolving unforeseen challenges. I imagine that she will encounter a number of questions that we can't foresee. Will she be noting these and e-mailing us about them?

Impact on current issues

As I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 (this one), #44 (since the main need for the AT was it's more logical structure), and #46 (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 as well. That's great!

@salmonix
Copy link
Collaborator

salmonix commented Jan 25, 2024 via email

@j-hagedorn
Copy link
Owner Author

@salmonix , how is the manually-annotated file going? Does Emma have any questions that @sdaranyi or I can help with?

@j-hagedorn
Copy link
Owner Author

j-hagedorn commented Mar 5, 2024

As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems like High value, but also High effort so I'm not doing work on it and assuming that its completion will be contingent upon the manual annotation, @salmonix .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high effort Requires substantial effort to address high value
Projects
None yet
Development

No branches or pull requests

3 participants