Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing words starting with 0 #20

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

milamarcheva
Copy link

Removing Omitted Words, annotated as 0word, because they were added by the annotator and are not part of the authentic child produced speech.

  • [done] Add a concise title to this pull request on the GitHub web interface.

  • [done ] Add a description in this box to describe what this pull request is about.

  • If code behavior is being updated (e.g., a bug fix), relevant tests should be added.

  • The CircleCI builds should pass, including both the code styling checks by
    black and flake8 as well as the test suite.

  • Add an entry to CHANGELOG.md at the repository's root level.

@jacksonllee
Copy link
Owner

Hello @milamarcheva, thank you for making this pull request! I haven't thought about how or whether to handle words annotated with a preceding 0, so this is a good opportunity to reflect on this.

Thank you for indicating the source data (CHILDES -> Biling -> Perez -> Shelia) for an instance of a 0-word. I've taken a look at the CHAT data file to see if there are clues to help decide what to do with these 0-words. I found the occurrence of "*CHI: I 0am done ." (utterance #136 in the data file), but this file has the transcribed utterances only and doesn't have dependent tiers such as %mor and %gra. I spot-checked other CHILDES datasets, and found a 0-word instance with %mor and %gra: https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Eve/010600a.cha, in utterance #2:

MOT: you 0v more cookies ?
%mor: pro:per|you 0v|v qn|more n|cookie-PL ?
%gra: 1|2|SUBJ 2|0|ROOT 3|4|QUANT 4|2|OBJ 5|2|PUNCT

In this example, the "0v" in the utterance corresponds to "0v|v" in the %mor tier and to "2|0|ROOT" in %gra. This example suggests that although the 0-words aren't part of the produced speech or are inaudible somehow (as you've also pointed out), they still have a role in other annotation tiers. For pylangacq, an important goal is to correctly align the pieces across an utterance and its associated %mor and %gra tiers (if available) to create the parsed tokens. I see you're proposing to remove these 0-words in your commit, but given this example of "0v" from the American English Brown dataset, it would seem like pylangacq should not drop these 0-words, or else it wouldn't be able to align the utterance with the %mor and %gra tiers.

To think out loud a bit more -- If a code change is needed within pylangacq, what are the options? I see the following:

  1. Drop the 0-words as you've proposed, but the problem is that pylangacq wouldn't correctly align the utterances with %mor and %gra tiers, as explained above.
  2. Keep the 0-words, but just remove the "0"? Not good, since there would be no indication that these words either aren't in the actual speech or are inaudible.
  3. Keep the 0-words, but remove the "0" and find another way to indicate the non-existence of these words. But what way? Is this the purpose of the "0" in the first place?
  4. Do nothing. If these untreated 0-words affect a pylangacq user, then the user has to handle these 0-words on their own. For instance, if a user is interested in word count in general, then the 0-words slightly inflate the word count numbers, in which case the user could detect and subtract these 0-words.

Option 1 is a deal breaker for pylangacq. Options 2 and 3 don't make sense. So I'm leaning towards option 4 for no code change needed.

Am I missing something? Let me know what you think, and thank you again for raising the issue!

@milamarcheva
Copy link
Author

milamarcheva commented Jan 12, 2024 via email

@jacksonllee
Copy link
Owner

Thank you, Mila, for using pylangacq and for your interest in contributing to it -- really appreciate it!

  • +/ -- interruption, the current library leaves the slash in the
    processed string. I could attempt to fix that

It looks like pylangacq doesn't handle +/ currently, but because CHILDES / TalkBank datasets are updated from time to time, it's possible that +/ is a new thing that pylangacq might need to deal with. May I know which CHILDES / TalkBank data files have occurrences of +/? Just wondering if I should take a quick look first before you start putting together another pull request.

  • [/] -- repetition; [//] -- retracing: the library removes any repeated
    words or phrases, but I think an option to leave them behind might be
    useful in some cases, when focusing on production

If you're interested in the original, unparsed utterance (with the repeated words retained, among other things), the utterance objects preserve the original tiers from CHAT data. Please let me know if it's not clear how to access the unparsed utterance line.

In case email (rather than the public GitHub platform here) is a preferred way to discuss these or any other questions/ideas you may have, I'm reachable at jacksonlunlee@gmail.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants