Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glottojoin_language_level (or something similar) #54

Open
HedvigS opened this issue Feb 2, 2022 · 13 comments
Open

glottojoin_language_level (or something similar) #54

HedvigS opened this issue Feb 2, 2022 · 13 comments

Comments

@HedvigS
Copy link

HedvigS commented Feb 2, 2022

thanks @SietzeN for making this package.

It would be great if there was a function that joins different datasets together such that dialects of the same language are matched up and assigned the glottocode of their common ancestor, probably the language levelled-parent.

Currently in glottolog-cldf there is a col called "Language_ID" in the languages table which reports, if the languoid is a dialect, what the glottocode is of the parent languoid that has the level "language". This needn't be the direct parent of the dialect, sometimes dialects are nested within dialects and so on (I think the max levels I've seen is 4 or 5). I used to have a script that looped through each dialect languoid and check which of its parents is a language and what glottocode that has, then I convinced Robert F to add this information to the language table which has simplified things a lot. Essentially what I do now is I add in the glottocodes for language and family-levelled langouid in the Language_ID column and I join datasets by that instead (with different methods for what to do if more than one dialect of the same language). I've also taken the liberty of renaming this column "Language_level_ID" so as to not cause confusion with the other columns called "Language_ID" elsewhere in CLDF-tables (even if there isn't another col called that in the language table specifically)..

An improvement of this method could be to merge dialects based on their common ancestor, even if that is in itself also a dialect. That may be overdoing it though, joining based on the language levelled parent is probably best after all.

So, in summary: either an option to an existing function or a new function which

  • checks if the two datasets have any directly matching dialect glottocodes (if both have the same dialect then there's no need to reduce to the language level)
  • for any remaining dialects, check if they can be matched via the language level
  • in any cases where one of the datasets has more than 1 dialect for the same language, merge them in a principled way. The user could choose between these:
    • combine all datapoints for all the dialects and if there is more than one datapoint for a given feature/word/variable choose randomly between them
    • users manually specify which dialects they prefer if they have to choose
    • choose randomly between the dialects
    • choose the dialect which has the most datapoints and only use those in the merged dataset

Once again, glad you're doing this and happy to be invited along!

@HedvigS
Copy link
Author

HedvigS commented Feb 23, 2022

@SietzeN Would you like me to make some sudo code or tidyverse code to exemplify this operation?

@SietzeN
Copy link
Collaborator

SietzeN commented Feb 23, 2022

That would be great. Perhaps we can add a new function to glottojoin.R, or build on this one: https://github.com/SietzeN/glottospace/blob/fe0b3ed8fb1ff87105fd932411de44c162c1e180/R/glottojoin.R#L196

@SietzeN
Copy link
Collaborator

SietzeN commented Feb 23, 2022

p.s. I'm not sure whether random assignment is such a good idea. I think it's better if the user specifies the kind of join: https://r4ds.had.co.nz/relational-data.html

@HedvigS
Copy link
Author

HedvigS commented Feb 23, 2022

p.s. I'm not sure whether random assignment is such a good idea. I think it's better if the user specifies the kind of join: https://r4ds.had.co.nz/relational-data.html

Sure, it's not always a great idea but if the users have no apriori reason to pick one or the other picking randomly is better than defaulting to for example always picking the first, which I've seen people do in similar circumstances.

@HedvigS
Copy link
Author

HedvigS commented Feb 23, 2022

That would be great. Perhaps we can add a new function to glottojoin.R, or build on this one:

https://github.com/SietzeN/glottospace/blob/fe0b3ed8fb1ff87105fd932411de44c162c1e180/R/glottojoin.R#L196

Okay I'll have a go!

@HedvigS
Copy link
Author

HedvigS commented Jun 23, 2022

Late, sorry. But here is a first draft of an approach:

https://github.com/HedvigS/personal-cookbook/blob/main/R/glottojoin_language_level.R

@HedvigS
Copy link
Author

HedvigS commented Dec 19, 2022

What do you think @SietzeN ?

@HedvigS
Copy link
Author

HedvigS commented May 3, 2023

@SietzeN if this isn't a good fit for this package, I might move to try to convince Simon to put it in rcldf instead.

@HedvigS
Copy link
Author

HedvigS commented May 3, 2023

Take back, I've apparently already tried that and hasn't worked. Oh well.

@SietzeN
Copy link
Collaborator

SietzeN commented May 9, 2023

Hi @HedvigS !

Thanks for the message, Sorry, I now see I missed your previous message with the link to the script.

Implementing the suggested changes in glottojoin might not be easiest, because the function behaviour changes depending on the input type. So a standalone function might be easier. How about 'glottodialang'?

If you create a pull request, I'm happy to add it!

Cheers!
Sietze

@HedvigS
Copy link
Author

HedvigS commented May 31, 2023

It's alright. I've suggested a function to language-level datasets to rgrambank grambank/rgrambank#3 and then once two sets are levelled they can be joined together. Maybe this is the better path to take.

@SietzeN
Copy link
Collaborator

SietzeN commented May 31, 2023 via email

@HedvigS
Copy link
Author

HedvigS commented May 31, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants