glottojoin_language_level (or something similar) #54

HedvigS · 2022-02-02T15:26:11Z

thanks @SietzeN for making this package.

It would be great if there was a function that joins different datasets together such that dialects of the same language are matched up and assigned the glottocode of their common ancestor, probably the language levelled-parent.

Currently in glottolog-cldf there is a col called "Language_ID" in the languages table which reports, if the languoid is a dialect, what the glottocode is of the parent languoid that has the level "language". This needn't be the direct parent of the dialect, sometimes dialects are nested within dialects and so on (I think the max levels I've seen is 4 or 5). I used to have a script that looped through each dialect languoid and check which of its parents is a language and what glottocode that has, then I convinced Robert F to add this information to the language table which has simplified things a lot. Essentially what I do now is I add in the glottocodes for language and family-levelled langouid in the Language_ID column and I join datasets by that instead (with different methods for what to do if more than one dialect of the same language). I've also taken the liberty of renaming this column "Language_level_ID" so as to not cause confusion with the other columns called "Language_ID" elsewhere in CLDF-tables (even if there isn't another col called that in the language table specifically)..

An improvement of this method could be to merge dialects based on their common ancestor, even if that is in itself also a dialect. That may be overdoing it though, joining based on the language levelled parent is probably best after all.

So, in summary: either an option to an existing function or a new function which

checks if the two datasets have any directly matching dialect glottocodes (if both have the same dialect then there's no need to reduce to the language level)
for any remaining dialects, check if they can be matched via the language level
in any cases where one of the datasets has more than 1 dialect for the same language, merge them in a principled way. The user could choose between these:
- combine all datapoints for all the dialects and if there is more than one datapoint for a given feature/word/variable choose randomly between them
- users manually specify which dialects they prefer if they have to choose
- choose randomly between the dialects
- choose the dialect which has the most datapoints and only use those in the merged dataset

Once again, glad you're doing this and happy to be invited along!

HedvigS · 2022-02-23T13:04:57Z

@SietzeN Would you like me to make some sudo code or tidyverse code to exemplify this operation?

SietzeN · 2022-02-23T14:04:11Z

That would be great. Perhaps we can add a new function to glottojoin.R, or build on this one: https://github.com/SietzeN/glottospace/blob/fe0b3ed8fb1ff87105fd932411de44c162c1e180/R/glottojoin.R#L196

SietzeN · 2022-02-23T14:06:55Z

p.s. I'm not sure whether random assignment is such a good idea. I think it's better if the user specifies the kind of join: https://r4ds.had.co.nz/relational-data.html

HedvigS · 2022-02-23T15:54:04Z

p.s. I'm not sure whether random assignment is such a good idea. I think it's better if the user specifies the kind of join: https://r4ds.had.co.nz/relational-data.html

Sure, it's not always a great idea but if the users have no apriori reason to pick one or the other picking randomly is better than defaulting to for example always picking the first, which I've seen people do in similar circumstances.

HedvigS · 2022-02-23T15:54:20Z

That would be great. Perhaps we can add a new function to glottojoin.R, or build on this one:

https://github.com/SietzeN/glottospace/blob/fe0b3ed8fb1ff87105fd932411de44c162c1e180/R/glottojoin.R#L196

Okay I'll have a go!

HedvigS · 2022-06-23T16:07:06Z

Late, sorry. But here is a first draft of an approach:

https://github.com/HedvigS/personal-cookbook/blob/main/R/glottojoin_language_level.R

HedvigS · 2022-12-19T12:17:06Z

What do you think @SietzeN ?

HedvigS · 2023-05-03T20:18:48Z

@SietzeN if this isn't a good fit for this package, I might move to try to convince Simon to put it in rcldf instead.

HedvigS · 2023-05-03T20:19:20Z

Take back, I've apparently already tried that and hasn't worked. Oh well.

SietzeN · 2023-05-09T08:47:20Z

Hi @HedvigS !

Thanks for the message, Sorry, I now see I missed your previous message with the link to the script.

Implementing the suggested changes in glottojoin might not be easiest, because the function behaviour changes depending on the input type. So a standalone function might be easier. How about 'glottodialang'?

If you create a pull request, I'm happy to add it!

Cheers!
Sietze

HedvigS · 2023-05-31T13:06:28Z

It's alright. I've suggested a function to language-level datasets to rgrambank grambank/rgrambank#3 and then once two sets are levelled they can be joined together. Maybe this is the better path to take.

SietzeN · 2023-05-31T13:18:04Z

Great! Yes, that sounds like a more suitable approach. And congrats on your nice paper! Van: Hedvig Skirgård ***@***.***> Verzonden: woensdag 31 mei 2023 15:07 Aan: SietzeN/glottospace ***@***.***> CC: Norder, S.J. (Sietze) ***@***.***>; Mention ***@***.***> Onderwerp: Re: [SietzeN/glottospace] glottojoin_language_level (or something similar) (Issue #54) U ontvangt niet vaak e-mail van ***@***.*** Meer informatie over waarom dit belangrijk is<https://aka.ms/LearnAboutSenderIdentification> It's alright. I've suggested a function to language-level datasets to rgrambank grambank/rgrambank#3<grambank/rgrambank#3> and then once two sets are levelled they can be joined together. Maybe this is the better path to take. — Reply to this email directly, view it on GitHub<#54 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG5X33ZU2DJKSF2T7PVJOWLXI4655ANCNFSM5NMLKOJA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

HedvigS · 2023-05-31T13:36:08Z

Dank u wel :)!

HedvigS mentioned this issue Dec 19, 2022

adding reduce_to_language SimonGreenhill/rcldf#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glottojoin_language_level (or something similar) #54

glottojoin_language_level (or something similar) #54

HedvigS commented Feb 2, 2022

HedvigS commented Feb 23, 2022

SietzeN commented Feb 23, 2022

SietzeN commented Feb 23, 2022

HedvigS commented Feb 23, 2022

HedvigS commented Feb 23, 2022

HedvigS commented Jun 23, 2022

HedvigS commented Dec 19, 2022

HedvigS commented May 3, 2023

HedvigS commented May 3, 2023

SietzeN commented May 9, 2023

HedvigS commented May 31, 2023

SietzeN commented May 31, 2023 via email

HedvigS commented May 31, 2023 via email •

edited

Loading

glottojoin_language_level (or something similar) #54

glottojoin_language_level (or something similar) #54

Comments

HedvigS commented Feb 2, 2022

HedvigS commented Feb 23, 2022

SietzeN commented Feb 23, 2022

SietzeN commented Feb 23, 2022

HedvigS commented Feb 23, 2022

HedvigS commented Feb 23, 2022

HedvigS commented Jun 23, 2022

HedvigS commented Dec 19, 2022

HedvigS commented May 3, 2023

HedvigS commented May 3, 2023

SietzeN commented May 9, 2023

HedvigS commented May 31, 2023

SietzeN commented May 31, 2023 via email

HedvigS commented May 31, 2023 via email • edited Loading

HedvigS commented May 31, 2023 via email •

edited

Loading