-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
selection of subsets based on coverage #6
Comments
The average coverage is given as percentage. I got it. The average coverage is calculated as the average of mutual coverage and converted into percentage. Average coverage seems to include almost all the languages. Average is a skewing measure that is easily influenced by some high coverage language pairs. Minimal mutual coverage is much more reasonable since it means that some language pairs that have a mutual coverage below the threshold are excluded. What happens when a language that has low mutual coverage (<100) with another language but has high mutual coverage with the rest of the languages? Regarding AN: I added Bouchard-Cote's dataset of 640 languoids to the repo. |
Wait: average coverage is just: Sum of (concepts-of-language-n / concepts-in-data) Minimal mutual coverage is: set(concepts-language-1).intersection(concepts-language-2) |
The procedure you can see in the script is easy: iterate over all languages, exclude those with low average coverage, and see whether minimal mutual coverage increases, if this is the case, and the score is good, keep this set. |
We'll need to clean the 640 languoids, which is a pain in the neck. I wonder whether we should ask pavel to look after the 400 languages that we had prepared already but split into 4 sets for the SVM paper could be combined (they are more or less cleaned), and we pick the one with highest coverage? |
I understand the table much better. So, mutual coverage reduces PN and AA by more than 50%. One good thing is the Bayesian programs will converge faster. This is good. I added abvd 400 languages to the repo. |
excellent, I'll test right away. |
Okay, this result hurts (but I believe, there is a systematic error in the data, that we have introduced when working on them first): there are 640 languages, but with a coverage of 89%, I get only 31 languages! Mutual coverage here is then 154. |
If we accept 86% of average coverage, we get 65 languages, with mutual coverage of 136. |
What about the other 396 languages dataset used by Gray et al. paper? https://github.com/PhyloStar/AutoCogPhylo/blob/master/data/abvd2.tsv |
ugly asjp, I don't want to touch it. But I'll have a try. |
coverage is better there, 33 languages, 162 MC, 92% AC. |
Yes. Both coverages are so low. Sorry for the asjp. I uploaded the asjp file. I added the original file extracted by Pavel |
no problem ;) we could also handle it with the SVM approach, so no deal. |
Okay. This is much better. 33 languages for ABVD is not bad given the coverage. |
I extracted a Oceanic subset of 160 languages from ABVD full. Do you want to test the coverage on that dataset also? |
The 160 languages come from the punctuational bursts paper of Atkinson. |
26 languages, MC 157, 89% AC for oceanic... |
Oceanic can be thrown out then. |
What is the command to run the coverage test? I want to check the mutual coverage for a lower average coverage of 70% something. |
you run: $ python check_data.py an coverage cutout 100 The 100 means: retain only those languages with coverage of more than 100 different concepts for which there is a word. It returns some scores, first for the original word list, then for the derived one. You need to change the "IPA" in the column to tokens, unless I have already done so. Gloss -> concept, this works better. I usually use cutout=180 or cutout=170 for 200-concept lists. |
I'm glad you immediately grasped the point about mutual pairwise coverage. It is just one score, for the worst language pair, but this is a warning. There are more refined algos in lingpy, but this is the easiest way to at least see whether you have a problem in the data. It's funny: we always thought, having 200-concept lists should be fine for lingpy-lexstat, but we never checked the actual number of concepts for which there is a word in the data. Only when I realized it, I could understand, why lingpy constantly performs so bad on AN languages. In our ST dataset (not the one here, but our own freshly collected data), we said that we only retain languages with coverage of 80% of our base list. Later we reduced the base list, so we now have 89% of coverage, which is a good score, I think, but ideally, it should be 90% and higher. I discussed this a lot with colleagues, many did not believe me that it is a problem if you have a couple of languages with low mutual coverage, or low average coverage. While it is evident, why lingpy strikes, it was only my gut feeling that told me this should also impact on phylogenies. In fact, ASJP retains languages with 32 words. This means, the potential lowest mutual coverage of two languages is 40 - 8 - 8, as the list may be skewed, which is just 24 words, and the average coverage may be below 70%. If you allow each language to have only 70% of the concepts in the sample, you may end up with a mutual coverage of 40%! This deserves more attention in general in computational historical linguistics! |
A coverage of 70% means that the worst case mutual coverage is 40%. This is really terrifying for longer word lists. Expect what can happen with ABVD. What we need to know is the average mutual coverage after the initial pruning threshold of X%. This depends on the dataset. The minimal mutual coverage then needs to be at least 70% of the size of the concept list. I tried to look for minimal mutual coverage with 70% of the dataset. So, here are the statistics for different datasets.
It seems like we can salvage more languages with 70% minimal mutual coverage cutoff. |
I'd say, as a rule of thumb, for lexstat-like operations, everything above 100 should in principle work, but I'd prefer even more. In ST we don't have a chance and don't need to bother, so 90 is about the best we can get, but we can also go for the 70% thingy. The good thing with MMC is that we know that this is the worst case, so PN, IE, and AA are above this threshold, and for ST, we don't have a chance. So we could go with that. Should we just discard AN, or is there a simple way to find out whether the concepts there are equally bad distributed? E.g., reducing the number of concepts initially, concentrating on the best ones, taking some 180 (as I did in PN, where I only took Swadesh and ABVD concepts, which is why we have 183, compare pn-full.tsv where we have all concepts, very skewed): could we increase MMC and ACC to arrive at a better AAC + MMC? I have the gut feeling that this is a hard problem to optimize, was thinking a lot about it, but couldn't come up with a deterministic algorithmic solution. |
In general, I think, this is quite an interesting problem: find the partition of the dataset, by deleting languages and concepts, which maximizes the number of languages in the sample and the number of concepts per language for which there is a word. It is difficult to counterbalance this, and it is not trivial, which concepts to delete... |
We will be left with 54 languages and a MMC of 153 which is more than 70% for abvd2 dataset. Reviewers might ask (with a high chance) why we discarded abvd2. Lot of effort went into ABVD and it is still in such a shape. :( |
It's something nobody considered some 10 years ago, I'm afraid. I am so glad we did better now with our ST database. I was initially insisting on 80% coverage, knowing we could still discard some meanings. If we say one sentence, quoting lingpy-2.6, that our choice of test sets is based on coverage, as we know that this may influence cognate detection, we should be fine, though. So should we start from there? I can tomorrow add a statement to output the datasets in the revised form. But it's also comforting to see that ABVD of Bouchard-Côté was also not better. This shows me and Pavel didn't mess up things when cleaning ABVD for the SVM paper... |
Yes. 80% threshold is a simple way to prune languages with low coverage. I agree that quoting LingPy should be sufficient. I will run Turchin, PMI, and Levenshtein systems and generate nexus files. Getting the gold standard trees should be fast. I will do it if Johannes is on vacation. Gerhard should be back from vacation tomorrow and he can get SVM nexus files. I will put the Bayesian runs on the server. The runs should be fast since we work with less number of languages and do not perform dating. I was thinking of a simple procedure to test the effect of coverage of LexStat. On a dataset, apply coverage threshold ranging from 10% to 100% and prune those languages. Estimate LexStat parameters on each of the datasets. Use a trained system to cluster the unpruned datasets. When a language pair is missing from LexStat use SCA to calculate the word similarities for that language pair. This would demonstrate the effect of missing languages on estimation of sound alignment probabilities in LexStat. This is not for the current paper, but for an alternate paper where we see the effect of these hyperparameters such as weighing parameter in LexStat. In case of PMI it would be number of word pairs to process in each batch. Also, the effect of input sound class is also need to be compared. This can go to LREC or workshop paper for SIGMORPHON. |
Let me check the cleaning of the data first. I'll recompute the cognates then, as Sino-Tibetan was flawed, and we had some bugs in the other data as well. I'll try to add a column each in your preferred way for SCA, DOLGO, ASJP, okay? |
Somewhat relevant topic to our discussion -- on looking into data -- on NAACL blog by COLING Chair: https://naacl2018.wordpress.com/2017/12/19/putting-the-linguistics-in-computational-linguistics/ At least the first point is quite relevant for us. |
Yes, I think, we should consider making this a little spin-off project, looking at the degree to which phylogenetic reconstruction algos suffer from distorted data. In fact: this is easy to simulate. Just take a high-coverage dataset, compute the trees and dates, and then delete data-points randomly, recompute, and compare. Problem: Bayesian approaches take a long time. Running many analyses will be a pain in the neck. But one could probably approximate by using distance measures. |
If you consider making this spin-off project you should have a look at Igor's work. https://www.lorentzcenter.nl/lc/web/2015/767/abstracts.pdf (p. 35) According to his webpage there is a corresponding paper currently under review. There is another thing about MMC we should consider. How bad is it for a particular data set? Hereby I mean, is it just one pair of languages with a low MMC. Maybe the mean of the remaining language pairs is much higher, i.e. how does the distribution of MC scores look like. This may be another point to add to the paper, to convince reviewers why there is the need to throw out certain languages or even entire data sets. |
Yes, I mean, it is trivial to even average MC (AMC) across all languages. I just did refrain from this, as it may hide a particularly bad dataset. LingPy offers both scores, but so far, I only considered MMC since it's faster to compute, and ACC (average concept coverage) additionally shows us whether we have a problem (AN HAS a problem, this is clear now, as ACC is very low as well). We might add the AMC to our calculations, in lingpy it's just: >>> from lingpy.compare.sanity import mutual_coverage
>>> mc = mutual_coverage(wordlist)
>>> amc = sum(mc.values()) / len(mc) As to Igor's paper: I remember I read it, but it's a pity it was never published, as in this dense form, I really don't know what to do with it. But you're right that it may point into a similar direction, although Igor is less pessimistic about humans messing up the coding... |
Okay, I just checked PN languages, and I set up the following requirements for the data quality:
1 average coverage should be 95% and higher
2 mutual coverage should be > 100 for concept lists with more than 100 concepts, minimally 90
This leaves us with the following scores:
AN is out, as it is by no means close to our criteria, it cannot be used in tthis form, unless you provide another dataset.
I can prepare data accordingly and submit reduced lists along those lines mentioned above.
The text was updated successfully, but these errors were encountered: