{{ message }}
/ vocabulary-tools Public
Switch branches/tags
Nothing to show

Cannot retrieve contributors at this time
86 lines (63 sloc) 2.88 KB

This is a literate doctest. Run `python -m doctest -v log_rank_differences.rst` to test.

Let's first of all grab Vanessa Gorman's lemmatisation of parts of Thucydides and Xenophon.

`>>> from lemma_counts import LemmaCounts`
```>>> gorman = LemmaCounts()
```>>> thucydides = gorman.get_counts("0003")
>>> xenophon = gorman.get_counts("0032")```

We can easily see the most common lemmas in each:

```>>> thucydides.most_common(10)
[('ὁ', 3672), ('καί', 1836), ('δέ', 959), ('τε', 608), ('εἰμί', 593), ('αὐτός', 593), ('οὐ', 452), ('μέν', 337), ('εἰς', 329), ('ὅς', 312)]```
```>>> xenophon.most_common(10)
[('ὁ', 6236), ('καί', 3101), ('δέ', 2325), ('εἰμί', 1303), ('οὗτος', 1031), ('αὐτός', 1026), ('μέν', 771), ('τε', 563), ('οὐ', 562), ('τις', 536)]```

But what if we were interested in lemmas that were common in one but not the other. We can easily compare the rank of any given lemma in each subcorpus:

`>>> from calc import rank_with_ties`
```>>> r_thuc = rank_with_ties(thucydides)
>>> r_xeno = rank_with_ties(xenophon)```
```>>> r_thuc["Κῦρος"], r_xeno["Κῦρος"]
(885, 30)```
```>>> r_thuc["δράω"], r_xeno["δράω"]
(196, 2345)```

In other words, Κῦρος is rank 30 in Xenophon but only rank 885 in Thucydides. δράω is rank 196 in Thucydides but only rank 2345 in Xenophon.

What if we wanted a list of all the lemmas in both sorted by how far their ranks differ? This is where log_rank_differences comes in.

`>>> from calc import log_rank_differences`
`>>> log_rank_diff = log_rank_differences(thucydides, xenophon)`
```>>> for lemma, log_diff in sorted(log_rank_diff.items(), key=lambda i: i, reverse=True)[:10]:
...     print(lemma, round(log_diff, 3))
Κῦρος 4.883
Μυτιληναῖοι 4.537
Κορίνθιος 3.964
Ἀττική 3.919
ἀμύνω 3.713
σοι 3.637
δράω 3.581
Πελοπόννησος 3.565
ἐπιμελέομαι 3.56
θύω1 3.268```

As well as comparing authors (or works) we can compare different lemmatisations of the same subcorpus.

Let's load counts for the subcorpus of Gorman and Diorisis that overlaps.

```>>> gorman_overlap = LemmaCounts()
>>> diorisis_overlap = LemmaCounts()
```>>> overlap_log_rank_diff = log_rank_differences(gorman_overlap.get_counts(), diorisis_overlap.get_counts())
>>> for lemma, log_diff in sorted(overlap_log_rank_diff.items(), key=lambda i: i, reverse=True)[:10]:
...     print(lemma, round(log_diff, 3))
λέγω 5.974
Ἀθήναια 5.159
ἔπειμι 4.507
εὐθύς 4.487
ἡδονή 4.356
πρόσειμι 4.326
πάρειμι 4.323
τίς 4.309
ποτός 4.265
ὁδός 4.201```

If the texts were identically lemmatised, they'd have log diffs of zero.

But this can be used to help find systemic differences in lemmatisation.