In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
%matplotlib widget
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10, 5)

from uriel import Uriel
from utils import fam
from papers import papers
from visualizations import show_methods, show_families, show_diff, show_languages

In [3]:
u = Uriel(load=True, umap=True)

In [4]:
def diagnose(paper):
    languages, scores = papers[paper]
    show_families(u, languages)
    show_languages(u, languages)
    show_methods(u, languages, scores, fam.gis)
    
def compare(paper, method1, method2):
    languages, scores = papers[paper]
    print(scores[:, method1], scores[:, method2])
    show_diff(u, languages, scores[:, method1], scores[:, method2])

- Most datasets seem to be stable
- There is no usually no particular difference between using GIS and Indo-European languages. Probably because there is not that many non-IE languages used.
- There are only two cases when I found a linguistic bias - Rahimi 2019 NER and Heinzerling 2019 POS. Are the difference statistically significant (especially in H- case)? Are there any other examples that have the same problems?
- There are not that many papers doing proper multilingual evaluation with all the scores reported. When a multilingual evaluation is done, usually it is not compared with different approaches since it is probably already hard to make it work with one approach.

# Rahimi 2019

[Massively Multilingual Transfer for NER](https://arxiv.org/pdf/1902.00193.pdf). 

![image.png](attachment:2069ec99-f46a-4e14-8a36-e128e935ca5a.png)

They compare high-resource (1) and low-resource (2) training. Supervised multi-source transfer learning (3-6) and unsupervised multi-source transfer learning (7-13). We can see that unsupervised transfer learning is falling behind significantly on non-GIS languages compared to low-resource supervised learning. Compare 2 and 10, 10 is better by almost 8 F1, but it has almost identical performance for non-GIS languages. This contrast might be caused by small number of non-GIS languages in the training data. However, it might mislead people into over-confident assessment about transfer learning methods capabilities.

Compare the results with Figure 3 from the paper (reproduce below), which seems to be quite conclusive. They address the linguistic imbalance partially by saying _Further analysis show that majority voting works reasonably well for Romance and Germanic languages, which are well represented in the dataset, but fails miserably compared to single best for Slavic languages (e.g. ru, uk, bg) wherethere are only a few related languages._

![image.png](attachment:5b68f06a-6788-4daf-b66c-db536c8a7ddf.png)

In [5]:
diagnose('rahimi_ner')

                      Family  Count  Ratio
0             Atlantic-Congo      0    0.0
1               Austronesian      3    7.5
2              Indo-European     29   72.5
3                     Slavic     10   25.0
4                   Germanic      6   15.0
5                     Italic      6   15.0
6               Afro-Asiatic      2    5.0
7                    Semitic      2    5.0
8               Sino-Tibetan      0    0.0
9   Nuclear_Trans_New_Guinea      0    0.0
10              Pama-Nyungan      0    0.0
11               Otomanguean      0    0.0
12             Austroasiatic      1    2.5
13                 Dravidian      1    2.5
14                    Turkic      1    2.5
15                    Uralic      3    7.5


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [6]:
compare('rahimi_ner', 2, 11)

[59. 64. 61. 70. 63. 62. 62. 62. 58. 61. 63. 64. 74. 67. 57. 53. 68. 61.
 59. 67. 60. 62. 68. 67. 66. 59. 65. 61. 59. 66. 53. 62. 64. 69. 69. 54.
 66. 61. 60. 55.] [74. 54. 54. 60. 77. 79. 72. 79. 64. 34. 76. 71. 52. 69. 73. 46. 58. 77.
 69. 61. 76. 75. 67. 48. 63. 78. 77. 77. 74. 74. 36. 76. 76. 76. 69. 25.
 57. 67. 49. 48.]


# Heinzerling 2019

[Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation](https://arxiv.org/pdf/1906.01569.pdf)

But there seems to be a problem with POS tagging, which has smaller number of languages. Is it even statistically signficant? (only 6 non GRS languages). On the other hand, we can see that BERT (6) is much worse than the rest of the pack and its improvements (7 and 8.) are also on the bottom of the pack. Similarly to 9 and 10. There is also `heinzerling_pos_low` that has 6 additional languages, but only for 4 methods.

**Note:** The original data had NRM for Norman Wikipedia, but NRM is the ISO-code for Narom language from Malaysia. NRM was changed to NRF (Norman ISO code) in the data. Similarly ARC (Old Aramaic) was corrected to SYR (Syriac). 

In [None]:
diagnose('heinzerling_ner')

In [None]:
diagnose('heinzerling_pos')

In [None]:
compare('heinzerling_pos', 0, 10)

# Artetxe 2020

[Translation Artifacts in Cross-lingual Transfer Learning](https://www.aclweb.org/anthology/2020.emnlp-main.618.pdf)

XNLI seems to be quite stable - GIS only 39%

In [None]:
diagnose('artetxe_nli')

In [None]:
compare('artetxe_nli', 1, 8)

In [None]:
diagnose('artetxe_nli_2')

# Huang 2019

[Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks](https://arxiv.org/abs/1909.00964)

Similarly to Artetxe above, NLI seems to be stable

In [None]:
diagnose('huang_nli')

# Longpre 2020

[MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering](https://arxiv.org/pdf/2007.15207.pdf)

There are slight changes in EM score between some models, but nothing dramatical going on here. F1 looks particularly stable.

In [None]:
diagnose('longpre_qa_em')

In [None]:
diagnose('longpre_qa_f1')

# Wang 2020

[Extending Multilingual BERT to Low-Resource Languages](https://arxiv.org/pdf/2004.13640.pdf)

Very diverse set of languages. No bias there. They use some languages as source languages (Sinhala, Hindi), but I have not found any bias coming from these either.

In [None]:
diagnose('wang_ner')

In [None]:
compare('wang_ner', 1, 3)

# _UD_ performance

There are some underperformers, e.g. _SLT-Interactions (Bengaluru)_ seems really GIS-oriented compared to _IBM NY (Yorktown Heights)_. But we are talking about $\pm2\%$.

In [None]:
diagnose('ud')

In [None]:
compare('ud', 23, 15)

# XTREME

XLMR has great results, but fails completely on some language where MMNT (based on NMT system) is still okay. This further shows how MLLM fail at certain languages and most of them are dominated by IE languages and Latin writing system. Nothing interesting is happening in XTREME NER, the only thing is that mBERT seems to be better for Indo-Aryan languages than XLM or XLMR.

In [None]:
diagnose('xtreme_pos')

In [None]:
compare('xtreme_pos', 2, 3)

In [None]:
diagnose('xtreme_ner')

In [None]:
compare('xtreme_ner', 2, 3)

In [None]:
compare('heinzerling_pos_low', 2, 3)

In [None]:
diagnose('massive')

In [5]:
import umap, bokeh

languages, scores = papers['rahimi_ner']
scores = scores[:,2:]
scores = dict(zip(languages, scores))
winner = {
    lang: min(np.where(row == max(row))[0])
    for lang, row in scores.items()
}


metadata = u.language_metadata(u.uriel_languages)
p = umap.plot.interactive(
    u.umap_vectors,
    hover_data=metadata,
    width=600,
    height=600,
    interactive_text_search=True,
    point_size=0.1,
    interactive_text_search_alpha_contrast=0.99,
#     color='#dddddd',
)
metadata = metadata[metadata['code'].isin(languages)]
for l in set(languages) - set(metadata['code']):
    print(f'Language {isos[l]} ({l}) not supported. It will be removed from visualization')
metadata = metadata.assign(
    x=u.umap_vectors.embedding_[metadata.index][:,0],
    y=u.umap_vectors.embedding_[metadata.index][:,1],
    color=[bokeh.palettes.Set3[12][winner[l]] for l in metadata['code']],
)
p.children[1].circle(
    x='x',
    y='y',
    size=10,
    alpha=1,
    color='color',
    source=metadata
)

bokeh.plotting.output_notebook() 
bokeh.plotting.show(p)