# Few-shot Learning with Multilingual Language Models (XGLM)

In [5]:
!pip install fairseq
!pip install sentencepiece
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess

## Introduction

In this work, we train a family of multilingual generative language models, dubbed XGLM, on a balanced corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning on more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (+7.4 accuracy points for 0-shot, +9.4 for 4-shot) and natural language inference (+5.4 for 0-shot, +5.4 for 4-shot). We have included a [model card](model_card.md) of XGLM for transparency and accountability.



## Data and Languages
XGLM models are trained on a new multilingual corpus extracted from CommonCrawl (CC100-XL), a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from [Summer 2013](http://commoncrawl.org/2013/11/new-crawl-data-available/) to [March/April 2020](https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/) consisting of 134 languages. The detailed languages and data statistics are reported in the paper (Table A.1).



## Pre-trained models

Model | Layers | Model Dim | FFN Dim | Languages | Download
---|---|---|---|---|---
`XGLM 564M` | 24 | 1024 | 4096 | trained on 30 languages|  [xglm.564M.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.564M.tar.gz)
`XGLM 1.7B` | 24 | 2048 | 8192 | trained on 30 languages|  [xglm.1.7B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.1.7B.tar.gz)
`XGLM 2.9B` | 48 | 2048 | 8192 | trained on 30 languages|  [xglm.2.9B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.2.9B.tar.gz)
`XGLM 7.5B` | 32 | 4096 | 16384 | trained on 30 languages|  [xglm.7.5B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.7.5B.tar.gz)
`XGLM 4.5B` | 48 | 2048 | 16384 | trained on 134 languages|  [xglm.4.5B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.4.5B.tar.gz)

## Pre-training Data Format
Our models were pre-trained with data in the following format (i.e. paragraphs are separated with new lines and documents were separated with double new lines).
```
<doc0,para0,tok0> ... <doc0,para0,tokX0> # X0: number of tokens in para0 of doc0
<doc0,para1,tok0> ... <doc0,para1,tokY0> # Y0: number of tokens in para1 of doc0

<doc1,para0,tok0> ... <doc1,para0,tokX1> # X1: number of tokens in para0 of doc1
<doc1,para1,tok0> ... <doc1,para1,tokY1> # Y1: number of tokens in para1 of doc1

...
```
Fairseq's preprocessing replaces newlines with the end-of-sentence symbol (`</s>`). As a result, the models never saw newline characters during pretraining and the same preprocessing should be run prior to few-shot inference to maximize performance. For example, our language model scoring function has `replace_newlines_with_eos` argument to trigger this preprocessing:

In [2]:
from fairseq.models.transformer_lm import TransformerLanguageModel

model_dir = 'https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.564M.tar.gz'
lm = TransformerLanguageModel.from_pretrained(model_dir, bpe='sentencepiece')

text = """First paragraph of the first document.
Second paragraph of the first document.

First paragraph of the second document.
"""
tokens = lm.score(text, replace_newlines_with_eos=True)['tokens']
assert '\n' not in lm.decode(tokens)  # no newlines were encoded

100%|██████████| 1050682697/1050682697 [01:06<00:00, 15745707.29B/s]


## Evaluation



In [None]:
from fairseq.models.transformer_lm import TransformerLanguageModel

model_dir = 'https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.564M.tar.gz'
lm = TransformerLanguageModel.from_pretrained(model_dir, bpe='sentencepiece')
lm = lm.eval()
lm = lm.half()
lm = lm.cuda()

### XCOPA

In [None]:
from datasets import load_dataset

langs_xcopa = ["et", "ht", "it", "id", "qu", "sw", "zh", "ta", "th", "tr", "vi"]

xcopa = {}
for lang in langs_xcopa:
    xcopa[lang] = load_dataset("xcopa", lang)

In [18]:
xcopa["et"]["validation"][0]

{'premise': 'Mees keeras kraani lahti.',
 'choice1': 'Tualett täitus veega.',
 'choice2': 'Tilast voolas vett.',
 'question': 'effect',
 'label': 1,
 'idx': 0,
 'changed': False}

In [55]:
def get_logprobs(prompt):
    import re
    prompt = re.sub('\n+' , '\n', prompt)  # collapse repeated newlines, which indicate separate documents
    return lm.score(prompt, replace_newlines_with_eos=True)['positional_scores']

# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def XCOPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

results_xcopa = {"idx": xcopa["et"]["test"]["idx"], 
           "label": xcopa["et"]["test"]["label"]}
for lang in langs_xcopa:
    predictions = []
    for idx, example in tqdm(enumerate(xcopa[lang]["test"])):
        predict = XCOPA_eval(example["premise"], example["choice1"], example["choice2"])
        predictions.append(predict)
    results_xcopa[lang] = predictions

500it [00:54,  9.15it/s]
500it [00:50,  9.83it/s]
500it [00:50,  9.84it/s]
500it [00:50,  9.95it/s]
500it [00:51,  9.74it/s]
500it [00:50,  9.82it/s]
500it [00:50,  9.82it/s]
500it [00:55,  9.01it/s]
500it [00:50,  9.89it/s]
500it [00:50,  9.84it/s]
500it [00:50,  9.84it/s]


In [56]:
results_xcopa_df = pd.DataFrame(results_xcopa).to_csv("XCOPA_xglm-564M.tsv", sep="\t", index=False)

In [57]:
results_xcopa_df = pd.read_csv("XCOPA_xglm-564M.tsv", delimiter="\t")

In [58]:
results_xcopa_df

Unnamed: 0,idx,label,et,ht,it,id,qu,sw,zh,ta,th,tr,vi
0,0,0,1,1,1,1,1,1,1,1,1,1,1
1,1,0,1,1,1,1,1,1,1,1,1,1,1
2,2,1,0,0,0,1,0,0,0,0,1,0,0
3,3,0,0,0,0,0,0,0,0,0,0,0,0
4,4,0,1,0,1,0,0,0,1,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,1,1,1,1,1,1,0,0,0,1,1,1
496,496,1,0,0,0,0,0,0,1,0,0,0,1
497,497,0,0,1,0,0,0,1,1,0,1,1,0
498,498,1,0,1,1,1,1,0,1,1,1,0,1


In [60]:
accuracy = {}
for lang in langs_xcopa:
    compare = results_xcopa_df["label"] == results_xcopa_df[lang]
    acc = list(compare).count(True) / len(list(compare)) * 100
    accuracy[lang] = round(acc, 2)

accuracy

{'et': 52.4,
 'ht': 54.2,
 'it': 52.0,
 'id': 55.8,
 'qu': 49.2,
 'sw': 52.8,
 'zh': 53.2,
 'ta': 54.4,
 'th': 55.6,
 'tr': 53.0,
 'vi': 55.8}

### XStoryCloze

In [8]:
%cd /content/drive/MyDrive/PhD Julen Etxaniz/phd/xglm/XStoryCloze

/content/drive/MyDrive/PhD Julen Etxaniz/phd/xglm/XStoryCloze


In [None]:
from datasets import load_dataset

langs_xstory = ["en", "ru", "zh", "es", "ar", "hi", "id", "te", "sw", "eu", "my"]

x_story_cloze = {}
for lang in langs_xstory:
    x_story_cloze[lang] = load_dataset('x_story_cloze.py', lang)

In [14]:
x_story_cloze["en"]["train"][0]

{'story_id': '138d5bfb-05cc-41e3-bf2c-fa85ebad14e2',
 'input_sentence_1': 'Rick grew up in a troubled household.',
 'input_sentence_2': 'He never found good support in family, and turned to gangs.',
 'input_sentence_3': "It wasn't long before Rick got shot in a robbery.",
 'input_sentence_4': 'The incident caused him to turn a new leaf.',
 'sentence_quiz1': 'He is happy now.',
 'sentence_quiz2': 'He joined a gang.',
 'answer_right_ending': 1}

In [None]:
from tqdm import tqdm
import pandas as pd

def get_logprobs(prompt):
    import re
    prompt = re.sub('\n+' , '\n', prompt)  # collapse repeated newlines, which indicate separate documents
    return lm.score(prompt, replace_newlines_with_eos=True)['positional_scores']

def XStoryCloze_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 1 if lprob1 > lprob2 else 2

results_xstory = {"idx": list(range(len(x_story_cloze[lang]["eval"]))), 
           "label": x_story_cloze["en"]["eval"]["answer_right_ending"]}
for lang in langs_xstory:
    predictions = []
    id = []
    for idx, example in tqdm(enumerate(x_story_cloze[lang]["eval"])):
        input_sentences = example["input_sentence_1"] + " " + example["input_sentence_2"] + " " + example["input_sentence_3"] + " " + example["input_sentence_4"]
        predict = XStoryCloze_eval(input_sentences, example["sentence_quiz1"], example["sentence_quiz2"])
        predictions.append(predict)
    results_xstory[lang] = predictions

In [30]:
results_xstory_df = pd.DataFrame(results_xstory).to_csv("XStoryCloze_xglm-564M.tsv", sep="\t", index=False)

In [74]:
results_xstory_df = pd.read_csv("XStoryCloze_xglm-564M.tsv", delimiter="\t")

In [63]:
accuracy = {}
for lang in langs_xstory:
    compare = results_xstory_df["label"] == results_xstory_df[lang]
    acc = list(compare).count(True) / len(list(compare)) * 100
    accuracy[lang] = round(acc, 1)

accuracy

{'en': 60.0,
 'ru': 55.9,
 'zh': 53.1,
 'es': 54.3,
 'ar': 49.6,
 'hi': 52.2,
 'id': 54.1,
 'te': 55.9,
 'sw': 53.3,
 'eu': 53.1,
 'my': 51.6}

## Transformers

In [64]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m104.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.26.0


In [65]:
import torch
import torch.nn.functional as F

from transformers import XGLMTokenizer, XGLMForCausalLM

tokenizer = XGLMTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs
 
 
# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def COPA_eval(premise, choice1, choice2):
    lprob1 = get_logprobs(premise + "\n" + choice1).sum()
    lprob2 = get_logprobs(premise + "\n" + choice2).sum()
    return 0 if lprob1 > lprob2 else 1

Downloading:   0%|          | 0.00/4.92M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/276 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [66]:
data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.", 
            "choice1": "I swept the floor in the unoccupied room.", 
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.", 
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。", 
            "choice1": "我在空着的房间里扫了地板。", 
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。", 
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.", 
            "choice1": "Mwen te fin baleye chanm lib la.", 
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.", 
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}

In [69]:
for lang in ['en', 'zh', 'hi']:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])

# en-0 1 1
# en-1 0 0
# zh-0 1 1
# zh-1 0 0
# hi-0 1 1
# hi-1 0 0

en-0 1 1
en-1 0 0
zh-0 1 1
zh-1 0 0
hi-0 1 1
hi-1 0 0
