Large Language Models (LLMs) learn which words go together, and also pick up on facts. For example, given the prompt "The United Nations headquarters is in __" or "Manila is the capital of _", GPT-2 has knowledge of the answer.

Facebook/Meta developed a benchmark named LAMA (LAnguage Model Analysis), which generates factual sentences in English language. These sentences are templates with the format "{x} was born in {y}" and the template code can replace {x}, {y}, or both. In this case, a left-to-right generative model can more easily replace {y}. https://github.com/facebookresearch/LAMA

Researchers at LMU Munich created a multilingual version of the benchmark (mLAMA) in 2021.  Template sentences are available in multiple languages and filled in with WikiData. https://github.com/norakassner/mlama

In this workshop (created for ML Prague) this is documentation showing how I filled Czech language templates.

In [2]:
# https://github.com/conda-incubator/condacolab
! pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:30
🔁 Restarting kernel...


In [56]:
! pip uninstall -y overrides
! pip install overrides==3.1.0

Found existing installation: overrides 6.1.0
Uninstalling overrides-6.1.0:
  Successfully uninstalled overrides-6.1.0
Collecting overrides==3.1.0
  Using cached overrides-3.1.0-py3-none-any.whl
Installing collected packages: overrides
Successfully installed overrides-3.1.0


In [None]:
%%capture
! rm -rf mlama
! rm -rf LAMA
! git clone https://github.com/norakassner/mlama
! git clone https://github.com/facebookresearch/LAMA

In [48]:
! conda create -n lama37 -y python=3.7 && conda activate lama37

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - done


  current version: 4.9.2
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /usr/local/envs/lama37

  added / updated specs:
    - python=3.7


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  ca-certificates    conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.36.1-hea4e1c9_2
  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5
  libgcc-ng          conda-forge/linux-64::libgcc-ng-12.1.0-h8d9b700_16
  libgomp            conda-forge/linux-64::libgomp-12.1.0-h8d9b700_16
  libnsl             conda-forge/linux-64::libnsl

In [2]:
! conda create -n mlama -y python=3.7 && conda activate mlama

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / done
Solving environment: \ | / done


  current version: 4.9.2
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /usr/local/envs/mlama

  added / updated specs:
    - python=3.7


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  ca-certificates    conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.36.1-hea4e1c9_2
  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5
  libgcc-ng          conda-forge/linux-64::libgcc-ng-12.1.0-h8d9b700_16
  libgomp            conda-forge/linux-64::libgomp-12.1.0-h8d9b700_16
  libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
  l

In [3]:
! cd mlama && pip install -r requirements.txt

Collecting Cython==0.29.2
  Downloading Cython-0.29.2-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.0 MB/s 
[?25hCollecting numpy==1.15.1
  Downloading numpy-1.15.1-cp37-cp37m-manylinux1_x86_64.whl (13.8 MB)
[K     |████████████████████████████████| 13.8 MB 22.2 MB/s 
[?25hCollecting torch==1.0.1
  Downloading torch-1.0.1-cp37-cp37m-manylinux1_x86_64.whl (560.1 MB)
[K     |████████████████████████████████| 560.1 MB 23 kB/s 
[?25hCollecting pytorch-pretrained-bert==0.6.1
  Downloading pytorch_pretrained_bert-0.6.1-py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 55.2 MB/s 
[?25hCollecting allennlp==0.8.5
  Downloading allennlp-0.8.5-py3-none-any.whl (7.4 MB)
[K     |████████████████████████████████| 7.4 MB 50.4 MB/s 
[?25hCollecting spacy==2.1.8
  Downloading spacy-2.1.8-cp37-cp37m-manylinux1_x86_64.whl (30.8 MB)
[K     |████████████████████████████████| 30.8 MB 84.5 MB/s 
[?25hCollecting tqdm==4.26.0
 

In [None]:
%%capture
! wget http://cistern.cis.lmu.de/mlama/mlama1.1.zip
! unzip mlama1.1.zip
! rm mlama1.1.zip

In [1]:
import sys
sys.path.append('/content/LAMA')
sys.path.append('/content/mlama')

In [3]:
from mlama import build_encoded_dataset

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [8]:
from typing import Text, List, Set, Any, Text, Dict
import os
import json

class MLama(object):
    """docstring for MLama"""

    def __init__(self, path: Text) -> None:
        super(MLama, self).__init__()
        self.path = path
        self.data = {}

    def get_all_languages(self) -> List[Text]:
        # not for all languages templates are available.
        return os.listdir(self.path)

    def get_official_languages(self) -> List[Text]:
        return ["ca", "az", "en", "ar", "uk", "fa", "tr", "it", "el", "ru", "hr", "hi", "sv", "sq", "fr", "ga", "eu", "de", "nl", "et", "he", "es", "bn", "ms", "sr",
                "hy", "ur", "hu", "la", "sl", "cs", "af", "gl", "fi", "ro", "ko", "cy", "th", "be", "id", "pt", "vi", "ka", "ja", "da", "bg", "zh", "pl", "lv", "sk", "lt", "ta", "ceb"]

    def get_relations(self, language) -> List[Text]:
        files = os.listdir(os.path.join(self.path, language))
        return [file.replace(".jsonl", "") for file in files if file != "templates.jsonl"]

    @staticmethod
    def _load_templates(path: Text) -> Dict[Text, Text]:
        templates = {}
        with open(path) as fp:
            for line in fp:
                line = json.loads(line)
                templates[line["relation"]] = line["template"]
        return templates

    @staticmethod
    def _load_triples(path: Text) -> Dict[Text, Dict[Text, Text]]:
        triples = {}
        with open(path) as fp:
            for line in fp:
                line = json.loads(line)
                triples[line["lineid"]] = line
        return triples

    def load(self, languages: List[Text] = [], relations: List[Text] = []) -> None:
        self.data = {}
        if not languages:
            languages = self.get_official_languages()
        for language in languages:
            self.data[language] = {}
            if not relations:
                relations = self.get_relations(language)
            templates = self._load_templates(os.path.join(self.path, language, "templates.jsonl"))
            for relation in relations:
                self.data[language][relation] = {}
                if relation not in templates:
                    print("Template missing for relation {} in language {}.".format(relation, language))
                self.data[language][relation]["template"] = templates.get(relation, "")
                self.data[language][relation]["triples"] = self._load_triples(
                    os.path.join(self.path, language, relation + ".jsonl"))

    @staticmethod
    def is_valid_template(template: Text) -> bool:
        return ("[X]" in template and "[Y]" in template)

    def _fill_templates(self, template: Text, triples: Dict[Text, Dict[Text, Text]], mode: Text) -> Dict[Text, Text]:
        '''
        mode in ["x", "y", "xy"]
        '''
        if not self.is_valid_template(template):
            print("Invalid template: {}".format(template))
            return {}
        else:
            filled_templates = {}
            for triple_id, triple in triples.items():
                filled_templates[triple_id] = template
                if "x" in mode:
                    filled_templates[triple_id] = filled_templates[triple_id].replace("[X]", triple["sub_label"])
                if "y" in mode:
                    filled_templates[triple_id] = filled_templates[triple_id].replace("[Y]", triple["obj_label"])
            return filled_templates

    def fill_all_templates(self, mode: Text):
        for language in self.data:
            for relation in self.data[language]:
                self.data[language][relation]["filled_templates"] = self._fill_templates(
                    self.data[language][relation]["template"], self.data[language][relation]["triples"], mode)


In [30]:
ml = MLama("./mlama1.1")
ml.load(languages=["cs"])
ml.fill_all_templates("x")
#ml.fill_all_templates("y")
#ml.fill_all_templates("xy")

In [26]:
ml.data['cs'].keys()

dict_keys(['P264', 'P937', 'P19', 'date_of_birth', 'P176', 'P103', 'P39', 'P449', 'P138', 'P37', 'P131', 'P108', 'place_of_birth', 'P364', 'P31', 'P127', 'P495', 'P136', 'P413', 'P1303', 'P279', 'P159', 'P530', 'P47', 'P1376', 'P106', 'P101', 'P740', 'P361', 'P140', 'P463', 'P1412', 'P17', 'P36', 'P178', 'P30', 'P407', 'P20', 'P527', 'P27', 'P276', 'place_of_death', 'P190', 'P1001'])

In [31]:
ml.data['cs']['P176']["filled_templates"]

{0: 'Fiat Multipla je produkován [Y].',
 3: 'Renault Vel Satis je produkován [Y].',
 5: 'Chevrolet Malibu je produkován [Y].',
 8: 'Chevrolet Cobalt je produkován [Y].',
 13: 'Nissan GT-R je produkován [Y].',
 15: 'Ferrari Enzo je produkován [Y].',
 17: 'Fiat 126 je produkován [Y].',
 19: 'Windows 2000 je produkován [Y].',
 20: 'Fiat Idea je produkován [Y].',
 21: 'Fiat Ritmo je produkován [Y].',
 22: 'Triumph Herald je produkován [Y].',
 23: 'Suzuki SV 650 je produkován [Y].',
 26: 'Ferrari Daytona je produkován [Y].',
 27: 'Toyota Supra je produkován [Y].',
 29: 'Honda CB-750 je produkován [Y].',
 34: 'Fiat Tipo je produkován [Y].',
 38: 'Honda ST 1100 Pan European je produkován [Y].',
 39: 'Airbus A330 je produkován [Y].',
 40: 'Chevrolet Monte Carlo je produkován [Y].',
 41: 'BMW řady 5 Gran Turismo je produkován [Y].',
 43: 'Windows Server 2003 je produkován [Y].',
 45: 'Audi 80 je produkován [Y].',
 46: 'Ferrari 212 Inter je produkován [Y].',
 48: 'Kindle Fire je produkován [Y].'

In [34]:
open('./cs-templates.json', 'w').write(json.dumps(ml.data))

3629054