<img src="https://github.com/UBC-NLP/afrolid/raw/main/images/afrolid_logo.jpg">

AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. AfroLID is described in this paper:
[**AfroLID: A Neural Language Identification Tool for African Languages**](https://arxiv.org/abs/2210.11744).


## (1) Install Afrolid

In [1]:
%pip install pip==24.0



In [2]:
!pip install torch==1.11.0 sentencepiece==0.1.96 fairseq==0.12.2 tqdm==4.63.0

Collecting torch==1.11.0
  Using cached torch-1.11.0-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting sentencepiece==0.1.96
  Using cached sentencepiece-0.1.96-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting fairseq==0.12.2
  Using cached fairseq-0.12.2.tar.gz (9.6 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tqdm==4.63.0
  Using cached tqdm-4.63.0-py2.py3-none-any.whl.metadata (57 kB)
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==0.12.2)
  Using cached hydra_core-1.0.7-py3-none-any.whl.metadata (3.7 kB)
Collecting omegaconf<2.1 (from fairseq==0.12.2)
  Using cached omegaconf-2.0.6-py3-none-any.whl.metadata (3.0 kB)
Collecting sacrebleu>=1.4.12 (from fairseq==0.12.2)
  Using cached sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
Collecti

In [3]:
!pip install -U git+https://github.com/UBC-NLP/afrolid.git --q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.4/125.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for afrolid (setup.py) ... [?25l[?25hdone
[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-ai-generativelanguage 0.6.10 requires protobuf!=4.21.0,!=4.21.1,!=4.21.

In [4]:
# !pip install torch -U

## (2) Donwload the model

In [5]:
! wget https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
!tar -xf afrolid_model.tar.gz

--2024-12-04 12:03:07--  https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
Resolving demos.dlnlp.ai (demos.dlnlp.ai)... 74.208.236.113, 2607:f1c0:100f:f000::264
Connecting to demos.dlnlp.ai (demos.dlnlp.ai)|74.208.236.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2277022086 (2.1G) [application/gzip]
Saving to: ‘afrolid_model.tar.gz’


2024-12-04 12:03:38 (70.0 MB/s) - ‘afrolid_model.tar.gz’ saved [2277022086/2277022086]



## (2) Initial AfroLID object

In [6]:
import os, sys
import logging
from afrolid.main import classifier

In [7]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
    force=True, # Resets any previous configuration
)
logger = logging.getLogger("afrolid")


In [8]:
cl = classifier(logger, model_path="/content/afrolid_model")

2024-12-04 12:04:32 | INFO | afrolid | Initalizing AfroLID's task and model.


| [input] dictionary: 64001 types
| [label] dictionary: 528 types


## (3) Get language prediction(s)

In [9]:
## Gold label = dip
text="6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali"
predicted_langs = cl.classify(text) # default max_outputs=3
print("Predicted languages:")
for lang in predicted_langs:
  print("     |-- ISO: {}\tName: {}\tScript: {}\tScore: {}%".format(
                      lang,
                      predicted_langs[lang]['name'],
                      predicted_langs[lang]['script'],
                      predicted_langs[lang]['score']))

2024-12-04 12:04:40 | INFO | afrolid | Input text: 6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali


Predicted languages:
     |-- ISO: dip	Name: Dinka, Northeastern	Script: Latin	Score: 100.0%


In [10]:
## Gold label = kmy
text="Ama vuodieke nɩŋ mana n Chʋa Ŋmɩŋ dɩ nagɩna yɩ mɩŋ , nan keŋ n jigiŋ a yi mɩŋ yada , ta n kaaŋ yagɩ vuodieke nɩŋ dɩ kienene n jigiŋ"
predicted_langs = cl.classify(text)  # default max_outputs=3
print("Predicted languages:")
for lang in predicted_langs:
  print("     |-- ISO: {}\tName: {}\tScript: {}\tScore: {}%".format(
                      lang,
                      predicted_langs[lang]['name'],
                      predicted_langs[lang]['script'],
                      predicted_langs[lang]['score']))

2024-12-04 12:04:41 | INFO | afrolid | Input text: Ama vuodieke nɩŋ mana n Chʋa Ŋmɩŋ dɩ nagɩna yɩ mɩŋ , nan keŋ n jigiŋ a yi mɩŋ yada , ta n kaaŋ yagɩ vuodieke nɩŋ dɩ kienene n jigiŋ


Predicted languages:
     |-- ISO: kma	Name: Konni	Script: Latin	Score: 68.42%
     |-- ISO: kmy	Name: Koma	Script: Latin	Score: 31.58%


## (3) Integrate with Pandas


In [None]:
!wget https://raw.githubusercontent.com/UBC-NLP/afrolid/main/examples/examples.tsv -O examples.tsv

In [None]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
df = pd.read_csv("examples.tsv", sep="\t")
df

In [None]:
def get_afrolid_prediction(text):
  predictions = cl.classify(text, max_outputs=1)
  for lang in predictions:
    return lang, predictions[lang]['score'], predictions[lang]['name'], predictions[lang]['script']

In [None]:
df['predict_iso'], df['predict_score'], df['predict_name'], df['predict_script'] = zip(*df['content'].progress_apply(get_afrolid_prediction))
df