# Tagging multilingual texts
`flair` provides as multilingual POS model which includes the languages German and French. They report a high accuracy across languages. Therefore, it used here to tag the source languages which are not English or Arabic.

First, the necessary packages are installed and imported.


In [None]:
%%bash
pip install flair
pip install nltk
pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
Collecting pptree
  Downloading pptree-3.1.tar.gz (3.0 kB)
Collecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
Collecting conllu>=4.0
  Downloading conllu-4.5.1-py2.py3-none-any.whl (16 kB)
Collecting segtok>=1.5.7
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
Collecting sqlitedict>=1.6.0
  Downloading sqlitedict-2.0.0.tar.gz (46 kB)
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
Collecting transformers>=4.0.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.4.1 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 3.10.1 which is incompatible.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.28.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.


In [None]:
import os
from collections import defaultdict
import pickle


from flair.data import Sentence
from flair.models import SequenceTagger
from google.colab import drive
from tqdm import tqdm
drive.mount('/content/drive')

Mounted at /content/drive


Next, we define the necessary paths and constants.

In [None]:
# Change the path to where the parallel data files are located on your device/drive
DATA_PATH = ["drive/MyDrive/Data/parallel/Tanzil-de-ha.txt",
             "drive/MyDrive/Data/parallel/Tanzil-fr-ha.txt"]

# These are the paths where the output should be stored.
OUT_PATH = ["/content/drive/MyDrive/Data/tagged/Tanzil-de-ha.tagged",
            "/content/drive/MyDrive/Data/tagged/Tanzil-fr-ha.tagged"]

# This the tag used for unknown words.
UNK = "<unk>"

Next, the tagger is initiated. We use the multilingual model which performs POS-Tagging for 12 languages.

In [None]:
# Load tagger
tagger = SequenceTagger.load("flair/upos-multi")



Downloading:   0%|          | 0.00/314M [00:00<?, ?B/s]

2022-07-21 12:42:51,322 loading file /root/.flair/models/upos-multi/1a44f168663182024fd3ea6d7dcaeee47fe5bcb537cc737ad058b64ad4db9736.5f899f25846741510a6567b89027d988bd6f634b2776a7c3e834fea4629367cb
2022-07-21 12:42:51,734 SequenceTagger predicts: Dictionary with 21 tags: <unk>, O, PROPN, PUNCT, ADJ, NOUN, VERB, DET, ADP, AUX, PRON, PART, SCONJ, NUM, ADV, CCONJ, X, INTJ, SYM, <START>, <STOP>


Additionally, we define some function to load and save temporary results that are stored in the same directory as the output files.

In [None]:
def load_pickle(file_name):
  with open(file_name, "rb") as file:
    return pickle.load(file)

def save_pickle(obj, file_name):
  with open(file_name, "wb") as file:
    pickle.dump(obj, file)

Now, Tagging is peformed. We iterate over all file in `DATA_PATH`, which contains lines in the following format: `source language ||| target language`. The left side is extracted and tagged with multingual POS flair model. 

In [None]:
for file, out in zip(DATA_PATH, OUT_PATH):
  print(file)
  tmp_path = "{}.pickle".format(out.split(".")[0])
  print(tmp_path)
  with open(file, encoding="utf-8") as f:
    # Store sentences that have been tagged before.
    tag_dict = dict()
    start = 0 # Line to start at.
    try:
      line_dict = load_pickle(tmp_path)
      if line_dict.keys():
        start = max(line_dict.keys())
    except (FileNotFoundError, EOFError) as err:
      # Store results.
      print("No saved results found. Starting from scratch.")
      line_dict = defaultdict(list)
    for idx, line in tqdm(enumerate(f)):
      if idx < start:
        continue
      if idx%1000 == 0:
        save_pickle(line_dict, tmp_path)
      # Split line at delimiter
      line = line.split("|||")
      # Only look at line with the right format.
      if len(line) != 2:
        print("ERROR in line {}".format(idx))
        continue
      src, _ = line
      words = src.split()
      # Default is unknown tag for each word.
      tags = [(UNK, 0) for i in range(len(words))]
      sentence = Sentence(src, use_tokenizer=False)
      # If sentence has been tagged before, we don't need to tag it again.
      if src in tag_dict:
        tags = tag_dict[src]
      else:
        tagger.predict(sentence)
        # Store tag and tag confidence.
        tags = [(tags["value"], tags["confidence"]) for tags in sentence.to_dict()["all labels"]]
      line_dict[idx] = tags
      tag_dict[src] = tags
  print("Writing results to file...")
  # Write results to file.
  with open(out, "w", encoding="utf-8") as out_file:
    for i in range(len(line_dict)):
      tags = line_dict[i]
      tag_str = ["{}-{}".format(tag, score) for tag, score in tags]
      out_file.write("{}\n".format(" ".join(tag_str)))

drive/MyDrive/Data/parallel/Tanzil-de-ha.txt
/content/drive/MyDrive/Data/tagged/Tanzil-de-ha.pickle


46434it [20:24, 37.93it/s]


Writing results to file...
drive/MyDrive/Data/parallel/Tanzil-fr-ha.txt
/content/drive/MyDrive/Data/tagged/Tanzil-fr-ha.pickle


11762it [05:11, 37.80it/s]


Writing results to file...
