<a href="https://colab.research.google.com/github/XuZuoLizzie/Archived_Work/blob/main/Bio_Entity_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extract Biomedical Entites from Literaure

This is a demo on how we can use data from Europe PMC Annotations to train a pipeline that extracts Cell entities from full-text articles.

Please note that this notebook only demonstrates the process from data preparation to model training. To further improve the NER model trained in the notebook, we need to process more data and allocate more computational resources.

## Prepare Dataset

In [None]:
! pip install pubmed_parser
! pip install scispacy

Collecting pubmed_parser
  Downloading pubmed_parser-0.3.1.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from pubmed_parser)
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting pytest-cov (from pubmed_parser)
  Downloading pytest_cov-4.1.0-py3-none-any.whl (21 kB)
Collecting coverage[toml]>=5.2.1 (from pytest-cov->pubmed_parser)
  Downloading coverage-7.4.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (234 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m234.1/234.1 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pubmed_parser
  Building wheel for pubmed_parser (setup.py) ... [?25l[?25hdone
  Created wheel for pubmed_parser: filename=pubmed_parser-0.3.1-py3-none-any.whl size=18495 sha256=63f9d9bae

### Load entities

 Select and extract the biomedical entities from Annotations API. I selected 'Cell' entity in this task.

In [None]:
import requests
import os
import time
import json
import pandas as pd

In [None]:
directory_path = "/content/data"
if not os.path.exists(directory_path):
   os.makedirs(directory_path)
   print("Data folder created.")

Data folder created.


In [None]:
def make_api_calls_and_save(num_calls):
    cursor_mark = 0
    for i in range(num_calls):
        url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsBySectionAndOrType?type=Cell&filter=1&format=JSON&pageSize=8&cursorMark={cursor_mark}"
        response = requests.get(url)

        # Check if the response is successful
        if response.status_code == 200:
            data = response.json()
            # Update the cursor_mark for the next call, assuming the new cursor_mark is part of the response
            cursor_mark = data.get('nextCursorMark', cursor_mark)

            # Save the response data into a JSON file
            file_name = f'cell_ann_{i}.json'
            file_path = os.path.join(directory_path, file_name)
            with open(file_path, 'w') as file:
                json.dump(data, file, indent=4)
            print(f"Data saved to {file_name}")
        else:
            print(f"Failed to get data for call {i}: Status code {response.status_code}")
            break  # Stop making further calls if there's a failure

        # Wait for 0.5 seconds before making the next call
        time.sleep(0.5)

In [None]:
num_calls = 2 # Set number of calls, here I set to 2 and download 2 * 8 = 16 articles in total
make_api_calls_and_save(num_calls)

Data saved to cell_ann_0.json
Data saved to cell_ann_1.json


Load saved JSON response into a DataFrame.

In [None]:
# Function to process a single JSON file
def process_json_file(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
        articles_data = []
        for article in data['articles']:
            for annotation in article['annotations']:
                prefix = annotation.get('prefix', '')
                exact = annotation.get('exact', '')
                postfix = annotation.get('postfix', '')
                prefix_exact_postfix = f"{prefix}{exact}{postfix}"
                articles_data.append({
                    'pmcid': article.get('pmcid', None),
                    'exact': exact,
                    'type': annotation.get('type', None),
                    'prefix_exact_postfix': prefix_exact_postfix
                })
        return articles_data

In [None]:
extracted_data = []

# Process each JSON file in the directory
for filename in os.listdir(directory_path):
    if filename.endswith('.json'):
        file_path = os.path.join(directory_path, filename)
        extracted_data.extend(process_json_file(file_path))

df = pd.DataFrame(extracted_data)
display(df)

Unnamed: 0,pmcid,exact,type,prefix_exact_postfix
0,PMC6833180,macrophage,Cell,s a minimally toxic macrophage repolarizing ag...
1,PMC6833180,macrophages,Cell,bitory impact of M2 macrophages on the activit...
2,PMC6833180,macrophages,Cell,e repolarization of macrophages by RRx-001 may...
3,PMC6833180,T,Cell,A-4 (anti cytotoxic T-lymphocyte-associat
4,PMC6833180,lymphocyte,Cell,4 (anti cytotoxic T-lymphocyte-associated prot...
...,...,...,...,...
20573,PMC6919427,hematopoietic stem,Cell,kground:\nAutologous hematopoietic stem and pr...
20574,PMC6919427,T cell,Cell,"CCR5-CD4+, and CD8+ T cell counts and SHIV pla"
20575,PMC6919427,T cell,Cell,odel simulations of T cell and SHIV dynamics a
20576,PMC6919427,T cell,Cell,"driver of CD4+CCR5− T cell growth, and rapid l"


### Load articles

Download full-text articles from Articles RESTful API.

In [None]:
from lxml import etree
import pubmed_parser as pp
import scispacy
import spacy

In [None]:
pmcid_list = df['pmcid'].unique().tolist()
print(pmcid_list[:10])
print("%d articles in total." % len(pmcid_list))

['PMC6833180', 'PMC6833189', 'PMC6854655', 'PMC6763540', 'PMC6501469', 'PMC6821132', 'PMC6802965', 'PMC6937151', 'PMC6584520', 'PMC6504235']
16 articles in total.


In [None]:
def download_pmc_article(pmcid):
    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML'

    # Check if the XML file already exists
    file_path = os.path.join(directory_path, f'{pmcid}.xml')
    if os.path.exists(file_path):
        print(f'{pmcid} already exists, skipping download.')
        return

    # Make the request
    response = requests.get(url)

    if response.status_code == 200:
        # Save the article as XML
        with open(file_path, 'wb') as file:
            file.write(response.content)
        print(f'{pmcid} downloaded successfully.')
    else:
        print(f'Error downloading {pmcid}. Status code: {response.status_code}')

    time.sleep(0.5)

In [None]:
for pmcid in pmcid_list:
    download_pmc_article(pmcid)

PMC6833180 downloaded successfully.
PMC6833189 downloaded successfully.
PMC6854655 downloaded successfully.
PMC6763540 downloaded successfully.
PMC6501469 downloaded successfully.
PMC6821132 downloaded successfully.
PMC6802965 downloaded successfully.
PMC6937151 downloaded successfully.
PMC6584520 downloaded successfully.
PMC6504235 downloaded successfully.
PMC6636997 downloaded successfully.
PMC6726422 downloaded successfully.
PMC6851788 downloaded successfully.
PMC6636905 downloaded successfully.
PMC6592685 downloaded successfully.
PMC6919427 downloaded successfully.


Parse the title, abstract and main text from XML files using Pubmed Parser.

In [None]:
def parse_full_text(pmc_file_path):
    para_dict = pp.parse_pubmed_paragraph(pmc_file_path, all_paragraph=False)
    main_text_list = []

    for paragraph in para_dict:
        cleaned_paragraph = paragraph['text'].strip()
        main_text_list.append(cleaned_paragraph)

    main_text = '\n'.join(main_text_list)
    info_dict = pp.parse_pubmed_xml(pmc_file_path)
    title = info_dict['full_title'].strip()
    abstract = info_dict['abstract'].strip()
    full_text = title + '\n\n' + abstract + '\n\n' + main_text

    return full_text

Segment full text into sentences using Scispacy.

In [None]:
! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz (14.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: en-core-sci-sm
  Building wheel for en-core-sci-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.5.3-py3-none-any.whl size=14776165 sha256=e8328228b769ba7a07a7580b262041eeb5e4b6112917573b034122e52ac260ad
  Stored in directory: /root/.cache/pip/wheels/1a/27/08/5863b9fc5a65254f943eff433dd1e0fafc7ac4595be28d789d
Successfully built en-core-sci-sm
Installing collected packages: en-core-sci-sm
Successfully installed en-core-sci-sm-0.5.3


In [None]:
nlp = spacy.load("en_core_sci_sm")

full_text_dict = {}
for file_name in os.listdir(directory_path):
    if file_name.endswith('.xml'):
        pmcid = file_name.replace('.xml', '')
        full_text = parse_full_text(os.path.join(directory_path, file_name))
        doc = nlp(full_text)
        full_text_dict[pmcid] = [sentence.text.strip() for sentence in doc.sents]
        print(f'Processed {file_name}')

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


Processed PMC6833189.xml
Processed PMC6726422.xml
Processed PMC6501469.xml
Processed PMC6802965.xml
Processed PMC6636905.xml
Processed PMC6584520.xml
Processed PMC6851788.xml
Processed PMC6854655.xml
Processed PMC6636997.xml
Processed PMC6919427.xml
Processed PMC6937151.xml
Processed PMC6592685.xml
Processed PMC6821132.xml
Processed PMC6833180.xml
Processed PMC6504235.xml
Processed PMC6763540.xml


In [None]:
# Convert the segmented full-text to a DataFrame
article_df = pd.DataFrame(list(full_text_dict.items()), columns=['pmcid', 'sentences'])

# Expand the 'sentences' column vertically
article_df = article_df.explode('sentences').reset_index(drop=True)
display(article_df)

Unnamed: 0,pmcid,sentences
0,PMC6833189,34th Annual Meeting & Pre-Conference Programs ...
1,PMC6726422,Astrocyte morphogenesis is dependent on BDNF s...
2,PMC6726422,"Herein, we demonstrate astrocytes express high..."
3,PMC6726422,"Using a novel culture paradigm, we show that a..."
4,PMC6726422,"Deletion of TrkB.T1, globally and astrocyte-sp..."
...,...,...
5188,PMC6504235,Paired-end (75 × 75 bp) sequencing was perform...
5189,PMC6504235,The number of biological replicates used were:...
5190,PMC6504235,Each sample was sequenced to a depth of approx...
5191,PMC6504235,Sequencing data have been deposited in GEO und...


### Extract the whole sentence

In [None]:
# Merge the DataFrames on 'pmcid'
merged_df = pd.merge(df, article_df, on='pmcid', how='inner')

# Check if 'prefix_exact_postfix' is part of 'sentence' for each row
merged_df['is_included'] = merged_df.apply(lambda row: row['prefix_exact_postfix'] in row['sentences'], axis=1)
included_df = merged_df[merged_df['is_included'] == True]

ner_df = included_df[['pmcid', 'exact', 'type', 'prefix_exact_postfix', 'sentences']].copy().reset_index(drop=True)
ner_df = ner_df.rename(columns={'sentences': 'sentence'})
display(ner_df)

Unnamed: 0,pmcid,exact,type,prefix_exact_postfix,sentence
0,PMC6854655,neuronal,Cell,s absorbed into the neuronal mass model that is,The homogeneous local connectivity is absorbed...
1,PMC6854655,neuron,Cell,tion tuning of a V1 neuron is restricted by th,By expanding the statistical wiring model prop...
2,PMC6854655,ON,Cell,ocal arrangement of ON and OFF retinal gan,By expanding the statistical wiring model prop...
3,PMC6854655,retinal ganglion cells,Cell,ement of ON and OFF retinal ganglion cells (RG...,By expanding the statistical wiring model prop...
4,PMC6854655,RGCs,Cell,"nal ganglion cells (RGCs) [2, 3], we suggest",By expanding the statistical wiring model prop...
...,...,...,...,...,...
3554,PMC6592685,T cell,Cell,"y, both CD4 and CD8 T cell subsets could be fo","Additionally, both CD4 and CD8 T cell subsets ..."
3555,PMC6592685,T cell,Cell,here was no obvious T cell or monocyte marker,"Thus, there was no obvious T cell or monocyte ..."
3556,PMC6592685,monocyte,Cell,o obvious T cell or monocyte marker that could d,"Thus, there was no obvious T cell or monocyte ..."
3557,PMC6592685,T cell,Cell,at the frequency of T cell:monocyte complexes,We found that the frequency of T cell:monocyte...


## Train an NER Model

### Preprocess Dataset

Tokenize sentences and convert the dataset into BIO (Begin, Inside, Outside) format.

In [None]:
from typing import List, Tuple, Dict
import re

In [None]:
grouped_ner_df = ner_df.groupby(['pmcid', 'sentence']).agg({
    'exact': lambda x: list(x),
    'type': lambda x: list(x),
    'prefix_exact_postfix': lambda x: list(x)
}).reset_index()

display(grouped_ner_df)

Unnamed: 0,pmcid,sentence,exact,type,prefix_exact_postfix
0,PMC6504235,(C) Heat map showing the relative enrichment o...,"[hair, cells, supporting cells, cochlear hair ...","[Cell, Cell, Cell, Cell, Cell, Cell, Cell, Cel...","[eq peaks in utricle hair cells, utricle supp,..."
1,PMC6504235,112 of these genes appear to be generic marker...,"[hair cells, neonatal, cochlear hair cells]","[Cell, Cell, Cell]","[ generic markers of hair cells, as they are a..."
2,PMC6504235,24 hr of exposure to gentamicin led to signifi...,[hair cell],[Cell],[ led to significant hair cell loss in the utr...
3,PMC6504235,70 of the top 100 enriched utricle hair cell g...,"[hair cell, hair cells]","[Cell, Cell]",[00 enriched utricle hair cell genes have been...
4,PMC6504235,A number of transcription factors have been pr...,[hair cell],[Cell],"[e with Atoh1 during hair cell induction, such..."
...,...,...,...,...,...
1953,PMC6937151,When we examined the proximal zebrafish intest...,[EECs],[Cell],[iscovered that most EECs had adopted a close]
1954,PMC6937151,Whereas HF feeding normally reduces the EEC mo...,[EEC],[Cell],"[ormally reduces the EEC morphology score, t]"
1955,PMC6937151,Whether EECs adopt the same mechanisms as neur...,[neurons],[Cell],[ same mechanisms as neurons to prune their cell]
1956,PMC6937151,Wild-type adult EKW zebrafish were bred and cl...,[eggs],[Cell],[red and clutches of eggs from three distinct]


In [None]:
# Function to find start and end token index for exact entity matches
def find_exact_entity_span(doc, entities):
    exact_entity_spans = []
    for ent in entities:
        ent = ent
        start_char = doc.text.find(ent)
        if start_char != -1:
            end_char = start_char + len(ent)
            start_token = None
            end_token = None
            for token in doc:
                # Identify the start token
                if start_token is None and start_char <= token.idx:
                    start_token = token.i
                # Identify the end token
                if end_token is None and end_char <= token.idx + len(token):
                    end_token = token.i
                    break
            if start_token is not None and end_token is not None:
                exact_entity_spans.append((start_token, end_token, ent))
    return exact_entity_spans

In [None]:
def tokenize_and_tag(row):
    doc = nlp(row['sentence'])
    exact_entity_spans = find_exact_entity_span(doc, row['exact'])
    bio_tags = ['O'] * len(doc)

    for start, end, ent in exact_entity_spans:
        entity_type = row['type'][row['exact'].index(ent)]
        if start is not None:
            bio_tags[start] = f'B-{entity_type}'
            for i in range(start + 1, end + 1):
                bio_tags[i] = f'I-{entity_type}'

    return [(token.text, tag) for token, tag in zip(doc, bio_tags)]

In [None]:
# Apply the function to each row
grouped_ner_df['bio_tags'] = grouped_ner_df.apply(tokenize_and_tag, axis=1)
display(grouped_ner_df)

Unnamed: 0,pmcid,sentence,exact,type,prefix_exact_postfix,bio_tags
0,PMC6504235,(C) Heat map showing the relative enrichment o...,"[hair, cells, supporting cells, cochlear hair ...","[Cell, Cell, Cell, Cell, Cell, Cell, Cell, Cel...","[eq peaks in utricle hair cells, utricle supp,...","[((, O), (C, O), (), O), (Heat, O), (map, O), ..."
1,PMC6504235,112 of these genes appear to be generic marker...,"[hair cells, neonatal, cochlear hair cells]","[Cell, Cell, Cell]","[ generic markers of hair cells, as they are a...","[(112, O), (of, O), (these, O), (genes, O), (a..."
2,PMC6504235,24 hr of exposure to gentamicin led to signifi...,[hair cell],[Cell],[ led to significant hair cell loss in the utr...,"[(24, O), (hr, O), (of, O), (exposure, O), (to..."
3,PMC6504235,70 of the top 100 enriched utricle hair cell g...,"[hair cell, hair cells]","[Cell, Cell]",[00 enriched utricle hair cell genes have been...,"[(70, O), (of, O), (the, O), (top, O), (100, O..."
4,PMC6504235,A number of transcription factors have been pr...,[hair cell],[Cell],"[e with Atoh1 during hair cell induction, such...","[(A, O), (number, O), (of, O), (transcription,..."
...,...,...,...,...,...,...
1953,PMC6937151,When we examined the proximal zebrafish intest...,[EECs],[Cell],[iscovered that most EECs had adopted a close],"[(When, O), (we, O), (examined, O), (the, O), ..."
1954,PMC6937151,Whereas HF feeding normally reduces the EEC mo...,[EEC],[Cell],"[ormally reduces the EEC morphology score, t]","[(Whereas, O), (HF, O), (feeding, O), (normall..."
1955,PMC6937151,Whether EECs adopt the same mechanisms as neur...,[neurons],[Cell],[ same mechanisms as neurons to prune their cell],"[(Whether, O), (EECs, O), (adopt, O), (the, O)..."
1956,PMC6937151,Wild-type adult EKW zebrafish were bred and cl...,[eggs],[Cell],[red and clutches of eggs from three distinct],"[(Wild-type, O), (adult, O), (EKW, O), (zebraf..."


Shuffle and Split the dataset into Train:Test = 4:1. Then save the sets to TSV files.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Shuffle the DataFrame
df_shuffled = grouped_ner_df.sample(frac=1).reset_index(drop=True)

# Split the DataFrame into train and temp sets first (80% train, 20% test)
df_train, df_test = train_test_split(df_shuffled, test_size=0.2, random_state=42)

In [None]:
def save_to_tsv(df, output_file_path):
    with open(output_file_path, "w") as f:
        # Iterate over each row in the DataFrame
        for _, row in df.iterrows():
            # Retrieve the list of (token, tag) tuples from the specified column
            bio_tags = row['bio_tags']
            # Write each tuple to the file, token and tag separated by a tab
            for token, tag in bio_tags:
                f.write(f"{token}\t{tag}\n")
            # Write a blank line after each sentence's tags to separate the data
            f.write("\n")

In [None]:
# Save each set to TSV files
save_to_tsv(df_train, '/content/data/train_data.txt')
save_to_tsv(df_test, '/content/data/test_data.txt')

### Training and Evaluation

Here I used the Stanza Python NLP Library to train the NER model.

In [None]:
! git clone https://github.com/stanfordnlp/stanza.git

Cloning into 'stanza'...
remote: Enumerating objects: 40118, done.[K
remote: Counting objects: 100% (2205/2205), done.[K
remote: Compressing objects: 100% (683/683), done.[K
remote: Total 40118 (delta 1680), reused 1961 (delta 1520), pack-reused 37913[K
Receiving objects: 100% (40118/40118), 83.21 MiB | 17.44 MiB/s, done.
Resolving deltas: 100% (30760/30760), done.


In [None]:
%cd /content/stanza

/content/stanza


Check the environment setting.

In [None]:
! git checkout dev
! git checkout -b cell_ner
! echo $PYTHONPATH

Branch 'dev' set up to track remote branch 'dev' from 'origin'.
Switched to a new branch 'dev'
Switched to a new branch 'cell_ner'
/env/python


In [None]:
! pip install -e .

Obtaining file:///content/stanza
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting emoji (from stanza==1.8.0)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji, stanza
  Running setup.py develop for stanza
Successfully installed emoji-2.10.1 stanza-1.8.0


Copy the input data to the application directory.

In [None]:
os.makedirs('/content/stanza/data/Input')
os.makedirs('/content/stanza/data/ner')

In [None]:
! cp /content/data/train_data.txt /content/stanza/data/Input
! cp /content/data/test_data.txt /content/stanza/data/Input

Here I modifed a few sources files to so that we can quickly go through training process. More changes are needed if we are implementing a proper training component.

In [None]:
%%writefile stanza/utils/datasets/ner/convert_bn_daffodil.py
"""
Convert a Bengali NER dataset to our internal .json format

The dataset is here:

https://github.com/Rifat1493/Bengali-NER/tree/master/Input
"""

import argparse
import os
import random
import tempfile

from stanza.utils.datasets.ner.utils import read_tsv, write_dataset

def redo_time_tags(sentences):
    """
    Replace all TIM, TIM with B-TIM, I-TIM

    A brief use of Google Translate suggests the time phrases are
    generally one phrase, so we don't want to turn this into B-TIM, B-TIM
    """
    new_sentences = []

    for sentence in sentences:
        new_sentence = []
        prev_time = False
        for word, tag in sentence:
            if tag == 'TIM':
                if prev_time:
                    new_sentence.append((word, "I-TIM"))
                else:
                    prev_time = True
                    new_sentence.append((word, "B-TIM"))
            else:
                prev_time = False
                new_sentence.append((word, tag))
        new_sentences.append(new_sentence)

    return new_sentences

def strip_words(dataset):
    return [[(x[0].strip().replace('\ufeff', ''), x[1]) for x in sentence] for sentence in dataset]

def filter_blank_words(train_file, train_filtered_file):
    """
    As of July 2022, this dataset has blank words with O labels, which is not ideal

    This method removes those lines
    """
    with open(train_file, encoding="utf-8") as fin:
        with open(train_filtered_file, "w", encoding="utf-8") as fout:
            for line in fin:
                if line.strip() == 'O':
                    continue
                fout.write(line)

def filter_broken_tags(train_sentences):
    """
    Eliminate any sentences where any of the tags were empty
    """
    return [x for x in train_sentences if not any(y[1] is None for y in x)]

def filter_bad_words(train_sentences):
    """
    Not bad words like poop, but characters that don't exist

    These characters look like n and l in emacs, but they are really
    0xF06C and 0xF06E
    """
    return [[x for x in sentence if not x[0] in ("", "")] for sentence in train_sentences]

def read_datasets(in_directory):
    """
    Reads & splits the train data, reads the test data

    There is no validation data, so we split the training data into
    two pieces and use the smaller piece as the dev set

    Also performeed is a conversion of TIM -> B-TIM, I-TIM
    """
    # make sure we always get the same shuffle & split
    random.seed(1234)

    train_file = os.path.join(in_directory, "Input", "train_data.txt")
    with tempfile.TemporaryDirectory() as tempdir:
        train_filtered_file = os.path.join(tempdir, "train.txt")
        filter_blank_words(train_file, train_filtered_file)
        train_sentences = read_tsv(train_filtered_file, text_column=0, annotation_column=1, keep_broken_tags=True)
    train_sentences = filter_broken_tags(train_sentences)
    train_sentences = filter_bad_words(train_sentences)
    train_sentences = redo_time_tags(train_sentences)
    train_sentences = strip_words(train_sentences)

    test_file = os.path.join(in_directory, "Input", "test_data.txt")
    test_sentences = read_tsv(test_file, text_column=0, annotation_column=1, keep_broken_tags=True)
    test_sentences = filter_broken_tags(test_sentences)
    test_sentences = filter_bad_words(test_sentences)
    test_sentences = redo_time_tags(test_sentences)
    test_sentences = strip_words(test_sentences)

    random.shuffle(train_sentences)
    split_len = len(train_sentences) * 9 // 10
    dev_sentences = train_sentences[split_len:]
    train_sentences = train_sentences[:split_len]

    datasets = (train_sentences, dev_sentences, test_sentences)
    return datasets

def convert_dataset(in_directory, out_directory):
    """
    Reads the datasets using read_datasets, then write them back out
    """
    datasets = read_datasets(in_directory)
    # write_dataset(datasets, out_directory, "bn_daffodil")
    write_dataset(datasets, out_directory, "en_cell")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_path', type=str, default="/home/john/extern_data/ner/bangla/Bengali-NER", help="Where to find the files")
    parser.add_argument('--output_path', type=str, default="/home/john/stanza/data/ner", help="Where to output the results")
    args = parser.parse_args()

    convert_dataset(args.input_path, args.output_path)


Overwriting stanza/utils/datasets/ner/convert_bn_daffodil.py


In [None]:
%%writefile stanza/utils/datasets/ner/prepare_ner_dataset.py

"""Converts raw data files into json files usable by the training script.

Currently it supports converting wikiner datasets, available here:
  https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
  - download the language of interest to {Language}-WikiNER
  - then run
    prepare_ner_dataset.py French-WikiNER

Also, Finnish Turku dataset, available here:
  - https://turkunlp.org/fin-ner.html
  - https://github.com/TurkuNLP/turku-ner-corpus
    git clone the repo into $NERBASE/finnish
    you will now have a directory
    $NERBASE/finnish/turku-ner-corpus
  - prepare_ner_dataset.py fi_turku

FBK in Italy produced an Italian dataset.
  - KIND: an Italian Multi-Domain Dataset for Named Entity Recognition
    Paccosi T. and Palmero Aprosio A.
    LREC 2022
  - https://arxiv.org/abs/2112.15099
  The processing here is for a combined .tsv file they sent us.
  - prepare_ner_dataset.py it_fbk
  There is a newer version of the data available here:
    https://github.com/dhfbk/KIND
  TODO: update to the newer version of the data

IJCNLP 2008 produced a few Indian language NER datasets.
  description:
    http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=3
  download:
    http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
  The models produced from these datasets have extremely low recall, unfortunately.
  - prepare_ner_dataset.py hi_ijc

FIRE 2013 also produced NER datasets for Indian languages.
  http://au-kbc.org/nlp/NER-FIRE2013/index.html
  The datasets are password locked.
  For Stanford users, contact Chris Manning for license details.
  For external users, please contact the organizers for more information.
  - prepare_ner_dataset.py hi-fire2013

HiNER is another Hindi dataset option
  https://github.com/cfiltnlp/HiNER
  - HiNER: A Large Hindi Named Entity Recognition Dataset
    Murthy, Rudra and Bhattacharjee, Pallab and Sharnagat, Rahul and
    Khatri, Jyotsana and Kanojia, Diptesh and Bhattacharyya, Pushpak
  There are two versions:
    hi_hinercollapsed and hi_hiner
  The collapsed version has just PER, LOC, ORG
  - convert data as follows:
    cd $NERBASE
    mkdir hindi
    cd hindi
    git clone git@github.com:cfiltnlp/HiNER.git
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset hi_hiner
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset hi_hinercollapsed

Ukranian NER is provided by lang-uk, available here:
  https://github.com/lang-uk/ner-uk
  git clone the repo to $NERBASE/lang-uk
  There should be a subdirectory $NERBASE/lang-uk/ner-uk/data at that point
  Conversion script graciously provided by Andrii Garkavyi @gawy
  - prepare_ner_dataset.py uk_languk

There are two Hungarian datasets are available here:
  https://rgai.inf.u-szeged.hu/node/130
  http://www.lrec-conf.org/proceedings/lrec2006/pdf/365_pdf.pdf
  We combined them and give them the label hu_rgai
  You can also build individual pieces with hu_rgai_business or hu_rgai_criminal
  Create a subdirectory of $NERBASE, $NERBASE/hu_rgai, and download both of
    the pieces and unzip them in that directory.
  - prepare_ner_dataset.py hu_rgai

Another Hungarian dataset is here:
  - https://github.com/nytud/NYTK-NerKor
  - git clone the entire thing in your $NERBASE directory to operate on it
  - prepare_ner_dataset.py hu_nytk

The two Hungarian datasets can be combined with hu_combined
  TODO: verify that there is no overlap in text
  - prepare_ner_dataset.py hu_combined

BSNLP publishes NER datasets for Eastern European languages.
  - In 2019 they published BG, CS, PL, RU.
  - http://bsnlp.cs.helsinki.fi/bsnlp-2019/shared_task.html
  - In 2021 they added some more data, but the test sets
    were not publicly available as of April 2021.
    Therefore, currently the model is made from 2019.
    In 2021, the link to the 2021 task is here:
    http://bsnlp.cs.helsinki.fi/shared-task.html
  - The below method processes the 2019 version of the corpus.
    It has specific adjustments for the BG section, which has
    quite a few typos or mis-annotations in it.  Other languages
    probably need similar work in order to function optimally.
  - make a directory $NERBASE/bsnlp2019
  - download the "training data are available HERE" and
    "test data are available HERE" to this subdirectory
  - unzip those files in that directory
  - we use the code name "bg_bsnlp19".  Other languages from
    bsnlp 2019 can be supported by adding the appropriate
    functionality in convert_bsnlp.py.
  - prepare_ner_dataset.py bg_bsnlp19

NCHLT produced NER datasets for many African languages.
  Unfortunately, it is difficult to make use of many of these,
  as there is no corresponding UD data from which to build a
  tokenizer or other tools.
  - Afrikaans:  https://repo.sadilar.org/handle/20.500.12185/299
  - isiNdebele: https://repo.sadilar.org/handle/20.500.12185/306
  - isiXhosa:   https://repo.sadilar.org/handle/20.500.12185/312
  - isiZulu:    https://repo.sadilar.org/handle/20.500.12185/319
  - Sepedi:     https://repo.sadilar.org/handle/20.500.12185/328
  - Sesotho:    https://repo.sadilar.org/handle/20.500.12185/334
  - Setswana:   https://repo.sadilar.org/handle/20.500.12185/341
  - Siswati:    https://repo.sadilar.org/handle/20.500.12185/346
  - Tsivenda:   https://repo.sadilar.org/handle/20.500.12185/355
  - Xitsonga:   https://repo.sadilar.org/handle/20.500.12185/362
  Agree to the license, download the zip, and unzip it in
  $NERBASE/NCHLT

UCSY built a Myanmar dataset.  They have not made it publicly
  available, but they did make it available to Stanford for research
  purposes.  Contact Chris Manning or John Bauer for the data files if
  you are Stanford affiliated.
  - https://arxiv.org/abs/1903.04739
  - Syllable-based Neural Named Entity Recognition for Myanmar Language
    by Hsu Myat Mo and Khin Mar Soe

Hanieh Poostchi et al produced a Persian NER dataset:
  - git@github.com:HaniehP/PersianNER.git
  - https://github.com/HaniehP/PersianNER
  - Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, and Massimo Piccardi,
    "PersoNER: Persian Named-Entity Recognition"
  - Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi,
    "BiLSTM-CRF for Persian Named-Entity Recognition; ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset"
  - Conveniently, this dataset is already in BIO format.  It does not have a dev split, though.
    git clone the above repo, unzip ArmanPersoNERCorpus.zip, and this script will split the
    first train fold into a dev section.

SUC3 is a Swedish NER dataset provided by Språkbanken
  - https://spraakbanken.gu.se/en/resources/suc3
  - The splitting tool is generously provided by
    Emil Stenstrom
    https://github.com/EmilStenstrom/suc_to_iob
  - Download the .bz2 file at this URL and put it in $NERBASE/sv_suc3shuffle
    It is not necessary to unzip it.
  - Gustafson-Capková, Sophia and Britt Hartmann, 2006,
    Manual of the Stockholm Umeå Corpus version 2.0.
    Stockholm University.
  - Östling, Robert, 2013, Stagger
    an Open-Source Part of Speech Tagger for Swedish
    Northern European Journal of Language Technology 3: 1–18
    DOI 10.3384/nejlt.2000-1533.1331
  - The shuffled dataset can be converted with dataset code
    prepare_ner_dataset.py sv_suc3shuffle
  - If you fill out the license form and get the official data,
    you can get the official splits by putting the provided zip file
    in $NERBASE/sv_suc3licensed.  Again, not necessary to unzip it
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset sv_suc3licensed

DDT is a reformulation of the Danish Dependency Treebank as an NER dataset
  - https://danlp-alexandra.readthedocs.io/en/latest/docs/datasets.html#dane
  - direct download link as of late 2021: https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip
  - https://aclanthology.org/2020.lrec-1.565.pdf
    DaNE: A Named Entity Resource for Danish
    Rasmus Hvingelby, Amalie Brogaard Pauli, Maria Barrett,
    Christina Rosted, Lasse Malm Lidegaard, Anders Søgaard
  - place ddt.zip in $NERBASE/da_ddt/ddt.zip
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset da_ddt

NorNE is the Norwegian Dependency Treebank with NER labels
  - LREC 2020
    NorNE: Annotating Named Entities for Norwegian
    Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg,
    Lilja Øvrelid, and Erik Velldal
  - both Bokmål and Nynorsk
  - This dataset is in a git repo:
    https://github.com/ltgoslo/norne
    Clone it into $NERBASE
    git clone git@github.com:ltgoslo/norne.git
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset nb_norne
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset nn_norne

tr_starlang is a set of constituency trees for Turkish
  The words in this dataset (usually) have NER labels as well

  A dataset in three parts from the Starlang group in Turkey:
  Neslihan Kara, Büşra Marşan, et al
    Creating A Syntactically Felicitous Constituency Treebank For Turkish
    https://ieeexplore.ieee.org/document/9259873
  git clone the following three repos
    https://github.com/olcaytaner/TurkishAnnotatedTreeBank-15
    https://github.com/olcaytaner/TurkishAnnotatedTreeBank2-15
    https://github.com/olcaytaner/TurkishAnnotatedTreeBank2-20
  Put them in
    $CONSTITUENCY_HOME/turkish    (yes, the constituency home)
  python3 -m stanza.utils.datasets.ner.prepare_ner_dataset tr_starlang

GermEval2014 is a German NER dataset
  https://sites.google.com/site/germeval2014ner/data
  https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J
  Download the files in that directory
    NER-de-train.tsv NER-de-dev.tsv NER-de-test.tsv
  put them in
    $NERBASE/germeval2014
  then run
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset de_germeval2014

The UD Japanese GSD dataset has a conversion by Megagon Labs
  https://github.com/megagonlabs/UD_Japanese-GSD
  https://github.com/megagonlabs/UD_Japanese-GSD/tags
  - r2.9-NE has the NE tagged files inside a "spacy"
    folder in the download
  - expected directory for this data:
    unzip the .zip of the release into
      $NERBASE/ja_gsd
    so it should wind up in
      $NERBASE/ja_gsd/UD_Japanese-GSD-r2.9-NE
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset ja_gsd

L3Cube is a Marathi dataset
  - https://arxiv.org/abs/2204.06029
    https://arxiv.org/pdf/2204.06029.pdf
    https://github.com/l3cube-pune/MarathiNLP
  - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models
    Parth Patil, Aparna Ranade, Maithili Sabane, Onkar Litake, Raviraj Joshi

  Clone the repo into $NERBASE/marathi
    git clone git@github.com:l3cube-pune/MarathiNLP.git
  Then run
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset mr_l3cube

Daffodil University produced a Bangla NER dataset
  - https://github.com/Rifat1493/Bengali-NER
  - https://ieeexplore.ieee.org/document/8944804
  - Bengali Named Entity Recognition:
    A survey with deep learning benchmark
    Md Jamiur Rahman Rifat, Sheikh Abujar, Sheak Rashed Haider Noori,
    Syed Akhter Hossain

  Clone the repo into a "bangla" subdirectory of $NERBASE
    cd $NERBASE/bangla
    git clone git@github.com:Rifat1493/Bengali-NER.git
  Then run
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset bn_daffodil

LST20 is a Thai NER dataset from 2020
  - https://arxiv.org/abs/2008.05055
    The Annotation Guideline of LST20 Corpus
    Prachya Boonkwan, Vorapon Luantangsrisuk, Sitthaa Phaholphinyo,
    Kanyanat Kriengket, Dhanon Leenoi, Charun Phrombut,
    Monthika Boriboon, Krit Kosawat, Thepchai Supnithi
  - This script processes a version which can be downloaded here after registration:
    https://aiforthai.in.th/index.php
  - There is another version downloadable from HuggingFace
    The script will likely need some modification to be compatible
    with the HuggingFace version
  - Download the data in $NERBASE/thai/LST20_Corpus
    There should be "train", "eval", "test" directories after downloading
  - Then run
    pytohn3 -m stanza.utils.datasets.ner.prepare_ner_dataset th_lst20

Thai-NNER is another Thai NER dataset, from 2022
  - https://github.com/vistec-AI/Thai-NNER
  - https://aclanthology.org/2022.findings-acl.116/
    Thai Nested Named Entity Recognition Corpus
    Weerayut Buaphet, Can Udomcharoenchaikit, Peerat Limkonchotiwat,
    Attapol Rutherford, and Sarana Nutanong
  - git clone the data to $NERBASE/thai
  - On the git repo, there should be a link to a more complete version
    of the dataset.  For example, in Sep. 2023 it is here:
    https://github.com/vistec-AI/Thai-NNER#dataset
    The Google drive it goes to has "postproc".
    Put the train.json, dev.json, and test.json in
    $NERBASE/thai/Thai-NNER/data/scb-nner-th-2022/postproc/
  - Then run
    pytohn3 -m stanza.utils.datasets.ner.prepare_ner_dataset th_nner22


NKJP is a Polish NER dataset
  - http://nkjp.pl/index.php?page=0&lang=1
    About the Project
  - http://zil.ipipan.waw.pl/DistrNKJP
    Wikipedia subcorpus used to train charlm model
  - http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=view&target=NKJP-PodkorpusMilionowy-1.2.tar.gz
    Annotated subcorpus to train NER model.
    Download and extract to $NERBASE/Polish-NKJP or leave the gzip in $NERBASE/polish/...

kk_kazNERD is a Kazakh dataset published in 2021
  - https://github.com/IS2AI/KazNERD
  - https://arxiv.org/abs/2111.13419
    KazNERD: Kazakh Named Entity Recognition Dataset
    Rustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan Varol
  - in $NERBASE, make a "kazakh" directory, then git clone the repo there
    mkdir -p $NERBASE/kazakh
    cd $NERBASE/kazakh
    git clone git@github.com:IS2AI/KazNERD.git
  - Then run
    pytohn3 -m stanza.utils.datasets.ner.prepare_ner_dataset kk_kazNERD

Masakhane NER is a set of NER datasets for African languages
  - MasakhaNER: Named Entity Recognition for African Languages
    Adelani, David Ifeoluwa; Abbott, Jade; Neubig, Graham;
    D’souza, Daniel; Kreutzer, Julia; Lignos, Constantine;
    Palen-Michel, Chester; Buzaaba, Happy; Rijhwani, Shruti;
    Ruder, Sebastian; Mayhew, Stephen; Azime, Israel Abebe;
    Muhammad, Shamsuddeen H.; Emezue, Chris Chinenye;
    Nakatumba-Nabende, Joyce; Ogayo, Perez; Anuoluwapo, Aremu;
    Gitau, Catherine; Mbaye, Derguene; Alabi, Jesujoba;
    Yimam, Seid Muhie; Gwadabe, Tajuddeen Rabiu; Ezeani, Ignatius;
    Niyongabo, Rubungo Andre; Mukiibi, Jonathan; Otiende, Verrah;
    Orife, Iroro; David, Davis; Ngom, Samba; Adewumi, Tosin;
    Rayson, Paul; Adeyemi, Mofetoluwa; Muriuki, Gerald;
    Anebi, Emmanuel; Chukwuneke, Chiamaka; Odu, Nkiruka;
    Wairagala, Eric Peter; Oyerinde, Samuel; Siro, Clemencia;
    Bateesa, Tobius Saul; Oloyede, Temilola; Wambui, Yvonne;
    Akinode, Victor; Nabagereka, Deborah; Katusiime, Maurice;
    Awokoya, Ayodele; MBOUP, Mouhamadane; Gebreyohannes, Dibora;
    Tilaye, Henok; Nwaike, Kelechi; Wolde, Degaga; Faye, Abdoulaye;
    Sibanda, Blessing; Ahia, Orevaoghene; Dossou, Bonaventure F. P.;
    Ogueji, Kelechi; DIOP, Thierno Ibrahima; Diallo, Abdoulaye;
    Akinfaderin, Adewale; Marengereke, Tendai; Osei, Salomey
  - https://github.com/masakhane-io/masakhane-ner
  - git clone the repo to $NERBASE
  - Then run
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset lcode_masakhane
  - You can use the full language name, the 3 letter language code,
    or in the case of languages with a 2 letter language code,
    the 2 letter code for lcode.  The tool will throw an error
    if the language is not supported in Masakhane.

SiNER is a Sindhi NER dataset
  - https://aclanthology.org/2020.lrec-1.361/
    SiNER: A Large Dataset for Sindhi Named Entity Recognition
    Wazir Ali, Junyu Lu, Zenglin Xu
  - It is available via git repository
    https://github.com/AliWazir/SiNER-dataset
    As of Nov. 2022, there were a few changes to the dataset
    to update a couple instances of broken tags & tokenization
  - Clone the repo to $NERBASE/sindhi
    mkdir $NERBASE/sindhi
    cd $NERBASE/sindhi
    git clone git@github.com:AliWazir/SiNER-dataset.git
  - Then, prepare the dataset with this script:
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset sd_siner

en_sample is the toy dataset included with stanza-train
  https://github.com/stanfordnlp/stanza-train
  this is not meant for any kind of actual NER use

ArmTDP-NER is an Armenian NER dataset
  - https://github.com/myavrum/ArmTDP-NER.git
    ArmTDP-NER: The corpus was developed by the ArmTDP team led by Marat M. Yavrumyan
    at the Yerevan State University by the collaboration of "Armenia National SDG Innovation Lab"
    and "UC Berkley's Armenian Linguists' network".
  - in $NERBASE, make a "armenian" directory, then git clone the repo there
    mkdir -p $NERBASE/armenian
    cd $NERBASE/armenian
    git clone https://github.com/myavrum/ArmTDP-NER.git
  - Then run
    python3 -m stanza.utils.datasets.ner.prepare_ner_dataset hy_armtdp

en_conll03 is the classic 2003 4 class CoNLL dataset
  - The version we use is posted on HuggingFace
  - https://huggingface.co/datasets/conll2003
  - The prepare script will download from HF
    using the datasets package, then convert to json
  - Introduction to the CoNLL-2003 Shared Task:
    Language-Independent Named Entity Recognition
    Tjong Kim Sang, Erik F. and De Meulder, Fien
  - python3 stanza/utils/datasets/ner/prepare_ner_dataset.py en_conll03

en_conll03ww is CoNLL 03 with Worldwide added to the training data.
  - python3 stanza/utils/datasets/ner/prepare_ner_dataset.py en_conll03ww

en_conllpp is a test set from 2020 newswire
  - https://arxiv.org/abs/2212.09747
  - https://github.com/ShuhengL/acl2023_conllpp
  - Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
    Shuheng Liu, Alan Ritter
  - git clone the repo in $NERBASE
  - then run
    python3 stanza/utils/datasets/ner/prepare_ner_dataset.py en_conllpp

en_ontonotes is the OntoNotes 5 on HuggingFace
  - https://huggingface.co/datasets/conll2012_ontonotesv5
  - python3 stanza/utils/datasets/ner/prepare_ner_dataset.py en_ontonotes
  - this downloads the "v12" version of the data

en_worldwide-4class is an English non-US newswire dataset
  - annotated by MLTwist and Aya Data, with help from Datasaur,
    collected at Stanford
  - work to be published at EMNLP Findings
  - the 4 class version is converted to the 4 classes in conll,
    then split into train/dev/test
  - clone https://github.com/stanfordnlp/en-worldwide-newswire
    into $NERBASE/en_worldwide

en_worldwide-9class is an English non-US newswire dataset
  - annotated by MLTwist and Aya Data, with help from Datasaur,
    collected at Stanford
  - work to be published at EMNLP Findings
  - the 9 class version is not edited
  - clone https://github.com/stanfordnlp/en-worldwide-newswire
    into $NERBASE/en_worldwide

zh-hans_ontonotes is the ZH split of the OntoNotes dataset
  - https://catalog.ldc.upenn.edu/LDC2013T19
  - https://huggingface.co/datasets/conll2012_ontonotesv5
  - python3 stanza/utils/datasets/ner/prepare_ner_dataset.py zh-hans_ontonotes
  - this downloads the "v4" version of the data


AQMAR is a small dataset of Arabic Wikipedia articles
  - http://www.cs.cmu.edu/~ark/ArabicNER/
  - Recall-Oriented Learning of Named Entities in Arabic Wikipedia
    Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith.
    In Proceedings of the 13th Conference of the European Chapter of
    the Association for Computational Linguistics, Avignon, France,
    April 2012.
  - download the .zip file there and put it in
    $NERBASE/arabic/AQMAR
  - there is a challenge for it here:
    https://www.topcoder.com/challenges/f3cf483e-a95c-4a7e-83e8-6bdd83174d38
  - alternatively, we just randomly split it ourselves
  - currently, running the following reproduces the random split:
    python3 stanza/utils/datasets/ner/prepare_ner_dataset.py ar_aqmar

"""

import glob
import os
import random
import re
import shutil
import sys
import tempfile

from stanza.models.common.constant import treebank_to_short_name, lcode2lang, lang_to_langcode, two_to_three_letters
import stanza.utils.default_paths as default_paths

from stanza.utils.datasets.ner.preprocess_wikiner import preprocess_wikiner
from stanza.utils.datasets.ner.split_wikiner import split_wikiner
import stanza.utils.datasets.ner.build_en_combined as build_en_combined
import stanza.utils.datasets.ner.conll_to_iob as conll_to_iob
import stanza.utils.datasets.ner.convert_ar_aqmar as convert_ar_aqmar
import stanza.utils.datasets.ner.convert_bn_daffodil as convert_bn_daffodil
import stanza.utils.datasets.ner.convert_bsf_to_beios as convert_bsf_to_beios
import stanza.utils.datasets.ner.convert_bsnlp as convert_bsnlp
import stanza.utils.datasets.ner.convert_en_conll03 as convert_en_conll03
import stanza.utils.datasets.ner.convert_fire_2013 as convert_fire_2013
import stanza.utils.datasets.ner.convert_ijc as convert_ijc
import stanza.utils.datasets.ner.convert_kk_kazNERD as convert_kk_kazNERD
import stanza.utils.datasets.ner.convert_lst20 as convert_lst20
import stanza.utils.datasets.ner.convert_nner22 as convert_nner22
import stanza.utils.datasets.ner.convert_mr_l3cube as convert_mr_l3cube
import stanza.utils.datasets.ner.convert_my_ucsy as convert_my_ucsy
import stanza.utils.datasets.ner.convert_ontonotes as convert_ontonotes
import stanza.utils.datasets.ner.convert_rgai as convert_rgai
import stanza.utils.datasets.ner.convert_nytk as convert_nytk
import stanza.utils.datasets.ner.convert_starlang_ner as convert_starlang_ner
import stanza.utils.datasets.ner.convert_nkjp as convert_nkjp
import stanza.utils.datasets.ner.prepare_ner_file as prepare_ner_file
import stanza.utils.datasets.ner.convert_sindhi_siner as convert_sindhi_siner
import stanza.utils.datasets.ner.ontonotes_multitag as ontonotes_multitag
import stanza.utils.datasets.ner.simplify_en_worldwide as simplify_en_worldwide
import stanza.utils.datasets.ner.suc_to_iob as suc_to_iob
import stanza.utils.datasets.ner.suc_conll_to_iob as suc_conll_to_iob
import stanza.utils.datasets.ner.convert_hy_armtdp as convert_hy_armtdp
from stanza.utils.datasets.ner.utils import convert_bio_to_json, get_tags, read_tsv, write_dataset, random_shuffle_by_prefixes, read_prefix_file, combine_files

SHARDS = ('train', 'dev', 'test')

class UnknownDatasetError(ValueError):
    def __init__(self, dataset, text):
        super().__init__(text)
        self.dataset = dataset

def process_turku(paths, short_name):
    assert short_name == 'fi_turku'
    base_input_path = os.path.join(paths["NERBASE"], "finnish", "turku-ner-corpus", "data", "conll")
    base_output_path = paths["NER_DATA_DIR"]
    for shard in SHARDS:
        input_filename = os.path.join(base_input_path, '%s.tsv' % shard)
        if not os.path.exists(input_filename):
            raise FileNotFoundError('Cannot find %s component of %s in %s' % (shard, short_name, input_filename))
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(input_filename, output_filename)

def process_it_fbk(paths, short_name):
    assert short_name == "it_fbk"
    base_input_path = os.path.join(paths["NERBASE"], short_name)
    csv_file = os.path.join(base_input_path, "all-wiki-split.tsv")
    if not os.path.exists(csv_file):
        raise FileNotFoundError("Cannot find the FBK dataset in its expected location: {}".format(csv_file))
    base_output_path = paths["NER_DATA_DIR"]
    split_wikiner(base_output_path, csv_file, prefix=short_name, suffix="io", shuffle=False, train_fraction=0.8, dev_fraction=0.1)
    convert_bio_to_json(base_output_path, base_output_path, short_name, suffix="io")


def process_languk(paths, short_name):
    assert short_name == 'uk_languk'
    base_input_path = os.path.join(paths["NERBASE"], 'lang-uk', 'ner-uk', 'data')
    base_output_path = paths["NER_DATA_DIR"]
    train_test_split_fname = os.path.join(paths["NERBASE"], 'lang-uk', 'ner-uk', 'doc', 'dev-test-split.txt')
    convert_bsf_to_beios.convert_bsf_in_folder(base_input_path, base_output_path, train_test_split_file=train_test_split_fname)
    for shard in SHARDS:
        input_filename = os.path.join(base_output_path, convert_bsf_to_beios.CORPUS_NAME, "%s.bio" % shard)
        if not os.path.exists(input_filename):
            raise FileNotFoundError('Cannot find %s component of %s in %s' % (shard, short_name, input_filename))
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(input_filename, output_filename)


def process_ijc(paths, short_name):
    """
    Splits the ijc Hindi dataset in train, dev, test

    The original data had train & test splits, so we randomly divide
    the files in train to make a dev set.

    The expected location of the IJC data is hi_ijc.  This method
    should be possible to use for other languages, but we have very
    little support for the other languages of IJC at the moment.
    """
    base_input_path = os.path.join(paths["NERBASE"], short_name)
    base_output_path = paths["NER_DATA_DIR"]

    test_files = [os.path.join(base_input_path, "test-data-hindi.txt")]
    test_csv_file = os.path.join(base_output_path, short_name + ".test.csv")
    print("Converting test input %s to space separated file in %s" % (test_files[0], test_csv_file))
    convert_ijc.convert_ijc(test_files, test_csv_file)

    train_input_path = os.path.join(base_input_path, "training-hindi", "*utf8")
    train_files = glob.glob(train_input_path)
    train_csv_file = os.path.join(base_output_path, short_name + ".train.csv")
    dev_csv_file = os.path.join(base_output_path, short_name + ".dev.csv")
    print("Converting training input from %s to space separated files in %s and %s" % (train_input_path, train_csv_file, dev_csv_file))
    convert_ijc.convert_split_ijc(train_files, train_csv_file, dev_csv_file)

    for csv_file, shard in zip((train_csv_file, dev_csv_file, test_csv_file), SHARDS):
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(csv_file, output_filename)


def process_fire_2013(paths, dataset):
    """
    Splits the FIRE 2013 dataset into train, dev, test

    The provided datasets are all mixed together at this point, so it
    is not possible to recreate the original test conditions used in
    the bakeoff
    """
    short_name = treebank_to_short_name(dataset)
    langcode, _ = short_name.split("_")
    short_name = "%s_fire2013" % langcode
    if not langcode in ("hi", "en", "ta", "bn", "mal"):
        raise UnkonwnDatasetError(dataset, "Language %s not one of the FIRE 2013 languages" % langcode)
    language = lcode2lang[langcode].lower()

    # for example, FIRE2013/hindi_train
    base_input_path = os.path.join(paths["NERBASE"], "FIRE2013", "%s_train" % language)
    base_output_path = paths["NER_DATA_DIR"]

    train_csv_file = os.path.join(base_output_path, "%s.train.csv" % short_name)
    dev_csv_file   = os.path.join(base_output_path, "%s.dev.csv" % short_name)
    test_csv_file  = os.path.join(base_output_path, "%s.test.csv" % short_name)

    convert_fire_2013.convert_fire_2013(base_input_path, train_csv_file, dev_csv_file, test_csv_file)

    for csv_file, shard in zip((train_csv_file, dev_csv_file, test_csv_file), SHARDS):
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(csv_file, output_filename)

def process_wikiner(paths, dataset):
    short_name = treebank_to_short_name(dataset)

    base_input_path = os.path.join(paths["NERBASE"], dataset)
    base_output_path = paths["NER_DATA_DIR"]

    expected_filename = "aij*wikiner*"
    input_files = [x for x in glob.glob(os.path.join(base_input_path, expected_filename)) if not x.endswith("bz2")]
    if len(input_files) == 0:
        raw_input_path = os.path.join(base_input_path, "raw")
        input_files = [x for x in glob.glob(os.path.join(raw_input_path, expected_filename)) if not x.endswith("bz2")]
        if len(input_files) > 1:
            raise FileNotFoundError("Found too many raw wikiner files in %s: %s" % (raw_input_path, ", ".join(input_files)))
    elif len(input_files) > 1:
        raise FileNotFoundError("Found too many raw wikiner files in %s: %s" % (base_input_path, ", ".join(input_files)))

    if len(input_files) == 0:
        raise FileNotFoundError("Could not find any raw wikiner files in %s or %s" % (base_input_path, raw_input_path))

    csv_file = os.path.join(base_output_path, short_name + "_csv")
    print("Converting raw input %s to space separated file in %s" % (input_files[0], csv_file))
    try:
        preprocess_wikiner(input_files[0], csv_file)
    except UnicodeDecodeError:
        preprocess_wikiner(input_files[0], csv_file, encoding="iso8859-1")

    # this should create train.bio, dev.bio, and test.bio
    print("Splitting %s to %s" % (csv_file, base_output_path))
    split_wikiner(base_output_path, csv_file, prefix=short_name)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def get_rgai_input_path(paths):
    return os.path.join(paths["NERBASE"], "hu_rgai")

def process_rgai(paths, short_name):
    base_output_path = paths["NER_DATA_DIR"]
    base_input_path = get_rgai_input_path(paths)

    if short_name == 'hu_rgai':
        use_business = True
        use_criminal = True
    elif short_name == 'hu_rgai_business':
        use_business = True
        use_criminal = False
    elif short_name == 'hu_rgai_criminal':
        use_business = False
        use_criminal = True
    else:
        raise UnknownDatasetError(short_name, "Unknown subset of hu_rgai data: %s" % short_name)

    convert_rgai.convert_rgai(base_input_path, base_output_path, short_name, use_business, use_criminal)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def get_nytk_input_path(paths):
    return os.path.join(paths["NERBASE"], "NYTK-NerKor")

def process_nytk(paths, short_name):
    """
    Process the NYTK dataset
    """
    assert short_name == "hu_nytk"
    base_output_path = paths["NER_DATA_DIR"]
    base_input_path = get_nytk_input_path(paths)

    convert_nytk.convert_nytk(base_input_path, base_output_path, short_name)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def concat_files(output_file, *input_files):
    input_lines = []
    for input_file in input_files:
        with open(input_file) as fin:
            lines = fin.readlines()
        if not len(lines):
            raise ValueError("Empty input file: %s" % input_file)
        if not lines[-1]:
            lines[-1] = "\n"
        elif lines[-1].strip():
            lines.append("\n")
        input_lines.append(lines)
    with open(output_file, "w") as fout:
        for lines in input_lines:
            for line in lines:
                fout.write(line)


def process_hu_combined(paths, short_name):
    assert short_name == "hu_combined"

    base_output_path = paths["NER_DATA_DIR"]
    rgai_input_path = get_rgai_input_path(paths)
    nytk_input_path = get_nytk_input_path(paths)

    with tempfile.TemporaryDirectory() as tmp_output_path:
        convert_rgai.convert_rgai(rgai_input_path, tmp_output_path, "hu_rgai", True, True)
        convert_nytk.convert_nytk(nytk_input_path, tmp_output_path, "hu_nytk")

        for shard in SHARDS:
            rgai_input = os.path.join(tmp_output_path, "hu_rgai.%s.bio" % shard)
            nytk_input = os.path.join(tmp_output_path, "hu_nytk.%s.bio" % shard)
            output_file = os.path.join(base_output_path, "hu_combined.%s.bio" % shard)
            concat_files(output_file, rgai_input, nytk_input)

    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_bsnlp(paths, short_name):
    """
    Process files downloaded from http://bsnlp.cs.helsinki.fi/bsnlp-2019/shared_task.html

    If you download the training and test data zip files and unzip
    them without rearranging in any way, the layout is somewhat weird.
    Training data goes into a specific subdirectory, but the test data
    goes into the top level directory.
    """
    base_input_path = os.path.join(paths["NERBASE"], "bsnlp2019")
    base_train_path = os.path.join(base_input_path, "training_pl_cs_ru_bg_rc1")
    base_test_path = base_input_path

    base_output_path = paths["NER_DATA_DIR"]

    output_train_filename = os.path.join(base_output_path, "%s.train.csv" % short_name)
    output_dev_filename   = os.path.join(base_output_path, "%s.dev.csv" % short_name)
    output_test_filename  = os.path.join(base_output_path, "%s.test.csv" % short_name)

    language = short_name.split("_")[0]

    convert_bsnlp.convert_bsnlp(language, base_test_path, output_test_filename)
    convert_bsnlp.convert_bsnlp(language, base_train_path, output_train_filename, output_dev_filename)

    for shard, csv_file in zip(SHARDS, (output_train_filename, output_dev_filename, output_test_filename)):
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(csv_file, output_filename)

NCHLT_LANGUAGE_MAP = {
    "af":  "NCHLT Afrikaans Named Entity Annotated Corpus",
    # none of the following have UD datasets as of 2.8.  Until they
    # exist, we assume the language codes NCHTL are sufficient
    "nr":  "NCHLT isiNdebele Named Entity Annotated Corpus",
    "nso": "NCHLT Sepedi Named Entity Annotated Corpus",
    "ss":  "NCHLT Siswati Named Entity Annotated Corpus",
    "st":  "NCHLT Sesotho Named Entity Annotated Corpus",
    "tn":  "NCHLT Setswana Named Entity Annotated Corpus",
    "ts":  "NCHLT Xitsonga Named Entity Annotated Corpus",
    "ve":  "NCHLT Tshivenda Named Entity Annotated Corpus",
    "xh":  "NCHLT isiXhosa Named Entity Annotated Corpus",
    "zu":  "NCHLT isiZulu Named Entity Annotated Corpus",
}

def process_nchlt(paths, short_name):
    language = short_name.split("_")[0]
    if not language in NCHLT_LANGUAGE_MAP:
        raise UnknownDatasetError(short_name, "Language %s not part of NCHLT" % language)
    short_name = "%s_nchlt" % language

    base_input_path = os.path.join(paths["NERBASE"], "NCHLT", NCHLT_LANGUAGE_MAP[language], "*Full.txt")
    input_files = glob.glob(base_input_path)
    if len(input_files) == 0:
        raise FileNotFoundError("Cannot find NCHLT dataset in '%s'  Did you remember to download the file?" % base_input_path)

    if len(input_files) > 1:
        raise ValueError("Unexpected number of files matched '%s'  There should only be one" % base_input_path)

    base_output_path = paths["NER_DATA_DIR"]
    split_wikiner(base_output_path, input_files[0], prefix=short_name, remap={"OUT": "O"})
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_my_ucsy(paths, short_name):
    assert short_name == "my_ucsy"
    language = "my"

    base_input_path = os.path.join(paths["NERBASE"], short_name)
    base_output_path = paths["NER_DATA_DIR"]
    convert_my_ucsy.convert_my_ucsy(base_input_path, base_output_path)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_fa_arman(paths, short_name):
    """
    Converts fa_arman dataset

    The conversion is quite simple, actually.
    Just need to split the train file and then convert bio -> json
    """
    assert short_name == "fa_arman"
    language = "fa"
    base_input_path = os.path.join(paths["NERBASE"], "PersianNER")
    train_input_file = os.path.join(base_input_path, "train_fold1.txt")
    test_input_file = os.path.join(base_input_path, "test_fold1.txt")
    if not os.path.exists(train_input_file) or not os.path.exists(test_input_file):
        full_corpus_file = os.path.join(base_input_path, "ArmanPersoNERCorpus.zip")
        if os.path.exists(full_corpus_file):
            raise FileNotFoundError("Please unzip the file {}".format(full_corpus_file))
        raise FileNotFoundError("Cannot find the arman corpus in the expected directory: {}".format(base_input_path))

    base_output_path = paths["NER_DATA_DIR"]
    test_output_file = os.path.join(base_output_path, "%s.test.bio" % short_name)

    split_wikiner(base_output_path, train_input_file, prefix=short_name, train_fraction=0.8, test_section=False)
    shutil.copy2(test_input_file, test_output_file)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_sv_suc3licensed(paths, short_name):
    """
    The .zip provided for SUC3 includes train/dev/test splits already

    This extracts those splits without needing to unzip the original file
    """
    assert short_name == "sv_suc3licensed"
    language = "sv"
    train_input_file = os.path.join(paths["NERBASE"], short_name, "SUC3.0.zip")
    if not os.path.exists(train_input_file):
        raise FileNotFoundError("Cannot find the officially licensed SUC3 dataset in %s" % train_input_file)

    base_output_path = paths["NER_DATA_DIR"]
    suc_conll_to_iob.process_suc3(train_input_file, short_name, base_output_path)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_sv_suc3shuffle(paths, short_name):
    """
    Uses an externally provided script to read the SUC3 XML file, then splits it
    """
    assert short_name == "sv_suc3shuffle"
    language = "sv"
    train_input_file = os.path.join(paths["NERBASE"], short_name, "suc3.xml.bz2")
    if not os.path.exists(train_input_file):
        train_input_file = train_input_file[:-4]
    if not os.path.exists(train_input_file):
        raise FileNotFoundError("Unable to find the SUC3 dataset in {}.bz2".format(train_input_file))

    base_output_path = paths["NER_DATA_DIR"]
    train_output_file = os.path.join(base_output_path, "sv_suc3shuffle.bio")
    suc_to_iob.main([train_input_file, train_output_file])
    split_wikiner(base_output_path, train_output_file, prefix=short_name)
    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_da_ddt(paths, short_name):
    """
    Processes Danish DDT dataset

    This dataset is in a conll file with the "name" attribute in the
    misc column for the NER tag.  This function uses a script to
    convert such CoNLL files to .bio
    """
    assert short_name == "da_ddt"
    language = "da"
    IN_FILES = ("ddt.train.conllu", "ddt.dev.conllu", "ddt.test.conllu")

    base_output_path = paths["NER_DATA_DIR"]
    OUT_FILES = [os.path.join(base_output_path, "%s.%s.bio" % (short_name, shard)) for shard in SHARDS]

    zip_file = os.path.join(paths["NERBASE"], "da_ddt", "ddt.zip")
    if os.path.exists(zip_file):
        for in_filename, out_filename, shard in zip(IN_FILES, OUT_FILES, SHARDS):
            conll_to_iob.process_conll(in_filename, out_filename, zip_file)
    else:
        for in_filename, out_filename, shard in zip(IN_FILES, OUT_FILES, SHARDS):
            in_filename = os.path.join(paths["NERBASE"], "da_ddt", in_filename)
            if not os.path.exists(in_filename):
                raise FileNotFoundError("Could not find zip in expected location %s and could not file %s file in %s" % (zip_file, shard, in_filename))

            conll_to_iob.process_conll(in_filename, out_filename)
    convert_bio_to_json(base_output_path, base_output_path, short_name)


def process_norne(paths, short_name):
    """
    Processes Norwegian NorNE

    Can handle either Bokmål or Nynorsk

    Converts GPE_LOC and GPE_ORG to GPE
    """
    language, name = short_name.split("_", 1)
    assert language in ('nb', 'nn')
    assert name == 'norne'

    if language == 'nb':
        IN_FILES = ("nob/no_bokmaal-ud-train.conllu", "nob/no_bokmaal-ud-dev.conllu", "nob/no_bokmaal-ud-test.conllu")
    else:
        IN_FILES = ("nno/no_nynorsk-ud-train.conllu", "nno/no_nynorsk-ud-dev.conllu", "nno/no_nynorsk-ud-test.conllu")

    base_output_path = paths["NER_DATA_DIR"]
    OUT_FILES = [os.path.join(base_output_path, "%s.%s.bio" % (short_name, shard)) for shard in SHARDS]

    CONVERSION = { "GPE_LOC": "GPE", "GPE_ORG": "GPE" }

    for in_filename, out_filename, shard in zip(IN_FILES, OUT_FILES, SHARDS):
        in_filename = os.path.join(paths["NERBASE"], "norne", "ud", in_filename)
        if not os.path.exists(in_filename):
            raise FileNotFoundError("Could not find %s file in %s" % (shard, in_filename))

        conll_to_iob.process_conll(in_filename, out_filename, conversion=CONVERSION)

    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_ja_gsd(paths, short_name):
    """
    Convert ja_gsd from MegagonLabs

    for example, can download from https://github.com/megagonlabs/UD_Japanese-GSD/releases/tag/r2.9-NE
    """
    language, name = short_name.split("_", 1)
    assert language == 'ja'
    assert name == 'gsd'

    base_output_path = paths["NER_DATA_DIR"]
    output_files = [os.path.join(base_output_path, "%s.%s.bio" % (short_name, shard)) for shard in SHARDS]

    search_path = os.path.join(paths["NERBASE"], "ja_gsd", "UD_Japanese-GSD-r2.*-NE")
    versions = glob.glob(search_path)
    max_version = None
    base_input_path = None
    version_re = re.compile("GSD-r2.([0-9]+)-NE$")

    for ver in versions:
        match = version_re.search(ver)
        if not match:
            continue
        ver_num = int(match.groups(1)[0])
        if max_version is None or ver_num > max_version:
            max_version = ver_num
            base_input_path = ver

    if base_input_path is None:
        raise FileNotFoundError("Could not find any copies of the NE conversion of ja_gsd here: {}".format(search_path))
    print("Most recent version found: {}".format(base_input_path))

    input_files = ["ja_gsd-ud-train.ne.conllu", "ja_gsd-ud-dev.ne.conllu", "ja_gsd-ud-test.ne.conllu"]

    def conversion(x):
        if x[0] == 'L':
            return 'E' + x[1:]
        if x[0] == 'U':
            return 'S' + x[1:]
        # B, I unchanged
        return x

    for in_filename, out_filename, shard in zip(input_files, output_files, SHARDS):
        in_path = os.path.join(base_input_path, in_filename)
        if not os.path.exists(in_path):
            in_spacy = os.path.join(base_input_path, "spacy", in_filename)
            if not os.path.exists(in_spacy):
                raise FileNotFoundError("Could not find %s file in %s or %s" % (shard, in_path, in_spacy))
            in_path = in_spacy

        conll_to_iob.process_conll(in_path, out_filename, conversion=conversion, allow_empty=True, attr_prefix="NE")

    convert_bio_to_json(base_output_path, base_output_path, short_name)

def process_starlang(paths, short_name):
    """
    Process a Turkish dataset from Starlang
    """
    assert short_name == 'tr_starlang'

    PIECES = ["TurkishAnnotatedTreeBank-15",
              "TurkishAnnotatedTreeBank2-15",
              "TurkishAnnotatedTreeBank2-20"]

    chunk_paths = [os.path.join(paths["CONSTITUENCY_BASE"], "turkish", piece) for piece in PIECES]
    datasets = convert_starlang_ner.read_starlang(chunk_paths)

    write_dataset(datasets, paths["NER_DATA_DIR"], short_name)

def remap_germeval_tag(tag):
    """
    Simplify tags for GermEval2014 using a simple rubric

    all tags become their parent tag
    OTH becomes MISC
    """
    if tag == "O":
        return tag
    if tag[1:5] == "-LOC":
        return tag[:5]
    if tag[1:5] == "-PER":
        return tag[:5]
    if tag[1:5] == "-ORG":
        return tag[:5]
    if tag[1:5] == "-OTH":
        return tag[0] + "-MISC"
    raise ValueError("Unexpected tag: %s" % tag)

def process_de_germeval2014(paths, short_name):
    """
    Process the TSV of the GermEval2014 dataset
    """
    in_directory = os.path.join(paths["NERBASE"], "germeval2014")
    base_output_path = paths["NER_DATA_DIR"]
    datasets = []
    for shard in SHARDS:
        in_file = os.path.join(in_directory, "NER-de-%s.tsv" % shard)
        sentences = read_tsv(in_file, 1, 2, remap_fn=remap_germeval_tag)
        datasets.append(sentences)
    tags = get_tags(datasets)
    print("Found the following tags: {}".format(sorted(tags)))
    write_dataset(datasets, base_output_path, short_name)

def process_hiner(paths, short_name):
    in_directory = os.path.join(paths["NERBASE"], "hindi", "HiNER", "data", "original")
    convert_bio_to_json(in_directory, paths["NER_DATA_DIR"], short_name, suffix="conll", shard_names=("train", "validation", "test"))

def process_hinercollapsed(paths, short_name):
    in_directory = os.path.join(paths["NERBASE"], "hindi", "HiNER", "data", "collapsed")
    convert_bio_to_json(in_directory, paths["NER_DATA_DIR"], short_name, suffix="conll", shard_names=("train", "validation", "test"))

def process_lst20(paths, short_name, include_space_char=True):
    convert_lst20.convert_lst20(paths, short_name, include_space_char)

def process_nner22(paths, short_name, include_space_char=True):
    convert_nner22.convert_nner22(paths, short_name, include_space_char)

def process_mr_l3cube(paths, short_name):
    base_output_path = paths["NER_DATA_DIR"]
    in_directory = os.path.join(paths["NERBASE"], "marathi", "MarathiNLP", "L3Cube-MahaNER", "IOB")
    input_files = ["train_iob.txt", "valid_iob.txt", "test_iob.txt"]
    input_files = [os.path.join(in_directory, x) for x in input_files]
    for input_file in input_files:
        if not os.path.exists(input_file):
            raise FileNotFoundError("Could not find the expected piece of the l3cube dataset %s" % input_file)

    datasets = [convert_mr_l3cube.convert(input_file) for input_file in input_files]
    write_dataset(datasets, base_output_path, short_name)

def process_bn_daffodil(paths, short_name):
    in_directory = os.path.join(paths["NERBASE"], "bangla", "Bengali-NER")
    out_directory = paths["NER_DATA_DIR"]
    convert_bn_daffodil.convert_dataset(in_directory, out_directory)

def process_pl_nkjp(paths, short_name):
    out_directory = paths["NER_DATA_DIR"]
    candidates = [os.path.join(paths["NERBASE"], "Polish-NKJP"),
                  os.path.join(paths["NERBASE"], "polish", "Polish-NKJP"),
                  os.path.join(paths["NERBASE"], "polish", "NKJP-PodkorpusMilionowy-1.2.tar.gz"),]
    for in_path in candidates:
        if os.path.exists(in_path):
            break
    else:
        raise FileNotFoundError("Could not find %s  Looked in %s" % (short_name, " ".join(candidates)))
    convert_nkjp.convert_nkjp(in_path, out_directory)

def process_kk_kazNERD(paths, short_name):
    in_directory = os.path.join(paths["NERBASE"], "kazakh", "KazNERD", "KazNERD")
    out_directory = paths["NER_DATA_DIR"]
    convert_kk_kazNERD.convert_dataset(in_directory, out_directory, short_name)

def process_masakhane(paths, dataset_name):
    """
    Converts Masakhane NER datasets to Stanza's .json format

    If we let N be the length of the first sentence, the NER files
    (in version 2, at least) are all of the form

    word tag
    ...
    word tag
      (blank line for sentence break)
    word tag
    ...

    Once the dataset is git cloned in $NERBASE, the directory structure is

    $NERBASE/masakhane-ner/MasakhaNER2.0/data/$lcode/{train,dev,test}.txt

    The only tricky thing here is that for some languages, we treat
    the 2 letter lcode as canonical thanks to UD, but Masakhane NER
    uses 3 letter lcodes for all languages.
    """
    language, dataset = dataset_name.split("_")
    lcode = lang_to_langcode(language)
    if lcode in two_to_three_letters:
        masakhane_lcode = two_to_three_letters[lcode]
    else:
        masakhane_lcode = lcode

    mn_directory = os.path.join(paths["NERBASE"], "masakhane-ner")
    if not os.path.exists(mn_directory):
        raise FileNotFoundError("Cannot find Masakhane NER repo.  Please check the setting of NERBASE or clone the repo to %s" % mn_directory)
    data_directory = os.path.join(mn_directory, "MasakhaNER2.0", "data")
    if not os.path.exists(data_directory):
        raise FileNotFoundError("Apparently found the repo at %s but the expected directory structure is not there - was looking for %s" % (mn_directory, data_directory))

    in_directory = os.path.join(data_directory, masakhane_lcode)
    if not os.path.exists(in_directory):
        raise UnknownDatasetError(dataset_name, "Found the Masakhane repo, but there was no %s in the repo at path %s" % (dataset_name, in_directory))
    convert_bio_to_json(in_directory, paths["NER_DATA_DIR"], "%s_masakhane" % lcode, "txt")

def process_sd_siner(paths, short_name):
    in_directory = os.path.join(paths["NERBASE"], "sindhi", "SiNER-dataset")
    if not os.path.exists(in_directory):
        raise FileNotFoundError("Cannot find SiNER checkout in $NERBASE/sindhi  Please git clone to repo in that directory")
    in_filename = os.path.join(in_directory, "SiNER-dataset.txt")
    if not os.path.exists(in_filename):
        in_filename = os.path.join(in_directory, "SiNER dataset.txt")
        if not os.path.exists(in_filename):
            raise FileNotFoundError("Found an SiNER directory at %s but the directory did not contain the dataset" % in_directory)
    convert_sindhi_siner.convert_sindhi_siner(in_filename, paths["NER_DATA_DIR"], short_name)

def process_en_worldwide_4class(paths, short_name):
    simplify_en_worldwide.main(args=['--simplify'])

    in_directory = os.path.join(paths["NERBASE"], "en_worldwide", "4class")
    out_directory = paths["NER_DATA_DIR"]

    destination_file = os.path.join(paths["NERBASE"], "en_worldwide", "en-worldwide-newswire", "regions.txt")
    prefix_map = read_prefix_file(destination_file)

    random_shuffle_by_prefixes(in_directory, out_directory, short_name, prefix_map)

def process_en_worldwide_9class(paths, short_name):
    simplify_en_worldwide.main(args=['--no_simplify'])

    in_directory = os.path.join(paths["NERBASE"], "en_worldwide", "9class")
    out_directory = paths["NER_DATA_DIR"]

    destination_file = os.path.join(paths["NERBASE"], "en_worldwide", "en-worldwide-newswire", "regions.txt")
    prefix_map = read_prefix_file(destination_file)

    random_shuffle_by_prefixes(in_directory, out_directory, short_name, prefix_map)

def process_en_ontonotes(paths, short_name):
    ner_input_path = paths['NERBASE']
    ontonotes_path = os.path.join(ner_input_path, "english", "en_ontonotes")
    ner_output_path = paths['NER_DATA_DIR']
    convert_ontonotes.process_dataset("en_ontonotes", ontonotes_path, ner_output_path)

def process_zh_ontonotes(paths, short_name):
    ner_input_path = paths['NERBASE']
    ontonotes_path = os.path.join(ner_input_path, "chinese", "zh_ontonotes")
    ner_output_path = paths['NER_DATA_DIR']
    convert_ontonotes.process_dataset(short_name, ontonotes_path, ner_output_path)

def process_en_conll03(paths, short_name):
    ner_input_path = paths['NERBASE']
    conll_path = os.path.join(ner_input_path, "english", "en_conll03")
    ner_output_path = paths['NER_DATA_DIR']
    convert_en_conll03.process_dataset("en_conll03", conll_path, ner_output_path)

def process_en_conll03_worldwide(paths, short_name):
    """
    Adds the training data for conll03 and worldwide together
    """
    print("============== Preparing CoNLL 2003 ===================")
    process_en_conll03(paths, "en_conll03")
    print("========== Preparing 4 Class Worldwide ================")
    process_en_worldwide_4class(paths, "en_worldwide-4class")
    print("============== Combined Train Data ====================")
    input_files = [os.path.join(paths['NER_DATA_DIR'], "en_conll03.train.json"),
                   os.path.join(paths['NER_DATA_DIR'], "en_worldwide-4class.train.json")]
    output_file = os.path.join(paths['NER_DATA_DIR'], "%s.train.json" % short_name)
    combine_files(output_file, *input_files)
    shutil.copyfile(os.path.join(paths['NER_DATA_DIR'], "en_conll03.dev.json"),
                    os.path.join(paths['NER_DATA_DIR'], "%s.dev.json" % short_name))
    shutil.copyfile(os.path.join(paths['NER_DATA_DIR'], "en_conll03.test.json"),
                    os.path.join(paths['NER_DATA_DIR'], "%s.test.json" % short_name))

def process_en_ontonotes_ww_multi(paths, short_name):
    """
    Combine the worldwide data with the OntoNotes data in a multi channel format
    """
    print("=============== Preparing OntoNotes ===============")
    process_en_ontonotes(paths, "en_ontonotes")
    print("========== Preparing 9 Class Worldwide ================")
    process_en_worldwide_9class(paths, "en_worldwide-9class")
    # TODO: pass in options?
    ontonotes_multitag.build_multitag_dataset(paths['NER_DATA_DIR'], short_name, True, True)

def process_en_combined(paths, short_name):
    """
    Combine WW, OntoNotes, and CoNLL into a 3 channel dataset
    """
    print("================= Preparing OntoNotes =================")
    process_en_ontonotes(paths, "en_ontonotes")
    print("========== Preparing 9 Class Worldwide ================")
    process_en_worldwide_9class(paths, "en_worldwide-9class")
    print("=============== Preparing CoNLL 03 ====================")
    process_en_conll03(paths, "en_conll03")
    build_en_combined.build_combined_dataset(paths['NER_DATA_DIR'], short_name)


def process_en_conllpp(paths, short_name):
    """
    This is ONLY a test set

    the test set has entities start with I- instead of B- unless they
    are in the middle of a sentence, but that should be find, as
    process_tags in the NER model converts those to B- in a BIOES
    conversion
    """
    base_input_path = os.path.join(paths["NERBASE"], "acl2023_conllpp", "dataset", "conllpp.txt")
    base_output_path = paths["NER_DATA_DIR"]
    sentences = read_tsv(base_input_path, 0, 3, separator=None)
    sentences = [sent for sent in sentences if len(sent) > 1 or sent[0][0] != '-DOCSTART-']
    write_dataset([sentences], base_output_path, short_name, shard_names=["test"], shards=["test"])

def process_armtdp(paths, short_name):
    assert short_name == 'hy_armtdp'
    base_input_path = os.path.join(paths["NERBASE"], "armenian", "ArmTDP-NER")
    base_output_path = paths["NER_DATA_DIR"]
    convert_hy_armtdp.convert_dataset(base_input_path, base_output_path, short_name)
    for shard in SHARDS:
        input_filename = os.path.join(base_output_path, f'{short_name}.{shard}.tsv')
        if not os.path.exists(input_filename):
            raise FileNotFoundError('Cannot find %s component of %s in %s' % (shard, short_name, input_filename))
        output_filename = os.path.join(base_output_path, '%s.%s.json' % (short_name, shard))
        prepare_ner_file.process_dataset(input_filename, output_filename)

def process_toy_dataset(paths, short_name):
    convert_bio_to_json(os.path.join(paths["NERBASE"], "English-SAMPLE"), paths["NER_DATA_DIR"], short_name)

def process_ar_aqmar(paths, short_name):
    base_input_path = os.path.join(paths["NERBASE"], "arabic", "AQMAR", "AQMAR_Arabic_NER_corpus-1.0.zip")
    base_output_path = paths["NER_DATA_DIR"]
    convert_ar_aqmar.convert_shuffle(base_input_path, base_output_path, short_name)

def process_en_cell(paths, short_name):
    # in_directory = os.path.join(paths["NERBASE"], "bangla", "Bengali-NER")
    in_directory = "/content/stanza/data/"
    out_directory = "/content/stanza/data/ner"
    convert_bn_daffodil.convert_dataset(in_directory, out_directory)

DATASET_MAPPING = {
    "ar_aqmar":          process_ar_aqmar,
    "bn_daffodil":       process_bn_daffodil,
    "da_ddt":            process_da_ddt,
    "de_germeval2014":   process_de_germeval2014,
    "en_conll03":        process_en_conll03,
    "en_conll03ww":      process_en_conll03_worldwide,
    "en_conllpp":        process_en_conllpp,
    "en_ontonotes":      process_en_ontonotes,
    "en_ontonotes-ww-multi": process_en_ontonotes_ww_multi,
    "en_combined":       process_en_combined,
    "en_worldwide-4class": process_en_worldwide_4class,
    "en_worldwide-9class": process_en_worldwide_9class,
    "fa_arman":          process_fa_arman,
    "fi_turku":          process_turku,
    "hi_hiner":          process_hiner,
    "hi_hinercollapsed": process_hinercollapsed,
    "hi_ijc":            process_ijc,
    "hu_nytk":           process_nytk,
    "hu_combined":       process_hu_combined,
    "hy_armtdp":         process_armtdp,
    "it_fbk":            process_it_fbk,
    "ja_gsd":            process_ja_gsd,
    "kk_kazNERD":        process_kk_kazNERD,
    "mr_l3cube":         process_mr_l3cube,
    "my_ucsy":           process_my_ucsy,
    "pl_nkjp":           process_pl_nkjp,
    "sd_siner":          process_sd_siner,
    "sv_suc3licensed":   process_sv_suc3licensed,
    "sv_suc3shuffle":    process_sv_suc3shuffle,
    "tr_starlang":       process_starlang,
    "th_lst20":          process_lst20,
    "th_nner22":         process_nner22,
    "zh-hans_ontonotes": process_zh_ontonotes,
    "en_cell":           process_en_cell,
}

def main(dataset_name):
    paths = default_paths.get_default_paths()
    print("Processing %s" % dataset_name)

    random.seed(1234)

    if dataset_name in DATASET_MAPPING:
        DATASET_MAPPING[dataset_name](paths, dataset_name)
    elif dataset_name in ('uk_languk', 'Ukranian_languk', 'Ukranian-languk'):
        process_languk(paths, dataset_name)
    elif dataset_name.endswith("FIRE2013") or dataset_name.endswith("fire2013"):
        process_fire_2013(paths, dataset_name)
    elif dataset_name.endswith('WikiNER'):
        process_wikiner(paths, dataset_name)
    elif dataset_name.startswith('hu_rgai'):
        process_rgai(paths, dataset_name)
    elif dataset_name.endswith("_bsnlp19"):
        process_bsnlp(paths, dataset_name)
    elif dataset_name.endswith("_nchlt"):
        process_nchlt(paths, dataset_name)
    elif dataset_name in ("nb_norne", "nn_norne"):
        process_norne(paths, dataset_name)
    elif dataset_name == 'en_sample':
        process_toy_dataset(paths, dataset_name)
    elif dataset_name.lower().endswith("_masakhane"):
        process_masakhane(paths, dataset_name)
    else:
        raise UnknownDatasetError(dataset_name, f"dataset {dataset_name} currently not handled by prepare_ner_dataset")
    print("Done processing %s" % dataset_name)

if __name__ == '__main__':
    main(sys.argv[1])


Overwriting stanza/utils/datasets/ner/prepare_ner_dataset.py


Convert the BIO files to a Stanza input format.

In [None]:
! python3 stanza/utils/datasets/ner/prepare_ner_dataset.py en_cell

Processing en_cell
Converting /content/stanza/data/ner/en_cell.train.bio to /content/stanza/data/ner/en_cell.train.json
1426 examples loaded from /content/stanza/data/ner/en_cell.train.bio
Generated json file /content/stanza/data/ner/en_cell.train.json
Converting /content/stanza/data/ner/en_cell.dev.bio to /content/stanza/data/ner/en_cell.dev.json
159 examples loaded from /content/stanza/data/ner/en_cell.dev.bio
Generated json file /content/stanza/data/ner/en_cell.dev.json
Converting /content/stanza/data/ner/en_cell.test.bio to /content/stanza/data/ner/en_cell.test.json
344 examples loaded from /content/stanza/data/ner/en_cell.test.bio
Generated json file /content/stanza/data/ner/en_cell.test.json
Done processing en_cell


In [None]:
%%writefile /content/stanza/stanza/utils/training/run_ner.py
"""
Trains or scores an NER model.

Will attempt to guess the appropriate word vector file if none is
specified, and will use the charlms specified in the resources
for a given dataset or language if possible.

Example command line:
  python3 -m stanza.utils.training.run_ner.py hu_combined

This script expects the prepared data to be in
  data/ner/{lang}_{dataset}.train.json, {lang}_{dataset}.dev.json, {lang}_{dataset}.test.json

If those files don't exist, it will make an attempt to rebuild them
using the prepare_ner_dataset script.  However, this will fail if the
data is not already downloaded.  More information on where to find
most of the datasets online is in that script.  Some of the datasets
have licenses which must be agreed to, so no attempt is made to
automatically download the data.
"""

import logging
import os

from stanza.models import ner_tagger
from stanza.resources.common import DEFAULT_MODEL_DIR
from stanza.utils.datasets.ner import prepare_ner_dataset
from stanza.utils.training import common
from stanza.utils.training.common import Mode, add_charlm_args, build_charlm_args, choose_charlm, find_wordvec_pretrain

from stanza.resources.default_packages import default_charlms, default_pretrains, ner_charlms, ner_pretrains

# extra arguments specific to a particular dataset
DATASET_EXTRA_ARGS = {
    "da_ddt":   [ "--dropout", "0.6" ],
    "fa_arman": [ "--dropout", "0.6" ],
    "vi_vlsp":  [ "--dropout", "0.6",
                  "--word_dropout", "0.1",
                  "--locked_dropout", "0.1",
                  "--char_dropout", "0.1" ],
    "en_cell":   [ "--max_steps", "200",
                   "--batch_size", "4"],
}

logger = logging.getLogger('stanza')

def add_ner_args(parser):
    add_charlm_args(parser)

    parser.add_argument('--use_bert', default=False, action="store_true", help='Use the default transformer for this language')


def build_pretrain_args(language, dataset, charlm="default", command_args=None, extra_args=None, model_dir=DEFAULT_MODEL_DIR):
    """
    Returns one list with the args for this language & dataset's charlm and pretrained embedding
    """
    charlm = choose_charlm(language, dataset, charlm, default_charlms, ner_charlms)
    charlm_args = build_charlm_args(language, charlm, model_dir=model_dir)

    wordvec_args = []
    if extra_args is None or '--wordvec_pretrain_file' not in extra_args:
        # will throw an error if the pretrain can't be found
        wordvec_pretrain = find_wordvec_pretrain(language, default_pretrains, ner_pretrains, dataset, model_dir=model_dir)
        wordvec_args = ['--wordvec_pretrain_file', wordvec_pretrain]

    bert_args = common.choose_transformer(language, command_args, extra_args, warn=False)

    return charlm_args + wordvec_args + bert_args


# TODO: refactor?  tagger and depparse should be pretty similar
def build_model_filename(paths, short_name, command_args, extra_args):
    short_language, dataset = short_name.split("_", 1)

    # TODO: can avoid downloading the charlm at this point, since we
    # might not even be training
    pretrain_args = build_pretrain_args(short_language, dataset, command_args.charlm, command_args, extra_args)

    dataset_args = DATASET_EXTRA_ARGS.get(short_name, [])

    train_args = ["--shorthand", short_name,
                  "--mode", "train"]
    train_args = train_args + pretrain_args + dataset_args + extra_args
    if command_args.save_name is not None:
        train_args.extend(["--save_name", command_args.save_name])
    if command_args.save_dir is not None:
        train_args.extend(["--save_dir", command_args.save_dir])
    args = ner_tagger.parse_args(train_args)
    save_name = ner_tagger.model_file_name(args)
    return save_name


# Technically NER datasets are not necessarily treebanks
# (usually not, in fact)
# However, to keep the naming consistent, we leave the
# method which does the training as run_treebank
# TODO: rename treebank -> dataset everywhere
def run_treebank(mode, paths, treebank, short_name,
                 temp_output_file, command_args, extra_args):
    ner_dir = paths["NER_DATA_DIR"]
    language, dataset = short_name.split("_")

    train_file = os.path.join(ner_dir, f"{treebank}.train.json")
    dev_file   = os.path.join(ner_dir, f"{treebank}.dev.json")
    test_file  = os.path.join(ner_dir, f"{treebank}.test.json")

    # if any files are missing, try to rebuild the dataset
    # if that still doesn't work, we have to throw an error
    missing_file = [x for x in (train_file, dev_file, test_file) if not os.path.exists(x)]
    if len(missing_file) > 0:
        logger.warning(f"The data for {treebank} is missing or incomplete.  Cannot find {missing_file}  Attempting to rebuild...")
        try:
            prepare_ner_dataset.main(treebank)
        except Exception as e:
            raise FileNotFoundError(f"An exception occurred while trying to build the data for {treebank}  At least one portion of the data was missing: {missing_file}  Please correctly build these files and then try again.") from e

    pretrain_args = build_pretrain_args(language, dataset, command_args.charlm, command_args, extra_args)

    if mode == Mode.TRAIN:
        # VI example arguments:
        #   --wordvec_pretrain_file ~/stanza_resources/vi/pretrain/vtb.pt
        #   --train_file data/ner/vi_vlsp.train.json
        #   --eval_file data/ner/vi_vlsp.dev.json
        #   --lang vi
        #   --shorthand vi_vlsp
        #   --mode train
        #   --charlm --charlm_shorthand vi_conll17
        #   --dropout 0.6 --word_dropout 0.1 --locked_dropout 0.1 --char_dropout 0.1
        dataset_args = DATASET_EXTRA_ARGS.get(short_name, [])

        train_args = ['--train_file', train_file,
                      '--eval_file', dev_file,
                      '--shorthand', short_name,
                      '--mode', 'train']
        train_args = train_args + pretrain_args + dataset_args + extra_args
        logger.info("Running train step with args: {}".format(train_args))
        ner_tagger.main(train_args)

    if mode == Mode.SCORE_DEV or mode == Mode.TRAIN:
        dev_args = ['--eval_file', dev_file,
                    '--shorthand', short_name,
                    '--mode', 'predict']
        dev_args = dev_args + pretrain_args + extra_args
        logger.info("Running dev step with args: {}".format(dev_args))
        ner_tagger.main(dev_args)

    if mode == Mode.SCORE_TEST or mode == Mode.TRAIN:
        test_args = ['--eval_file', test_file,
                     '--shorthand', short_name,
                     '--mode', 'predict']
        test_args = test_args + pretrain_args + extra_args
        logger.info("Running test step with args: {}".format(test_args))
        ner_tagger.main(test_args)


def main():
    common.main(run_treebank, "ner", "nertagger", add_ner_args, ner_tagger.build_argparse(), build_model_filename=build_model_filename)

if __name__ == "__main__":
    main()



Overwriting /content/stanza/stanza/utils/training/run_ner.py


Call the NER training component and evluate the results on the test set.
Given the amount of data used in training and the computational resources on Colab. I set the training hyper-parameter as follows:
*   batch size = 4
*   max step = 200

We can see that the model perforamce is very low. We can process more data, adjust the hyper-parameters if more computational resources are available.

In [None]:
! python -m stanza.utils.training.run_ner en_cell

2024-02-28 01:33:44 INFO: Training program called with:
/content/stanza/stanza/utils/training/run_ner.py en_cell
2024-02-28 01:33:44 DEBUG: en_cell: en_cell
2024-02-28 01:33:44 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-02-28 01:33:44 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-02-28 01:33:44 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-02-28 01:33:44 INFO: en_cell: saved_models/ner/en_cell_charlm_nertagger.pt does not exist, training new model
2024-02-28 01:33:44 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-02-28 01:33:44 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-02-28 01:33:44 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretra