# Data Preparation for Flair

Our NER dataset is not automatically supported by Flair, so we'll need to format it properly ([more info](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#reading-your-own-sequence-labeling-dataset)).

Create text files for train, validation, and test sets:

In [50]:
import pandas as pd
import numpy as np

# Load the dataset
ner_dataset = pd.read_csv('/Users/julia/Datasets/ner_corpus/ner_dataset.csv', 
    encoding='latin1', on_bad_lines='warn', low_memory=False)

# The dataset row of every sentence's ending word should be empty, to comply with Flair's format
indices_sent_end = ner_dataset[ner_dataset['Sentence #'].notnull()].index - 1
ner_dataset.loc[indices_sent_end[1:]] = ['', '', '', '']

# Remove the 'Sentence #' column, to comply with Flair's format
ner_dataset = ner_dataset.drop(labels='Sentence #', axis=1)

# Train-validation-test split (80/10/10)
train_df = ner_dataset[:838862]
valid_df = ner_dataset[838862:943743]
test_df = ner_dataset[943743:]
print(f'Splits: Train {len(train_df)/len(ner_dataset)*100:.0f}%, Validation {len(valid_df)/len(ner_dataset)*100:.0f}%, Test {len(test_df)/len(ner_dataset)*100:.0f}%')

# Save to text files
np.savetxt('/Users/julia/Datasets/ner_corpus/ner_flair_train.txt', train_df.values, fmt='%s')
np.savetxt('/Users/julia/Datasets/ner_corpus/ner_flair_valid.txt', valid_df.values, fmt='%s')
np.savetxt('/Users/julia/Datasets/ner_corpus/ner_flair_test.txt', test_df.values, fmt='%s')

Splits: Train 80%, Validation 10%, Test 10%


Read in text files to Flair:

In [51]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# Columns in the text files
columns = {0: 'text', 1: 'pos', 2: 'ner'}

# Data folder in which train, test and dev files reside
data_folder = '/Users/julia/Datasets/ner_corpus'

# Initialize a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='ner_flair_train.txt',
                              dev_file='ner_flair_valid.txt')

2022-12-02 14:06:27,626 Reading data from /Users/julia/Datasets/ner_corpus
2022-12-02 14:06:27,627 Train: /Users/julia/Datasets/ner_corpus/ner_flair_train.txt
2022-12-02 14:06:27,627 Dev: /Users/julia/Datasets/ner_corpus/ner_flair_valid.txt
2022-12-02 14:06:27,628 Test: None


Verify that the text files were correctly read in:

In [54]:
len(corpus.train), corpus.train[0].to_tagged_string('ner'), corpus.train[1].to_tagged_string('ner'), corpus.train[2].to_tagged_string('ner')

(34511,
 'Sentence: "Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings ." → ["Bush"/per]',
 'Sentence: "They marched from the Houses of Parliament to a rally in Hyde Park" → ["Hyde Park"/geo]',
 'Sentence: "Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000"')

In [55]:
len(corpus.dev), corpus.dev[0].to_tagged_string('ner'), corpus.dev[1].to_tagged_string('ner'), corpus.dev[2].to_tagged_string('ner')

(4778,
 'Sentence: "They were successful in getting it out of the plane and home"',
 'Sentence: "When they took it for a float on the river , they were quite surprised by a Coast Guard helicopter coming towards them" → ["Coast Guard"/org]',
 'Sentence: "It turned out that the chopper was homing in on the emergency locator that is automatically activated when the raft is inflated"')

Looks good!