## Intsall and import

In [None]:
!pip install -r requirements.txt

In [14]:
import pandas as pd
import nltk

nltk.download('wordnet')
nltk.download('punkt_tab')

import string
import contractions
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package wordnet to /Users/vika/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/vika/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Base dataset overview

For this task was decided to create custom dataset, that will contain texts about classes from image dataset, used from classification model    
The created dataset is based on github dataset with fun facts about animals, its size and columns will be described below        
Link: https://github.com/ekohrt/animal-fun-facts-dataset

In [17]:
# Classes that are chosen for image classification, therefore we must determine are they all present in dataset for ner task
animal_classes = ['cat', 'cow', 'dog', 'elephant', 'gorilla', 'hippo', 'lizard', 'monkey', 'mouse', 'panda', 'tiger', 'zebra']
print(len(animal_classes))

12


In [19]:
# load dataset for ner task
ff_df = pd.read_csv('animal-fun-facts-dataset.csv')  

In [29]:
# get overwiev of dataset df info
ff_df.describe()

Unnamed: 0,animal_name,source,text,media_link,wikipedia_link
count,7734,7734,7731,247,7522
unique,2501,3107,7622,247,2191
top,platypus,https://factanimal.com/dolphins/,"For more information about bony fishes, explor...",https://v.redd.it/tcwv55l0n41a1,/wiki/Platypus
freq,47,29,23,1,47


In [21]:
# view first 5 samples
ff_df.head()

Unnamed: 0,animal_name,source,text,media_link,wikipedia_link
0,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,"Aardvarks are sometimes called ""ant bears"", ""e...",,/wiki/Aardvark
1,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Aardvarks\nhave rather primitive brains that a...,,/wiki/Aardvark
2,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Aardvarks\nteeth are lined with fine upright t...,,/wiki/Aardvark
3,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,"The aardvarks Latin family name ""Tubulidentata...",,/wiki/Aardvark
4,aardvark,https://www.animalfactsencyclopedia.com/Aardva...,Baby aardvarks are born with front teeth that ...,,/wiki/Aardvark


To ensure that dataset is contains information about animals defined in animal_classes, get all animals names from df and view intersection with animal_classes

In [23]:
names = ff_df['animal_name'].unique()

In [25]:
# append to array only classes, that are present in both ner dataset and animal_classes
res = [value for value in animal_classes if value in names]
print(res)
len(res)

['cat', 'cow', 'dog', 'elephant', 'gorilla', 'hippo', 'lizard', 'monkey', 'mouse', 'panda', 'tiger', 'zebra']


12

Animal names intersect perfectly, all 12 classes are present in chosen dataset

## Custom dataset creation

Now we need to create a custom dataset with data suitable for our task. 

In [165]:
# as we will use sentences to create tokens and ner_tags, ensure that all rows in ff_df['text'] column are string
ff_df = ff_df[ff_df["text"].apply(lambda x: isinstance(x, str))]

Our custom dataset will have 3 columns: words: array, ner_tags:array and bio: array (stands for bio-tagging)
Before adding data to the dataset, each text from base dataset will be preprocessed:
- remove all punctuation and expand contractions
- all remained words will be lowercase and replaces with their lemmas for better performance

In [167]:
lemmatizer = WordNetLemmatizer()

unique_labels = ["O", "B-AN", "I-AN"]
label2id = {k: v for v, k in enumerate(unique_labels)}

# convert sentence into array of word lemmas, without punctuation and with expanded contractions
def process_tokens(sentence):
    sentence = contractions.fix(sentence)
    tokens = word_tokenize(sentence)
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [token.lower() for token in tokens]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens, ' '.join(tokens)

# check are there any needed animals mentioned amoung tokens
def contains_animal(tokens):
    return any(token in animal_classes for token in tokens)

# create array of ner tags with 0 if not needed animal class and with 1 if needed
def create_ner_tags(tokens):
    ner_tags = []
    for token in tokens:
        if token in animal_classes:
            ner_tags.append(1)
        else:
            ner_tags.append(0)
    ner_tags, bio_tags = bio_tagger(ner_tags)
    return ner_tags, bio_tags

def bio_tagger(ner_tags):
    bio_tagged = []
    prev_tag = 0
    for tag in ner_tags:
        if tag == 0:
            bio_tagged.append('O')
            prev_tag = tag
            continue
        if tag != 0 and prev_tag == 0:
            bio_tagged.append('B-AN')
            prev_tag = tag
        elif tag != 0 and prev_tag != 0:
            bio_tagged.append('I-AN')
            prev_tag = tag
    ner_tagged = [label2id[bio] for bio in bio_tagged]
    return ner_tags, bio_tagged

# create empty df for new training data
train_df = pd.DataFrame(columns=['words','ner_tags', 'bio'])

# fill df
for index, row in ff_df.iterrows():
    text = row['text']
    processed_tokens, sentence = process_tokens(text)
    if contains_animal(processed_tokens):
        ner_tags, bio_tags = create_ner_tags(processed_tokens)
        train_df.loc[len(train_df)] = [processed_tokens, ner_tags, bio_tags]

Custom dataset is created

In [42]:
# get dataset size
len(train_df)

NameError: name 'train_df' is not defined

In [None]:
# get overwiev of dataset df info
ff_df.describe()

In [169]:
# view first 5 rows
train_df.head()

Unnamed: 0,words,ner_tags,bio
0,"[wild, dog, are, known, by, many, different, n...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[O, B-AN, O, O, O, O, O, O, O, O, B-AN, O, O, ..."
1,"[wild, dog, do, not, use, a, kill, bite, when,...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[O, B-AN, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[the, african, wild, dog, is, the, second, lar...","[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]","[O, O, O, B-AN, O, O, O, O, B-AN, O, O, O, O, O]"
3,"[the, mandrill, is, the, largest, monkey, specie]","[0, 0, 0, 0, 0, 1, 0]","[O, O, O, O, O, B-AN, O]"
4,"[baboon, walk, on, all-fours, like, a, dog]","[0, 0, 0, 0, 0, 0, 1]","[O, O, O, O, O, O, B-AN]"


Save custom dataset for further use

In [175]:
# save df
train_df.to_csv('ner_dataset.csv')