## Creating an NER Model for the Corpus

### Loading the docs

The following code loads the descriptions of each caste as doc entities from the pickled file created earlier. It then filters the descriptions to retain only those that are more than a sentence long. It then saves these to a dataframe with the relevant doc ID <span style='color:red'>(but the spaCy doc entities are converted to strings in the CSV file)</span>.

In [33]:
# Importing the required libraries
import pickle
import pandas as pd

# Loading the stored file of doc entities
with open("/Users/ilamanish/Documents/dh_ism/scripts/data/descriptions_docs.pkl", "rb") as descriptions_docs:
    docs = pickle.load(descriptions_docs)

# Loading the indices and corresponding doc entities for all descriptions longer than one sentence (hereafter referred to as long descriptions)
long_descriptions = []
ids = []

for doc_index, doc in enumerate(docs):
    sent_counter = 0
    for sent in doc.sents:
        sent_counter += 1
    if sent_counter > 1:
        ids.append(doc_index)
        long_descriptions.append(doc)

# Saving the list of sentences and corresponding doc indices as a csv
long_descriptions_df = pd.DataFrame({'doc_id': ids, 'Description': long_descriptions})
long_descriptions_df.to_csv('./data/long_descriptions.csv', index=False)

### Extracting Sentences

Sentences are individually extracted and saved to enable sentence-based analysis and modelling. A set of 1000 sample sentences is then extracted for manual annotation.

In [38]:
import pandas as pd
import random

sent_tokens_list = []
sent_text_list = []
sent_indice_list = []

for doc_id, doc in long_descriptions_df[['doc_id', 'Description']].iterrows():
    for sent in doc['Description'].sents:
        sent_tokens_list.append(sent)
        sent_text_list.append(sent.text)
        sent_indice_list.append(doc['doc_id'])

castes_dataframe = pd.read_csv('./data/castes_dataframe.csv')
castes = []
for i in sent_indice_list:
    castes.append(castes_dataframe.iloc[i,1])
    
column_names = ['Caste', 'Text Description', 'Tokenized Description']
sentences_df = pd.DataFrame(list(zip(castes, sent_text_list, sent_tokens_list)),
                  columns = column_names)

display(sentences_df)

# Save the DataFrame to a CSV file
sentences_df.to_csv('./data/all_sentences.csv', index=False)

# Extract 1000 random sentences for manual annotation
sample_sentences = random.sample(sentences_df['Text Description'].tolist(), k=1000)

# Save the sample sentences as a txt file
with open('./data/sample_sentences.txt', 'w') as file:
    # Write each element of the list to the file
    for value in sample_sentences:
        file.write(str(value) + '\n')

Unnamed: 0,Caste,Text Description,Tokenized Description
0,Acchu Tāli,A sub-division of Vāniyan.,"(A, sub, -, division, of, Vāniyan, .)"
1,Acchu Tāli,The name refers to the peculiar tāli (marriage...,"(The, name, refers, to, the, peculiar, tāli, (..."
2,Acchuvāru,"Recorded, in the Madras Census Report, 1901, a...","(Recorded, ,, in, the, Madras, Census, Report,..."
3,Acchuvāru,Treated as a sub-division of Gaudo.”,"(Treated, as, a, sub, -, division, of, Gaudo, ..."
4,Acchuvāru,"The Acchuvārus are not Oriya people, but are a...","(The, Acchuvārus, are, not, Oriya, people, ,, ..."
...,...,...,...
41866,Yōgi Gurukkal,"It is recorded, in the Gazetteer of Malabar, t...","(It, is, recorded, ,, in, the, Gazetteer, of, ..."
41867,Yōgi Gurukkal,"They perform sakti pūja in their own houses, t...","(They, perform, sakti, pūja, in, their, own, h..."
41868,Yōgi Gurukkal,"They are celebrated sorcerers and exorcists, a...","(They, are, celebrated, sorcerers, and, exorci..."
41869,Zonnala,"Zonnala, or the equivalent Zonnakūti, has been...","(Zonnala, ,, or, the, equivalent, Zonnakūti, ,..."


### Building the NER Model

#### Creating the training data

The following code uses manual annotations of **material entities** and **social relations** across 1000 sentences from the corpus to create the requisite training data. The annotation was done on tecoholic's NER Annotator: https://tecoholic.github.io/ner-annotator/.

In [1]:
# Importing the requisite libraries
import json

# Loading the annotations from the exported JSON file
with open('./data/annotations.json', 'r') as file:
    TRAIN_DATA = json.load(file)

print(len(TRAIN_DATA))
print(TRAIN_DATA[45])

1000
['The persons meanwhile, whose names have been given out by the woman as having been implicated in the offence, have to vindicate their character on pain of excommunication.', {'entities': []}]


#### Training the NER Model
Code adapted from https://ner.pythonhumanities.com/03_02_train_spacy_ner_model.html 

The following code converts the training data into the binary format required in spaCy 3. It then splits the training data into a training set and a validation set and saves these in the right formats.

In [2]:
# Importing the necessary libraries
import srsly
import typer
import spacy
import warnings
from pathlib import Path

from spacy.tokens import DocBin

# Converting the training dataset into spaCy 3 binary format
def convert(lang: str, TRAIN_DATA, output_path: Path):
    nlp = spacy.blank(lang)
    db = DocBin()
    for text, annot in TRAIN_DATA:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            # Skipping over data that is not in the required format
            if span is None:
                msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
                warnings.warn(msg)
            else:
                ents.append(span)
        doc.ents = ents
        db.add(doc)
    db.to_disk(output_path)

# Determining the proportion of data for the training set and validation set
train_proportion = 0.8  # Adjust this value based on your preference

# Calculating the index to split the data
split_index = int(len(TRAIN_DATA) * train_proportion)

# Splitting the data
train_data = TRAIN_DATA[:split_index]
valid_data = TRAIN_DATA[split_index:]

# Performing spaCy conversion
convert("en", train_data, "./data/train.spacy")
convert("en", valid_data, "./data/valid.spacy")

The following code creates a config.cfg file using the base_config.cfg file created using spaCy GUI: https://spacy.io/usage/training#quickstart.

In [3]:
# Reformatting the base_config.cfg file into a properly formatted config.cfg file using spaCy

!python3 -m spacy init fill-config /Users/ilamanish/Documents/dh_ism/scripts/data/base_config.cfg data/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


The following code trains the NER model using the training and validation sets created earlier.

In [4]:
# Training the spaCy model

!python3 -m spacy train data/config.cfg --output ./models/material_models/output

[38;5;2m✔ Created output directory: models/material_models/output[0m
[38;5;4mℹ Saving to output directory: models/material_models/output[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2024-01-12 12:16:09,439] [INFO] Set up nlp object from config
[2024-01-12 12:16:09,449] [INFO] Pipeline: ['tok2vec', 'ner']
[2024-01-12 12:16:09,451] [INFO] Created vocabulary
[2024-01-12 12:16:09,451] [INFO] Finished initializing nlp object
[2024-01-12 12:16:09,997] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     39.00    0.00    0.00    0.00    0.00
  0     200         55.97   2018.32   54.13   64.17   46.81    0.54
  1     400         84.29   1000.17   

### Implementing the NER Model

The following code applies the NER model created on the descriptions in the corpus.

In [6]:
# Importing the required libraries
import pandas as pd

# Loading the descriptions
file_path = '/Users/ilamanish/Documents/dh_ism/scripts/data/castes_dataframe.csv'
df = pd.read_csv(file_path)
df['Description'] = df['Description'].astype(str)
trained_nlp = spacy.load("./models/material_models/output/model-best") # Loading the trained model
docs_ner = list(trained_nlp.pipe(df.Description))

# Creating a list of material entities
material_entities = []

def create_entities_list(docs, entity, list):
    for doc in docs:
        for ent in doc.ents:
            if ent.label_ == entity:
                if ent.text not in list:
                    list.append(ent.text)
    print(len(list))
    display(list)

create_entities_list(docs_ner, 'MATERIAL_ENTITY', material_entities)

7143


['tāli',
 'marriage badge',
 'grain',
 'bullocks',
 'structures',
 'hand-loom',
 'dressing-bag',
 'tree',
 'food',
 'head',
 'females gōsha',
 'cup',
 'molten metal',
 'offerings',
 'liquor',
 'sacred thread',
 'sīmantam',
 'Gāyatri',
 'cadjan',
 'palm leaf',
 'umbrella',
 'nuzur',
 'gift',
 'money',
 'patron',
 'fanams',
 'slave money',
 'Malayālam',
 'Sanskrit amba',
 'root āham',
 'significations',
 'earth',
 'meanings',
 'root',
 'step',
 'steps',
 'aham',
 'water',
 'cock',
 'degrees',
 'fire-pot',
 'corpse',
 'cow-dung',
 'cloth',
 'jewels',
 'flowers',
 'wall',
 'plank',
 'Betel leaves',
 'areca nuts',
 'turmeric-dyed',
 'neck',
 'conch shell',
 'musical',
 'feast',
 'planets',
 'animal',
 'horse',
 'cow',
 'elephant',
 'dog',
 'ruling planets',
 'abdomen',
 'trees',
 'idol',
 'pariyam cloth',
 'invitation',
 'pariyam',
 'bride’s money',
 'laterite earth',
 'turmeric paste',
 'paste',
 'gingelly',
 'Sesamum',
 'oil',
 'boiled rice',
 'sacred fire',
 'hōmam',
 'pot',
 'lamp',
 'b