## Dataset Creation

In [6]:
import pandas as pd
import re

`top_100_mountain_names.csv`:

- This file contains a list of the top 100 most popular mountains.
- The data was gathered from open data sources that provide information on well-known mountains around the world.

`sentences.csv`

In the `generate_sentences_with_AI` script, I use OpenAI's API to generate sentences that include mountain names from a dataset of the top 100 mountains. 
First, I load the mountain names from a CSV file, and with the help of `nltk` I tokenize sentences. The script then checks if the generated sentences contain possessive forms of the mountain names using a regular expression pattern. The result is saved in `sentences.csv`

Let's add labels to our sentences. We will tag mountain names as entities for training the NER model:
- `O` - means the word doesn't correspond to a mountain name entity
- `B-MOUNTAIN` - means the word corresponds to the beginning of a mountain name entity
- `I-MOUNTAIN` - means the word is inside a mountain name entity

We will save the result in `labeled_sentences.csv`

In [None]:
def label_ner_data(sentence, mountain):
    words = re.findall(r"[\w'-]+|[.,!?;]", sentence)
    mountain_parts = mountain.split()
    labeled_sentence = []
    index = 0
    
    while index < len(words):
        if words[index:index + len(mountain_parts)] == mountain_parts:
            for position, word in enumerate(mountain_parts):
                label = 'B-MOUNTAIN' if position == 0 else 'I-MOUNTAIN'
                labeled_sentence.append((word, label))
            index += len(mountain_parts)
        else:
            labeled_sentence.append((words[index], 'O'))
            index += 1

    return labeled_sentence

data = pd.read_csv('datasets/sentences.csv')
labeled_data = []

for index, row in data.iterrows():
    sentence = row['sentence']
    mountain = row['mountain']
    labeled_sentence = label_ner_data(sentence, mountain)

    for word, label in labeled_sentence:
        labeled_data.append([index, word, label])

labeled_df = pd.DataFrame(labeled_data, columns=['sentence_idx', 'word', 'label'])
grouped = labeled_df.groupby('sentence_idx').agg(
    tokens=('word', list),
    tags=('label', list)
).reset_index()

grouped.to_csv('data/labeled_sentences.csv', index=False)