<a href="https://colab.research.google.com/github/mary-lev/NER_evaluation/blob/main/NER_Models_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction and Processing
## Project Introduction
This Jupyter notebook is part of a project focused on analyzing the literary scene in St. Petersburg through a dataset derived from **SPbLitGuide**, a newsletter compiled by Daria Sukhovey. The newsletter, starting in May 1999, provides a rich source of information on literary events, capturing the essence of St. Petersburg's cultural life.

In [None]:
import re
import csv
import pandas as pd
from IPython.display import Code
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import ipywidgets as widgets
from IPython.display import display

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/data/litgid/

Mounted at /content/drive
/content/drive/My Drive/data/litgid


## Data Source
The raw data for this project is sourced from the WordPress database containing issues of the SPbLitGuide newsletter. This data is initially structured in an XML format, with each post containing details such as event titles, dates, descriptions, and hyperlinks.

### Sample Data Structure
A typical post in the XML dataset looks like this:

In [None]:
post = "post.html"
with open(post, "r") as f:
  text = f.read()
Code(text[:3000])

## Data Processing Objectives

The primary goal is to transform this rich yet unstructured dataset into a clean, structured format suitable for analysis. Key steps in this process include:

- Parsing XML Data: Extracting relevant information from the XML structure.
- Data Cleaning: Removing unnecessary HTML tags, handling missing values, and standardizing text.
- Text Processing: Employing techniques like Regular Expressions and BeautifulSoup to parse and structure the content.
- Named Entity Recognition: Using NLP models to identify key entities such as dates, names, and places.
- Creating a Pandas DataFrame: Aggregating the cleaned data into a structured format for easy analysis and visualization.

In [None]:
# Regular expression patterns
date_pattern = re.compile(r'^\d{2}\.\d{2}\.\d{2}', re.MULTILINE)
date_full_pattern = re.compile(r'\d{2}\.\d{2}\.\d{2} [а-яА-Я]+ \d{2}\.\d{2}')
record_pattern = re.compile(r'((?:<[^>]+>)*\d{2}\.\d{2}\.\d{2}.*?)(?=\n(?:<[^>]+>)*\d{2}\.\d{2}\.\d{2}|\Z)', re.DOTALL)
info_source_pattern = re.compile(r'\[(.*?)\]')
remove_places_pattern = re.compile(r'<b>МЕСТА</b>.*', re.DOTALL)

# Path to the XML file
xml_file = "Export-2023-November-16-2231.xml"
tree = ET.parse(xml_file)
root = tree.getroot()

events = []

# Iterate over the posts
for post in root.findall('post'):
    title = post.find("Title").text
    permalink = post.find("Permalink").text
    publication_date = post.find("Date").text
    content = post.find('Content').text.replace("<wbr />", '').replace("</em></strong><strong><em>", "")
    content = remove_places_pattern.sub('', content)

    # Find all matches of the record pattern in the content
    matches = record_pattern.findall(content)

    for match in matches:
        # Process each match as a separate record
        soup = BeautifulSoup(match, 'html.parser')
        text = soup.get_text(strip=True)

        # Check if the text starts with a date
        if date_pattern.match(text):
            info_source = ""

            tag = soup.find_all(['b', 'strong'])
            if tag:
                event_data = tag[0].get_text(strip=True)
                event_description = text.replace(event_data, "")
            else:
                try:
                    event_data, event_description = text.split("\n", 1)
                except:
                    if date_full_pattern.match(text):
                        print(text)

            # Extract info source if present
            info_source_match = info_source_pattern.search(text)
            if info_source_match:
                info_source = info_source_match.group(1)
                event_description = info_source_pattern.sub('', event_description).strip()

            events.append({
                "Event Data": event_data.replace("\n", ""),
                "Event Description": event_description.replace("\n", ""),
                "Info Source": info_source,
                "Issue": title,
                "Permalink": permalink,
                "Publication Date": publication_date,
            })

# Creating a DataFrame
df_events = pd.DataFrame(events)

df_events.to_csv("exported_from_2023_xml.csv",
                 sep='\t',  # Tab as delimiter
                 index=False,
                 encoding='utf-8',
                 quoting=csv.QUOTE_NONNUMERIC,
                 escapechar='\\',
                 )


  soup = BeautifulSoup(match, 'html.parser')


## Named Entity Recognition with DeepPavlov

### Overview
Named Entity Recognition (NER) is utilized in this project to identify and classify key information such as names, dates, and locations from the SPbLitGuide newsletter text.

### Why DeepPavlov?
DeepPavlov, an open-source NLP library, is chosen for its proficiency with the Russian language, essential for processing our dataset. It offers pre-trained models specifically designed for NER tasks, making it well-suited for extracting relevant entities from the newsletter content.

### Implementation
In this part of the notebook, we:
- Load DeepPavlov's pre-trained NER model.
- Apply the model to extract entities from the newsletter text.
- The extracted entities (e.g., names, dates) are then structured for further analysis, contributing to a more comprehensive understanding of the literary events in St. Petersburg.

This step is crucial for transforming the raw text data into a structured format, enabling more effective analysis and insight generation.

In [None]:
!pip install -q deeppavlov
!python -m deeppavlov install ner_ontonotes_bert

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m490.4/490.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.4/222.4 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
from deeppavlov import build_model
mult_model = build_model("ner_ontonotes_bert_mult", download=True, install=True)

2024-04-20 08:32:30.928 INFO in 'deeppavlov.core.data.utils'['utils'] at line 97: Downloading from http://files.deeppavlov.ai/v1/ner/ner_ontonotes_bert_mult_torch_crf.tar.gz to /root/.deeppavlov/models/ner_ontonotes_bert_mult_torch_crf.tar.gz
INFO:deeppavlov.core.data.utils:Downloading from http://files.deeppavlov.ai/v1/ner/ner_ontonotes_bert_mult_torch_crf.tar.gz to /root/.deeppavlov/models/ner_ontonotes_bert_mult_torch_crf.tar.gz
100%|██████████| 1.39G/1.39G [00:51<00:00, 27.1MB/s]
2024-04-20 08:33:22.935 INFO in 'deeppavlov.core.data.utils'['utils'] at line 284: Extracting /root/.deeppavlov/models/ner_ontonotes_bert_mult_torch_crf.tar.gz archive into /root/.deeppavlov/models/ner_ontonotes_torch_bert_mult_crf
INFO:deeppavlov.core.data.utils:Extracting /root/.deeppavlov/models/ner_ontonotes_bert_mult_torch_crf.tar.gz archive into /root/.deeppavlov/models/ner_ontonotes_torch_bert_mult_crf
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face 

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and a

In [None]:
rus_ner_model = build_model("ner_collection3_bert", download=True, install=True)

2024-04-20 08:35:30.907 INFO in 'deeppavlov.core.data.utils'['utils'] at line 97: Downloading from http://files.deeppavlov.ai/v1/ner/ner_rus_bert_coll3_torch.tar.gz to /root/.deeppavlov/models/ner_rus_bert_coll3_torch.tar.gz
INFO:deeppavlov.core.data.utils:Downloading from http://files.deeppavlov.ai/v1/ner/ner_rus_bert_coll3_torch.tar.gz to /root/.deeppavlov/models/ner_rus_bert_coll3_torch.tar.gz
100%|██████████| 1.44G/1.44G [00:50<00:00, 28.3MB/s]
2024-04-20 08:36:22.435 INFO in 'deeppavlov.core.data.utils'['utils'] at line 284: Extracting /root/.deeppavlov/models/ner_rus_bert_coll3_torch.tar.gz archive into /root/.deeppavlov/models/ner_rus_bert_coll3_torch
INFO:deeppavlov.core.data.utils:Extracting /root/.deeppavlov/models/ner_rus_bert_coll3_torch.tar.gz archive into /root/.deeppavlov/models/ner_rus_bert_coll3_torch


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initializ

In [None]:
!pip install ipymarkup
from ipymarkup import show_box_markup
from ipymarkup.palette import palette, BLUE, RED, GREEN


Collecting ipymarkup
  Downloading ipymarkup-0.9.0-py3-none-any.whl (14 kB)
Collecting intervaltree>=3 (from ipymarkup)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: intervaltree
  Building wheel for intervaltree (setup.py) ... [?25l[?25hdone
  Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26096 sha256=13c9c9fd362d86a3015132fe9080a3a677315211edf5f677febb5ad8b32d47b9
  Stored in directory: /root/.cache/pip/wheels/fa/80/8c/43488a924a046b733b64de3fac99252674c892a4c3801c0a61
Successfully built intervaltree
Installing collected packages: intervaltree, ipymarkup
Successfully installed intervaltree-3.1.0 ipymarkup-0.9.0


Get the labeled data

In [None]:
import json
data = []
with open("doccano_annotated_sample_1000_only_names.json1", "r") as f:
  d = f.readlines()
  for each in d:
    data.append(json.loads(each))

In [None]:
df = pd.DataFrame.from_records(data)
df["mult_model"] = None
df["rus_ner_model"] = None
df["roberta_large"] = None
df["gpt"] = None
df["spacy"] = None

In [None]:
def create_spans(text, tokens, labels):
    """
    Create spans of named entities from tokenized text and NER labels.

    Args:
        text (str): Original text.
        tokens (List[str]): List of tokens.
        labels (List[str]): Corresponding list of NER labels.

    Returns:
        List[Tuple[int, int, str]]: List of tuples (start, end, entity_type).
    """
    spans = []
    current_entity = None
    start, end = None, None

    # Calculate the start position of each token
    positions = []
    current_pos = 0
    for token in tokens:
        current_pos = text.find(token, current_pos)
        positions.append(current_pos)
        current_pos += len(token)

    for i, (token, label) in enumerate(zip(tokens, labels)):
        if label.startswith('B-'):
            # Save previous entity if it exists
            if current_entity is not None:
                spans.append((start, positions[i], current_entity))

            current_entity = label.split('-')[1]  # Get entity type
            start = positions[i]

        elif (label.startswith('I-') or label.startswith("E-") or label.startswith("B-")) and current_entity is not None:
            # Continue the current entity
            continue
        else:
            # End of the current entity
            if current_entity is not None:
                end = positions[i] if i < len(positions) else len(text)
                spans.append((start, end, current_entity))
                current_entity = None

    return spans



In [None]:
def create_person_spans(text, tokens, labels):
    """
    Create spans of named entities for persons from tokenized text and NER labels.

    Args:
        text (str): Original text.
        tokens (List[str]): List of tokens.
        labels (List[str]): Corresponding list of NER labels.

    Returns:
        List[Tuple[int, int, str]]: List of tuples (start, end, 'PERSON').
    """
    spans = []
    current_entity = None
    start, end = None, None

    # Calculate the start position of each token
    positions = []
    current_pos = 0
    for token in tokens:
        current_pos = text.find(token, current_pos)
        positions.append(current_pos)
        current_pos += len(token)

    for i, (token, label) in enumerate(zip(tokens, labels)):
        if label == 'B-PERSON':
            # Save previous entity if it exists and is a person
            if current_entity == 'PERSON':
                spans.append((start, positions[i], current_entity))

            # Start a new person entity
            current_entity = 'PERSON'
            start = positions[i]

        elif label == 'I-PERSON' and current_entity == 'PERSON':
            # Continue the current person entity
            continue

        else:
            # End of the current entity if it's a person
            if current_entity == 'PERSON':
                end = positions[i] if i < len(positions) else len(text)
                spans.append((start, end, current_entity))
                current_entity = None

    # Check if the last entity is a person and needs to be added
    if current_entity == 'PERSON':
        spans.append((start, len(text), current_entity))

    return spans


## DeepPavlov

In [None]:
for i, row in df.iterrows():
  text = row["text"]
  try:
    result = rus_ner_model([text])
    tokens = result[0][0]
    labels = result[1][0]
    doccano_spans = row["labels"]
    spans = create_spans(text, tokens, labels)
    df.at[i, "rus_ner_model"] = spans
  except RuntimeError:
    pass



In [None]:
for i, row in df[:5].iterrows():
  show_box_markup(row["text"], row["gpt4"])
  print(" ")

 


 


 


 


 


In [None]:
import ast

def convert_to_list_of_tuples(data_str):
    try:
        return ast.literal_eval(data_str)
    except ValueError as e:
        print(f"Error converting string to list: {e}")
        return []  # Return an empty list in case of error

df['labels'] = df['labels'].apply(convert_to_list_of_tuples)
df['mult_model'] = df['mult_model'].apply(convert_to_list_of_tuples)
df['rus_ner_model'] = df['rus_ner_model'].apply(convert_to_list_of_tuples)
df['spacy'] = df['spacy'].apply(convert_to_list_of_tuples)
df['gpt'] = df['gpt'].apply(convert_to_list_of_tuples)
df['gpt4'] = df['gpt4'].apply(convert_to_list_of_tuples)
df["roberta_large"] = df["roberta_large"].apply(convert_to_list_of_tuples)


In [None]:
def evaluate_ner_performance(df, ground_truth_col, model_col):
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Default to empty list if None
        ground_truth = row[ground_truth_col] if row[ground_truth_col] else []
        predictions = row[model_col] if row[model_col] else []

        # Filter for 'PERSON' entities and standardize to 'PERSON' type
        ground_truth = [(start, end, 'PERSON') for start, end, typ in ground_truth if typ in ['PERSON', 'PER']]
        predictions = [(start, end, 'PERSON') for start, end, typ in predictions if typ in ['PERSON', 'PER']]

        # Sort by the start and then end positions
        gt_set = sorted(ground_truth, key=lambda x: (x[0], x[1]))
        pred_set = sorted(predictions, key=lambda x: (x[0], x[1]))

        # Calculate matches allowing for slight boundary mismatches
        matched_gt = set()
        matched_pred = set()

        for gt in gt_set:
            for pred in pred_set:
                # Check if entities match and boundaries are within tolerance
                if gt[2] == pred[2] and abs(gt[0] - pred[0]) <= 2 and abs(gt[1] - pred[1]) <= 2:
                    matched_gt.add(gt)
                    matched_pred.add(pred)

        # Update counts
        true_positives += len(matched_gt)
        false_positives += len(pred_set) - len(matched_pred)
        false_negatives += len(gt_set) - len(matched_gt)

        # Debug output if there are mismatches
        # if len(pred_set) - len(matched_pred) > 0:
        #     print(f"False positives")
        #     print("Ground truth labels :", gt_set)
        #     show_box_markup(row["text"], gt_set)
        #     print("Model labels:", pred_set)
        #     show_box_markup(row["text"], pred_set)
        #     print("")
        # if len(gt_set) - len(matched_gt) > 0:
        #   print("False negatives")
        #   print("Groung truth label:", gt_set)
        #   show_box_markup(row["text"], gt_set)
        #   print("Model labels:", pred_set)
        #   show_box_markup(row["text"], pred_set)
        #   print(" ")

    # Calculate precision, recall, and F1 score
    precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    print(true_positives, false_positives, false_negatives)

    return precision, recall, f1


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'rus_ner_model')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")


3396 132 1805
Precision: 0.96, Recall: 0.65, F1 Score: 0.78


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'mult_model')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")


3667 216 1534
Precision: 0.94, Recall: 0.71, F1 Score: 0.81


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'roberta_large')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

3988 341 1213
Precision: 0.92, Recall: 0.77, F1 Score: 0.84


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'spacy')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

4206 783 995
Precision: 0.84, Recall: 0.81, F1 Score: 0.83


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'gpt')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

327 14 4874
Precision: 0.96, Recall: 0.06, F1 Score: 0.12


In [None]:
precision, recall, f1 = evaluate_ner_performance(df, 'labels', 'gpt4')
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

314 2 4887
Precision: 0.99, Recall: 0.06, F1 Score: 0.11


In [None]:
df.to_csv("dataset_with_models_1000.csv")

In [None]:
df = pd.read_csv("dataset_with_models_1000.csv")
df

Unnamed: 0.1,Unnamed: 0,id,text,labels,mult_model,rus_ner_model,roberta_large,spacy,gpt,gpt4
0,0,2201,Вечер поэта Томаса Венцлова (США). Презентация...,"[[12, 27, 'PERSON']]","[(12, 28, 'PERSON'), (29, 32, 'GPE'), (53, 71,...","[(12, 28, 'PER')]","[[12, 27, 'PER'], [71, 72, 'LOC'], [75, 78, 'O...","[[12, 27, 'PER'], [29, 32, 'LOC']]","[(12, 27, 'PERSON')]","[(12, 27, 'PERSON')]"
1,1,2202,Лекция Александра Степанова «Genius loci. Смыс...,"[[7, 27, 'PERSON']]","[(7, 28, 'PERSON'), (28, 54, 'WORK_OF_ART')]","[(7, 28, 'PER')]","[[7, 27, 'PER']]","[[7, 27, 'PER']]","[(7, 27, 'PERSON')]","[(7, 27, 'PERSON')]"
2,2,3134,В магазине книг «Фаренгейт 451» писатель Роман...,"[[41, 53, 'PERSON']]","[(17, 30, 'WORK_OF_ART'), (41, 54, 'PERSON'), ...","[(41, 54, 'PER'), (197, 214, 'ORG')]","[[41, 53, 'PER'], [55, 61, 'LOC'], [197, 212, ...","[[41, 53, 'PER'], [55, 61, 'LOC'], [79, 87, 'P...","[(41, 53, 'PERSON')]","[(41, 53, 'PERSON')]"
3,3,3135,Настройки / Settings. Александр Ильянен (лекци...,"[[50, 68, 'PERSON'], [22, 39, 'PERSON']]","[(22, 40, 'PERSON'), (50, 69, 'PERSON')]","[(22, 40, 'PER'), (50, 69, 'PER')]","[[22, 39, 'PER'], [50, 68, 'PER']]","[[22, 39, 'PER'], [50, 68, 'PER']]","[(22, 39, 'PERSON'), (50, 68, 'PERSON')]","[(22, 39, 'PERSON'), (50, 68, 'PERSON')]"
4,4,2203,21 марта во Всемирный День Поэзии состоится Те...,[],"[(0, 9, 'DATE'), (12, 34, 'EVENT'), (44, 74, '...",[],"[[0, 8, 'ORG'], [12, 33, 'ORG'], [44, 584, 'OR...","[[0, 8, 'DATE'], [44, 66, 'LOC'], [67, 72, 'LO...","[(44, 72, 'PERSON')]",[]
...,...,...,...,...,...,...,...,...,...,...
995,995,3195,Творческая встреча с русским и американским ху...,"[[75, 92, 'PERSON'], [193, 207, 'PERSON'], [71...","[(21, 29, 'NORP'), (31, 44, 'NORP'), (75, 94, ...","[(75, 94, 'PER'), (193, 208, 'PER'), (391, 406...","[[75, 92, 'PER'], [107, 113, 'LOC'], [134, 143...","[[75, 92, 'PER'], [107, 113, 'LOC'], [134, 143...",,
996,996,3196,"Выставка-исследование ""Постсоветское Красное З...","[[1439, 1451, 'PERSON'], [1512, 1524, 'PERSON'...",[],[],"[[95, 116, 'LOC'], [1081, 1088, 'ORG'], [1094,...","[[23, 50, 'ORG'], [62, 75, 'ORG'], [95, 116, '...",,
997,997,3197,"Презентация книги Виктора Тихомирова ""ПРО ШПИО...","[[18, 36, 'PERSON'], [126, 142, 'PERSON']]","[(18, 37, 'PERSON'), (37, 50, 'WORK_OF_ART'), ...","[(18, 37, 'PER'), (126, 143, 'PER')]","[[18, 36, 'PER'], [126, 142, 'PER']]","[[18, 36, 'PER'], [126, 142, 'PER']]",,
998,998,3198,"На заседании клуба ""Берггассе, 19"" состоится т...","[[68, 79, 'PERSON']]","[(19, 35, 'ORG'), (68, 79, 'PERSON')]","[(20, 33, 'ORG'), (68, 79, 'PER')]","[[68, 79, 'PER']]","[[20, 29, 'ORG'], [68, 79, 'PER']]",,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             1000 non-null   int64 
 1   text           1000 non-null   object
 2   labels         1000 non-null   object
 3   mult_model     1000 non-null   object
 4   rus_ner_model  1000 non-null   object
 5   roberta_large  1000 non-null   object
 6   spacy          1000 non-null   object
 7   gpt            100 non-null    object
 8   gpt4           100 non-null    object
dtypes: int64(1), object(8)
memory usage: 70.4+ KB


## RoBerta Large

In [None]:
!pip install transformers torch




In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("yqelz/xml-roberta-large-ner-russian")
model = AutoModelForTokenClassification.from_pretrained("yqelz/xml-roberta-large-ner-russian")


tokenizer_config.json:   0%|          | 0.00/421 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [None]:
def create_entity_spans(entities):
    spans = []
    current_span = None

    for entity in entities:
        # Check if it's the beginning of a new entity
        if entity['entity'].startswith('B-'):
            # If there's an active entity being built, add it before starting a new one
            if current_span:
                spans.append([current_span['start'], current_span['end'], current_span['entity']])
            # Start a new entity
            current_span = {'entity': entity['entity'][2:],  # Remove the 'B-' prefix
                            'start': entity['start'],
                            'end': entity['end']}
        elif entity['entity'].startswith('I-') and current_span:
            # Continue the entity if the types match and it's an interior token
            if current_span['entity'] == entity['entity'][2:]:
                current_span['end'] = entity['end']
        else:
            # Non-continuous entity token or different entity type, finalize the current span
            if current_span:
                spans.append([current_span['start'], current_span['end'], current_span['entity']])
                current_span = None

    # Finalize the last entity if it exists
    if current_span:
        spans.append([current_span['start'], current_span['end'], current_span['entity']])

    return spans

# Example data from the NER model's prediction
entities = [
    {'entity': 'B-PER', 'score': 0.998906, 'index': 3, 'word': '▁Александра', 'start': 18, 'end': 31},
    {'entity': 'I-PER', 'score': 0.99793005, 'index': 4, 'word': '▁Степан', 'start': 32, 'end': 39},
    {'entity': 'I-PER', 'score': 0.99669254, 'index': 5, 'word': 'ова', 'start': 39, 'end': 42},
    {'entity': 'B-PER', 'score': 0.9985, 'index': 6, 'word': '▁Елена', 'start': 34, 'end': 40},
    {'entity': 'I-PER', 'score': 0.9980, 'index': 7, 'word': '▁Литвинцева', 'start': 41, 'end': 51}
]

spans = create_entity_spans(entities)
print(spans)


[[18, 42, 'PER'], [34, 51, 'PER']]


In [None]:
from transformers import pipeline
classifier = pipeline("ner", model=model, tokenizer=tokenizer)
for i, row in df.iterrows():
  text = row["text"]
  result = classifier(text)
  spans = create_entity_spans(result)
  df.at[i, "roberta_large"] = spans
  #show_box_markup(text, spans)


## Spacy

In [None]:
!pip install https://huggingface.co/Dessan/ru_spacy_ru_updated/resolve/main/ru_spacy_ru_updated-any-py3-none-any.whl

# Using spacy.load().
import spacy
nlp = spacy.load("ru_spacy_ru_updated")

# Importing as module.
import ru_spacy_ru_updated
nlp = ru_spacy_ru_updated.load()

Collecting ru-spacy-ru-updated==any
  Downloading https://huggingface.co/Dessan/ru_spacy_ru_updated/resolve/main/ru_spacy_ru_updated-any-py3-none-any.whl (513.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m513.4/513.4 MB[0m [31m915.7 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymorphy3>=1.0.0 (from ru-spacy-ru-updated==any)
  Downloading pymorphy3-2.0.1-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.2/53.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting dawg-python>=0.7.1 (from pymorphy3>=1.0.0->ru-spacy-ru-updated==any)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting pymorphy3-dicts-ru (from pymorphy3>=1.0.0->ru-spacy-ru-updated==any)
  Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pymorphy3-dicts-r

In [None]:
for i, row in df.iterrows():
  doc = nlp(row["text"])
  entities = [[ent.start_char, ent.end_char, ent.label_] for ent in doc.ents]
  df.at[i, "spacy"] = entities


## Named Entity Recognition with ChatGPT-4

In this section, we explore the application of ChatGPT-4, a sophisticated language model developed by OpenAI, for Named Entity Recognition (NER) tasks.

ChatGPT-4 is known for its advanced text understanding and generation capabilities, especially in handling complex language structures. This makes it well-suited for accurately identifying and categorizing named entities within our text data.

### Implementation Process
- **Model Utilization**: We leverage ChatGPT-4 for its cutting-edge NLP technology to perform NER on the SPbLitGuide newsletter content.
- **Entity Extraction**: The model processes the text to identify various entities, including names of individuals, locations, and dates relevant to the literary events.
- **Data Structuring**: The entities identified by ChatGPT-4 are then organized and integrated into our dataset, enhancing the depth and quality of our analysis.

This approach with ChatGPT-4 offers an alternative NER method, complementing the analysis and providing a broader perspective on the dataset's informational content.




In [None]:
!pip install openai
import openai
import ast

Collecting openai
  Downloading openai-1.23.2-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: httpcore, httpx, openai
Successfully installed httpcore-1.0.5 httpx-0.27.0 openai-1.23.2


In [None]:
with open("prompt2.txt", "r") as f:
  prompt = f.read()

In [None]:
prompt

'You are a Named Entity Recognition (NER) system designed to process Russian text. Your task is to extract the names of real people mentioned in the text. You are given the text. Follow these steps:\n\n1. Read the text and identify the names of the people mentioned in the text. \n2. Consider only the names of people if they describe a person. We don\'t need titles of works of art or names of organizations if they contain the names of people.\n3. Return the list of these names in the same form as they appear in the text. Don\'t change the form of any name!\n4. Be sure that all occurences of the names in the text are included in the list even if they are mentioned several times. \nReturn only the list of names without any additional comments or formatting. \n\nExample 1: \nInput: text = "Встреча с писательницей Сюзанной Кулешовой Презентация книги «Последний глоток божоле на двоих». Кулешова Сюзанна Марковна, член Союза писателей Санкт-Петербурга. Закончила Горный институт, работала пале

In [None]:
from google.colab import userdata
def extract_persons(text):
    openai.api_key = userdata.get("OPENAI_API_KEY")

    response = openai.chat.completions.create(
        #model="gpt-4",
        #model="gpt-3.5-turbo",
        model="gpt-4-turbo-2024-04-09",
        #model="gpt-4-1106-preview",
        messages=[
            {
            "role": "system",
               "content": f"{prompt}. This is the text to process: {text}"
            },
        ],
        temperature=1,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return (response.choices[0].message.content)

In [None]:
def find_name_positions(text, names):
    results = []
    start_index = 0

    for name in names:
        start = text.find(name, start_index)
        if start != -1:
            end = start + len(name)
            results.append((start, end, 'PERSON'))
            start_index = end  # Prevent re-finding the same name

    return results

# Example usage
text = "Очередная встреча проекта «Открытая читка – юность» – литературный салон, где чтение вслух есть способ общения и творческого самовыражения отрока, тинэйджера, школьника – «Человек Читающий». Творческое объединение «Digital Chehov» в рамках проектов «Человек Читающий» и «Читающие дети» при поддержке «Открытой гостиной», издательств: «Питер», «Самокат», «Нигма», «Белая ворона» и портала «Литературно» открывает новый сезон «Открытой Читки – юность»! В этот день соберутся любители литературного чтения от 7 до 17 лет. Ребята будут читать вслух собственные сочинения и чужие произведения: проза или поэзия, «по листочку» или наизусть – выбор способа выражения остаётся за участниками. Главное – свободное объяснение со своими литературными предпочтениями. На «Читке» будем снимать видеоролики и фото для проекта «Человек Читающий». Завершится творческий вечер выбором трёх лучших чтецов: ребята получат подарки от издательств: «Питер», «Самокат», «Нигма», «Белая ворона». Все юные смельчаки гарантированно получат внимание и аплодисменты слушателей. «Открытая читка» – это не просто возможность особенным образом поделиться своими мыслями с окружающими, но и шанс найти неравнодушных к твоим интересам друзей. Читка – это способ литературной коммуникации – выход в оффлайн, возвращение к «бородатой» традиции чтения вслух перед аудиторией, которая сейчас опять становится востребованной. Куратор − Черток Анна [маяковка]"
names = ['Черток Анна']

positions = find_name_positions(text, names)
print(positions)

show_box_markup(text, positions)


[(1398, 1409, 'PERSON')]


In [None]:
print("NER GPT")
for i, row in df[90:100].iterrows():
  result = extract_persons(row["text"])
  result = ast.literal_eval(result)
  print(result)
  spans = find_name_positions(row["text"], result)
  print(spans)
  df.at[i, "gpt4"] = spans
  show_box_markup(row["text"], spans)

NER GPT
['Лоры Кутузовой']
[(149, 163, 'PERSON')]


['Павла Заруцкого', 'Павел Заруцкий']
[(7, 22, 'PERSON'), (186, 200, 'PERSON')]


['Вероника Капустина', 'Александр Гуревич', 'Наталья Перевезенцева', 'Татьяна Алфёрова', 'Нина Савушкина', 'Борис Григорин', 'Вадим Пугач', 'Александр Фролов']
[(62, 80, 'PERSON'), (82, 99, 'PERSON'), (101, 122, 'PERSON'), (124, 140, 'PERSON'), (142, 156, 'PERSON'), (158, 172, 'PERSON'), (174, 185, 'PERSON'), (187, 203, 'PERSON')]


['Наринэ Абгарян']
[(18, 32, 'PERSON')]


['Димы Олейника', 'Евгения Мори']
[(17, 30, 'PERSON'), (33, 45, 'PERSON')]


['Мария Амфилохиева', 'Юрий Санников']
[(102, 119, 'PERSON'), (148, 161, 'PERSON')]


['Анны Ямпольской']
[(37, 52, 'PERSON')]


['Марии Агаповой']
[(18, 32, 'PERSON')]


['Григорий Тульчинский']
[(0, 20, 'PERSON')]


['Леонида Прайсмана', 'Леонид Григорьевич Прайсман']
[(18, 35, 'PERSON'), (702, 729, 'PERSON')]


## Create a sample

In [None]:
def save_as_jsonl(df, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for _, row in df.iterrows():
            # Create a dictionary for each row
            json_obj = {"text": row['description'], "id": row['id']}
            # Write the JSON object to the file with a newline
            f.write(json.dumps(json_obj) + '\n')

# Call the function with the DataFrame and a filename
save_as_jsonl(sampled_data, 'sampled_data.jsonl')

In [None]:
df_events_export.to_json("data_export_for_doccano.json")

In [None]:
import pandas as pd

# Step 1: Load the CSV data
file_path = 'events_final.csv'  # Update this to your actual file path
df = pd.read_csv(file_path)

def simple_random_sample(df, desc_length=100, n_samples=1000):
    # Ensure 'Event Description' is a string and not empty
    df['description'] = df['description'].fillna('').str.strip()

    # Create a new column for description length
    df['desc_length'] = df['description'].apply(len)

    # Filter data to only include non-empty descriptions longer than desc_length
    filtered_df = df[df['desc_length'] > desc_length]

    # Shuffle the DataFrame
    shuffled_df = filtered_df.sample(frac=1).reset_index(drop=True)

    # Take the first n_samples rows
    if len(shuffled_df) > n_samples:
        sampled_df = shuffled_df.head(n_samples)
    else:
        sampled_df = shuffled_df

    return sampled_df



# Sample the DataFrame using the simplified random sampling function
sampled_data = simple_random_sample(df, desc_length=50, n_samples=1000)  # Adjusted desc_length for broader inclusion
print(sampled_data)



        id                                        description   
0    14661  1000 солнц. Выступление жёстких брутальных и э...  \
1     3849  Вечер поэта Томаса Венцлова (США). Презентация...   
2     1785  Лекция Александра Степанова «Genius loci. Смыс...   
3     1462  21 марта во Всемирный День Поэзии состоится Те...   
4    10497  Встреча с историком Баиром Иринчеевым. Презент...   
..     ...                                                ...   
995   2115  Творческая встреча с русским и американским ху...   
996   2661  Выставка-исследование "Постсоветское Красное З...   
997   1453  Презентация книги Виктора Тихомирова "ПРО ШПИО...   
998   5073  На заседании клуба "Берггассе, 19" состоится т...   
999   7221  В литературном мини-отеле «Старая Вена» состои...   

                    date                                      place   
0    2001-05-02 18:00:00                             Пушкинская, 10  \
1    2002-06-13 17:00:00                            Музей Ахматовой   
2    2

In [None]:
sampled_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1000 non-null   int64  
 1   description  1000 non-null   object 
 2   date         1000 non-null   object 
 3   place        1000 non-null   object 
 4   address      1000 non-null   object 
 5   latitude     1000 non-null   float64
 6   longitude    1000 non-null   float64
 7   desc_length  1000 non-null   int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 62.6+ KB
