<a href="https://colab.research.google.com/github/mary-lev/spblitguide/blob/main/SPbLitGuide_DataProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction and Processing
## Project Introduction
This Jupyter notebook is part of a project focused on analyzing the literary scene in St. Petersburg through a dataset derived from **SPbLitGuide**, a newsletter compiled by Daria Sukhovey. The newsletter, starting in May 1999, provides a rich source of information on literary events, capturing the essence of St. Petersburg's cultural life.

In [1]:
import re
import csv
import pandas as pd
from IPython.display import Code
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup, NavigableString
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/data/litgid/

Mounted at /content/drive
/content/drive/My Drive/data/litgid


## Data Source
The raw data for this project is sourced from the WordPress database containing issues of the SPbLitGuide newsletter. This data is initially structured in an XML format, with each post containing details such as event titles, dates, descriptions, and hyperlinks.

### Sample Data Structure
A typical post in the XML dataset looks like this:

In [2]:
post = "post.html"
with open(post, "r") as f:
  text = f.read()
Code(text[:3000])

## Data Processing Objectives

The primary goal is to transform this rich yet unstructured dataset into a clean, structured format suitable for analysis. Key steps in this process include:

- Parsing XML Data: Extracting relevant information from the XML structure.
- Data Cleaning: Removing unnecessary HTML tags, handling missing values, and standardizing text.
- Text Processing: Employing techniques like Regular Expressions and BeautifulSoup to parse and structure the content.
- Named Entity Recognition: Using NLP models to identify key entities such as dates, names, and places.
- Creating a Pandas DataFrame: Aggregating the cleaned data into a structured format for easy analysis and visualization.

In [3]:
# Regular expression patterns
date_pattern = re.compile(r'^\d{2}\.\d{2}\.\d{2}', re.MULTILINE)
date_full_pattern = re.compile(r'\d{2}\.\d{2}\.\d{2} [а-яА-Я]+ \d{2}\.\d{2}')
record_pattern = re.compile(r'((?:<[^>]+>)*\d{2}\.\d{2}\.\d{2}.*?)(?=\n(?:<[^>]+>)*\d{2}\.\d{2}\.\d{2}|\Z)', re.DOTALL)
info_source_pattern = re.compile(r'\[(.*?)\]')
remove_places_pattern = re.compile(r'<b>МЕСТА</b>.*', re.DOTALL)

# Path to your XML file
xml_file = "Export-2023-November-16-2231.xml"

# Parse the XML file
tree = ET.parse(xml_file)
root = tree.getroot()

events = []
count = 0

# Iterate over the posts
for post in root.findall('post'):
    title = post.find("Title").text
    permalink = post.find("Permalink").text
    publication_date = post.find("Date").text
    content = post.find('Content').text.replace("<wbr />", '').replace("</em></strong><strong><em>", "")
    content = remove_places_pattern.sub('', content)

    # Find all matches of the record pattern in the content
    matches = record_pattern.findall(content)

    for match in matches:
        # Process each match as a separate record
        soup = BeautifulSoup(match, 'html.parser')
        text = soup.get_text(strip=True)

        # Check if the text starts with a date
        if date_pattern.match(text):
            info_source = ""

            tag = soup.find_all(['b', 'strong'])
            if tag:
                event_data = tag[0].get_text(strip=True)
                event_description = text.replace(event_data, "")
            else:
                try:
                    event_data, event_description = text.split("\n", 1)
                except:
                    if date_full_pattern.match(text):
                        print(text)
                        count += 1

            # Extract info source if present
            info_source_match = info_source_pattern.search(text)
            if info_source_match:
                info_source = info_source_match.group(1)
                event_description = info_source_pattern.sub('', event_description).strip()

            events.append({
                "Event Data": event_data.replace("\n", ""),
                "Event Description": event_description.replace("\n", ""),
                "Info Source": info_source,
                "Issue": title,
                "Permalink": permalink,
                "Publication Date": publication_date,
            })

# Creating a DataFrame
df_events = pd.DataFrame(events)

df_events.to_csv("exported_from_2023_xml.csv",
                 sep='\t',  # Tab as delimiter
                 index=False,
                 encoding='utf-8',
                 quoting=csv.QUOTE_NONNUMERIC,
                 escapechar='\\',
                 )


  soup = BeautifulSoup(match, 'html.parser')


## Named Entity Recognition with DeepPavlov

### Overview
Named Entity Recognition (NER) is utilized in this project to identify and classify key information such as names, dates, and locations from the SPbLitGuide newsletter text.

### Why DeepPavlov?
DeepPavlov, an open-source NLP library, is chosen for its proficiency with the Russian language, essential for processing our dataset. It offers pre-trained models specifically designed for NER tasks, making it well-suited for extracting relevant entities from the newsletter content.

### Implementation
In this part of the notebook, we:
- Load DeepPavlov's pre-trained NER model.
- Apply the model to extract entities from the newsletter text.
- The extracted entities (e.g., names, dates) are then structured for further analysis, contributing to a more comprehensive understanding of the literary events in St. Petersburg.

This step is crucial for transforming the raw text data into a structured format, enabling more effective analysis and insight generation.

In [4]:
!pip install -q deeppavlov
!python -m deeppavlov install ner_ontonotes_bert

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.2/489.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.4/222.4 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.3/64.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [5]:
from deeppavlov import build_model

ner_model = build_model('ner_ontonotes_bert', download=True, install=True)

2023-11-19 18:46:13.525 INFO in 'deeppavlov.core.data.utils'['utils'] at line 95: Downloading from http://files.deeppavlov.ai/v1/ner/ner_ontonotes_bert_torch_crf.tar.gz to /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf.tar.gz
INFO:deeppavlov.core.data.utils:Downloading from http://files.deeppavlov.ai/v1/ner/ner_ontonotes_bert_torch_crf.tar.gz to /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf.tar.gz
100%|██████████| 1.13G/1.13G [00:39<00:00, 28.4MB/s]
2023-11-19 18:46:54.450 INFO in 'deeppavlov.core.data.utils'['utils'] at line 276: Extracting /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf.tar.gz archive into /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf
INFO:deeppavlov.core.data.utils:Extracting /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf.tar.gz archive into /root/.deeppavlov/models/ner_ontonotes_bert_torch_crf


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['cl

In [6]:
!pip install ipymarkup
from ipymarkup import show_box_markup
from ipymarkup.palette import palette, BLUE, RED, GREEN
df_events = pd.read_csv("exported_from_2023_xml.csv",sep='\t')

Collecting ipymarkup
  Downloading ipymarkup-0.9.0-py3-none-any.whl (14 kB)
Collecting intervaltree>=3 (from ipymarkup)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: intervaltree
  Building wheel for intervaltree (setup.py) ... [?25l[?25hdone
  Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26097 sha256=70c5afc74479152a9deb08a8e5454f86e382c7546e9e5d905ea1cd0989ce69ee
  Stored in directory: /root/.cache/pip/wheels/fa/80/8c/43488a924a046b733b64de3fac99252674c892a4c3801c0a61
Successfully built intervaltree
Installing collected packages: intervaltree, ipymarkup
Successfully installed intervaltree-3.1.0 ipymarkup-0.9.0


In [7]:
for id, event in df_events[:10].iterrows():
  print(event["Event Description"])
  try:
    print(ner_model([event["Event Description"]]))
  except RuntimeError:
    continue

Большой поэтический вечер в самом начале весны! Дмитрий Артис, Борис Кутенков (Москва),Дмитрий Шабанов, Рахман Кусимов, Серафима Сапрыкина, Ася Анистратенко.


  score = torch.where(mask[i].unsqueeze(1), next_score, score)


[[['Большой', 'поэтический', 'вечер', 'в', 'самом', 'начале', 'весны', '!', 'Дмитрий', 'Артис', ',', 'Борис', 'Кутенков', '(', 'Москва', ')', ',', 'Дмитрий', 'Шабанов', ',', 'Рахман', 'Кусимов', ',', 'Серафима', 'Сапрыкина', ',', 'Ася', 'Анистратенко', '.']], [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-GPE', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O']]]
"Наполеоновские войны и Кавказ - взглядом Ермолова". Читает историк, публицист, главный редактор журнала «Звезда» - Яков Аркадьевич Гордин. Герой Наполеоновских войн, проконсул Кавказа, генерал от инфантерии и артиллерии Алексей Петрович Ермолов оставил обширную переписку. Письма Ермолова — неисчерпаемый источник сведений о его взглядах и планах, которые были осуществлены или остались лишь намерениями, они дают возможность приблизиться к пониманию масштабности фигуры их автора и проникнуть в психол

In [8]:
def create_spans(text, tokens, labels):
    """
    Create spans of named entities from tokenized text and NER labels.

    Args:
        text (str): Original text.
        tokens (List[str]): List of tokens.
        labels (List[str]): Corresponding list of NER labels.

    Returns:
        List[Tuple[int, int, str]]: List of tuples (start, end, entity_type).
    """
    spans = []
    current_entity = None
    start, end = None, None

    # Calculate the start position of each token
    positions = []
    current_pos = 0
    for token in tokens:
        current_pos = text.find(token, current_pos)
        positions.append(current_pos)
        current_pos += len(token)

    for i, (token, label) in enumerate(zip(tokens, labels)):
        if label.startswith('B-'):
            # Save previous entity if it exists
            if current_entity is not None:
                spans.append((start, positions[i], current_entity))

            current_entity = label.split('-')[1]  # Get entity type
            start = positions[i]

        elif label.startswith('I-') and current_entity is not None:
            # Continue the current entity
            continue
        else:
            # End of the current entity
            if current_entity is not None:
                end = positions[i] if i < len(positions) else len(text)
                spans.append((start, end, current_entity))
                current_entity = None

    return spans



[(48, 61, 'PERSON'), (63, 78, 'PERSON'), (79, 85, 'GPE'), (87, 102, 'PERSON'), (104, 118, 'PERSON'), (120, 138, 'PERSON'), (140, 156, 'PERSON')]


In [12]:
# Example usage
text = "Большой поэтический вечер в самом начале весны! Дмитрий Артис, Борис Кутенков (Москва),Дмитрий Шабанов, Рахман Кусимов, Серафима Сапрыкина, Ася Анистратенко."
tokens = ['Большой', 'поэтический', 'вечер', 'в', 'самом', 'начале', 'весны', '!', 'Дмитрий', 'Артис', ',', 'Борис', 'Кутенков', '(', 'Москва', ')', ',', 'Дмитрий', 'Шабанов', ',', 'Рахман', 'Кусимов', ',', 'Серафима', 'Сапрыкина', ',', 'Ася', 'Анистратенко', '.']
labels = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-GPE', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O']

spans = create_spans(text, tokens, labels)
print(spans)

[(48, 61, 'PERSON'), (63, 78, 'PERSON'), (79, 85, 'GPE'), (87, 102, 'PERSON'), (104, 118, 'PERSON'), (120, 138, 'PERSON'), (140, 156, 'PERSON')]


In [9]:
show_box_markup(text, spans, palette=palette(BLUE))


In [35]:
text = 'публичная лекция цикла «Петербургский текст в пространстве современной литературы: Евгений Водолозкин, лауреат премии “Большая книга” (2013)». Читает кандидат филологических наук, доцент Санкт-Петербургского Гуманитарного университета профсоюзов Мария Дмитриевна АНДРИАНОВА.'
ner = ner_model([text])
tokens = ner[0][0]
ner_labels = ner[1][0]
ner_spans = create_spans(text, tokens, ner_labels)
show_box_markup(text, ner_spans, palette=palette(BLUE))

## Named Entity Recognition with ChatGPT-4

In this section, we explore the application of ChatGPT-4, a sophisticated language model developed by OpenAI, for Named Entity Recognition (NER) tasks.

ChatGPT-4 is known for its advanced text understanding and generation capabilities, especially in handling complex language structures. This makes it well-suited for accurately identifying and categorizing named entities within our text data.

### Implementation Process
- **Model Utilization**: We leverage ChatGPT-4 for its cutting-edge NLP technology to perform NER on the SPbLitGuide newsletter content.
- **Entity Extraction**: The model processes the text to identify various entities, including names of individuals, locations, and dates relevant to the literary events.
- **Data Structuring**: The entities identified by ChatGPT-4 are then organized and integrated into our dataset, enhancing the depth and quality of our analysis.

This approach with ChatGPT-4 offers an alternative NER method, complementing the analysis and providing a broader perspective on the dataset's informational content.




In [11]:
!pip install openai
import openai
import ast

Collecting openai
  Downloading openai-1.3.3-py3-none-any.whl (220 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/220.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m143.4/220.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currently 

In [45]:
with open("prompt1.txt", "r") as f:
  prompt = f.read()

In [18]:
from google.colab import userdata
def extract_persons(text, tokens):
    openai.api_key = userdata.get("OPENAI_KEY")

    response = openai.chat.completions.create(
        model="gpt-4",
        #model="gpt-3.5-turbo",
        #model="gpt-4-1106-preview",
        messages=[
            {
            "role": "system",
               "content": f"{prompt}. This is the text to process: {event}. This is the text tokenized: {tokens}"
            },
        ],
        temperature=1,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return (response.choices[0].message.content)

In [47]:
# here we use the tokens that we got from DeepPavlov NER model to align the tokenisation and be sure that we can correctly compare the obtained results
result = extract_persons(text, tokens)
gpt_labels = ast.literal_eval(result)
print(gpt_labels)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'O']


In [42]:
text

'публичная лекция цикла «Петербургский текст в пространстве современной литературы: Евгений Водолозкин, лауреат премии “Большая книга” (2013)». Читает кандидат филологических наук, доцент Санкт-Петербургского Гуманитарного университета профсоюзов Мария Дмитриевна АНДРИАНОВА.'

In [54]:
ner_spans = create_spans(text, tokens, ner_labels)
show_box_markup(text, ner_spans, palette=palette(BLUE))
print("-------------------------")
spans = create_spans(text, tokens, gpt_labels)
show_box_markup(text, spans, palette=palette(BLUE))

-------------------------
