# 1. Базовый парсер заголовков

Вытаскивает из latex-кода заголовки статей и их расположение в файлах.

Разбивка происходит в полуручном режиме, т.к. нет уверенности в формате заголовков.

В тексте ищутся слова, содержащие в своём составе заглавные буквы на русском и английском языках в отношении, большем или равным заданному (по умолчанию 0.51, при меньших значениях количество вхождений значительно возрастает, например за счёт двухбуквенных предлогов). Предполагается, что таким образом удаётся обнаруживать неправильно машинно распознанный капс. Слова или цепочки слов, состоящие из одного строчного символа включаются в заголовок, если стоят между слов, определённых как часть заголовка. При этом, одиночные заглавные буквы, а также инициалы не воспринимаются как начало заголовка.

## Использование
- При удовлетворительном определении заголовка нажать `Enter` без дополнительного ввода.
- Если предложенное место заголовком не является ввести `"n"`
- При неправильном определении границ заголовка ввести два корректировочных числа для сдвига левой и правой границы.
  - ЗАМЕЧАНИЕ: сдвиг производится попробельно, т.е. двойной пробел будет распознан как слово нулевой длины.
  - ЗАМЕЧАНИЕ: границы отображаемого фрагмента текста будут передвинуты автоматически. Длины левой и правой границ в словах задаются в параметрах.
  - ПРИМЕРЫ:
    - `out: a [B C] d e f` -> `in: 0 2` -> `out: a [B C D E] f`
    - `out: a b c [D E] f` -> `in: 2 -1` -> `out: a [B C D] e f`
- Также возможен посимвольный сдвиг правой границы в случае "сращивания" заголовка статьи и её текста. Ввести одно число, начиная с точки.
  - ПРИМЕРЫ:
    - `out: a[BC]def` -> `in: .2` -> `out: a[BCDE]f`
    - `out: a[BCDE]f` -> `in: .-1` -> `out: a[BCD]ef`

В выводе в терминале переносы строк для удобства заменены на `"$"`

### Прочее
- Для определителя капса доступны исключения, которые никогда не будут рассматриваться, как потенциальные начала заголовков, см. опции. По умолчанию: первые 10 римских цифр, "МэВ" и "ГэВ". Также определитель не реагирует на "СМ.", что часто встречается в ссылках сразу после заголовков.
- Использовать системный терминал для взаимодействия оказывается удобнее, чем использовать jupyter, поэтому рекомендуется запускать файл `base_titles_parser.py` из терминала или с использованием Python Launcher (но вы всё ещё можете запустить ячейку ниже).
- При положительном определении заголовка файл дополняется немедленно, прервать процесс можно в любой момент, как и продолжить после -- итоговый файл будет дополняться, а не перезаписываться с нуля при новом запуске программы (главное не забыть предварительно удалить из конца файла дубликаты, если вы начинаете с той страницы, на которой закончили в прошлый раз, а не со следующей).
- В случае пропуска парсером заголовка его можно добавить вручную двумя способами:
  1) Сдвинуть границы заголовка назад, как описано в инструкции выше. Подходит, если была пропущена небольшая (обычно ссылочная) статья, примерно 20 слов, плюс-минус. При этом после ввода заголовка поиск продолжится с __его__ конца, поэтому следующий заголовок "вместо" которого был введён пропущенный будет определён заново и пропущен не будет.
  2) Воспользоваться ячейкой 1.1. Для этого в сыром tex-файле страницы нужно отыскать заголовок, скопировать его и __в точности__ вставить в разделе параметров, а также указать номер страницы. Скрипт парсера при этом можно не закрывать, последующая нумерация подстроится автоматически.

In [None]:
# !pip3 uninstall enchant
# !pip3 uninstall pyenchant


In [None]:
import base_titles_parser
import importlib

importlib.reload(base_titles_parser)
base_titles_parser.run()

## 1.1. Добавление заголовков по одному

В разделе параметров указать номер страницы и ТОЧНУЮ формулировку заголовка из сырого latex-текста, а также номер страницы, после чего запустить ячейку.

Закрывать скрипт парсера не обязательно, это не вызовет ошибок и его нумерация подстроится автоматически.

In [1]:
# 1.1. Добавление заголовков по одному

from lib import *

# -------------------------- VARS --------------------------------
PAGES_DIR = GLOBAL_WORK_DIR + GLOBAL_PAGES_DIR
EXIT_DIR = GLOBAL_WORK_DIR
EXIT_FILE = SINGLE_TITLE_PARSER_OUTPUT_FILE
# Search parameters
PAGE = 146
TITLE = 'глюоний'
# ----------------------------------------------------------------



class Article:
    start_title = 0
    end_title = 0
    filename = ''


# Get filenames needed
filenames_raw = get_filenames(PAGES_DIR)
filenames = []
for i in range(PAGE, PAGE + 1):
    for filename in filenames_raw:
        beginning = "rp-" + str(i) + "_"
        if filename[:len(beginning)] == beginning and filename[-4:] == ".mmd":
            filenames.append(filename)


# Check for existing xml
filenames_raw = get_filenames(EXIT_DIR)
if not(EXIT_FILE in filenames_raw):
    root = ElementTree.Element('data')
    xml_write(root, EXIT_DIR + EXIT_FILE)


root = parse_xml(EXIT_DIR + EXIT_FILE)


# Add article title and metadata to xml tree
def add_article(article_local:Article, etree_root:ElementTree.Element, number:int):
    elem_article = ElementTree.SubElement(etree_root, 'article', {'n':str(number)})
    elem_title = ElementTree.SubElement(elem_article, 'title')
    elem_title.text = file[article_local.start_title + 1:article_local.end_title]
    elem_title_meta = ElementTree.SubElement(elem_article, 'title-meta')
    elem_title_file = ElementTree.SubElement(elem_title_meta, 'title-file')
    elem_title_file.text = article_local.filename
    elem_title_start = ElementTree.SubElement(elem_title_meta, 'title-start')
    elem_title_start.text = str(article_local.start_title + 1)
    elem_title_end = ElementTree.SubElement(elem_title_meta, 'title-end')
    elem_title_end.text = str(article_local.end_title)
    xml_write(etree_root, EXIT_DIR + EXIT_FILE)

# Read requested file
with codecs.open(PAGES_DIR + filenames[0], 'r', 'utf-8') as f:
    file = f.read()

# Find titles and add them
start_title = 0
end_title = 0
num = len(root) + 1
while file.find(TITLE, end_title) != -1:
    start_title = file.find(TITLE, start_title)
    end_title = start_title + len(TITLE)
    start_title -= 1 # Set on space before the title

    article = Article()
    article.start_title = max(start_title, 0)
    article.end_title = min(end_title, len(file))
    article.filename = filenames[0]
    add_article(article, root, num)

<module 'enchant' from 'C:\\Users\\User\\AppData\\Roaming\\Python\\Python312\\site-packages\\enchant\\__init__.py'>


# 2. Исправление ошибок в заголовках

Состоит из двух частей: "составитель пар" и "подстановщик".

## 2.1. Составитель пар "оригинальный - исправленный" для заголовков

Формирует xml-список всех заголовков с возможными автоматическими исправлениями (в формате было / стало):
1. замена латиницы на аналогичную кириллицу;
2. замена заданных буквосочетаний (см. параметры)
3. удаление обрамляющих знаков препинания;
4. замена всех букв на заглавные (в том числе это избавляет дальнейшей необходимости исправлять имена);
5. слияние разорванных на отдельные буквы слов (если рядом оказываются несколько таких слов, то они оказываются слиты вместе).

Этот список необходимо просмотреть и исправить оставшиеся ошибки.

Дополнительно, для помощи в поиске орфографических ошибок, формируется строка с изменениями, предложенными спеллчекером. ВНИМАНИЕ: спеллчекер может делать ошибки в именах, специфических терминах и т.п., поэтому следует использовать его результаты лишь для ориентира.

In [2]:
# 2.1. Составитель пар "оригинальный - исправленный" для заголовков:

from lib import *

# -------------------------- VARS --------------------------------
WORK_DIR = GLOBAL_WORK_DIR
INPUT_FILE = BASE_TITLES_PARSER_OUTPUT_FILE
CORRECTION_FILE = TITLES_CHECKER_CORRECTIONS_FILE
COMBINATIONS_CORR = dict_merge(COMBINATIONS_CORR_GLOBAL, {
    'ХК' : 'Ж',
    'ЬI' : 'Ы',
    'II' : 'Ш',
    'I' : 'П',
    'J' : 'Л',
    'ЛАГРАНХ' : 'ЛАГРАНЖ',
    'ЛАТРАНХ' : 'ЛАГРАНЖ',
})
SPELLCHECK_ONLY = False # Use if the only thing you need from this script is spellcheck
# ----------------------------------------------------------------


# Check for existing xml
filenames_raw = get_filenames(WORK_DIR)
if not(INPUT_FILE in filenames_raw):
    root = ElementTree.Element('data')
    xml_write(root, WORK_DIR + CORRECTION_FILE)


root = parse_xml(WORK_DIR + INPUT_FILE)


# Get all the titles into a dict
titles_dict = {}
pages_dict = {}
for article in root:
    title = get_xml_elem(article, 'title').text
    titles_dict[title] = (title, title)
    title_file = get_xml_elem(article, 'title-meta/title-file')
    pages_dict[title] = title_file.text[title_file.text.find('-')+1:title_file.text.find('_')]


if not SPELLCHECK_ONLY:
    print("Processing general corrections...")

    # Correct preferred combinations and latin letters
    for title in tqdm(titles_dict.keys()):
        title_new = title_handle_latin(titles_dict[title][0], COMBINATIONS_CORR)
        titles_dict[title] = (title_new, title_new)

    # Remove bounding symbols
    for title in tqdm(titles_dict.keys()):
        title_new = title_handle_bounding(titles_dict[title][0])
        titles_dict[title] = (title_new, title_new)

    # CAPS
    for title in tqdm(titles_dict.keys()):
        title_new = titles_dict[title][0].upper()
        titles_dict[title] = (title_new, title_new)

    # Merge single-lettered words
    for title in tqdm(titles_dict.keys()):
        title_new = title_handle_merge(titles_dict[title][0])
        titles_dict[title] = (title_new, title_new)

    # Revert changes for aux formulas in titles
    for title in tqdm(titles_dict.keys()):
        title_new = title_handle_formulas(titles_dict[title][0], title)
        titles_dict[title] = (title_new, title_new)

# Try spellcheck on titles
print("Processing spellcheck...")
spellcheck_dict_update()
for title in tqdm(titles_dict.keys()):
    title_new = titles_dict[title][0]
    title_suggestions = do_spellcheck(title_new)
    for i in range(len(title_new)):
        title_new = title_new[:i] + ('_' if title_new[i] not in [' ', '\n', '\r'] else title_new[i]) + (title_new[i+1:] if i + 1 <= len(title_new) else '')
    for pos in sorted(title_suggestions.keys(), reverse=True):
        title_new = title_new[:pos] + title_suggestions[pos][1] + title_new[pos+len(title_suggestions[pos][0]):]
    titles_dict[title] = (titles_dict[title][0], title_new)


# Write corrections xml
root = ElementTree.Element('data')
for i in titles_dict.items():
    pair = ElementTree.SubElement(root, 'pair')
    title_old = ElementTree.SubElement(pair, 'title_old')
    title_old.text = i[0]
    title_new = ElementTree.SubElement(pair, 'title_new')
    title_new.text = i[1][0]
    title_new = ElementTree.SubElement(pair, 'title__sc')
    title_new.text = i[1][1]
    page = ElementTree.SubElement(pair, 'page')
    page.text = pages_dict[i[0]]
xml_write(root, WORK_DIR + CORRECTION_FILE)

Processing general corrections...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Processing spellcheck...


  0%|          | 0/3 [00:00<?, ?it/s]

## 2.2. Подстановщик исправленных заголовков

Заменяет все заголовки на исправленные согласно списку пар.

In [3]:
# 2.2. Подстановщик исправленных заголовков:

from lib import *

# -------------------------- VARS --------------------------------
WORK_DIR = GLOBAL_WORK_DIR
INPUT_FILE = BASE_TITLES_PARSER_OUTPUT_FILE
CORRECTION_FILE = TITLES_CHECKER_CORRECTIONS_FILE
EXIT_FILE = CHECKED_TITLES_FILE
# ----------------------------------------------------------------



root = parse_xml(WORK_DIR + CORRECTION_FILE)


# Get all the corrections into a dict
titles_dict = {}
for pair in tqdm(root):
    titles_dict[get_xml_elem(pair, 'title_old').text] = get_xml_elem(pair, 'title_new').text


root = parse_xml(WORK_DIR + INPUT_FILE)


# Replace titles
for article in tqdm(root):
    get_xml_elem(article, 'title').text = titles_dict[get_xml_elem(article, 'title').text]
xml_write(root, WORK_DIR + EXIT_FILE)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

# 3. Сортировщик / сливщик файлов с заголовками

Сортирует статьи в файлах из основного списка в порядке страница-расположение, т.е. (если не сказано иного) в алфавитном порядке и выводит в один выходной файл. Также порядковый номер заменяется uri формата "http://libmeta.ru/me/article/1_Kraevaya". (Созданные uri кешируются по номеру страницы и позиции заголовка в тексте и при последующих запусках остаются неизменными, если включен `URI_SAFER`).

Также в конец выходного файл добавляются заголовки из "ручного" файла, в том же формате, но без сортировки, что позволяет добавлять случайно забытые статьи без изменения uri и имён файлов всех остальных статей.

In [2]:
# 3. Сортировщик / сливщик файлов с заголовками

from lib import *

# -------------------------- VARS --------------------------------
WORK_DIR = GLOBAL_RESULTS_DIR
TITLES_DIR = GLOBAL_TITLES_DIR
MANUALLY_ADDED_FILE = SINGLE_TITLE_PARSER_OUTPUT_FILE
URI_CACHE = GLOBAL_URI_CACHE
INPUT_FILES = get_filenames(WORK_DIR + TITLES_DIR)
INPUT_FILES.remove(URI_CACHE)
INPUT_FILES.remove(MANUALLY_ADDED_FILE)
EXIT_FILE = GLOBAL_MERGED_TITLES_FILE
# Uri safer prevents already existing uri from being changed. Set to False ONLY IF you need to update an existing uris.
URI_SAFER = True
# ----------------------------------------------------------------



class Article:
    title = ''
    start_title = ''
    end_title = ''
    filename = ''



# Try to get uri from the cache for title with given page and pos
def get_uri(title_page:str, title_pos:str) -> str:
    global cache_root
    for elem_uri in cache_root:
        if elem_uri.tag == 'uri' and elem_uri.attrib['page'] == title_page and elem_uri.attrib['pos'] == title_pos:
            return elem_uri.text
    return ''
# Cache given uri
def cache_uri(title_page:str, title_pos:str, uri_str:str):
    global cache_root
    elem_uri = ElementTree.SubElement(cache_root, 'uri', {'page':title_page, 'pos':title_pos})
    elem_uri.text = uri_str
    xml_write(cache_root, WORK_DIR + TITLES_DIR + URI_CACHE)


# Add article title and metadata to xml tree
def add_article(article_local:Article, etree_root:ElementTree.Element, number:int):
    page_str = article_local.filename[article_local.filename.find('-') + 1: article_local.filename.find('_')]
    uri_cached = get_uri(page_str, article_local.start_title)
    translitted = translit(article_local.title[:article_local.title.find(' ')], 'ru', True)
    while translitted.find('/') != -1:
        translitted = translitted[:translitted.find('/')] + '_' + translitted[translitted.find('/')+1:]		# Prevent slash being counted as subfolder in further
    uri_str = URI_PREFIX + "article/" + str(number) + "_" + translitted
    if URI_SAFER and uri_cached != '':
        uri_str = uri_cached
    else:
        cache_uri(page_str, article_local.start_title, uri_str)
    elem_article = ElementTree.SubElement(etree_root, 'article', {'uri':uri_str})
    elem_title = ElementTree.SubElement(elem_article, 'title')
    elem_title.text = article_local.title
    elem_title_meta = ElementTree.SubElement(elem_article, 'title-meta')
    elem_title_file = ElementTree.SubElement(elem_title_meta, 'title-file')
    elem_title_file.text = article_local.filename
    elem_title_start = ElementTree.SubElement(elem_title_meta, 'title-start')
    elem_title_start.text = str(int(article_local.start_title) + 1)
    elem_title_end = ElementTree.SubElement(elem_title_meta, 'title-end')
    elem_title_end.text = article_local.end_title


# Check for existing uri list
filenames_raw = get_filenames(WORK_DIR + TITLES_DIR)
if not(URI_CACHE in filenames_raw):
    root = ElementTree.Element('data')
    xml_write(root, WORK_DIR + TITLES_DIR + URI_CACHE)
cache_root = parse_xml(WORK_DIR + TITLES_DIR + URI_CACHE)


# Collect all the articles
print("Parsing main input files...")
articles_dict = {}
for filename in tqdm(INPUT_FILES):
    root = parse_xml(WORK_DIR + TITLES_DIR + filename)
    for article in root:
        title = get_xml_elem(article, 'title').text
        elem = get_xml_elem(article, 'title-meta/title-file')
        page = elem.text[elem.text.find('-')+1:elem.text.find('_')]
        pos = get_xml_elem(article, 'title-meta/title-start').text
        start = get_xml_elem(article, 'title-meta/title-start').text
        end = get_xml_elem(article, 'title-meta/title-end').text
        file = get_xml_elem(article, 'title-meta/title-file').text
        num = (int(page), int(pos))
        articles_dict[num] = {'title':title, 'file':file, 'start':start, 'end':end}

# Same for manually added articles
articles_dict_man = {}
nums_list_man = []
if len(MANUALLY_ADDED_FILE):
    print("Parsing \"manual\" file...")
    root = parse_xml(WORK_DIR + TITLES_DIR + MANUALLY_ADDED_FILE)
    for article in root:
        title = get_xml_elem(article, 'title').text
        elem = get_xml_elem(article, 'title-meta/title-file')
        page = elem.text[elem.text.find('-')+1:elem.text.find('_')]
        pos = get_xml_elem(article, 'title-meta/title-start').text
        start = get_xml_elem(article, 'title-meta/title-start').text
        end = get_xml_elem(article, 'title-meta/title-end').text
        file = get_xml_elem(article, 'title-meta/title-file').text
        num = (int(page), int(pos))
        articles_dict_man[num] = {'title':title, 'file':file, 'start':start, 'end':end}
        nums_list_man.append(num)


# Sort keys and write articles accordingly
root = ElementTree.Element('data')
nums_list = sorted(list(i for i in articles_dict.keys()))
print("Writing articles...")
for num in tqdm(range(len(nums_list))):
    article = Article()
    article.title = articles_dict[nums_list[num]]['title']
    article.start_title = articles_dict[nums_list[num]]['start']
    article.end_title = articles_dict[nums_list[num]]['end']
    article.filename = articles_dict[nums_list[num]]['file']
    add_article(article, root, num + 1)
if len(MANUALLY_ADDED_FILE):
    for num in tqdm(range(len(nums_list_man))):
        article = Article()
        article.title = articles_dict_man[nums_list_man[num]]['title']
        article.start_title = articles_dict_man[nums_list_man[num]]['start']
        article.end_title = articles_dict_man[nums_list_man[num]]['end']
        article.filename = articles_dict_man[nums_list_man[num]]['file']
        add_article(article, root, num + 1 + len(nums_list))
xml_write(root, WORK_DIR + EXIT_FILE)

<module 'enchant' from 'C:\\Users\\User\\AppData\\Roaming\\Python\\Python312\\site-packages\\enchant\\__init__.py'>
Parsing main input files...


  0%|          | 0/9 [00:00<?, ?it/s]

Parsing "manual" file...
Writing articles...


  0%|          | 0/3584 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

# 4. Парсер текстов статей

По информации из указанного файла с заголовками вытаскивает в сыром виде тексты статей. Каждая статья помещается в свой .xml файл, с именем, содержащим номер статьи и первое слово из заголовка транслитом.

In [3]:
# 4. Парсер текстов статей

from lib import *

# -------------------------- VARS --------------------------------
TITLES_FILE = GLOBAL_RESULTS_DIR + GLOBAL_MERGED_TITLES_FILE
PAGES_DIR = GLOBAL_WORK_DIR + GLOBAL_PAGES_DIR
EXIT_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
COMBINATIONS_CORR = {
    'І' : 'I'		# These two are different!
}
# ----------------------------------------------------------------


class Article:
    start_file = ''
    start_pos = 0
    end_file = ''
    end_pos = 0
    text = ''
    text_orig = ''
    uri = ''
    num = ''
    title = ''
    xml = ''

    def get_text(self):
        # Get filenames
        filenames_raw_local = get_filenames(PAGES_DIR)
        filenames_local = []
        for filename_local in filenames_raw_local:
            if filename_local[-4:] == ".mmd":
                filenames_local.append(filename_local)
        if self.start_file == self.end_file:
            with codecs.open(PAGES_DIR + self.start_file, 'r', 'utf-8') as f_in:
                self.text += f_in.read()[self.start_pos:self.end_pos]
        else:
            with codecs.open(PAGES_DIR + self.start_file, 'r', 'utf-8') as f_in:
                self.text += f_in.read()[self.start_pos:]
            for page_local in range(int(self.start_file[3:self.start_file.find('_')]) + 1, int(self.end_file[3:self.end_file.find('_')])):
                for filename_local in filenames_local:
                    if int(filename_local[3:filename_local.find('_')]) == page_local:
                        self.text += ' ' # Add a space to prevent word merging
                        with codecs.open(PAGES_DIR + filename_local, 'r', 'utf-8') as f_in:
                            self.text += f_in.read()
            self.text += ' ' # Add a space to prevent word merging
            with codecs.open(PAGES_DIR + self.end_file, 'r', 'utf-8') as f_in:
                self.text += f_in.read()[:self.end_pos]
        for comb_local in COMBINATIONS_CORR.keys():
            while self.text.find(comb_local) != -1:
                self.text = self.text[:self.text.find(comb_local)] + COMBINATIONS_CORR[comb_local] + self.text[self.text.find(comb_local) + len(comb_local):]
        while self.text is not None and len(self.text) and self.text[0] in [' ', ',', '.', ':', ';', '-', '\n', '\r']:
            self.text = self.text[1:]
        while self.text is not None and len(self.text) and self.text[-1] in [' ', '\n', '\r']:
            self.text = self.text[:-1]
        self.text_orig = self.text
        # Fix several capital symbols per word
        word_left = 0
        while word_left < len(self.text):
            word_right = min(len(self.text), self.text.find(' ', word_left) if self.text.find(' ', word_left) != -1 else len(self.text))
            word_right = min(word_right, self.text.find('\n', word_left) if self.text.find('\n', word_left) != -1 else len(self.text))
            word_right = min(word_right, self.text.find('\r', word_left) if self.text.find('\r', word_left) != -1 else len(self.text))
            word_right = min(word_right, self.text.find('-', word_left) if self.text.find('-', word_left) != -1 else len(self.text))
            word_right = min(word_right, self.text.find('.', word_left) if self.text.find('.', word_left) != -1 else len(self.text))
            word_str = self.text[word_left:word_right]
            if word_str is not None and len(word_str) > 1 and not check_in_uri(self.text, word_left) and not check_in_formula(self.text, word_left) and not check_in_link(self.text, word_left):
                word_str = word_str[0] + word_str[1:len(word_str)].lower()
                self.text = self.text[:word_left] + word_str + self.text[word_right:]
            word_left = word_right + 1

    def make_xml(self):
        self.get_text()

        elem_article = ElementTree.Element("article", {'uri':self.uri, 'alphabetic_pos':self.num})
        elem_title = ElementTree.SubElement(elem_article, 'title')
        elem_title.text = self.title
        elem_author = ElementTree.SubElement(elem_article, 'authors')
        elem_author.text = None
        #elem_title_short = ElementTree.SubElement(elem_article, 'title_short')
        #elem_title_short.text = None
        elem_pages = ElementTree.SubElement(elem_article, 'pages')
        elem_start = ElementTree.SubElement(elem_pages, 'start')
        elem_start.text = self.start_file[3:self.start_file.find('_', 3)]
        elem_end = ElementTree.SubElement(elem_pages, 'end')
        elem_end.text = self.end_file[3:self.end_file.find('_', 3)]
        elem_literature = ElementTree.SubElement(elem_article, 'literature')
        elem_literature_orig = ElementTree.SubElement(elem_literature, 'literature_orig')
        elem_literature_orig.text = None
        elem_formulas_remote = ElementTree.SubElement(elem_article, 'formulas_main')
        elem_formulas_remote.text = None
        elem_formulas_inline = ElementTree.SubElement(elem_article, 'formulas_aux')
        elem_formulas_inline.text = None
        elem_relations = ElementTree.SubElement(elem_article, 'relations', {'n': '0'})
        elem_relations.text = None
        elem_text = ElementTree.SubElement(elem_article, 'text')
        elem_text.text = self.text
        elem_text_orig = ElementTree.SubElement(elem_article, 'text_orig')
        elem_text_orig.text = self.text_orig

        self.xml = prettify(elem_article)



class Title:
    text = ''
    file = ''
    start_pos = 0
    end_pos = 0
    uri = ''


def get_titles_dict(etree_root:ElementTree.Element) -> dict:
    titles_dict_local = {}
    for elem_title in etree_root:
        elem_uri = elem_title.attrib['uri']
        elem_text = get_xml_elem(elem_title, 'title').text
        elem_file = get_xml_elem(elem_title, 'title-meta/title-file').text
        elem_page = int(elem_file[elem_file.find('-') + 1 : elem_file.find('_')])
        elem_start_pos = int(get_xml_elem(elem_title, 'title-meta/title-start').text)
        elem_end_pos = int(get_xml_elem(elem_title, 'title-meta/title-end').text)
        titles_dict_local[(elem_page, elem_start_pos)] = Title()
        titles_dict_local[(elem_page, elem_start_pos)].uri = elem_uri
        titles_dict_local[(elem_page, elem_start_pos)].text = elem_text
        titles_dict_local[(elem_page, elem_start_pos)].file = elem_file
        titles_dict_local[(elem_page, elem_start_pos)].start_pos = elem_start_pos
        titles_dict_local[(elem_page, elem_start_pos)].end_pos = elem_end_pos
    return titles_dict_local


def get_title(number:int, dict_with_titles:dict) -> Title:
    out_title = Title()
    titles_dict_keys = sorted(dict_with_titles.keys())
    for p in range(len(titles_dict_keys)):
        if p == number:
            out_title = dict_with_titles[titles_dict_keys[p]]
    return out_title


root = parse_xml(TITLES_FILE)

# Create articles list
articles_list = []
title = Title()
titles_dict = get_titles_dict(root)
print("Getting articles info...")
for i in tqdm(range(len(root))):
    title = get_title(i, titles_dict)
    if i:
        articles_list[-1].end_file = title.file
        articles_list[-1].end_pos = max(title.start_pos - 2, 0) # There is a shift for some reason
    articles_list.append(Article())
    articles_list[-1].uri = title.uri
    articles_list[-1].num = str(i + 1)
    articles_list[-1].title = title.text
    articles_list[-1].start_file = title.file
    articles_list[-1].start_pos = title.end_pos
    articles_list[-1].end_file = title.file
    with codecs.open(PAGES_DIR + title.file, 'r', 'utf-8') as f:
        articles_list[-1].end_pos = len(f.read())

# Parse texts themselves and write
print("Parsing articles...")
for i in tqdm(range(len(articles_list))):
    articles_list[i].make_xml()
    with codecs.open(EXIT_DIR + '' + articles_list[i].uri[len(URI_PREFIX) + 8:] + '.xml', 'w', 'utf-8') as f:
        f.write(articles_list[i].xml)

Getting articles info...


  0%|          | 0/3586 [00:00<?, ?it/s]

Parsing articles...


  0%|          | 0/3586 [00:00<?, ?it/s]

# 5. Проверка правописания в текстах

## 5.1. Сканер

Сканирует тексты из указанного диапазона статей и выносит все показавшиеся подозрительными слова в отдельный xml следующего формата:
- Статья (имя файла в аттрибутах)
  - Слово (позиция в тексте и флаги в аттрибутах)
    - Исходный вариант
    - Контекстная строка (размер задаётся в разделе параметров скрипта)
    - Предложенная замена

Предлагается два флага для определения дальнейшей "судьбы" слова: "результат" (0 -- исходное, 1 -- предложенное) и "добавление в словарь" (0 -- не добавлять, 1 -- добавить как есть, 2 -- перевести в нижний регистр и добавить (для первого слова в предложении), 3 -- сделать первую букву заглавной и добавить (для имён, случайно распознанных без заглавной; применяется к выбранному результату)

In [4]:
# 5.1. Проверка правописания в текстах. Сканер.

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
EXIT_DIR = GLOBAL_WORK_DIR
CONTEXT_SIZE = 20
START_ARTICLE = 3587
END_ARTICLE = 3586
# Flags for usual cases
DEFAULT_RESULT_FLAG = '1'
DEFAULT_ADD_TO_PWL_FLAG = '0'
# Flags if name is detected
"""NAME_RESULT_FLAG = '0'
NAME_ADD_TO_PWL_FLAG = '1'"""
# Cases that have to be overriden
OVERRIDE_FORCE_CYRILLIC = {
    'Ссср' : 'СССР',
    'Усср' : 'УССР',
    'Церн' : 'ЦЕРН'
}
OVERRIDE_AS_IS = {
}
# ----------------------------------------------------------------


spellcheck_dict_update()

# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)
#filenames = ['4_ABELEVA.xml']

root = ElementTree.Element('data')

total_wois = 0
for filename in tqdm(filenames):
    article_number = int(filename[:filename.find('_')])
    if article_number < START_ARTICLE or article_number > END_ARTICLE:
        continue

    #print(f'{filename}: found ', end='')
    article = parse_xml(ARTICLES_DIR + filename)
    text = get_xml_elem(article, 'text')
    if text.text is None:
        continue
    elif not len(text.text):
        continue

    #add_to_pwl(filename[filename.find('_')+1:filename.find('.xml')])

    text_suggestions = do_spellcheck(text.text)
    #print(len(text_suggestions.keys()))
    total_wois += len(text_suggestions.keys())
    if len(text_suggestions.keys()):
        article = ElementTree.SubElement(root, 'article', {'filename': filename})
        for pos in text_suggestions.keys():
            #print(f'{pos}: {text_suggestions[pos][0]} -> {text_suggestions[pos][1]}')
            local_result_flag = DEFAULT_RESULT_FLAG
            local_add_to_pwl_flag = DEFAULT_ADD_TO_PWL_FLAG
            # Process possible name case
            '''if len(text_suggestions[pos][0]) >= 2 and len(text_suggestions[pos][1]) >= 2:
                is_name_orig = re.match(r"[А-ЯA-Z]", text_suggestions[pos][0][0]) is not None and re.match(r"[а-яa-z]", text_suggestions[pos][0][1]) is not None
                is_name_sugg = re.match(r"[А-ЯA-Z]", text_suggestions[pos][1][0]) is not None and re.match(r"[а-яa-z]", text_suggestions[pos][1][1]) is not None
                if is_name_orig and is_name_sugg:
                    local_result_flag = NAME_RESULT_FLAG
                    local_add_to_pwl_flag = NAME_ADD_TO_PWL_FLAG'''
            # Override specific cases
            suggestion_text = text_suggestions[pos][1]
            if title_handle_latin(text_suggestions[pos][0], COMBINATIONS_CORR_GLOBAL) in OVERRIDE_FORCE_CYRILLIC.keys():
                suggestion_text = OVERRIDE_FORCE_CYRILLIC[title_handle_latin(text_suggestions[pos][0], COMBINATIONS_CORR_GLOBAL)]
                local_result_flag = DEFAULT_RESULT_FLAG
                local_add_to_pwl_flag = DEFAULT_ADD_TO_PWL_FLAG
            if text_suggestions[pos][0] in OVERRIDE_AS_IS.keys():
                suggestion_text = OVERRIDE_AS_IS[text_suggestions[pos][0]]
                local_result_flag = DEFAULT_RESULT_FLAG
                local_add_to_pwl_flag = DEFAULT_ADD_TO_PWL_FLAG
            word = ElementTree.SubElement(article, 'word', {'pos': str(pos), 'result': local_result_flag, 'add_to_pwl': local_add_to_pwl_flag})
            source = ElementTree.SubElement(word, 'source')
            source.text = text_suggestions[pos][0]
            context = ElementTree.SubElement(word, 'context')
            context_string = text.text[max(0, pos - CONTEXT_SIZE):min(len(text.text), pos + len(text_suggestions[pos][0]) + CONTEXT_SIZE)]
            while context_string.find('\n') != -1:
                context_string = context_string[:context_string.find('\n')] + '\\n' + context_string[context_string.find('\n')+1:]
            while context_string.find('\r') != -1:
                context_string = context_string[:context_string.find('\r')] + '\\r' + context_string[context_string.find('\r')+1:]
            context.text = context_string
            suggestion = ElementTree.SubElement(word, 'suggestion')
            suggestion.text = suggestion_text

print("Cases found:", total_wois)


with codecs.open(EXIT_DIR + f'MEspellcheck-a{START_ARTICLE}-{END_ARTICLE}.xml', 'w', 'utf-8') as f:
    f.write(prettify(root))

  0%|          | 0/3586 [00:00<?, ?it/s]

Cases found: 0


## 5.2. Пополнение словаря

Добавляет отмеченные флагом "добавление в словарь" слова из всех файлов в директории спеллчека
- Учитывается, было ли выбрано оригинальное слово или исправленное флагом "результат".
- Словарь сортируется по алфавиту при каждом запуске.
- Дубликаты удаляются при каждом запуске (символы разного регистра одинаковыми не считаются).
- Слова добавленные вручную при запуске не удаляются.

Чтобы объединить ваш словарь с другим, скопируйте и вставьте всё содержимое нового словаря в ваш, после чего запустите скрипт. Дубликаты будут удалены, итоговый словарь будет отсортирован.

In [5]:
# 5.2. Проверка правописания в текстах. Пополнение словаря.

from lib import *

# -------------------------- VARS --------------------------------
SPELLCHECK_DIR = GLOBAL_RESULTS_DIR + GLOBAL_SPELLCHECK_DIR
# ----------------------------------------------------------------


# Read PWL and form word list
with codecs.open(PERSONAL_WORD_LIST, 'r', 'utf-8') as f:
    PWL_text = f.read()
additions = [i.strip() for i in PWL_text.split('\n')]
while '' in additions:
    additions.remove('')
PWL_text = ''

# Read all spellcheck outputs and create additions list
# Get filenames needed
filenames = get_filenames(SPELLCHECK_DIR)

print("Scanning for PWL additions...")
for filename in tqdm(filenames):
    root = parse_xml(SPELLCHECK_DIR + filename)
    for article in root:
        if article.tag == "article":
            for word in article:
                if word.tag == "word" and word.attrib["add_to_pwl"] != '0':
                    word_text = get_xml_elem(word, 'suggestion').text.strip() if word.attrib["result"] == '1' else get_xml_elem(word, 'source').text.strip()
                    # Check (and correct) that the word has no latin and cyrillic letters at the same time
                    if word_text is not None and len(word_text):
                        exist_from_comb = False
                        exist_rus = False
                        for i in range(len(word_text)):
                            exist_from_comb = True if word_text[i] in COMBINATIONS_CORR_GLOBAL.keys() else exist_from_comb
                            exist_rus = True if re.match(r"[А-Яа-я]", word_text[i]) is not None else exist_rus
                        if exist_from_comb and exist_rus:
                            for i in range(len(word_text)):
                                word_text = word_text[:i] + (COMBINATIONS_CORR_GLOBAL[word_text[i]] if word_text[i] in COMBINATIONS_CORR_GLOBAL.keys() else word_text[i]) + (word_text[i+1:] if (i + 1) <= len(word_text) else '')
                    if word.attrib["add_to_pwl"] == '2':
                        additions.append(word_text.lower())
                    elif word.attrib["add_to_pwl"] == '3' and len(word_text):
                        additions.append(word_text[0].upper() + word_text[1:] if len(word_text) > 1 else '')
                    else:
                        additions.append(word_text)

# Make new PWL list and sort it
print("Processing PWL...")
PWL_list_new = []
for word in tqdm(additions):
    # Append word to the list if not present yet
    if word is not None and len(word) and not word in PWL_list_new:
        PWL_list_new.append(word)
PWL_list_new.sort()

# Write PWL
print("Writing PWL...")
for word in tqdm(PWL_list_new):
    PWL_text = PWL_text + word + '\n'
with codecs.open(PERSONAL_WORD_LIST, 'w', 'utf-8') as f:
    f.write(PWL_text)

Scanning for PWL additions...


  0%|          | 0/69 [00:00<?, ?it/s]

Processing PWL...


  0%|          | 0/8420 [00:00<?, ?it/s]

Writing PWL...


  0%|          | 0/4508 [00:00<?, ?it/s]

## 5.3. Подстановка исправленной орфографии

Подставляет в исходный текст исправленные слова или оригиналы, в зависимости от установленного флага "результат".

In [6]:
# 5.3. Проверка правописания в текстах. Подстановка исправленной орфографии.

from lib import *

# -------------------------- VARS --------------------------------
SPELLCHECK_DIR = GLOBAL_RESULTS_DIR + GLOBAL_SPELLCHECK_DIR
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
# ----------------------------------------------------------------


corrections = {}

# Get filenames needed
filenames = get_filenames(SPELLCHECK_DIR)

print('Getting corrections...')
for filename_data in tqdm(filenames):
    # Parse articles corrections file
    root_data = parse_xml(SPELLCHECK_DIR + filename_data)
    for article in root_data:
        if article.tag == 'article':
            filename_article = article.attrib['filename']
            if filename_article not in corrections.keys():
                corrections[filename_article] = []
            # Parse corrections in one article
            for word in article:
                if word.tag == 'word':
                    pos = int(word.attrib['pos'])
                    len_src = len(get_xml_elem(word, 'source').text)
                    word_text = get_xml_elem(word, 'suggestion').text if word.attrib['result'] == '1' else get_xml_elem(word, 'source').text
                    if word.attrib['add_to_pwl'] == '3' and len(word_text):
                        word_text = word_text[0].upper() + word_text[1:]
                    corrections[filename_article].append((pos, word_text, len_src))

print('Correcting articles...')
for filename_article in tqdm(corrections.keys()):
    # Apply corrections
    root_article = parse_xml(ARTICLES_DIR + filename_article)
    text = get_xml_elem(root_article, 'text')
    corrections[filename_article].sort(reverse=True)
    for word in corrections[filename_article]:
        pos = word[0]
        len_src = word[2]
        word_text = word[1]
        text.text = text.text[:pos] + word_text + (text.text[pos+len_src:] if pos+len_src <= len(text.text) else '')
    # Write corrected article xml
    with codecs.open(ARTICLES_DIR + filename_article, 'w', 'utf-8') as f:
        f.write(prettify(root_article))

Getting corrections...


  0%|          | 0/69 [00:00<?, ?it/s]

Correcting articles...


  0%|          | 0/1412 [00:00<?, ?it/s]

# 6. Парсер авторов статьи

Ищет в конце текста статей конструкции типа ` [Xxxx]. [Xxxx]. [Xxxx]` или ` [Xxxx].[Xxxx]. [Xxxx]` и интерпретирует её как автора статьи.

In [7]:
# 6. Парсинг авторов статьи

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
COMBINATIONS_CORR = dict_merge(COMBINATIONS_CORR_UNICODE, {
    'II' : 'П'
})
# ----------------------------------------------------------------


n = 0
LOCAL_DICT = {'0':'О', '3':'З', '6':'б'}

# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)

for filename in tqdm(filenames):
    article = parse_xml(ARTICLES_DIR + filename)
    textelem = get_xml_elem(article, 'text')
    text = textelem.text
    authors = get_xml_elem(article, 'authors')

    auth_start = 1
    auth_list = []
    while auth_start and text is not None:
        # Find first non-space from the end
        while text[-1] == ' ' or text[-1] == '\n' or text[-1] == '\r':
            text = text[:-1]

        auth_start = 0
        # Try recognize
        first_space = max(text.rfind(' ', 0, len(text)), text.rfind('\n', 0, len(text)), text.rfind('\r', 0, len(text)))
        second_space = max(text.rfind(' ', 0, first_space), text.rfind('\n', 0, first_space), text.rfind('\r', 0, first_space))
        third_space = max(text.rfind(' ', 0, second_space), text.rfind('\n', 0, second_space), text.rfind('\r', 0, second_space))
        if first_space >= 0 and text[first_space-1] == '.' and second_space >= 0:
            if text.find('.', second_space, first_space-1) != -1: # If there's no space between initials
                third_space = second_space
                second_space = first_space
            if text[second_space-1] == '.' and third_space >= 0:
                # Check if first letters of each word are capitals
                keep = text
                for comb in LOCAL_DICT.keys():
                    while text[third_space+1:].find(comb) != -1:
                        text = text[:third_space+1+text[third_space+1:].find(comb)] + LOCAL_DICT[comb] + text[third_space+2+text[third_space+1:].find(comb):]
                if re.match(r"[A-ZА-ЯІ]", text[first_space+1]) is not None and re.match(r"[A-ZА-ЯІ]", text[second_space+1]) is not None and re.match(r"[A-ZА-ЯІ]", text[third_space+1]) is not None:
                    auth_start = third_space + 1
                text = keep

        if auth_start: # Suggest that an article cannot consist of author only and therefore auth_start should be > 0
            #print(article.attrib['uri'], author_text)
            author_text = text[auth_start:]
            if author_text[author_text.find('.')+1] != ' ': # Add space if there's no one between initials
                author_text = author_text[:author_text.find('.')+1] + ' ' + author_text[author_text.find('.')+1:]
            if author_text[-1] == '.' or author_text[-1] == ',':
                author_text = author_text[:-1]
            # convert wrong symbols
            for comb in dict_merge(COMBINATIONS_CORR, LOCAL_DICT).keys():
                while author_text.find(comb) != -1:
                    author_text = author_text[:author_text.find(comb)] + dict_merge(COMBINATIONS_CORR, LOCAL_DICT)[comb] + author_text[author_text.find(comb) + len(comb):]

            auth_list.append(author_text)
            text = text[:auth_start]

    # add authors, reverse their order to alphabetic
    for auth in reversed(auth_list):
        n += 1
        author = ElementTree.SubElement(authors, 'author')
        author.text = auth

    textelem.text = text
    with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
        f.write(prettify(article))

print("Authors found in total:", n)

  0%|          | 0/3586 [00:00<?, ?it/s]

Authors found in total: 1287


# 7. Парсер литературы

После извлечения авторов статьи в конце за текстом статьи присутствует только строчка литературы, если вообще присутствует. Поэтому ищется и извлекается фрагмент начиная с "`Лит.:`". Он разделяется на сегменты по "`[num]`", а сегменты на подфрагменты по запятым. Общий вид сегмента полагается следующим: "`[Авторы (возможно несколько, определяются по наличию инициалов в конце подфрагмента)], Название (возможно содержит запятые), Номер тома (может отсутствовать), [Информация об издании (может частично или полностью отсутствовать)], Год, [Прочее (главы, страницы и прочее, может отсутствовать)];`"

In [8]:
# 7. Парсинг литературы

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
COMBINATIONS_CORR_LOCAL = dict_merge(dict_merge(COMBINATIONS_CORR_ALPHABET, COMBINATIONS_CORR_UNICODE), {'J':'Л'})
# ----------------------------------------------------------------


class Unit:
    authors = []
    title = ""
    publication = ""
    year = ""
    other = ""


n_lit = 0
n_pub = 0

# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)

for filename in tqdm(filenames):
    article = parse_xml(ARTICLES_DIR + filename)
    textelem = get_xml_elem(article, 'text')
    text = textelem.text
    literature = get_xml_elem(article, 'literature')
    literature_orig = get_xml_elem(literature, 'literature_orig')

    if textelem.text is not None and len(textelem.text):
        #Find literature start position and extract if present
        for key in COMBINATIONS_CORR_LOCAL.keys():
            while text.find(key) != -1:
                text = text[:text.find(key)] + COMBINATIONS_CORR_LOCAL[key] + text[text.find(key)+1:]
        text = text.upper()
        lit_pos = text.rfind('\nЛИТ.: ')
        lit_pos = text.rfind('\rЛИТ.: ') if lit_pos == -1 else lit_pos
        lit_pos = text.rfind(' ЛИТ.: ') if lit_pos == -1 else lit_pos
        if lit_pos != -1:
            n_lit += 1
            literature_orig.text = textelem.text[lit_pos:]
            while literature_orig.text[0] in [' ', '\n', '\r']:
                literature_orig.text = literature_orig.text[1:]
            textelem.text = textelem.text[:lit_pos]
            while textelem.text[-1] in [' ', '\n', '\r']:
                textelem.text = textelem.text[:-1]


            # Parse literature string
            text = literature_orig.text
            units = []
            num = 1
            while text.find('['+str(num)+']') != -1:
                units.append(text[text.find('['+str(num)+']')+len('['+str(num)+']'):(text.find('['+str(num+1)+']') if text.find('['+str(num+1)+']') != -1 else len(text))])
                n_pub += 1
                num += 1
            for unit in units:
                logical_parts = Unit()
                logical_parts.authors.clear()
                subunits = unit.split(',')
                while '' in subunits:
                    subunits.remove('')
                pos_last_auth = -1
                pos_last_title = -1
                pos_thome = -1
                pos_transl = -1
                pos_pub_num = -1
                pos_pub_place = -1
                pos_year = -1


                # Define positions of most common pats of literature string
                for i in range(len(subunits)):
                    text = subunits[i]
                    while text[-1] in [' ', '\n', '\r', ';']:
                        text = text[:-1]
                    while text[0] in [' ', '\n', '\r']:
                        text = text[1:]
                    subunits[i] = text

                    if pos_last_auth + 1 == i: # Recognize authors
                        keep = text
                        pos_initials = 0
                        for j in range(len(text)):
                            if text[j] in COMBINATIONS_CORR_UNICODE:
                                text = text[:j] + COMBINATIONS_CORR_UNICODE[text[j]] + text[j+1:]
                        if text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) is not None and text[-3] == ' ' and text[-4] == '.' and re.match(r"[[А-ЯA-Z]", text[-5]) is not None:
                            # "X. X."
                            pos_last_auth = i
                            pos_initials = -5
                        elif text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) is not None and text[-3] == '.' and re.match(r"[[А-ЯA-Z]", text[-4]) is not None:
                            # "X.X."
                            pos_last_auth = i
                            text = text[:-2] + ' ' + text[-2:]
                            pos_initials = -5
                        elif text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) is not None:
                            # "X."
                            pos_last_auth = i
                            pos_initials = -2
                        else: # Title starts
                            text = keep
                        # If correct
                        if pos_last_auth == i:
                            surname = text[:pos_initials]
                            while surname.find(' ') != -1:
                                surname = surname[:surname.find(' ')] + surname[surname.find(' ')+1:]
                            text = surname + ' ' + text[pos_initials:]
                            j = 1
                            while j < len(text):
                                if re.match(r"[А-ЯA-Z]", text[j]) is not None and re.match(r"[а-яa-z]", text[j-1]) is not None:
                                    text = text[:j] + ' ' + text[j:]
                                    j = 1
                                else:
                                    j += 1
                            subunits[i] = text
                    else:
                        if pos_thome == -1: # Recognize thome
                            keep = text
                            for j in range(len(text)):
                                if text[j] in COMBINATIONS_CORR_GLOBAL:
                                    text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
                            if text.upper().find('Т.') != -1:
                                pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
                                pos_thome = i
                            text = keep
                        if pos_transl == -1: # Recognize publication number
                            keep = text
                            for j in range(len(text)):
                                if text[j] in COMBINATIONS_CORR_GLOBAL:
                                    text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
                            if text.upper().find('ПЕР.') != -1:
                                pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
                                pos_transl = i
                            text = keep
                        if pos_pub_num == -1: # Recognize publication number
                            keep = text
                            for j in range(len(text)):
                                if text[j] in COMBINATIONS_CORR_GLOBAL:
                                    text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
                            if text.upper().find('ИЗД.') != -1:
                                pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
                                pos_pub_num = i
                            text = keep
                        if pos_pub_place == -1: # Recognize publication place
                            keep = text
                            for j in range(len(text)):
                                if text[j] in COMBINATIONS_CORR_GLOBAL:
                                    text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
                            if text.upper() in ['М.', 'Л.', 'СПБ.', 'М.Л.', 'Л.М.', 'М.СПБ.', 'СПБ.М.']:
                                pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
                                pos_pub_place = i
                            text = keep
                        # If correct
                        if pos_last_auth != i and (pos_thome == i or pos_pub_num == i or pos_pub_place == i):
                            for j in range(len(text)):
                                if text[j] in COMBINATIONS_CORR_UNICODE:
                                    subunits[i] = text[:j] + COMBINATIONS_CORR_UNICODE[text[j]] + text[j+1:]

                        if pos_year == -1 and len(text) >= 4: # Recognize year
                            numbers = ['0','1','2','3','4','5','6','7','8','9']
                            j = 0
                            for j in range(len(text) - 3):
                                if text[j] in numbers and text[j+1] in numbers and text[j+2] in numbers and text[j+3] in numbers:
                                    pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
                                    pos_year = i
                                    break
                            # if correct
                            if pos_year == i:
                                subunits[i] = text[j:j+4]


                # Extract info from literature string using positions defined above
                for i in range(len(subunits)):
                    text = subunits[i]
                    if pos_last_auth >= i: # Author
                        logical_parts.authors.append(text)
                    elif pos_last_auth < i <= pos_last_title: # Title
                        logical_parts.title = logical_parts.title + ('' if len(logical_parts.title) == 0 else ', ') + text
                    elif pos_year == i: # Year
                        logical_parts.year = logical_parts.year + ('' if len(logical_parts.year) == 0 else ', ') + text
                    elif ((pos_pub_num <= i and pos_pub_num != -1) or (pos_pub_place <= i and pos_pub_place != -1) or (pos_transl <= i and pos_transl != -1) or (pos_thome + 1 <= i and pos_thome != -1)) and pos_year > i: # Publication
                        logical_parts.publication = logical_parts.publication + ('' if len(logical_parts.publication) == 0 else ', ') + text
                    else: # Other
                        logical_parts.other = logical_parts.other + ('' if len(logical_parts.other) == 0 else ', ') + text


                # Debug section
                """print('\n', filename, unit)
                print('authors:', logical_parts.authors)
                print('title:', logical_parts.title)
                print('publication:', logical_parts.publication)
                print('year:', logical_parts.year)
                print('other:', logical_parts.other)
                print(pos_last_auth, pos_last_title, pos_thome, pos_transl, pos_pub_num, pos_pub_place, pos_year)"""


                # Add literature unit
                unit = ElementTree.SubElement(literature, "unit")
                for auth_str in logical_parts.authors:
                    author = ElementTree.SubElement(unit, "author")
                    author.text = auth_str
                title = ElementTree.SubElement(unit, "title")
                title.text = logical_parts.title
                publication = ElementTree.SubElement(unit, "publication")
                publication.text = logical_parts.publication
                year = ElementTree.SubElement(unit, "year")
                year.text = logical_parts.year
                other = ElementTree.SubElement(unit, "other")
                other.text = logical_parts.other


            # Write xml
            with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
                f.write(prettify(article))

print("Literature found in", n_lit, "articles")
print("Publications found in total:", n_pub)

  0%|          | 0/3586 [00:00<?, ?it/s]

  if text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) is not None and text[-3] == ' ' and text[-4] == '.' and re.match(r"[[А-ЯA-Z]", text[-5]) is not None:


Literature found in 1040 articles
Publications found in total: 3214


# 8. Парсер формул

Выносит из текстов ранее подготовленных xml-файлов статей сначала выносные, а затем строчные формулы, оставляя на их месте ссылку внутри математического окружения.

Минимальная длина в символах, которой должна обладать строчная формула, настраивается.

In [9]:
# 8. Парсер формул

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
MIN_INLINE_LEN = 0
# ----------------------------------------------------------------


# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)

n_main = 0
n_aux = 0
for filename in tqdm(filenames):
    article = parse_xml(ARTICLES_DIR + filename)
    #print('REMOTES: ' + article.attrib['uri'])
    text = get_xml_elem(article, 'text')
    formulas_main = get_xml_elem(article, 'formulas_main')
    formulas_aux = get_xml_elem(article, 'formulas_aux')

    # Get main formulas
    pos_find = 0
    pos_start = 0
    pos_end = 0
    n = 1
    while text.text is not None and text.text.find('\\[', pos_find) != -1:
        pos_start = text.text.find('\\[', pos_find) + 2
        pos_end = text.text.find('\\]', pos_start)
        while text.text[pos_start] == '\n':
            pos_start += 1
        while text.text[pos_end-1] == '\n':
            pos_end -= 1
        pos_find = pos_start
        uri = URI_PREFIX + 'formula/main' + article.attrib['uri'][article.attrib['uri'].rfind('/', 0, article.attrib['uri'].find('_')):article.attrib['uri'].find('_')+1] + str(n) + article.attrib['uri'][article.attrib['uri'].find('_'):]
        n += 1
        formula = ElementTree.SubElement(formulas_main, 'formula', {'uri':uri})
        formula.text = text.text[pos_start:pos_end]
        text.text = text.text[:pos_start] + 'URI[[' + uri + ']]/URI' + text.text[pos_end:]
    n_main += n

    # Get auxiliary formulas
    pos_find = 0
    pos_start = 0
    pos_end = 0
    cnt = 0
    n = 1
    # Count dollar symbols
    while text.text is not None and text.text.find('$', pos_find) != -1:
        pos_find = text.text.find('$', pos_find) + 1
        cnt += 1
    # If cnt is not even assume that first one is garbage from title
    pos_find = 0
    if cnt % 2:
        pos_find = text.text.find('$', pos_find)
        text.text = text.text[:pos_find] + '#' + text.text[pos_find+1:]
    while text.text is not None and text.text.find('$', pos_find) != -1:
        pos_start = text.text.find('$', pos_find) + 1
        pos_end = text.text.find('$', pos_start)
        if not check_in_uri(text.text, pos_start) and not check_in_uri(text.text, pos_end):
            while text.text[pos_start] == '\n':
                pos_start += 1
            while text.text[pos_end-1] == '\n':
                pos_end -= 1
            pos_find = pos_start
            if pos_end - pos_start >= MIN_INLINE_LEN:
                uri = URI_PREFIX + 'formula/aux' + article.attrib['uri'][article.attrib['uri'].rfind('/', 0, article.attrib['uri'].find('_')):article.attrib['uri'].find('_')+1] + str(n) + article.attrib['uri'][article.attrib['uri'].find('_'):]
                n += 1
                formula = ElementTree.SubElement(formulas_aux, 'formula', {'uri':uri})
                formula.text = text.text[pos_start:pos_end]
                text.text = text.text[:pos_start] + 'URI[[' + uri + ']]/URI' + text.text[pos_end:]
            pos_find = text.text.find('$', pos_find) + 1
        else:
            pos_find = pos_end + 1
    n_aux += n

    with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
        f.write(prettify(article))

print("Found main formulas:", n_main)
print("Found auxiliary formulas:", n_aux)

  0%|          | 0/3586 [00:00<?, ?it/s]

Found main formulas: 9959
Found auxiliary formulas: 38421


## 8.1. Вынос формул

Выносит все формулы в отдельный файл с указанием типа для возможной последующей обработки.

In [10]:
# 8.1. Вынос формул

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
EXIT_FILE = GLOBAL_RESULTS_DIR + GLOBAL_EXTRACTED_FORMULAS_FILE
# ----------------------------------------------------------------


# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)


formulas = ElementTree.Element('formulas')

for filename in tqdm(filenames):
    root = parse_xml(ARTICLES_DIR + filename)
    fmain = get_xml_elem(root, 'formulas_main')
    faux = get_xml_elem(root, 'formulas_aux')

    for formula in fmain:
        formulas.append(formula)
    for formula in faux:
        formulas.append(formula)

with codecs.open(EXIT_FILE, 'w', 'utf-8') as f:
    f.write(prettify(formulas))

  0%|          | 0/3586 [00:00<?, ?it/s]

## 8.2. Проверка формул

Случайным образом выбирает 20 случайных формул (из случайных статей) и вставляет их в математическое окружение Markdown для визуальной проверки

In [11]:
# 8.2. Проверка формул

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
EXIT_FILE = GLOBAL_WORK_DIR + GLOBAL_FORMULAS_CHECK_FILE
NUMBER = 20
# ----------------------------------------------------------------


# Get filenames needed
filenames = get_filenames(ARTICLES_DIR)


file = ''

i = 0
while i < NUMBER:
    root = parse_xml(ARTICLES_DIR + filenames[randint(0, len(filenames)-1)])

    # Get all the info from article
    fmain = get_xml_elem(root, 'formulas_main')
    start = get_xml_elem(root, 'pages/start').text


    # if there's no formulas in the article try another one
    total_num = 0
    for formula in fmain:
        total_num += 1
    if not total_num:
        continue
    i += 1

    num = randint(0, 100) % total_num

    formula = fmain[num].text

    file += f'{i}. Статья: {root.attrib["uri"]}, Начало на стр. {start}, формула {num + 1}:\n$${formula}$$\n'

with codecs.open(EXIT_FILE, 'w', 'utf-8') as f:
    f.write(file)

# 9. Парсер ссылок типа "смотри также"

Ищет в тексте ссылки начинающиеся на `"см. [другие опциональные вводные слова]"` и пытается найти соответствующие им статьи в энциклопедии.

In [12]:
# 9. Парсер ссылок типа "смотри также"

print("Preparing search base...")

import relations as r
import importlib
importlib.reload(r)

# -------------------------- VARS --------------------------------
r.BRUTE_FORCE_MODE = False  # Maximum amount of links to find, but takes more time (very slow)
r.USE_MULTIPROCESSING = False  # WARNING: Does not work inside Jupyter!!!; Significantly speeds up scanning process
# ----------------------------------------------------------------

r.run()

Preparing search base...


  0%|          | 0/3586 [00:00<?, ?it/s]

  0%|          | 0/3586 [00:00<?, ?it/s]

Searching relations in articles...


  0%|          | 0/3586 [00:00<?, ?it/s]

Relations found in total: 1983


# 10. RDF конвертер

Преобразует полученные "проприетарные" xml-файлы статей в формат RDF, пригодный для загрузки в базу данных.

In [13]:
# 10. RDF конвертер

from lib import *

# -------------------------- VARS --------------------------------
ARTICLES_DIR = GLOBAL_RESULTS_DIR + GLOBAL_ARTICLES_DIR
EXIT_DIR = GLOBAL_RESULTS_DIR + GLOBAL_RDF_DIR
# Resources links
RESOURCE_CONCEPT = RDF_RESOURCE_CONCEPT
RESOURCE_PERSON = RDF_RESOURCE_PERSON
RESOURCE_PUBLICATION = RDF_RESOURCE_PUBLICATION
RESOURCE_FORMULA = RDF_RESOURCE_FORMULA
# Uri prefixes
CORE_URL = RDF_CORE_URL
CONCEPTS_URI_POSTPREFIX = RDF_CONCEPTS_URI_POSTPREFIX
CONCEPTS_URI_PREFIX = RDF_CONCEPTS_URI_PREFIX
PERSONS_URI_PREFIX = RDF_PERSONS_URI_PREFIX
PUBLICATIONS_URI_PREFIX = RDF_PUBLICATIONS_URI_PREFIX
FORMULAS_URI_PREFIX = RDF_FORMULAS_URI_PREFIX
# Filename and uri ranges
CONCEPTS_NUM_RANGE = RDF_CONCEPTS_NUM_RANGE
OBJECTS_NUM_RANGE = RDF_OBJECTS_NUM_RANGE
# Option that adds ".xml" file type for automatic highlighting in text editors, `False` by default
XML_FILETYPE = False
# ----------------------------------------------------------------


class Node:
    attrib = {}
    type = ''
    text = ''
    file = ''
    link = ''

    def __init__(self):
        self.contents = []
        self.attrib = {}
        self.type = ''
        self.text = ''
        self.file = ''
        self.link = ''


def add_person(name: str, src_type: str) -> str:
    global objects
    global objects_index
    global doubles_person
    # Split name into parts
    last = ''
    first = ''
    middle = ''
    num = len(name.split(' '))
    if src_type == "art":
        last = name.split(' ')[-1] if num >= 1 else ''
        first = name.split(' ')[0] if num >= 2 else ''
        middle = name.split(' ')[1] if num >= 3 else ''
    if src_type == "lit":
        last = name.split(' ')[0] if num >= 1 else ''
        first = name.split(' ')[1] if num >= 2 else ''
        middle = name.split(' ')[2] if num >= 3 else ''
    # Try to find an existing one
    for index in objects.keys():
        if objects[index].type == "person":
            match = 0
            match += 1 if objects[index].attrib["last"] == last else 0
            match += 1 if objects[index].attrib["first"] == first else 0
            match += 1 if objects[index].attrib["middle"] == middle else 0
            if match >= num:
                doubles_person += 1
                return objects[index].link
    # If not found create a new one
    index = str(objects_index)
    objects_index += 1
    objects[index] = Node()
    objects[index].type = 'person'
    objects[index].attrib["last"] = last
    objects[index].attrib["first"] = first
    objects[index].attrib["middle"] = middle
    objects[index].link = PERSONS_URI_PREFIX + index
    return objects[index].link


def add_publication(node: Node) -> Node:
    global objects
    global objects_index
    global doubles_publication
    # Try to find an existing one
    for index in objects.keys():
        if objects[index].type == "publication":
            match = True
            for author_in in node.attrib["authors"]:
                exist = False
                for author_ref in objects[index].attrib["authors"]:
                    exist = True if author_in.link == author_ref.link else exist
                match = match and exist
            match = False if node.attrib['title'] != objects[index].attrib['title'] else match
            match = False if node.attrib['publication'] != objects[index].attrib['publication'] else match
            match = False if node.attrib['year'] != objects[index].attrib['year'] else match
            match = False if node.attrib['other'] != objects[index].attrib['other'] else match
            if match:
                doubles_publication += 1
                return objects[index]
    # If not found create a new one
    index = str(objects_index)
    objects_index += 1
    node.type = 'publication'
    node.link = PUBLICATIONS_URI_PREFIX + index
    objects[index] = node
    return objects[index]


def add_formula(node: Node) -> Node:
    global objects
    global objects_index
    global doubles_formula
    # Try to find an existing one
    for index in objects.keys():
        if objects[index].type == "formula":
            if node.text == objects[index].text:
                doubles_formula += 1
                return objects[index]
    # If not found create a new one
    index = str(objects_index)
    objects_index += 1
    node.type = "formula"
    node.link = FORMULAS_URI_PREFIX + index
    objects[index] = node
    return objects[index]


def get_ct() -> str:
    ct = datetime.datetime.now(datetime.timezone.utc)
    return f' {ct.day}-{ct.month}-{ct.year} {ct.hour}:{ct.minute} '


def make_person(node: Node) -> ElementTree.Element:
    person_root = ElementTree.Element('rdf:RDF', {'xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                                      'xmlns:lbm': 'http://libmeta.ru/'})
    subroot = ElementTree.SubElement(person_root, 'lbm:InformationObject', {'rdf:about': node.link})
    ElementTree.SubElement(subroot, 'lbm:type', {'rdf:resource': RESOURCE_PERSON})
    ElementTree.SubElement(subroot, 'lbm:description')
    person_elem = ElementTree.SubElement(subroot, 'lbm:dateCreated')
    person_elem.text = get_ct()
    person_elem = ElementTree.SubElement(subroot, 'lbm:dateUpdated')
    person_elem.text = get_ct()

    subroot = ElementTree.SubElement(subroot, 'lbm:properties')

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#first'})
    person_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    person_elem.text = node.attrib['first']

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#last'})
    person_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    person_elem.text = node.attrib['last']

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#middle'})
    person_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    person_elem.text = node.attrib['middle']

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    elem = ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource':'http://libmeta.ru/attribute#email'})
    elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    elem.text = ''"""

    return person_root


def make_publication(node: Node) -> ElementTree.Element:
    publication_root = ElementTree.Element('rdf:RDF', {'xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                                           'xmlns:lbm': 'http://libmeta.ru/'})
    subroot = ElementTree.SubElement(publication_root, 'lbm:InformationObject', {'rdf:about': node.link})
    ElementTree.SubElement(subroot, 'lbm:type', {'rdf:resource': RESOURCE_PUBLICATION})
    ElementTree.SubElement(subroot, 'lbm:description')
    publication_elem = ElementTree.SubElement(subroot, 'lbm:dateCreated')
    publication_elem.text = get_ct()
    publication_elem = ElementTree.SubElement(subroot, 'lbm:dateUpdated')
    publication_elem.text = get_ct()

    subroot = ElementTree.SubElement(subroot, 'lbm:properties')

    for auth in node.attrib['authors']:
        subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
        ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#auth'})
        ElementTree.SubElement(subsubroot, 'lbm:value', {'rdf:resource': auth.link})

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute/doi'})
    pub_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    pub_elem.text = ''"""

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#keywords'})
    pub_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    pub_elem.text = ''"""

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#lang'})
    pub_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    pub_elem.text = ''"""

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#originalTitle'})
    pub_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    pub_elem.text = node.attrib['title']

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute#udc'})
    pub_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    pub_elem.text = ''"""

    return publication_root


def make_formula(node: Node) -> (ElementTree.Element, str):
    formula_root = ElementTree.Element('rdf:RDF', {'xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                                       'xmlns:lbm': 'http://libmeta.ru/'})
    subroot = ElementTree.SubElement(formula_root, 'lbm:InformationObject', {'rdf:about': node.link})
    ElementTree.SubElement(subroot, 'lbm:type', {'rdf:resource': RESOURCE_FORMULA})
    ElementTree.SubElement(subroot, 'lbm:description')
    formula_elem = ElementTree.SubElement(subroot, 'lbm:dateCreated')
    formula_elem.text = get_ct()
    formula_elem = ElementTree.SubElement(subroot, 'lbm:dateUpdated')
    formula_elem.text = get_ct()

    subroot = ElementTree.SubElement(subroot, 'lbm:properties')

    """subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute/simplemathml'})
    formula_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    formula_elem.text = ''"""

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute/mathml'})
    ElementTree.SubElement(subsubroot, 'lbm:value')

    subsubroot = ElementTree.SubElement(subroot, 'lbm:property')
    ElementTree.SubElement(subsubroot, 'lbm:type', {'rdf:resource': 'http://libmeta.ru/attribute/tex'})
    formula_elem = ElementTree.SubElement(subsubroot, 'lbm:value')
    formula_elem.text = '$$' + (node.text if node.text is not None else '') + '$$'

    converted2mathml = ''
    # noinspection PyBroadException
    try:
        converted2mathml = tex2mml(node.text)
    except:
        pass
    if converted2mathml is None:
        converted2mathml = ''
    return formula_root, converted2mathml


def make_concept(node: Node, index) -> ElementTree.Element:
    concept_root = ElementTree.Element('rdf:RDF', {'xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                                       'xmlns:lbm': 'http://libmeta.ru/', 'xmlns:core': 'http://libmeta.ru/core'})
    subroot = ElementTree.SubElement(concept_root, 'lbm:Concept', {'rdf:about': node.link})
    ElementTree.SubElement(subroot, 'lbm:thesaurus', {'rdf:resource': RESOURCE_CONCEPT})
    concept_elem = ElementTree.SubElement(subroot, 'lbm:code')
    concept_elem.text = node.link[len(CONCEPTS_URI_PREFIX):]
    concept_elem = ElementTree.SubElement(subroot, 'core:url')
    concept_elem.text = CORE_URL + index
    ElementTree.SubElement(subroot, 'lbm:descriptor')
    ElementTree.SubElement(subroot, 'lbm:comment')

    properties_root = ElementTree.SubElement(subroot, 'lbm:properties')

    for auth in node.attrib["authors"]:
        property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                               {'rdf:resource':
                                                    'http://libmeta.ru/thesaurus/attribute/author_of_the_article'})
        ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': auth.link})

    """property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                           {'rdf:resource': 'http://libmeta.ru/thesaurus/attribute/msc'})
    ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': ''})"""

    """property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                           {'rdf:resource': 'http://libmeta.ru/thesaurus/attribute/theme_odu'})
    ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': ''})"""

    for pub in node.attrib["lit"]:
        property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                               {'rdf:resource': 'http://libmeta.ru/thesaurus/attribute/lit'})
        ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': pub.link})

    formulas_added = []
    for form in node.attrib["f_main"]:
        if form.link not in formulas_added:
            formulas_added.append(form.link)
            property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                                   {'rdf:resource':
                                                        'http://libmeta.ru/thesaurus/attribute/mainFormula'})
            ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': form.link})
    for form in node.attrib["f_aux"]:
        if form.link not in formulas_added:
            formulas_added.append(form.link)
            property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                                   {'rdf:resource':
                                                        'http://libmeta.ru/thesaurus/attribute/additonalFormula'})
            ElementTree.SubElement(property_elem, 'lbm:value', {'rdf:resource': form.link})

    property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                           {'rdf:resource': 'http://libmeta.ru/thesaurus/attribute/article'})
    ElementTree.SubElement(property_elem, 'lbm:value')

    property_elem = ElementTree.SubElement(properties_root, 'lbm:property',
                                           {'rdf:resource': 'http://libmeta.ru/thesaurus/attribute/original_text'})
    ElementTree.SubElement(property_elem, 'lbm:value')

    relations_added = []
    for relat in node.attrib["relations"]:
        if relat.link not in relations_added:
            relation_elem = ElementTree.SubElement(subroot, 'lbm:familyRelation',
                                                   {'type': 'http://libmeta.ru/relation/family#related'})
            ElementTree.SubElement(relation_elem, 'lbm:value', {'rdf:resource': relat.link})

    return concept_root


def make_link(text: str, link: str, link_type: str) -> str:
    return f'<a href="/{link_type}/show?uri={link}">{text}</a>'


def make_concept_uri(uri: str) -> str:
    return CONCEPTS_URI_PREFIX + CONCEPTS_URI_POSTPREFIX + uri[len(URI_PREFIX) + len("article/"):]


def prepare_texts(node: Node) -> Node:
    text = node.attrib["text"] if node.attrib["text"] is not None else ''

    # Process publications
    if len(node.attrib["lit"]):
        text += '\n\n<i>Лит.</i>: '
    for pub_idx in range(1, len(node.attrib["lit"]) + 1):
        pub = node.attrib["lit"][pub_idx - 1]
        link = make_link(f'[{pub_idx}]', pub.link, 'object')
        text_pos = 0
        while text.find(f'[{pub_idx}]', text_pos) != -1:
            text = text[:text.find(f'[{pub_idx}]', text_pos)] + link +\
                   text[text.find(f'[{pub_idx}]', text_pos)+len(f'[{pub_idx}]'):]
            text_pos = text.find(f'[{pub_idx}]', text_pos)+len(f'[{pub_idx}]')
        text += f'{link} '
        pub_lst = []
        for auth in pub.attrib["authors"]:
            if auth.text != '':
                pub_lst.append(make_link(auth.text, auth.link, 'object'))
        text += str.join(', ', pub_lst) + ', '
        pub_lst = []
        if pub.attrib["title"] is not None and pub.attrib["title"] != '':
            pub_lst.append(pub.attrib["title"])
        if pub.attrib["publication"] is not None and pub.attrib["publication"] != '':
            pub_lst.append(pub.attrib["publication"])
        if pub.attrib["year"] is not None and pub.attrib["year"] != '':
            pub_lst.append(pub.attrib["year"])
        if pub.attrib["other"] is not None and pub.attrib["other"] != '':
            pub_lst.append(pub.attrib["other"])
        text += make_link(str.join(', ', pub_lst), pub.link, 'object') +\
                ('; ' if pub_idx < len(node.attrib["lit"]) else '.')

    # Process authors
    if len(node.attrib["authors"]):
        text += '\n\n'
        auth_lst = []
        for auth in node.attrib["authors"]:
            if auth.text is not None and auth.text != '':
                auth_lst.append(make_link(auth.text, auth.link, 'object'))
        text += '<i>' + str.join(', ', auth_lst) + '.</i>'

    # Process main formulas
    for form in node.attrib["f_main"]:
        link = make_link(f'$${form.text}$$', form.link, 'object')
        pos_start = text.find(form.attrib["uri"])
        if pos_start >= 0:
            pos_end = pos_start
            while text[pos_start:pos_start+2] != '\\[':
                if pos_start == 0:
                    break
                pos_start -= 1
            while text[pos_end-2:pos_end] != '\\]':
                if pos_end == len(text):
                    break
                pos_end += 1
            text = (text[:pos_start] if pos_start > 0 else '') + link + (text[pos_end:] if pos_end < len(text) else '')
    # Process auxiliary formulas
    for form in node.attrib["f_aux"]:
        link = make_link(f'$${form.text}$$', form.link, 'object')
        pos_start = text.find(form.attrib["uri"])
        if pos_start >= 0:
            pos_end = pos_start
            while text[pos_start:pos_start+1] != '$':
                if pos_start == 0:
                    break
                pos_start -= 1
            while text[pos_end-1:pos_end] != '$':
                if pos_end == len(text):
                    break
                pos_end += 1
            text = (text[:pos_start] if pos_start > 0 else '') + link + (text[pos_end:] if pos_end < len(text) else '')

    # Process relations
    for relat in node.attrib["relations"]:
        link = make_link(relat.text, relat.link, 'concept')
        pos_start = text.find(relat.attrib["uri"])
        if pos_start >= 0:
            pos_end = pos_start
            while text[pos_start:pos_start+5] != 'URI[[':
                if pos_start == 0:
                    break
                pos_start -= 1
            while text[pos_end-6:pos_end] != ']]/URI':
                if pos_end == len(text):
                    break
                pos_end += 1
            text = (text[:pos_start] if pos_start > 0 else '') + link + (text[pos_end:] if pos_end < len(text) else '')

    paragraphs = text.split('\n\n')
    for p in range(len(paragraphs)):
        paragraphs[p] = f'<P>{paragraphs[p]}<BR/></P>'
    text = str.join('', paragraphs)

    node.attrib["text"] = text
    return node


concepts = {}
concepts_index = CONCEPTS_NUM_RANGE[0]
objects = {}
objects_index = OBJECTS_NUM_RANGE[0]
doubles_person = 0
doubles_publication = 0
doubles_formula = 0


# Prepare directories
deep = 0
for next_dir in EXIT_DIR.split("/"):
    if next_dir not in ('.', ''):
        deep += 1
        os.chdir(next_dir)
try:
    os.chdir("concept")
    os.chdir("..")
except FileNotFoundError:
    os.mkdir("concept")
try:
    os.chdir("object")
    os.chdir("..")
except FileNotFoundError:
    os.mkdir("object")
for i in range(deep):
    os.chdir('..')

# Convert articles into a node tree
filenames = get_filenames(ARTICLES_DIR)
# filenames = filenames[:5]
# filenames = ["13_AVTOKOLEBANI.xml", "29_ADAMARA-PERRONA.xml"]

print("\nScanning articles...")
for filename in tqdm(filenames):
    root = parse_xml(ARTICLES_DIR + filename)
    article = Node()
    article.attrib["uri"] = root.attrib["uri"]
    article.link = make_concept_uri(root.attrib["uri"])
    article.text = get_xml_elem(root, "title").text

    # Extract article authors
    article.attrib["authors"] = []
    elem = get_xml_elem(root, "authors")
    for subelem in elem:
        if subelem.tag == 'author':
            author = Node()
            author.text = subelem.text
            # Check for duplicates
            author.link = add_person(author.text, 'art')
            article.attrib["authors"].append(author)

    # Extract literature
    article.attrib["lit"] = []
    elem = get_xml_elem(root, "literature")
    for subelem in elem:
        if subelem.tag == 'unit':
            unit = Node()
            unit.attrib["authors"] = []
            # Extract authors
            for subsubelem in subelem:
                if subsubelem.tag == 'author':
                    author = Node()
                    author.text = subsubelem.text
                    author.link = add_person(author.text, 'lit')
                    unit.attrib["authors"].append(author)
            # Extract other attributes
            unit.attrib['title'] = get_xml_elem(subelem, 'title').text
            unit.attrib['publication'] = get_xml_elem(subelem, 'publication').text
            unit.attrib['year'] = get_xml_elem(subelem, 'year').text
            unit.attrib['other'] = get_xml_elem(subelem, 'other').text
            # Check for duplicates and add
            article.attrib["lit"].append(add_publication(unit))

    # Extract formulas
    article.attrib["f_main"] = []
    elem = get_xml_elem(root, "formulas_main")
    for subelem in elem:
        if subelem.tag == 'formula':
            formula = Node()
            formula.text = subelem.text
            # Check for duplicates and add
            formula = add_formula(formula)
            formula.attrib["uri"] = subelem.attrib["uri"]
            article.attrib["f_main"].append(formula)
    article.attrib["f_aux"] = []
    elem = get_xml_elem(root, "formulas_aux")
    for subelem in elem:
        if subelem.tag == 'formula':
            formula = Node()
            formula.text = subelem.text
            # Check for duplicates and add
            formula = add_formula(formula)
            formula.attrib["uri"] = subelem.attrib["uri"]
            article.attrib["f_aux"].append(formula)

    # Extract relations
    article.attrib["relations"] = []
    elem = get_xml_elem(root, "relations")
    for subelem in elem:
        if subelem.tag == 'relation':
            relation = Node()
            relation.text = get_xml_elem(subelem, "rel_text").text
            relation.attrib['uri'] = subelem.attrib['uri']
            relation.attrib['tgt'] = get_xml_elem(subelem, 'target').text
            relation.link = make_concept_uri(relation.attrib['tgt'])
            article.attrib["relations"].append(relation)

    # Extract modified text
    article.attrib['text'] = get_xml_elem(root, 'text').text
    # Extract original text
    article.attrib['text_orig'] = get_xml_elem(root, 'text_orig').text

    # Create concepts
    idx = str(concepts_index)
    concepts_index += 1
    concepts[idx] = article

print(f"Total objects: {len(objects)}\n"
      f"Person duplicates found: {doubles_person}\n"
      f"Publication duplicates found: {doubles_publication}\n"
      f"Formula duplicates found: {doubles_formula}\n")

print("\nWriting objects...")
for idx in tqdm(objects.keys()):
    obj = ElementTree.Element('_')
    mml = ''
    if objects[idx].type == 'person':
        obj = make_person(objects[idx])
    elif objects[idx].type == 'publication':
        obj = make_publication(objects[idx])
    elif objects[idx].type == 'formula':
        obj, mml = make_formula(objects[idx])

    xml_out = prettify(obj)
    if objects[idx].type == 'formula':
        xml_out = insert_texts(xml=xml_out,
                               fragment='mathml"/>\n        <lbm:value/>',
                               left_scope='mathml"/>\n        <lbm:value>',
                               right_scope='</lbm:value>',
                               text=mml)
    with codecs.open(EXIT_DIR + 'object/' + idx + ('.xml' if XML_FILETYPE else ''), 'w', 'utf-8') as f:
        f.write(xml_out)

print("\nWriting concepts...")
for idx in tqdm(concepts.keys()):
    concept_node = prepare_texts(concepts[idx])
    obj = make_concept(concept_node, idx)
    xml_out = prettify(obj)

    xml_out = insert_texts(xml=xml_out,
                           fragment='<lbm:descriptor/>',
                           left_scope='<lbm:descriptor><![CDATA[',
                           right_scope=']]></lbm:descriptor>',
                           text=concept_node.text)

    xml_out = insert_texts(xml=xml_out,
                           fragment='attribute/article">\n        <lbm:value/>',
                           left_scope='attribute/article">\n        <lbm:value><![CDATA[',
                           right_scope=']]></lbm:value>',
                           text=concept_node.attrib['text'])

    xml_out = insert_texts(xml=xml_out,
                           fragment='attribute/original_text">\n        <lbm:value/>',
                           left_scope='attribute/original_text">\n        <lbm:value><![CDATA[',
                           right_scope=']]></lbm:value>',
                           text=concept_node.attrib['text_orig'])

    with codecs.open(EXIT_DIR + 'concept/' + idx + ('.xml' if XML_FILETYPE else ''), 'w', 'utf-8') as f:
        f.write(xml_out)


Scanning articles...


  0%|          | 0/3586 [00:00<?, ?it/s]

Total objects: 26482
Person duplicates found: 4030
Publication duplicates found: 583
Formula duplicates found: 18818


Writing objects...


  0%|          | 0/26482 [00:00<?, ?it/s]


Writing concepts...


  0%|          | 0/3586 [00:00<?, ?it/s]