# 0. Базовый парсер заголовков

Вытаскивает из latex-кода заголовки статей и их расположение в файлах.

Разбивка происходит в полуручном режиме, т.к. нет уверенности в формате заголовков.

В тексте ищутся слова, содержащие в своём составе заглавные буквы на русском и английском языках в отношении, большем или равным заданному (по умолчанию 0.51, при меньших значениях количество вхождений значительно возрастает, например за счёт двухбуквенных предлогов). Предполагается, что таким образом удаётся обнаруживать неправильно машиинно распознанный капс. Слова или цепочки слов, состоящие из одного строчного символа включаются в заголовок, если стоят между слов, определённых как часть заголовка. При этом, одиночные заглавные буквы, а также инициалы не воспринимаются как начало заголовка.

## Использование
- При удовлетворительном определении заголовка нажать `Enter` без дополнительного ввода.
- Если предложенное место заголовком не является ввести `"n"`
- При неправильном определении границ заголовка ввести два корректировочных числа для сдвига левой и правой границы.
  - ЗАМЕЧАНИЕ: сдвиг производится попробельно, т.е. двойной пробел будет распознан как слово нулевой длины.
  - ЗАМЕЧАНИЕ: границы отображаемого фрагмента текста будут передвинуты автоматически. Длины левой и правой границ в словах задаются в параметрах.
  - ПРИМЕРЫ:
    - `out: a [B C] d e f` -> `in: 0 2` -> `out: a [B C D E] f`
    - `out: a b c [D E] f` -> `in: 2 -1` -> `out: a [B C D] e f`
- Также возможен посимвольный сдвиг правой границы в случае "сращивания" заголовка статьи и её текста. Ввести одно число, начиная с точки.
  - ПРИМЕРЫ:
    - `out: a[BC]def` -> `in: .2` -> `out: a[BCDE]f`
    - `out: a[BCDE]f` -> `in: .-1` -> `out: a[BCD]ef`

В выводе в терминале переносы строк для удобства заменены на `"$"`

### Прочее
- Для определителя капса достуны исключения, которые никогда не будут рассматриваться, как потенциальные начала заголовков, см. опции. По умолчанию: первые 10 римских цифр, "МэВ" и "ГэВ". Также определитель не реагирует на "СМ.", что часто встречается в ссылках сразу после заголовков.
- Использовать системный терминал для взаимодействия оказывается удобнее, чем использовать jupyter, поэтому можно скопировать ячейку с кодом в файл `scripter.py` и запускать его.
- При положительном определении заголовка файл дополняется немедленно, прервать процесс можно в любой момент, как и продолжить после -- итоговый файл будет дополяться, а не перезаписываться с нуля при новом запуске программы (главное не забыть предварительно удалить из конца файла дубликаты, если вы начинаете с той страницы, на которой закончили в прошлый раз, а не со следующей).
- В случае пропуска парсером заголовка его можно добавить вручную двумя способами:
  1) Сдвинуть границы заголовка назад, как описано в инструкции выше. Подходит, если была пропущена небольшая (обычно ссылочная) статья, примерно 20 слов, плюс-минус. При этом после ввода заголовка поиск продолжится с __его__ конца, поэтому следующий заголовок "вместо" которого был введён пропущенный будет определён заново и пропущен не будет.
  2) Воспользоваться ячейкой 1.1. Для этого в сыром tex-файле страницы нужно отыскать заголовок, скопировать его и __в точности__ вставить в разделе параметров, а также указать номер страницы. Скрипт парсера при этом можно не закрывать, последующая нумерация подстроится автоматически.

In [None]:
# 0. Базовый парсер заголовков

from os import walk
import xml.etree.ElementTree as ET
from xml.dom import minidom
import re
import codecs


############################ VARS ################################
PAGES_DIR = "./matphys/rpages/"
EXIT_DIR = "./matphys/"
EXIT_FILE = "FMEv2.xml"
# First and last pages to be parsed
START_PAGE = 639
END_PAGE = 700
# How many words to display before and after a potential title
LEAD_WORDS = 5
AFT_WORDS = 5
# Look in the description
CAPS_QUOT = 0.51
EXCEPTIONS = ['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'МэВ', 'ГэВ']
# Symbols excluded in xml have to be converted back
XML_EXCLUDES = {'&quot;' : '"', '&apos;' : "'", '&lt;' : '<',	'&gt;' : '>',	'&amp;' : '&'}
##################################################################



class Article:
	start_title = 0
	end_title = 0
	filename = ''



# Write xml tree to file
def prettify_1(elem:ET.Element) -> str:
	# Pretty-printed XML string for the Element.
	rough_string = ET.tostring(elem, 'utf-8')
	reparsed = minidom.parseString(rough_string)
	return reparsed.toprettyxml(indent="  ")
def xml_write_1(root:ET.Element):
	with codecs.open(EXIT_DIR + EXIT_FILE, 'w', 'utf-8') as f:
		f.write(prettify_1(root))


# Get filenames needed
filenames_raw = next(walk(PAGES_DIR), (None, None, []))[2]  # [] if no file
filenames = []
for i in range(START_PAGE, END_PAGE + 1):
	for filename in filenames_raw:
		beginning = "rp-" + str(i) + "_"
		if filename[:len(beginning)] == beginning and filename[-4:] == ".mmd":
			filenames.append(filename)
						

# Check for existing xml
filenames_raw = next(walk(EXIT_DIR), (None, None, []))[2]  # [] if no file
if not(EXIT_FILE in filenames_raw):
	root = ET.Element('data')
	xml_write_1(root)


# Convert xml excluded symbols
def xml_excluded_convert (text:str) -> str:
	for key in XML_EXCLUDES.keys():
		while text.find(key) != -1:
			pos = text.find(key)
			text = text[:pos] + XML_EXCLUDES[key] + text[pos+len(key):]
	return text
def remove_xml_spaces_1(elem:ET.Element) -> ET.Element:
	elem.tail = None
	if elem.text != None:
		is_space = True
		for letter in elem.text:
			is_space = False if letter != ' ' else is_space
		elem.text = None if is_space else xml_excluded_convert(elem.text)
	for subelem in elem:
		subelem = remove_xml_spaces_1(subelem)
	return elem
def parse_xml_1() -> ET.Element:
	# Parse existing xml (string parsing is needed to avoid extra newlines appearing)
	exit_string = ''
	with codecs.open(EXIT_DIR + EXIT_FILE, 'r', 'utf-8') as f:
		for i in f.readlines():
			exit_string += i[:-1]
	root = ET.fromstring(exit_string)
	# Remove empty tails and texts
	root = remove_xml_spaces_1(root)
	return root
root = parse_xml_1()
num = len(root) + 1


# Add article title and metadata to xml tree
def add_artice_1(elem:Article) -> int:
	# Update root in case it's been changed
	root = parse_xml_1()
	num = len(root) + 1
	article = ET.SubElement(root, 'article', {'n':str(num)})
	title = ET.SubElement(article, 'title')
	title.text = file[elem.start_title+1:elem.end_title]
	title_meta = ET.SubElement(article, 'title-meta')
	title_file = ET.SubElement(title_meta, 'title-file')
	title_file.text = elem.filename
	title_start = ET.SubElement(title_meta, 'title-start')
	title_start.text = str(elem.start_title + 1)
	title_end = ET.SubElement(title_meta, 'title-end')
	title_end.text = str(elem.end_title)
	xml_write_1(root)
	return num


# Count number of alphabetic letters in word
def count_letters_1(word:str) -> int:
	num = 0
	for letter in word:
		num += 0 if re.match(r"[A-ZА-Яa-zа-я]", letter) == None else 1
	return num

# Check if word is written in CAPS
def check_caps_1(word:str) -> int:
	num = 0
	len_word = 0
	while len(word) and re.match(r"[!#$%&'*+-.^_`|~:]", word[-1]) != None:
		word = word[:-1]
	while len(word) and re.match(r"[!#$%&'*+-.^_`|~:]", word[0]) != None:
		word = word[1:]
	for letter in word:
		#num += 0 if re.match(r"[A-ZА-Я0-9]|[!#$%&'*+-.^_`|~:]", letter) == None else 1					# Too many symbols, math formulas are being detected
		len_word += 1 if re.match(r"[!#$%&'*+-.^_`|~:]", letter) == None else 0
		num += 0 if re.match(r"[A-ZА-Я]", letter) == None else 1
	return 0 if len_word == 0 or num / len_word < CAPS_QUOT or word in EXCEPTIONS else num				# Also exclude common roman numbers

# Check for initials like "I.E."
def check_initials_1(word:str) -> bool:
	initials = True
	for i in range(len(word) - 1):
		type_1 = 0 if re.match(r"[A-ZА-Яa-zа-я]", word[i]) == None else 1
		type_2 = 0 if re.match(r"[A-ZА-Яa-zа-я]", word[i + 1]) == None else 1
		initials = False if type_1 and type_2 else initials
	return initials

# Check if the word is "CM." which happens often
def check_link_1(word:str) -> bool:
	word = word.upper()
	# Convert to cyrillic
	for i in range(len(word)):
		word = (word[:i] + 'С' + word[i+1:]) if word[i] == 'C' else word
		word = (word[:i] + 'М' + word[i+1:]) if word[i] == 'M' else word
	return True if word == 'СМ.' else False


# Find next ot prev word boundary (space / newline)
def prev_from_1(pos:int, file:str) -> int:
	pos = max(pos, 0)
	prev_space = file.rfind(' ', 0, pos)
	prev_nl = file.rfind('\n', 0, pos)
	prev_space = -1 if prev_space == -1 else prev_space
	prev_nl = -1 if prev_nl == -1 else prev_nl
	return max(prev_nl, prev_space)
def next_from_1(pos:int, file:str, end_replace = True) -> int:
	next_space = file.find(' ', pos + 1)
	next_nl = file.find('\n', pos + 1)
	if end_replace:
		next_space = len(file) if next_space == -1 else next_space
		next_nl = len(file) if next_nl == -1 else next_nl
	return max(next_nl, next_space) if next_space == -1 or next_nl == -1 else min(next_nl, next_space)


# Main loop
for filename in filenames:
	print()
	print("################################ " + filename + " ################################")
	with codecs.open(PAGES_DIR + filename, 'r', 'utf-8') as f:
		file = f.read()
	
	word_bound_l = -1
	word_bound_r = next_from_1(word_bound_l, file, end_replace=False)
	EOF_reached = False

	while not EOF_reached:
		if word_bound_r == -1:
			word_bound_r = len(file)
			EOF_reached = True


		if check_caps_1(file[word_bound_l+1:word_bound_r]) < 2 or check_initials_1(file[word_bound_l+1:word_bound_r]) or check_link_1(file[word_bound_l+1:word_bound_r]):
			word_bound_l = word_bound_r
			word_bound_r = next_from_1(word_bound_l, file, end_replace=False)
		
		else: # Possibly found a title
			# Left border of a title is already known
			start_title = word_bound_l

			# Define right border of a title
			defined_end = False
			end_title = word_bound_r
			while not defined_end:
				word_bound_l = word_bound_r
				word_bound_r = next_from_1(word_bound_l, file)

				if word_bound_l == len(file):
					defined_end = True
				elif check_link_1(file[word_bound_l+1:word_bound_r]):
					# A "CM." link, not a title
					pass
				elif not check_caps_1(file[word_bound_l+1:word_bound_r]) and count_letters_1(file[word_bound_l+1:word_bound_r]) < 2:
					if re.match(r"[A-ZА-Яa-zа-я]", file[word_bound_l+1]) != None:
						# Most possibly belongs to title
						end_title = word_bound_r
					else:
						# Most possibly NOT belongs to title
						pass
				elif check_caps_1(file[word_bound_l+1:word_bound_r]):
					end_title = word_bound_r
				else:
					defined_end = True

			next_title = False
			while not next_title:
				# Update root in case it's been changed
				root = parse_xml_1()
				num = len(root) + 1

				# Console output for further user actions
				segment_start = start_title
				segment_end = end_title
				for i in range(LEAD_WORDS):
					segment_start = prev_from_1(segment_start, file)
				for i in range(AFT_WORDS):
					segment_end = next_from_1(segment_end, file)
				
				out_str = file[segment_start+1:segment_end]

				# Format
				for i in range(len(out_str)):
					out_str = out_str[:i] + ('$' if out_str[i] == '\n' else out_str[i]) + out_str[i+1:]
				out_str = f"{num})\n" + out_str + '\n' + ' ' * (start_title - segment_start) + '^' * (end_title - start_title - 1)
				# Check for "section" in the string. This is referred to alphabetic tip at the bottom of the page
				"""if 'section' in out_str or 'title' in out_str:
					out_str += '     ############################### Title or section found! ###############################'""" # Not Used
				print(out_str)

				# User actions
				response = input()
				try:
					if response == '':
						# Add article
						article = Article()
						article.start_title = start_title
						article.end_title = end_title
						article.filename = filename
						num = add_artice_1(article)
						next_title = True
						word_bound_l = end_title
						word_bound_r = next_from_1(word_bound_l, file, end_replace=False)
						print(f'Adding article, n="{num}", title="{file[start_title+1:end_title]}"\n\n')
					elif response == 'n' or response == 'т':
						# Do not add this one
						next_title = True
						print("Not an article, skipping\n\n")
					elif response[0] == '.':
						end_title += int(response[1:])
						print("Changing title right border\n\n")
					else:
						# Change title borders
						corrections = response.split(' ')
						corrections[0] = int(corrections[0])
						corrections[1] = int(corrections[1])
						if corrections[0] > 0:
							for i in range(abs(corrections[0])):
								start_title = prev_from_1(start_title, file)
						if corrections[0] < 0:
							for i in range(abs(corrections[0])):
								start_title = next_from_1(start_title, file)
						if corrections[1] < 0:
							for i in range(abs(corrections[1])):
								end_title = prev_from_1(end_title, file)
						if corrections[1] > 0:
							for i in range(abs(corrections[1])):
								end_title = next_from_1(end_title, file)
						print("Changing title borders\n\n")
				except:
					print("########## !!! Failed on input, try again !!! ##########\n\n")


# End reached
print('###########################################################################################')
print('Last requested page processed. Press "Enter" to close this window.')
response = input()

## 0.1. Добавление заголовков по одному

В разделе параметров указать номер страницы и ТОЧНУЮ формулировку заголовка из сырого latex-текста, а также номер страницы, после чего запустить ячейку.

Закрывать скрипт парсера не обязательно, это не вызовет ошибок и его нумерация подстроится автоматически.

In [None]:
# 0.1. Добавление заголовков по одному

############################ VARS ################################
PAGES_DIR = "./matphys/rpages/"
EXIT_DIR = "./matphys/"
EXIT_FILE = "FMEv2.xml"
# Search parameters
PAGE = 73
TITLE = 'BAPИА山ия -'
##################################################################



class Article:
	start_title = 0
	end_title = 0
	filename = ''


# Get filenames needed
filenames_raw = next(walk(PAGES_DIR), (None, None, []))[2]  # [] if no file
filenames = []
for i in range(PAGE, PAGE + 1):
	for filename in filenames_raw:
		beginning = "rp-" + str(i) + "_"
		if filename[:len(beginning)] == beginning and filename[-4:] == ".mmd":
			filenames.append(filename)
						

# Check for existing xml
filenames_raw = next(walk(EXIT_DIR), (None, None, []))[2]  # [] if no file
if not(EXIT_FILE in filenames_raw):
	root = ET.Element('data')
	xml_write(root, EXIT_DIR + EXIT_FILE)


root = parse_xml(EXIT_DIR + EXIT_FILE)


# Add article title and metadata to xml tree
def add_artice(elem:Article, root:ET.Element, num:int):
	article = ET.SubElement(root, 'article', {'n':str(num)})
	title = ET.SubElement(article, 'title')
	title.text = file[elem.start_title+1:elem.end_title]
	title_meta = ET.SubElement(article, 'title-meta')
	title_file = ET.SubElement(title_meta, 'title-file')
	title_file.text = elem.filename
	title_start = ET.SubElement(title_meta, 'title-start')
	title_start.text = str(elem.start_title + 1)
	title_end = ET.SubElement(title_meta, 'title-end')
	title_end.text = str(elem.end_title)
	xml_write(root, EXIT_DIR + EXIT_FILE)

# Read requested file
with codecs.open(PAGES_DIR + filenames[0], 'r', 'utf-8') as f:
	file = f.read()

# Find titles and add them
start_title = 0
end_title = 0
num = len(root) + 1
while file.find(TITLE, end_title) != -1:
	start_title = file.find(TITLE, start_title)
	end_title = start_title + len(TITLE)
	start_title -= 1 # Set on space befor the title

	article = Article()
	article.start_title = max(start_title, 0)
	article.end_title = min(end_title, len(file))
	article.filename = filenames[0]
	add_artice(article, root, num)

# 1. Общий код

Ячейка подключает библиотеки и задаёт функции, используемые в ячейках 0.1 (выше), 2. и далее (ниже), для сокращения объёма кода.

In [21]:
# 1. Общий код

from os import walk
import xml.etree.ElementTree as ET
from xml.dom import minidom
import re
import codecs
from transliterate import translit, get_available_language_codes
from random import randint
import enchant
from enchant.checker import SpellChecker
from enchant.tokenize import EmailFilter, URLFilter
import difflib


# Small dictionaries merger
def dict_merge(dict1:dict, dict2:dict) -> dict:
	dict0 = {}
	for key in dict1.keys():
		dict0[key] = dict1[key]
	for key in dict2.keys():
		dict0[key] = dict2[key]
	return dict0


############################ VARS ################################
# Symbols and combinations that have to be corrected after OCR
COMBINATIONS_CORR_ALPHABET = {
	'A':'А', 'a':'а', 'B':'В', 'b':'Ь', 'C':'С', 'c':'с', 'E':'Е', 'e':'е', 'H':'Н', 'K':'К', 'M':'М', 'O':'О', 'P':'Р', 'p':'р', 'T':'Т', 'X':'Х', 'y':'у', 'x':'х',
	'U' : 'И',
	'u' : 'и',
	'r' : 'г',
	'N' : 'П',
	'n' : 'п',
	'm' : 'т',
	'Y' : 'У',
	#'S' : 'Я',		# Seems irrelevant
}
COMBINATIONS_CORR_UNICODE = {
	'І' : 'I',		# These two "I" are different!
	'ก' : 'п',
	'山' : 'Ц',
	'כ' : 'э',
	'חи' : 'пи',
}
COMBINATIONS_CORR_OTHER = {
	' -' : '-',
	'- ' : '-',
	'0' : 'О',
	'3' : 'З',
	'6' : 'б',
}
COMBINATIONS_CORR_GLOBAL = dict_merge(COMBINATIONS_CORR_ALPHABET, dict_merge(COMBINATIONS_CORR_UNICODE, COMBINATIONS_CORR_OTHER))
# Symbols excluded in xml have to be converted back
XML_EXCLUDES = {
	'&quot;' : '"',
	'&apos;' : "'",
	'&lt;' : '<',
	'&gt;' : '>',
	'&amp;' : '&'
}
PERSONAL_WORD_LIST = "./matphys/PWL.txt"
##################################################################


# Write xml tree to file
def prettify(elem:ET.Element) -> str:
	# Pretty-printed XML string for the Element.
	rough_string = ET.tostring(elem, 'utf-8')
	reparsed = minidom.parseString(rough_string)
	return reparsed.toprettyxml(indent="  ")
def xml_write(root:ET.Element, filename:str):
	with codecs.open(filename, 'w', 'utf-8') as f:
		f.write(prettify(root))


# Convert xml excluded symbols
def xml_excluded_convert (text:str) -> str:
	for key in XML_EXCLUDES.keys():
		while text.find(key) != -1:
			pos = text.find(key)
			text = text[:pos] + XML_EXCLUDES[key] + text[pos+len(key):]
	return text


def remove_xml_spaces(elem:ET.Element, filename:str) -> ET.Element:
	elem.tail = None
	if elem.text != None:
		is_space = True
		for letter in elem.text:
			is_space = False if letter != ' ' else is_space
		if is_space:
			elem.text = None
		else:
			if elem.tag == 'text':
				elem.text = get_texts(filename)[0]
			elif elem.tag == 'text_orig':
				elem.text = get_texts(filename)[1]
			elem.text = xml_excluded_convert(elem.text)
	for subelem in elem:
		subelem = remove_xml_spaces(subelem, filename)
	return elem
def parse_xml(filename:str) -> ET.Element:
	# Parse existing xml (string parsing is needed to avoid extra newlines appearing)
	exit_string = ''
	with codecs.open(filename, 'r', 'utf-8') as f:
		for i in f.readlines():
			exit_string += i[:-1]
	root = ET.fromstring(exit_string)
	root = remove_xml_spaces(root, filename)
	return root


# !!!BUG!!! for some reason newlines diappear in texts in parsed xml, so extract article texts manually and replace
def get_texts(filename:str) -> str:
	with codecs.open(filename, 'r', 'utf-8') as f:
		file = f.read()
	text = file[file.find('<text>')+6:file.find('</text>')]
	with codecs.open(filename, 'r', 'utf-8') as f:
		file = f.read()
	text_orig = file[file.find('<text_orig>')+11:file.find('</text_orig>')]
	return (text, text_orig)


# Get xml tree element wit sertain tag name
def get_xml_elem(root:ET.Element, elem_path:str) -> ET.Element:
	tgt = elem_path.split('/')[0]
	for elem in root:
		if elem.tag == tgt:
			if elem_path.find('/') != -1:
				return get_xml_elem(elem, elem_path[elem_path.find('/')+1:])
			else:
				return elem
	return None


## Titles handling functions
# Correct preferred combinations and latin letters
def title_handle_latin(title_new: str, COMBINATIONS_CORR: dict) -> str:
	if title_new == None or len(title_new) == 0:
		return title_new
	for comb in COMBINATIONS_CORR.keys():
		while title_new.find(comb) != -1:
			title_new = title_new[:title_new.find(comb)] + COMBINATIONS_CORR[comb] + title_new[title_new.find(comb) + len(comb):]
	return title_new
# Remove bounding symbols
def title_handle_bounding(title_new: str) -> str:
	if title_new == None or len(title_new) == 0:
		return title_new
	while len(title_new) and re.match(r"[!#%&'*+-.^_`|~:;]", title_new[0]) != None:
		title_new = title_new[1:]
	while len(title_new) and re.match(r"[!#%&'*+-.^_`|~:;]", title_new[-1]) != None:
		title_new = title_new[:-1]
	return title_new
# Merge single-lettered words
def title_handle_merge(title_new: str) -> str:
	if title_new == None or len(title_new) == 0:
		return title_new
	title_new = ' ' + title_new + ' '
	for i in range(len(title_new) - 4):
		if (title_new[i] == ' ' or title_new[i] == '№') and title_new[i + 2] == ' ' and title_new[i + 4] == ' ':
			title_new = title_new[:i+2] + '№' + title_new[i+3:]
	i = 0
	while i < len(title_new):
		if title_new[i] == '№':
			title_new = title_new[:i] + title_new[i+1:]
			i = 0
		else:
			i += 1
	while title_new[0] == ' ':
		title_new = title_new[1:]
	while title_new[-1] == ' ':
		title_new = title_new[:-1]
	return title_new
# Revert changes for aux formulas in titles
def title_handle_formulas(title_new: str, title: str) -> str:
	if title_new == None or len(title_new) == 0:
		return title_new
	pos_old = 0
	pos_new = 0
	while title.find('$', pos_old) != -1 and title_new.find('$', pos_new) != -1:
		pos_old = title.find('$', pos_old) + 1
		pos_new = title_new.find('$', pos_new) + 1
		pos_old_next = title.find('$', pos_old) if title.find('$', pos_old) != -1 else len(title)
		pos_new_next = title_new.find('$', pos_new) if title_new.find('$', pos_new) != -1 else len(title_new)
		title_new = title_new[:pos_new] + title[pos_old:pos_old_next] + title_new[pos_new_next:]
		pos_old = pos_old_next + 1
		pos_new = (title_new.find('$', pos_new) if title_new.find('$', pos_new) != -1 else len(title_new)) + 1
	while title_new[-2:] == '-$' or title_new[-2:] == ',$' or title_new[-2:] == ':$':
		title_new = title_new[:-2] + '$'
	return title_new


# Position checkers
# Checks if given position is between opening and closing scopes
def check_in_scopes(text:str, pos:int, scope_open:str, scope_close:str) -> bool:
	if text == None:
		return False
	open_prev = text.rfind(scope_open, 0, pos)
	close_prev = text.rfind(scope_close, 0, pos)
	open_next = text.find(scope_open, pos)
	close_next = text.find(scope_close, pos)
	after_open = True if ((open_prev != -1 and close_prev == -1) or (open_prev > close_prev and open_prev != -1 and close_prev != -1)) else False
	before_close = True if ((open_next == -1 and close_next != -1) or (open_next > close_next and open_next != -1 and close_next != -1)) else False
	return (after_open and before_close)

def check_in_uri(text:str, pos:int) -> bool:
	return check_in_scopes(text, pos, 'URI[[', ']]/URI')

def check_in_link(text:str, pos:int) -> bool:
	return check_in_scopes(text, pos, '![](', ')')

def check_in_formula(text:str, pos:int) -> bool:
	# Main formulas
	in_main = check_in_scopes(text, pos, '\\[', '\\]')
	# Aux formulas
	if text == None:
		return False
	pos_find = 0
	cnt = 0
	found_before = 0
	# Count dollar symbols and find target position
	while text.find('$', pos_find) != -1:
		pos_find = text.find('$', pos_find)
		cnt += 1
		if not found_before and pos <= pos_find:
			found_before = cnt
		pos_find += 1
	# If cnt is not even assume that first one is garbage from title
	in_aux = not ((found_before + cnt) % 2) and found_before > 0
	return in_main or in_aux


# Prepare spellcheckers
ru_dict = enchant.DictWithPWL("ru_RU", PERSONAL_WORD_LIST)
ru_checker = SpellChecker(ru_dict, filters=[EmailFilter, URLFilter])
en_dict = enchant.DictWithPWL("en_US", PERSONAL_WORD_LIST)
en_checker = SpellChecker(en_dict, filters=[EmailFilter, URLFilter])
def spellcheck_dict_update():
	global ru_dict
	global ru_checker
	global en_dict
	global en_checker
	ru_dict = enchant.DictWithPWL("ru_RU", PERSONAL_WORD_LIST)
	ru_checker = SpellChecker(ru_dict, filters=[EmailFilter, URLFilter])
	en_dict = enchant.DictWithPWL("en_US", PERSONAL_WORD_LIST)
	en_checker = SpellChecker(en_dict, filters=[EmailFilter, URLFilter])

# Use PyEnchant spellchecker
def do_spellcheck(text: str) -> dict:
	global ru_dict
	global ru_checker
	global en_dict
	global en_checker
	dictionaries = [ru_dict, en_dict]
	checkers = [ru_checker, en_checker]
	text_suggestions = dict()

	# Spellcheck
	for i in range(len(checkers)):
		checker = checkers[i]
		dictionary = dictionaries[i]

		checker.set_text(text)
		for woi in checker:
			# Exclude some wois to reduce computation time and output
			if check_in_uri(text, woi.wordpos) or check_in_link(text, woi.wordpos) or check_in_formula(text, woi.wordpos) or len(woi.word) < 4 or woi.wordpos in text_suggestions.keys() or text[min(woi.wordpos + len(woi.word), len(text)-1)] in ['.']:
				continue
			# Check if word is correct in some other language
			word_is_correct_in_other_dict = False
			for _dictionary in dictionaries:
				word_is_correct_in_other_dict = True if _dictionary.check(woi.word) else word_is_correct_in_other_dict
			if word_is_correct_in_other_dict:
				continue
			# Generate a suggestion
			sim = dict()
			word_suggestions = set(dictionary.suggest(woi.word))
			for word in word_suggestions:
				measure = difflib.SequenceMatcher(None, woi.word, word).ratio()
				sim[measure] = word
			suggest = sim[max(sim.keys())] if len(sim.keys()) else None
			# Exclude some wois to reduce computation time and output
			if suggest == None or suggest == woi.word:
				continue
			else:
				text_suggestions[woi.wordpos] = (woi.word, suggest)

	return text_suggestions

def add_to_pwl(word: str):
	with codecs.open(PERSONAL_WORD_LIST, 'r', 'utf-8') as f:
		pwl = f.read()
	if pwl.find(f"\n{word}\n") == -1:
		with codecs.open(PERSONAL_WORD_LIST, 'a', 'utf-8') as f:
			f.write(f"{word.strip()}\n")

# 2. Исправление ошибок в заголовках

Состоит из двух частей: "составитель пар" и "подстановщик".

## 2.1. Составитель пар "оригинальный - исправленный" для заголовков

Формирует xml-список всех заголовков с возможными автоматическими исправлениями (в формате было / стало):
1. замена латиницы на агалогичную кириллицу;
2. замена задаванных буквосочетаний (см. параметры)
3. удаление обрамляющих знаков препинания;
4. замена всех букв на заглавные (в том числе это избавляет дальнейшей необходимости исправлять имена);
5. слияние разорванных на отдельные буквы слов (если рядом оказываются несколько таких слов, то они оказываются слиты вместе).

Этот список необходимо просмотреть и исправить оставшиеся ошибки.

Дополнительно, для помощи в поиске орфографических ошибок, формируется строка с изменениями, предложенными спеллчекером. ВНИМАНИЕ: спеллчекер может делать ошибки в именах, специфических терминах и т.п., поэтому следует использовать его результаты лишь для ориентира.

In [33]:
# 2.1. Составитель пар "оригинальный - исправленный" для заголовков:

############################ VARS ################################
WORK_DIR = "./matphys/"
INPUT_FILE = "FMEv2.xml"
CORRECTION_FILE = "FMEcorr.xml"
COMBINATIONS_CORR = dict_merge(COMBINATIONS_CORR_GLOBAL, {
	'ХК' : 'Ж',
	'ЬI' : 'Ы',
	'II' : 'Ш',
	'I' : 'П',
	'J' : 'Л',
	'ЛАГРАНХ' : 'ЛАГРАНЖ',
	'ЛАТРАНХ' : 'ЛАГРАНЖ',
})
SPELLCHECK_ONLY = True # Use if the only thing you need from this script is spellcheck
##################################################################
						

# Check for existing xml
filenames_raw = next(walk(WORK_DIR), (None, None, []))[2]  # [] if no file
if not(INPUT_FILE in filenames_raw):
	root = ET.Element('data')
	xml_write(root, WORK_DIR + CORRECTION_FILE)


root = parse_xml(WORK_DIR + INPUT_FILE)


# Get all the titles into a dict
titles_dict = {}
pages_dict = {}
for article in root:
	title = get_xml_elem(article, 'title').text
	titles_dict[title] = (title, title)
	title_file = get_xml_elem(article, 'title-meta/title-file')
	pages_dict[title] = title_file.text[title_file.text.find('-')+1:title_file.text.find('_')]


if not SPELLCHECK_ONLY:
	# Correct preferred combinations and latin letters
	for title in titles_dict.keys():
		title_new = title_handle_latin(titles_dict[title][0], COMBINATIONS_CORR)
		titles_dict[title] = (title_new, title_new)

	# Remove bounding symbols
	for title in titles_dict.keys():
		title_new = title_handle_bounding(titles_dict[title][0])
		titles_dict[title] = (title_new, title_new)

	# CAPS
	for title in titles_dict.keys():
		title_new = titles_dict[title][0].upper()
		titles_dict[title] = (title_new, title_new)

	# Merge single-lettered words
	for title in titles_dict.keys():
		title_new = title_handle_merge(titles_dict[title][0])
		titles_dict[title] = (title_new, title_new)

	# Revert changes for aux formulas in titles
	for title in titles_dict.keys():
		title_new = title_handle_formulas(titles_dict[title][0], title)
		titles_dict[title] = (title_new, title_new)

# Try spellcheck on titles
spellcheck_dict_update()
for title in titles_dict.keys():
	title_new = titles_dict[title][0]
	title_suggestions = do_spellcheck(title_new)
	for i in range(len(title_new)):
		title_new = title_new[:i] + ('_' if title_new[i] not in [' ', '\n', '\r'] else title_new[i]) + (title_new[i+1:] if i + 1 <= len(title_new) else '')
	for pos in sorted(title_suggestions.keys(), reverse=True):
		title_new = title_new[:pos] + title_suggestions[pos][1] + title_new[pos+len(title_suggestions[pos][0]):]
	titles_dict[title] = (titles_dict[title][0], title_new)


# Write corrections xml
root = ET.Element('data')
for i in titles_dict.items():
	pair = ET.SubElement(root, 'pair')
	title_old = ET.SubElement(pair, 'title_old')
	title_old.text = i[0]
	title_new = ET.SubElement(pair, 'title_new')
	title_new.text = i[1][0]
	title_new = ET.SubElement(pair, 'title__sc')
	title_new.text = i[1][1]
	page = ET.SubElement(pair, 'page')
	page.text = pages_dict[i[0]]
xml_write(root, WORK_DIR + CORRECTION_FILE)

## 2.2. Подстановщик исправленных заголовков

Заменяет все заголовки на исправленные согласно списку пар.

In [None]:
# 2.2. Подстановщик исправленных заголовков:

############################ VARS ################################
WORK_DIR = "./matphys/"
INPUT_FILE = "FMEv2.xml"
CORRECTION_FILE = "FMEcorr.xml"
EXIT_FILE = "FMEtitles.xml"
##################################################################



root = parse_xml(WORK_DIR + CORRECTION_FILE)


# Get all the corrections into a dict
titles_dict = {}
for pair in root:
	titles_dict[get_xml_elem(pair, 'title_old').text] = get_xml_elem(pair, 'title_new').text


root = parse_xml(WORK_DIR + INPUT_FILE)


# Replace titles
for article in root:
	get_xml_elem(article, 'title').text = titles_dict[get_xml_elem(article, 'title').text]
xml_write(root, WORK_DIR + EXIT_FILE)

# 3. Сортировщик / сливщик файлов с заголовками

Сортирует статьи в файлах из данного списка в порядке страница-расположение, т.е. (если не сказано иного) в алфавитном порядке и выводит в один выходной файл. Также порядковый номер заменяется uri формата "http://libmeta.ru/fme/article/1_Kraevaya"

In [None]:
# 3. Сортировщик / сливщик файлов с заголовками

############################ VARS ################################
WORK_DIR = "./results/"
TITLES_DIR = "FMEtitles/"
INPUT_FILES = ["FMEtitles-p5-100.xml", "FMEtitles-p101-200.xml", "FMEtitles-p201-300.xml", "FMEtitles-p301-400.xml", "FMEtitles-p301-400-add.xml",
							 "FMEtitles-p401-500.xml", "FMEtitles-p501-600.xml", "FMEtitles-p601-692.xml", "FMEtitles-p601-692-add.xml"]
EXIT_FILE = "FMEtitles-merged.xml"
DISABLED = False # Use to prevent accidential URI changes
##################################################################



class Article:
	title = ''
	start_title = ''
	end_title = ''
	filename = ''



# Add article title and metadata to xml tree
def add_artice(elem:Article, root:ET.Element, num:int):
	translitted = translit(elem.title[:elem.title.find(' ')], 'ru', True)
	while translitted.find('/') != -1:
		translitted = translitted[:translitted.find('/')] + '_' + translitted[translitted.find('/')+1:]		# Prevent slash being counted as subfolder in further
	article = ET.SubElement(root, 'article', {'uri':"http://libmeta.ru/fme/article/"+str(num)+"_" + translitted})
	title = ET.SubElement(article, 'title')
	title.text = elem.title
	title_meta = ET.SubElement(article, 'title-meta')
	title_file = ET.SubElement(title_meta, 'title-file')
	title_file.text = elem.filename
	title_start = ET.SubElement(title_meta, 'title-start')
	title_start.text = str(int(elem.start_title) + 1)
	title_end = ET.SubElement(title_meta, 'title-end')
	title_end.text = elem.end_title


# Collect all the articles
articles_dict = {}
for filename in INPUT_FILES:
	root = parse_xml(WORK_DIR + TITLES_DIR + filename)
	for article in root:
		num = ()
		title = get_xml_elem(article, 'title').text
		elem = get_xml_elem(article, 'title-meta/title-file')
		page = elem.text[elem.text.find('-')+1:elem.text.find('_')]
		pos = get_xml_elem(article, 'title-meta/title-start').text
		start = get_xml_elem(article, 'title-meta/title-start').text
		end = get_xml_elem(article, 'title-meta/title-end').text
		file = get_xml_elem(article, 'title-meta/title-file').text
		num = (int(page), int(pos))
		articles_dict[num] = {'title':title, 'file':file, 'start':start, 'end':end}


# Sort keys and wrtite articles accordingly
root = ET.Element('data')
nums_list = sorted(list(i for i in articles_dict.keys()))
for num in range(len(nums_list)):
	article = Article()
	article.title = articles_dict[nums_list[num]]['title']
	article.start_title = articles_dict[nums_list[num]]['start']
	article.end_title = articles_dict[nums_list[num]]['end']
	article.filename = articles_dict[nums_list[num]]['file']
	add_artice(article, root, num + 1)
if not DISABLED:
	xml_write(root, WORK_DIR + EXIT_FILE)

# 4. Парсер текстов статей

По информации из указанного файла с заголовками вытаскивает в сыром виде тексты статей. Каждая статья помещается в свой .xml файл, с заголовком, содержащим номер статьи и первое слово из заголовка транслитом.

In [None]:
# 4. Парсер текстов статей

############################ VARS ################################
TITLES_FILE = "./results/FMEtitles-merged.xml"
PAGES_DIR = "./matphys/rpages/"
EXIT_DIR = "./results/FMEarticles/"
COMBINATIONS_CORR = {
	'І' : 'I'		# This teo are different!
}
##################################################################


class Article:
	start_file = ''
	start_pos = 0
	end_file = ''
	end_pos = 0
	text = ''
	text_orig = ''
	uri = ''
	title = ''
	xml = ''

	def get_text(self):
		# Get filenames
		filenames_raw = next(walk(PAGES_DIR), (None, None, []))[2]  # [] if no file
		filenames = []
		for filename in filenames_raw:
			if filename[-4:] == ".mmd":
				filenames.append(filename)
		if self.start_file == self.end_file:
			with codecs.open(PAGES_DIR + self.start_file, 'r', 'utf-8') as f:
				self.text += f.read()[self.start_pos:self.end_pos]
		else:
			with codecs.open(PAGES_DIR + self.start_file, 'r', 'utf-8') as f:
				self.text += f.read()[self.start_pos:]
			for page in range(int(self.start_file[3:self.start_file.find('_')]) + 1, int(self.end_file[3:self.end_file.find('_')])):
				for filename in filenames:
					if int(filename[3:filename.find('_')]) == page:
						self.text += ' ' # Add a space to prevent word merging
						with codecs.open(PAGES_DIR + filename, 'r', 'utf-8') as f:
							self.text += f.read()
			self.text += ' ' # Add a space to prevent word merging
			with codecs.open(PAGES_DIR + self.end_file, 'r', 'utf-8') as f:
				self.text += f.read()[:self.end_pos]
		for comb in COMBINATIONS_CORR.keys():
			while self.text.find(comb) != -1:
				self.text = self.text[:self.text.find(comb)] + COMBINATIONS_CORR[comb] + self.text[self.text.find(comb) + len(comb):]
		while self.text != None and len(self.text) and self.text[0] in [' ', ',', '.', ':', ';', '-', '\n', '\r']:
			self.text = self.text[1:]
		while self.text != None and len(self.text) and self.text[-1] in [' ', '\n', '\r']:
			self.text = self.text[:-1]
		self.text_orig = self.text
		# Fix several capital symbols per word
		word_left = 0
		word_right = 0
		while word_left < len(self.text):
			word_right = min(len(self.text), self.text.find(' ', word_left) if self.text.find(' ', word_left) != -1 else len(self.text))
			word_right = min(word_right, self.text.find('\n', word_left) if self.text.find('\n', word_left) != -1 else len(self.text))
			word_right = min(word_right, self.text.find('\r', word_left) if self.text.find('\r', word_left) != -1 else len(self.text))
			word_right = min(word_right, self.text.find('-', word_left) if self.text.find('-', word_left) != -1 else len(self.text))
			word_right = min(word_right, self.text.find('.', word_left) if self.text.find('.', word_left) != -1 else len(self.text))
			word = self.text[word_left:word_right]
			if word != None and len(word) > 1 and not check_in_uri(self.text, word_left) and not check_in_formula(self.text, word_left) and not check_in_link(self.text, word_left):
				word = word[0] + word[1:len(word)].lower()
				self.text = self.text[:word_left] + word + self.text[word_right:]
			word_left = word_right + 1
	
	def make_xml(self):
		self.get_text()

		article = ET.Element("article", {'uri':self.uri})
		title = ET.SubElement(article, 'title')
		title.text = self.title
		author = ET.SubElement(article, 'authors')
		title_short = ET.SubElement(article, 'title_short')
		pages = ET.SubElement(article, 'pages')
		start = ET.SubElement(pages, 'start')
		start.text = self.start_file[3:self.start_file.find('_', 3)]
		end = ET.SubElement(pages, 'end')
		end.text = self.end_file[3:self.end_file.find('_', 3)]
		literature = ET.SubElement(article, 'literature')
		literature_orig = ET.SubElement(literature, 'literature_orig')
		formulas_remote = ET.SubElement(article, 'formulas_main')
		formulas_inline = ET.SubElement(article, 'formulas_aux')
		relations = ET.SubElement(article, 'relations')
		text = ET.SubElement(article, 'text')
		text.text = self.text
		text_orig = ET.SubElement(article, 'text_orig')
		text_orig.text = self.text_orig

		self.xml = prettify(article)
	
	

class Title:
	text = ''
	file = ''
	start_pos = 0
	end_pos = 0
	uri = ''


def get_title(n:int, root:ET.Element) -> Title:
	otitle = Title()
	for title in root:
		if int(title.attrib['uri'][30:title.attrib['uri'].find('_', 30)]) == n:
			otitle.uri = title.attrib['uri']
			otitle.text = get_xml_elem(title, 'title').text
			otitle.file = get_xml_elem(title, 'title-meta/title-file').text
			otitle.start_pos = int(get_xml_elem(title, 'title-meta/title-start').text)
			otitle.end_pos = int(get_xml_elem(title, 'title-meta/title-end').text)
	return otitle


root = parse_xml(TITLES_FILE)

# Create articles list
articles_list = []
title = Title()
for i in range(len(root)):
	title = get_title(i + 1, root)
	if i:
		articles_list[-1].end_file = title.file
		articles_list[-1].end_pos = max(title.start_pos - 2, 0) # There is a shift for some reason
	articles_list.append(Article())
	articles_list[-1].uri = title.uri
	articles_list[-1].title = title.text
	articles_list[-1].start_file = title.file
	articles_list[-1].start_pos = title.end_pos
	articles_list[-1].end_file = title.file
	with codecs.open(PAGES_DIR + title.file, 'r', 'utf-8') as f:
		articles_list[-1].end_pos = len(f.read())

# Parse texts themselves and write
for i in range(len(articles_list)):
	articles_list[i].make_xml()
	with codecs.open(EXIT_DIR + '' + articles_list[i].uri[30:] + '.xml', 'w', 'utf-8') as f:
		f.write(articles_list[i].xml)

# 5. Проверка правописания в текстах

## 5.1. Сканер

Сканирует тексты из указанного диапазона статей и выносит все показавшиеся подозрительными слова в отдельный xml следующего формата:
- Статья (имя файла в аттрибутах)
  - Слово (позиция в тексте и флаги в аттрибутах)
    - Исходный вариант
    - Контекстная строка (размер задаётся в разделе параметров скрипта)
    - Предложенная замена

Предлагается два флага для определения дальнейшей "судьбы" слова: "результат" (0 -- исходное, 1 -- предложенное) и "добавление в словарь" (0 -- не добавлять, 1 -- добавить; применяется к выбранному результату)

In [None]:
# 5.1. Проверка правописания в текстах. Сканер.

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
EXIT_DIR = "./matphys/"
CONTEXT_SIZE = 20
START_ARTICLE = 1
END_ARTICLE = 1
DEFAULT_RESULT_FLAG = '1'
DEFAULT_ADD_TO_PWL_FLAG = '0'
##################################################################


spellcheck_dict_update()

# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file
#filenames = ['4_ABELEVA.xml']

root = ET.Element('data')

for filename in filenames:
	article_number = int(filename[:filename.find('_')])
	if article_number < START_ARTICLE or article_number > END_ARTICLE:
		continue

	print(f'{filename}: found ', end='')
	article = parse_xml(ARTICLES_DIR + filename)
	text = get_xml_elem(article, 'text')

	#add_to_pwl(filename[filename.find('_')+1:filename.find('.xml')])

	text_suggestions = do_spellcheck(text.text)
	print(len(text_suggestions.keys()))
	if len(text_suggestions.keys()):
		article = ET.SubElement(root, 'article', {'filename': filename})
		for pos in text_suggestions.keys():
			#print(f'{pos}: {text_suggestions[pos][0]} -> {text_suggestions[pos][1]}')
			word = ET.SubElement(article, 'word', {'pos': str(pos), 'result': DEFAULT_RESULT_FLAG, 'add_to_pwl': DEFAULT_ADD_TO_PWL_FLAG})
			source = ET.SubElement(word, 'source')
			source.text = text_suggestions[pos][0]
			context = ET.SubElement(word, 'context')
			context_string = text.text[max(0, pos - CONTEXT_SIZE):min(len(text.text), pos + len(text_suggestions[pos][0]) + CONTEXT_SIZE)]
			while context_string.find('\n') != -1:
				context_string = context_string[:context_string.find('\n')] + '\\n' + context_string[context_string.find('\n')+1:]
			while context_string.find('\r') != -1:
				context_string = context_string[:context_string.find('\r')] + '\\r' + context_string[context_string.find('\r')+1:]
			context.text = context_string
			suggestion = ET.SubElement(word, 'suggestion')
			suggestion.text = text_suggestions[pos][1]


with codecs.open(EXIT_DIR + f'FMEspellcheck-a{START_ARTICLE}-{END_ARTICLE}.xml', 'w', 'utf-8') as f:
	f.write(prettify(root))

## 5.2. Пополнение словаря

Добавляет отмеченные флагом "добавление в словарь" слова из всех файлов в директории спеллчека
- Учитывается, было ли выбрано оригинальное слово или исправленное флагом "результат".
- Словарь сортируется по алфавиту при каждом запуске.
- Дубликаты удаляются при каждом запуске (символы разного регистра одинаковыми на считаются).
- Слова добавленные вручную при запуске не удаляются.

Чтобы объединить ваш словарь с другим, скопируйте и вставьте всё содрежимое нового словаря в ваш, после чего запустите скрипт. Дубликаты будут удалены, итоговый словарь будет отсортирован.

In [None]:
# 5.2. Проверка правописания в текстах. Сканер.

############################ VARS ################################
SPELLCHECK_DIR = "./results/FMEspellcheck/"
##################################################################


# Read PWL and form word list
with codecs.open(PERSONAL_WORD_LIST, 'r', 'utf-8') as f:
	PWL_text = f.read()
PWL_list_old = [i.strip() for i in PWL_text.split('\n')]
while '' in PWL_list_old:
	PWL_list_old.remove('')
PWL_text = ''

# Read all spellcheck outputs and create additions list
# Get filenames needed
filenames = next(walk(SPELLCHECK_DIR), (None, None, []))[2]  # [] if no file

additions = []
for filename in filenames:
	root = parse_xml(SPELLCHECK_DIR + filename)
	for article in root:
		if article.tag == "article":
			for word in article:
				if word.tag == "word":
					if word.attrib["add_to_pwl"] == '1' and word.attrib["result"] == '1':
						additions.append(get_xml_elem(word, 'suggestion').text.strip())
					elif word.attrib["add_to_pwl"] == '1' and word.attrib["result"] == '0':
						additions.append(get_xml_elem(word, 'source').text.strip())

# Make new PWL list and sort it
PWL_list_new = []
for word in PWL_list_old:
	if not word in PWL_list_new:
		PWL_list_new.append(word)
for word in additions:
	if not word in PWL_list_new:
		PWL_list_new.append(word)
PWL_list_new.sort()

# Write PWL
for word in PWL_list_new:
	PWL_text = PWL_text + word + '\n'
with codecs.open(PERSONAL_WORD_LIST, 'w', 'utf-8') as f:
	f.write(PWL_text)

## 5.3. Подстановка исправленной орфографии

Подставляет в исходный текст исправленные слова или оригиналы, в зависимости от установленного флага "результат".

# 6. Парсер авторов статьи

Ищет в конце текста статей конструкции типа ` [Xxxx]. [Xxxx]. [Xxxx]` или ` [Xxxx].[Xxxx]. [Xxxx]` и итерпретирует её как автора статьи.

In [None]:
# 6. Парсинг авторов статьи

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
COMBINATIONS_CORR = dict_merge(COMBINATIONS_CORR_UNICODE, {
	'II' : 'П'
})
LOCAL_DICT = {'0':'О', '3':'З', '6':'б'}
##################################################################


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file

for filename in filenames:
	article = parse_xml(ARTICLES_DIR + filename)
	textelem = get_xml_elem(article, 'text')
	text = textelem.text
	authors = get_xml_elem(article, 'authors')

	auth_start = 1
	auth_list = []
	while auth_start and text != None:
		# Find first non-space from the end
		while text[-1] == ' ' or text[-1] == '\n' or text[-1] == '\r':
			text = text[:-1]

		auth_start = 0
		# Try recognize
		first_space = max(text.rfind(' ', 0, len(text)), text.rfind('\n', 0, len(text)), text.rfind('\r', 0, len(text)))
		second_space = max(text.rfind(' ', 0, first_space), text.rfind('\n', 0, first_space), text.rfind('\r', 0, first_space))
		third_space = max(text.rfind(' ', 0, second_space), text.rfind('\n', 0, second_space), text.rfind('\r', 0, second_space))
		if first_space >= 0 and text[first_space-1] == '.' and second_space >= 0:
			if text.find('.', second_space, first_space-1) != -1: # If there's no space between initials
				third_space = second_space
				second_space = first_space
			if text[second_space-1] == '.' and third_space >= 0:
				# Check if first letters of each word are capitals
				keep = text
				for comb in LOCAL_DICT.keys():
					while text[third_space+1:].find(comb) != -1:
						text = text[:third_space+1+text[third_space+1:].find(comb)] + LOCAL_DICT[comb] + text[third_space+2+text[third_space+1:].find(comb):]
				if re.match(r"[A-ZА-ЯІ]", text[first_space+1]) != None and re.match(r"[A-ZА-ЯІ]", text[second_space+1]) != None and re.match(r"[A-ZА-ЯІ]", text[third_space+1]) != None:
					auth_start = third_space + 1
				text = keep

		if auth_start: # Suggest that an article cannot consist of author only and therefore auth_start should be > 0
			#print(article.attrib['uri'], author_text)
			author_text = text[auth_start:]
			if author_text[author_text.find('.')+1] != ' ': # Add space if there's no one between initials
				author_text = author_text[:author_text.find('.')+1] + ' ' + author_text[author_text.find('.')+1:]
			if author_text[-1] == '.' or author_text[-1] == ',':
				author_text = author_text[:-1]
			# convert wrong symbols
			for comb in dict_merge(COMBINATIONS_CORR, LOCAL_DICT).keys():
				while author_text.find(comb) != -1:
					author_text = author_text[:author_text.find(comb)] + dict_merge(COMBINATIONS_CORR, LOCAL_DICT)[comb] + author_text[author_text.find(comb) + len(comb):]
			
			auth_list.append(author_text)
			text = text[:auth_start]

	# add authors, reverse their order to alphabetic
	for auth in reversed(auth_list):
		author = ET.SubElement(authors, 'author')
		author.text = auth

	textelem.text = text
	with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
		f.write(prettify(article))

# 7. Парсер литературы

После извлечения авторов статьи в конце за текстом статьи присутствует только строчка литературы, если вообще присутствует. Поэтому ищется и извлекается фрагмент начиная с "`Лит.:`". Он разделяется на сегменты по "`[num]`", а сегменты на подфрагменты по запятым. Общий вид сегмента полагается следующим: "`[Авторы (возможно несколько, определяются по наличию инициалов в конце подфрагмента)], Название (возможно содержит запятые), Номер тома (может отсутствовать), [Информация об издании (может частично или полностью отсутствовать)], Год, [Прочее (главы, страницы и прочее, может отсутствовать)];`"

In [None]:
# 7. Парсинг литературы

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
COMBINATIONS_CORR_LOCAL = dict_merge(dict_merge(COMBINATIONS_CORR_ALPHABET, COMBINATIONS_CORR_UNICODE), {'J':'Л'})
##################################################################


class Unit:
	authors = []
	title = ""
	publication = ""
	year = ""
	other = ""


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file

for filename in filenames:
	article = parse_xml(ARTICLES_DIR + filename)
	textelem = get_xml_elem(article, 'text')
	text = textelem.text
	literature = get_xml_elem(article, 'literature')
	literature_orig = get_xml_elem(literature, 'literature_orig')

	if textelem.text != None and len(textelem.text):
		#Find literature start position and extract if present
		for key in COMBINATIONS_CORR_LOCAL.keys():
			while text.find(key) != -1:
				text = text[:text.find(key)] + COMBINATIONS_CORR_LOCAL[key] + text[text.find(key)+1:]
		text = text.upper()
		lit_pos = text.rfind('\nЛИТ.: ')
		lit_pos = text.rfind('\rЛИТ.: ') if lit_pos == -1 else lit_pos
		lit_pos = text.rfind(' ЛИТ.: ') if lit_pos == -1 else lit_pos
		if lit_pos != -1:
			literature_orig.text = textelem.text[lit_pos:]
			while literature_orig.text[0] in [' ', '\n', '\r']:
				literature_orig.text = literature_orig.text[1:]
			textelem.text = textelem.text[:lit_pos]
			while textelem.text[-1] in [' ', '\n', '\r']:
				textelem.text = textelem.text[:-1]


			# Parse literature string
			text = literature_orig.text
			units = []
			num = 1
			while text.find('['+str(num)+']') != -1:
				units.append(text[text.find('['+str(num)+']')+len('['+str(num)+']'):(text.find('['+str(num+1)+']') if text.find('['+str(num+1)+']') != -1 else len(text))])
				num += 1
			for unit in units:
				logical_parts = Unit()
				logical_parts.authors.clear()
				subunits = unit.split(',')
				while '' in subunits:
					subunits.remove('')
				pos_last_auth = -1
				pos_last_title = -1
				pos_thome = -1
				pos_transl = -1
				pos_pub_num = -1
				pos_pub_place = -1
				pos_year = -1


				# Define positions of most common pats of literature string
				for i in range(len(subunits)):
					text = subunits[i]
					while text[-1] in [' ', '\n', '\r', ';']:
						text = text[:-1]
					while text[0] in [' ', '\n', '\r']:
						text = text[1:]
					subunits[i] = text

					if pos_last_auth + 1 == i: # Recognize authors
						keep = text
						for j in range(len(text)):
							if text[j] in COMBINATIONS_CORR_UNICODE:
								text = text[:j] + COMBINATIONS_CORR_UNICODE[text[j]] + text[j+1:]
						if text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) != None and text[-3] == ' ' and text[-4] == '.' and re.match(r"[[А-ЯA-Z]", text[-5]) != None:
							# "X. X."
							pos_last_auth = i
							pos_initials = -5
						elif text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) != None and text[-3] == '.' and re.match(r"[[А-ЯA-Z]", text[-4]) != None:
							# "X.X."
							pos_last_auth = i
							text = text[:-2] + ' ' + text[-2:]
							pos_initials = -5
						elif text[-1] == '.' and re.match(r"[[А-ЯA-Z]", text[-2]) != None:
							# "X."
							pos_last_auth = i
							pos_initials = -2
						else: # Title starts
							text = keep
						# If correct
						if pos_last_auth == i:
							surname = text[:pos_initials]
							while surname.find(' ') != -1:
								surname = surname[:surname.find(' ')] + surname[surname.find(' ')+1:]
							text = surname + ' ' + text[pos_initials:]
							j = 1
							while j < len(text):
								if re.match(r"[А-ЯA-Z]", text[j]) != None and re.match(r"[а-яa-z]", text[j-1]) != None:
									text = text[:j] + ' ' + text[j:]
									j = 1
								else:
									j += 1
							subunits[i] = text
					else:
						if pos_thome == -1: # Recognize thome
							keep = text
							for j in range(len(text)):
								if text[j] in COMBINATIONS_CORR_GLOBAL:
									text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
							if text.upper().find('Т.') != -1:
								pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
								pos_thome = i
							text = keep
						if pos_transl == -1: # Recognize publication number
							keep = text
							for j in range(len(text)):
								if text[j] in COMBINATIONS_CORR_GLOBAL:
									text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
							if text.upper().find('ПЕР.') != -1:
								pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
								pos_transl = i
							text = keep
						if pos_pub_num == -1: # Recognize publication number
							keep = text
							for j in range(len(text)):
								if text[j] in COMBINATIONS_CORR_GLOBAL:
									text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
							if text.upper().find('ИЗД.') != -1:
								pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
								pos_pub_num = i
							text = keep
						if pos_pub_place == -1: # Recognize publication place
							keep = text
							for j in range(len(text)):
								if text[j] in COMBINATIONS_CORR_GLOBAL:
									text = text[:j] + COMBINATIONS_CORR_GLOBAL[text[j]] + text[j+1:]
							if text.upper() in ['М.', 'Л.', 'СПБ.', 'М.Л.', 'Л.М.', 'М.СПБ.', 'СПБ.М.']:
								pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
								pos_pub_place = i
							text = keep
						# If correct
						if pos_last_auth != i and (pos_thome == i or pos_pub_num == i or pos_pub_place == i):
							for j in range(len(text)):
								if text[j] in COMBINATIONS_CORR_UNICODE:
									subunits[i] = text[:j] + COMBINATIONS_CORR_UNICODE[text[j]] + text[j+1:]

						if pos_year == -1 and len(text) >= 4: # Recognize year
							numbers = ['0','1','2','3','4','5','6','7','8','9']
							for j in range(len(text) - 3):
								if text[j] in numbers and text[j+1] in numbers and text[j+2] in numbers and text[j+3] in numbers:
									pos_last_title = (i - 1) if pos_last_title == -1 else pos_last_title
									pos_year = i
									break
							# if correct
							if pos_year == i:
								subunits[i] = text[j:j+4]


				# Extract info from literature string using positions defined above
				for i in range(len(subunits)):
					text = subunits[i]
					if pos_last_auth >= i: # Author
						logical_parts.authors.append(text)
					elif pos_last_auth < i and pos_last_title >= i: # Title
						logical_parts.title = logical_parts.title + ('' if len(logical_parts.title) == 0 else ', ') + text
					elif pos_year == i: # Year
						logical_parts.year = logical_parts.year + ('' if len(logical_parts.year) == 0 else ', ') + text
					elif ((pos_pub_num <= i and pos_pub_num != -1) or (pos_pub_place <= i and pos_pub_place != -1) or (pos_transl <= i and pos_transl != -1) or (pos_thome + 1 <= i and pos_thome != -1)) and pos_year > i: # Publication
						logical_parts.publication = logical_parts.publication + ('' if len(logical_parts.publication) == 0 else ', ') + text
					else: # Other
						logical_parts.other = logical_parts.other + ('' if len(logical_parts.other) == 0 else ', ') + text


				# Debug section
				"""print('\n', filename, unit)
				print('authors:', logical_parts.authors)
				print('title:', logical_parts.title)
				print('publication:', logical_parts.publication)
				print('year:', logical_parts.year)
				print('other:', logical_parts.other)
				print(pos_last_auth, pos_last_title, pos_thome, pos_transl, pos_pub_num, pos_pub_place, pos_year)"""


				# Add literature unit
				unit = ET.SubElement(literature, "unit")
				for auth_str in logical_parts.authors:
					author = ET.SubElement(unit, "author")
					author.text = auth_str
				title = ET.SubElement(unit, "title")
				title.text = logical_parts.title
				publication = ET.SubElement(unit, "publication")
				publication.text = logical_parts.publication
				year = ET.SubElement(unit, "year")
				year.text = logical_parts.year
				other = ET.SubElement(unit, "other")
				other.text = logical_parts.other


			# Write xml
			with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
				f.write(prettify(article))

# 8. Парсер ссылок типа "смотри также"

Ищет в тексте ссылки начинающиеся на `"см. [другие опциональные вводные слова]"` и пытается найти соответствующие им статьи в энциклопедии.

In [None]:
# 8. Парсер ссылок типа "смотри также"

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
##################################################################


class Article:
	title = ''
	file = ''


# Find previous space / newline from given position
def find_prev_space(text: str, start_pos: int) -> int:
	newpos_s = text.rfind(' ', 0, start_pos)
	newpos_s = 0 if newpos_s == -1 else newpos_s
	newpos_n = text.rfind('\n', 0, start_pos)
	newpos_n = 0 if newpos_n == -1 else newpos_n
	newpos_r = text.rfind('\r', 0, start_pos)
	newpos_r = 0 if newpos_r == -1 else newpos_r
	return max(newpos_s, newpos_n, newpos_r)
# Find next space / newline from given position
def find_next_space(text: str, start_pos: int) -> int:
	newpos_s = text.find(' ', start_pos + 1)
	newpos_s = len(text) if newpos_s == -1 else newpos_s
	newpos_n = text.find('\n', start_pos + 1)
	newpos_n = len(text) if newpos_n == -1 else newpos_n
	newpos_r = text.find('\r', start_pos + 1)
	newpos_r = len(text) if newpos_r == -1 else newpos_r
	return min(newpos_s, newpos_n, newpos_r)


# Try find a matching title to the given one
def find_matching_title(seq: str, titles_list: list) -> (bool, bool, int):
	match_possible = False
	matches_list = []
	list_pos = 0
	seq_list = seq.split(' ')
	while list_pos < len(titles_list):
		title = titles_list[list_pos].title.split(' ')
		match_local = False
		if len(title) >= len(seq_list):
			match_local = True
			for i in range(len(seq_list)):
				match_local = False if seq_list[i] != title[i] else match_local
			match_possible = match_possible or match_local
			if match_local and len(title) == len(seq_list):
				# Solid match found
				return (True, True, True, list_pos)
		if match_local:
			matches_list.append(list_pos)
		
		list_pos += 1

	# If only one local match consider possible solid match where title is longer than the link sequence
	if len(matches_list) == 1:
		return (True, True, False, matches_list[0])

	# No solid match found
	return (match_possible, False, False, -1)


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file

# Get all the titles into a list
titles_list = []
for filename in filenames:
	article = parse_xml(ARTICLES_DIR + filename)
	title = get_xml_elem(article, 'title').text
	article_obj = Article()
	article_obj.title = title
	article_obj.file = filename
	titles_list.append(article_obj)

### DEBUG												###
#cnt = 5													###
#filenames = ['3564_JuLA-FARRI.xml']	###
###################################
n = 0
for filename in filenames:
	### DEBUG				###
	#cnt -= 1				###
	#if not cnt:			###
	#	break					###
	###################
	article = parse_xml(ARTICLES_DIR + filename)
	textelem = get_xml_elem(article, 'text')
	text = textelem.text

	if text != None and len(text):
		# Move along the text from right to left to allow easier uri insertion
		find_right = len(text)
		find_left = find_prev_space(text, find_right)

		# Find link starting word
		while find_left != -1:
			word = title_handle_latin(text[find_left:find_right].strip(), COMBINATIONS_CORR_GLOBAL).upper()
			if word == 'СМ.':
				border_left = find_right
				border_right = find_right
				_border_right = find_right
				border_find_allowed = True
				match_possible = False
				match_single = False
				_match_single = False
				match_exact = False
				match_pos = -1
				_match_pos = -1
				#border_cnt = 0																## DEBUG
				while border_find_allowed:
					border_right = find_next_space(text, border_right)
					#border_cnt += 1														## DEBUG
					border_find_allowed = False if border_right == len(text) else True
					event = title_handle_latin(text[find_right:border_right].strip(), COMBINATIONS_CORR_GLOBAL).upper()
					if event in ['В', 'ПРИ']:
						# Possible starting words continuation
						continue
					if event in ['ТАКЖЕ', 'В СТ.', 'ПРИ СТ.', 'seeAlso', 'sameAs']:
						# Confirmed starting words continuation
						find_right = border_right
						#border_cnt = 0														## DEBUG
						continue

					# Extract word sequence and try find a matching title from list
					event = title_handle_formulas(title_handle_bounding(title_handle_latin(text[border_left:border_right].strip(), COMBINATIONS_CORR_GLOBAL).upper()), text[border_left:border_right].strip())
					(match_possible, match_single, match_exact, match_pos) = find_matching_title(event, titles_list)
					border_find_allowed = border_find_allowed and match_possible
					#border_find_allowed = border_find_allowed or border_cnt <= 5 ##DEBUG
					# Remember if single match
					if match_single:
						_border_right = border_right
						_match_pos = match_pos
					# Consider last single match as exact
					if (not match_single and _match_single) or (not border_find_allowed and match_single):
						border_right = _border_right
						match_pos = _match_pos
						match_exact = True
					_match_single = match_single
					print(event, match_possible, match_single, match_exact, match_pos, titles_list[match_pos].title)
					# Process exact match
					if match_exact:
						_match_single = False
						# Add an inter-link
						n += 1
						border_left += 1 if text[border_left] in [' ', '\n', '\r'] else 0
						while re.match(r"[!#%&'*+-.^_`|~:;]", text[border_right - 1]) != None:
							border_right -= 1
						uri = 'http://libmeta.ru/fme/relation' + article.attrib['uri'][article.attrib['uri'].rfind('/', 0, article.attrib['uri'].find('_')):article.attrib['uri'].find('_')+1] + str(n) + article.attrib['uri'][article.attrib['uri'].find('_'):]
						relations = get_xml_elem(article, 'relations')
						relation = ET.SubElement(relations, 'relation', {'uri':uri})
						rel_text = ET.SubElement(relation, 'rel_text')
						rel_text.text = text[border_left:border_right]
						rel_tgt = ET.SubElement(relation, 'target')
						related_article = parse_xml(ARTICLES_DIR + titles_list[match_pos].file)
						rel_tgt.text = related_article.attrib['uri']
						text = text[:border_left] + 'URI[[' + uri + ']]/URI' + text[border_right:]
						# Continue in case of multilink
						border_left += len('URI[[' + uri + ']]/URI')
						while border_left < len(text) and not text[border_left] in [' ', '\n', '\r']:
							border_left += 1
						border_right = border_left

				## DEBUG
				#print(f'\nFound in {filename}:\n	{text[find_left:find_right].strip()} ||| {text[find_right:border_right].strip()} {"" if match_exact else "NO MATCH FOUND"}')
				#if match_exact:
				#	print(f'	Match in {titles_list[match_pos].file}, \"{titles_list[match_pos].title}\"')

			find_right = find_left
			find_left = find_prev_space(text, find_left) if find_left else -1

	# Write xml
	textelem.text = text
	with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
		f.write(prettify(article))

# 9. Парсер формул

Выносит из текстов ранее подготовленных xml-файлов статей сначала выносные, а затем строчные формулы, оставляя на их месте ссылку внутри их математического окружения. 

Минимальная длина в символах, которой должна обладать строчная формула, настраивается.

In [None]:
# 9. Парсер формул

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
MIN_INLINE_LEN = 0
##################################################################


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file

for filename in filenames:
	article = parse_xml(ARTICLES_DIR + filename)
	#print('REMOTES: ' + article.attrib['uri'])
	text = get_xml_elem(article, 'text')
	formulas_main = get_xml_elem(article, 'formulas_main')
	formulas_aux = get_xml_elem(article, 'formulas_aux')
			
# Get main formulas
	pos_find = 0
	pos_start = 0
	pos_end = 0
	n = 1
	while text.text != None and text.text.find('\\[', pos_find) != -1:
		pos_start = text.text.find('\\[', pos_find) + 2
		pos_end = text.text.find('\\]', pos_start)
		while text.text[pos_start] == '\n':
			pos_start += 1
		while text.text[pos_end-1] == '\n':
			pos_end -= 1
		pos_find = pos_start
		uri = 'http://libmeta.ru/fme/formula/main' + article.attrib['uri'][article.attrib['uri'].rfind('/', 0, article.attrib['uri'].find('_')):article.attrib['uri'].find('_')+1] + str(n) + article.attrib['uri'][article.attrib['uri'].find('_'):]
		n += 1
		formula = ET.SubElement(formulas_main, 'formula', {'uri':uri})
		formula.text = text.text[pos_start:pos_end]
		text.text = text.text[:pos_start] + 'URI[[' + uri + ']]/URI' + text.text[pos_end:]

# Get auxilary formulas
	pos_find = 0
	pos_start = 0
	pos_end = 0
	cnt = 0
	n = 1
	# Count dollar symbols
	while text.text != None and text.text.find('$', pos_find) != -1:
		pos_find = text.text.find('$', pos_find) + 1
		cnt += 1
	# If cnt is not even assume that first one is garbage from title
	pos_find = 0
	if cnt % 2:
		pos_find = text.text.find('$', pos_find)
		text.text = text.text[:pos_find] + '#' + text.text[pos_find+1:]
	while text.text != None and text.text.find('$', pos_find) != -1:
		pos_start = text.text.find('$', pos_find) + 1
		pos_end = text.text.find('$', pos_start)
		if not check_in_uri(text.text, pos_start) and not check_in_uri(text.text, pos_end):
			while text.text[pos_start] == '\n':
				pos_start += 1
			while text.text[pos_end-1] == '\n':
				pos_end -= 1
			pos_find = pos_start
			if pos_end - pos_start >= MIN_INLINE_LEN:
				uri = 'http://libmeta.ru/fme/formula/aux' + article.attrib['uri'][article.attrib['uri'].rfind('/', 0, article.attrib['uri'].find('_')):article.attrib['uri'].find('_')+1] + str(n) + article.attrib['uri'][article.attrib['uri'].find('_'):]
				n += 1
				formula = ET.SubElement(formulas_aux, 'formula', {'uri':uri})
				formula.text = text.text[pos_start:pos_end]
				text.text = text.text[:pos_start] + 'URI[[' + uri + ']]/URI' + text.text[pos_end:]
			pos_find = text.text.find('$', pos_find) + 1
		else:
			pos_find = pos_end + 1

	with codecs.open(ARTICLES_DIR + filename, 'w', 'utf-8') as f:
		f.write(prettify(article))

## 9.1. Вынос формул

Выносит все формулы в отдельный файл с указанием типа для возможной последующей обработки.

In [None]:
# 9.1. Вынос формул

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
EXIT_FILE = "./results/FMEformulas.xml"
##################################################################


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file


formulas = ET.Element('formulas')

for filename in filenames:
	root = parse_xml(ARTICLES_DIR + filename)
	fmain = get_xml_elem(root, 'formulas_main')
	faux = get_xml_elem(root, 'formulas_aux')
	
	for formula in fmain:
		formulas.append(formula)
	for formula in faux:
		formulas.append(formula)

with codecs.open(EXIT_FILE, 'w', 'utf-8') as f:
	f.write(prettify(formulas))

## 9.2. Проверка формул

Случайным образом выбирает 20 случайных формул (из случайных статей) и ставляет их в математическое окружение Markdown для визуальной проверки

In [None]:
# 9.2. Проверка формул

############################ VARS ################################
ARTICLES_DIR = "./results/FMEarticles/"
EXIT_FILE = "./matphys/FMEformulas_check.md"
NUMBER = 20
##################################################################


# Get filenames needed
filenames = next(walk(ARTICLES_DIR), (None, None, []))[2]  # [] if no file


file = ''

i = 0
while i < NUMBER:
	root = parse_xml(ARTICLES_DIR + filenames[randint(0, len(filenames)-1)])

	# Get all the info from article
	fmain = get_xml_elem(root, 'formulas_main')
	start = get_xml_elem(root, 'pages/start').text
	

	# if there's no formulas in the article try another one
	total_num = 0
	for formula in fmain:
		total_num += 1
	if not total_num:
		continue
	i += 1
	
	num = randint(0, 100) % total_num

	formula = fmain[num].text

	file += f'{i}. Статья: {root.attrib["uri"]}, Начало на стр. {start}, формула {num + 1}:\n$${formula}$$\n'

with codecs.open(EXIT_FILE, 'w', 'utf-8') as f:
	f.write(file)