## Список источников


1. Источники текстов Н.И. Новикова и А.Н. Радищева "Путешествия из Петербурга в Москву" -- lib.ru (Библиотека Владимира Мошкова);
2. [Стихи](http://rupoem.ru/radischev/all.aspx)
 А.Н. Радищева.

## Описание задачи

Стили написания текстов различными авторами обладают своими особенностями: лексиконом, свойственным только этому автору, количеством употребляемых слов, синтаксическими конструкциями и др. Оказывается, что 100 -- 200 наиболее употребляемых автором слов имеют уже сильно различаются друг от друга, поэтому являются важным показателем в определении авторства (Evert, Proisl, 2017). Все это открывает возможности для статистического исследования текстов и применения стилометрии -- метода определения авторства текста на основе статистически выделяемых признаков.

Наша задача заключается в том, чтобы определить, дествительно ли Н. И. Новиков, которому официально приписывают это произведение, является автором произведения "Путешествие из И в Т", либо его, все-таки, написал А. Н. Радищев. 

## Меотды решения задачи определения авторства

Для решения задачи определения авторства были разработаны разные метрики, в том числе индекс Delta, который основывается на измерении расстояния между текстами.  Дельта -- расстояние между векторными представлениями текстов в многомерном пространстве (Argamon 2008). Где слово -- одна из осей этого пространства.


Тексты представляются в виде "мешка слов", где каждый вектор содержит относительные частоты слова по всем документам корпуса (могут быть использованы различные варианты подсчета частот).

Могут использоваться разные варианты подсчета расстояния:
1. Manhattan Distance
2. Евклидово расстояние
3. Косинусная мера близости 

На основе мер расстояния и выбранного метода нормализации выделяют разные Дельты:
1. Manhattan Distance + нормализация по относительным частотам слов = Линейная Дельта
2. Евклидово расстояние + нормализация по Z-score = Квадратичная Дельта
(частоты слов независимы друг от друга).
Существуют также 
3. Делта Эдера (Eder's Delta), которая уменшьает веса редко встречающихся слов и другие вариации. 

В основе всех методов лежит построение $n$-мерной матрицы попарного постчитанного растояния между текстами корпуса.  Затем тексты кластеризуются на основе матрицы расстояний. Например,  с помощью иерархической кластеризации и построением дендрораммы.


In [1]:
import delta
import re
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
from tqdm import tqdm

KMedoids clustering not available.
You need a patched scikit-learn, see README.txt
  warn("KMedoidsClustering not available")


In [2]:
def slurp(path):
    with open(path, 'r') as fo:
        return fo.read()
def preprocess(path):
    text = slurp(path)
    pattern = re.compile(r'[а-яё\-\d]+')
    text = re.findall(pattern, text.lower())
    return ' '.join(text)

def lemmatize_text(text):
    lemmas = [morph.parse(word)[0].normal_form for word in tqdm(text.split())]
    return ' '.join(lemmas)

def write_to_file(text, filename):
    with open(filename, 'w') as fo:
        fo.write(text)

In [9]:
raw_corpus = delta.Corpus('/home/nst/mount/data/linguistics_hse/digital_humanities/corpus/')

  self.metadata = metadata


In [10]:
raw_corpus

Unnamed: 0,и,в,не,что,на,я,его,с,то,но,...,одарен,огурцы,огромный,огромность,огромности,огромном,огромной,огромное,огорожены,яству
Новиков_Кошелек,261.0,190.0,167.0,117.0,30.0,149.0,23.0,53.0,57.0,57.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Новиков_Пословицы,575.0,342.0,250.0,221.0,159.0,91.0,204.0,118.0,63.0,123.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Новиков_Путешествие_из_И_в_Т,77.0,35.0,29.0,41.0,19.0,30.0,8.0,6.0,4.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Новиков_Трутень,278.0,150.0,196.0,145.0,52.0,118.0,34.0,50.0,54.0,47.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Радищев_Журналы,698.0,475.0,318.0,260.0,251.0,53.0,88.0,118.0,167.0,125.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Радищев_Путешествие_из_Петербурга,1705.0,1517.0,980.0,578.0,711.0,462.0,455.0,320.0,320.0,277.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
Радищев_Стихи,164.0,222.0,112.0,50.0,68.0,67.0,19.0,48.0,28.0,11.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [11]:
c2500 = raw_corpus.get_mfw_table(2500)

  self.metadata = metadata


In [12]:
c2500

Unnamed: 0,и,в,не,что,на,я,его,с,то,но,...,земледелие,старый,первой,земледелию,таковом,монастырь,веселий,мнении,обязаны,заблуждении
Новиков_Кошелек,0.037543,0.02733,0.024022,0.01683,0.004315,0.021433,0.003308,0.007624,0.008199,0.008199,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000144
Новиков_Пословицы,0.041513,0.024691,0.018049,0.015956,0.011479,0.00657,0.014728,0.008519,0.004548,0.00888,...,0.0,7.2e-05,0.0,7.2e-05,0.0,0.000217,0.0,0.0,0.0,0.0
Новиков_Путешествие_из_И_в_Т,0.044253,0.020115,0.016667,0.023563,0.01092,0.017241,0.004598,0.003448,0.002299,0.004023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Новиков_Трутень,0.038239,0.020633,0.02696,0.019945,0.007153,0.016231,0.004677,0.006878,0.007428,0.006465,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Радищев_Журналы,0.043231,0.029419,0.019695,0.016103,0.015546,0.003283,0.00545,0.007308,0.010343,0.007742,...,0.000124,0.0,0.000248,0.000248,6.2e-05,0.0,0.0,0.000124,0.0,0.0
Радищев_Путешествие_из_Петербурга,0.033285,0.029615,0.019132,0.011284,0.01388,0.009019,0.008883,0.006247,0.006247,0.005408,...,5.9e-05,7.8e-05,2e-05,0.0,7.8e-05,3.9e-05,5.9e-05,5.9e-05,9.8e-05,7.8e-05
Радищев_Стихи,0.018106,0.024509,0.012365,0.00552,0.007507,0.007397,0.002098,0.005299,0.003091,0.001214,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000221,0.0,0.0,0.0


In [13]:
distances = delta.functions.cosine_delta(c2500)
distances

  self.metadata = metadata
  self.metadata = Metadata(metadata, **kwargs)
  self.metadata = copy_from.metadata


Unnamed: 0,Новиков_Кошелек,Новиков_Пословицы,Новиков_Путешествие_из_И_в_Т,Новиков_Трутень,Радищев_Журналы,Радищев_Путешествие_из_Петербурга,Радищев_Стихи
Новиков_Кошелек,0.0,1.145267,1.136182,1.031943,1.152155,1.224562,1.288683
Новиков_Пословицы,1.145267,0.0,1.206848,1.138565,1.185157,1.06762,1.184207
Новиков_Путешествие_из_И_в_Т,1.136182,1.206848,0.0,1.088864,1.168855,1.314118,1.138681
Новиков_Трутень,1.031943,1.138565,1.088864,0.0,1.163076,1.252569,1.256193
Радищев_Журналы,1.152155,1.185157,1.168855,1.163076,0.0,1.092874,1.243326
Радищев_Путешествие_из_Петербурга,1.224562,1.06762,1.314118,1.252569,1.092874,0.0,1.008259
Радищев_Стихи,1.288683,1.184207,1.138681,1.256193,1.243326,1.008259,0.0


In [14]:
distances.evaluate()

  self.metadata = copy_from.metadata
  self.metadata = Metadata(metadata, **kwargs)


F-Ratio         0.433715
Fisher's LD     0.257565
Simple Score    0.966662
dtype: float64

## Литература
_Evert, Stefan, Proisl, Thomas_. Understanding and explaining Delta measures for authorship attribution, Digital Scholarship in the Humanities, Vol. 32, Supplement 2, 2017,
[URL](https://watermark.silverchair.com/fqx023.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAcQwggHABgkqhkiG9w0BBwagggGxMIIBrQIBADCCAaYGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM2zhQHBBa5PGycPXSAgEQgIIBd75BalV28g3663OrlPFPOCOhIWn3i0cj4XzGWQfiyxC3blSa1ukzk74M9CwpfIc9Smvi0aYfYJO_AluXPVE53uHAsc3TEFEPgYuFuGPZafOVdOu9FrH0v8M3iWaExHRoPksPZHci5LKyws1d_JTLzHcOmzCUeqlLN-EsuurJcuWGDsh-cnk5VpIYD-6WCHgwTKy95dUylozKTIw8rotGyff1d45DhDM0rsYkyrzVWp5hgrEyMSEiVHTBoLmnq1q7KD7mf4DkZq7Ytb0S6iHDdFw0hPfqsxyHrw4euKYO8bsQHBh0Xke-VtDAG7dYTQlJZnqVgGZ8Nwhbb07_SWYgWiXXyCDLRPq3W1wP3yT8b9L4bolqnqUdgyoEZys8K-VcZgyqPE5Ng1zTzRveEVQPUUMi2Vs3xEKZlHn3K5mIsl6kEMsqj7CLKyTYg86bqONzaHhdrtUu035Zt6P3gS_wFvnz5iWhcWPgDs-mJWBGwXTDxBmpAvDrhA).