# Hypothesis testing

The hypothesis, which is going to be tested, states that different from lexical perspective styles of writing have equal distribution of parts of speech. 
<br><br>
Styles to test:
- science
- blogs

Both corresponding files contain collected from the internet articles, such as geographical and geological articles from open e-libraries and also articles from Leonid Kaganov's blog, Drogoy Journal, etc. All articles are grouped in 2 files.

In [1]:
from nltk import FreqDist 
from nltk.tokenize import WhitespaceTokenizer
from pymorphy2 import MorphAnalyzer
from string import punctuation
exclude = set(punctuation + '0123456789[]—«»–')

## Science texts

In [2]:
with open('science.txt', 'r', encoding='utf-8') as t:
    text = t.read().replace('-\n', '')
    buf = ''.join(ch for ch in text if ch not in exclude)
    science_tokens = WhitespaceTokenizer().tokenize(buf.lower())

In [3]:
print('Токенов: ', len(science_tokens))
print('Типов: ', len(set(science_tokens)))

Токенов:  18623
Типов:  6565


In [4]:
morph = MorphAnalyzer()
freq = FreqDist(science_tokens)
science_parts_of_speech = {}
for word in list(set(science_tokens)):
    part_of_speech = str(morph.parse(word)[0].tag)[:4]
    if part_of_speech in science_parts_of_speech.keys():
        science_parts_of_speech[part_of_speech] += int(freq[word])
    else:
        science_parts_of_speech[part_of_speech] = int(freq[word])

In [5]:
science_parts_of_speech

{'ADJF': 3246,
 'ADJS': 144,
 'ADVB': 485,
 'COMP': 26,
 'CONJ': 1438,
 'GRND': 62,
 'INFN': 274,
 'INTJ': 52,
 'LATN': 196,
 'NOUN': 7832,
 'NPRO': 274,
 'NUMR': 45,
 'PNCT': 12,
 'PRCL': 270,
 'PRED': 40,
 'PREP': 2134,
 'PRTF': 414,
 'PRTS': 150,
 'ROMN': 28,
 'UNKN': 463,
 'VERB': 1038}

## Blogs

In [6]:
with open('blogs.txt', 'r', encoding='utf-8') as t:
    text = t.read().replace('-\n', '')
    buf = ''.join(ch for ch in text if ch not in exclude)
    blog_tokens = WhitespaceTokenizer().tokenize(buf.lower())

In [7]:
print('Токенов: ', len(blog_tokens))
print('Типов: ', len(set(blog_tokens)))

Токенов:  12684
Типов:  5511


In [8]:
morph = MorphAnalyzer()
freq = FreqDist(blog_tokens)
blogs_parts_of_speech = {}
for word in list(set(blog_tokens)):
    # taking 4 first characters
    part_of_speech = str(morph.parse(word)[0].tag)[:4] 
    if part_of_speech in blogs_parts_of_speech.keys():
        blogs_parts_of_speech[part_of_speech] += int(freq[word])
    else:
        blogs_parts_of_speech[part_of_speech] = int(freq[word])

In [9]:
blogs_parts_of_speech

{'ADJF': 1599,
 'ADJS': 127,
 'ADVB': 842,
 'COMP': 63,
 'CONJ': 1286,
 'GRND': 40,
 'INFN': 404,
 'INTJ': 37,
 'LATN': 143,
 'NOUN': 3802,
 'NPRO': 644,
 'NUMR': 63,
 'PRCL': 641,
 'PRED': 74,
 'PREP': 1411,
 'PRTF': 84,
 'PRTS': 94,
 'ROMN': 7,
 'UNKN': 33,
 'VERB': 1290}

## Correlation

In [10]:
from scipy.stats import spearmanr

In [12]:
print(len(science_parts_of_speech), len(blogs_parts_of_speech)) 

21 20


Difference in dictionary sizes means that blogs don't comprise some part of speech. To calculate spearman correlation coefficient dictionaries should be of same sizes, thus we correct data.

In [13]:
diff = list(set(science_parts_of_speech.keys()) - set(blogs_parts_of_speech.keys()))
if diff:
    print(diff)
    blogs_parts_of_speech[diff[0]] = 0

['PNCT']


In [14]:
science_freq = [y for x, y in sorted(science_parts_of_speech.items(), key=lambda x: x[0])]
blogs_freq = [y for x, y in sorted(blogs_parts_of_speech.items(), key=lambda x: x[0])]

In [15]:
spearmanr(science_freq, blogs_freq)

SpearmanrResult(correlation=0.83885640025990893, pvalue=2.025563223320523e-06)

Low p-value tells us that we reject the null hypothesis, which states that distributions of different parts of speech are the same in both writing styles.

In conclusion, the alternative hypothesis that different writing styles have different lexical distribution is accepted. It means that we can probably say, that all other writing styles differ comparing with lexical distribution as well. Though we cannot say that for sure, that's another hypothesis that should be validated.