#### Preparation

Import relevant packages and read the text files.
Each .txt file includes text from a single author, a, b, c respectively.
The length of each .txt file is different, with "author c" being the shortest with about 15k words.
Then, unnecessary line breaks are removed.
<br>Since punctuation marks cannot be removed for an already split text, I revise the first cell like the following, creating 2 versions of a text - one with lowercase letters, and the other split and without punctuation marks.

In [1]:
import nltk
import os
import time
import random
import pandas as pd
import collections
import string

In [2]:
text_a_orig = open('author_a.txt','r')
text_a_orig = text_a_orig.read().replace("\n", " ").lower()
text_a = text_a_orig.translate(str.maketrans("","", string.punctuation))
print("Punctuation marks are deleted.")
text_a = text_a.split()

text_b_orig = open('author_b.txt','r')
text_b_orig = text_b_orig.read().replace("\n", " ").lower()
text_b = text_b_orig.translate(str.maketrans("","", string.punctuation))
print("Punctuation marks are deleted.")
text_b = text_b.split()

text_c_orig = open('author_c.txt','r')
text_c_orig = text_c_orig.read().replace("\n", " ").lower()
text_c = text_c_orig.translate(str.maketrans("","", string.punctuation))
print("Punctuation marks are deleted.")
text_c = text_c.split()

print("text_X_orig is the original text with all lowercase letters, and text_X is prepared by removing punctuation marks and splitting it in word units.")

Punctuation marks are deleted.
Punctuation marks are deleted.
Punctuation marks are deleted.
text_X_orig is the original text with all lowercase letters, and text_X is prepared by removing punctuation marks and splitting it in word units.


### Lexical Measurement
#### type-token ratio
First, I could calculate type-token ratio for each author.

In [3]:
print(len(text_a))
print(len(text_b))
print(len(text_c))

21654
22897
15234


In [4]:
# number of types

type_a = (len(set(text_a)))
type_b = (len(set(text_b)))
type_c = (len(set(text_c)))

# type - token ratio
def lexical_diversity(text):
    return round(len(set(text))/len(text),4)

lexical_diversity(text_a)
lexical_diversity(text_b)
lexical_diversity(text_c)

print("There are ", type_a,",", type_b,",", type_c, "types in each respective text and the type-token ratio is ", lexical_diversity(text_a), ",", lexical_diversity(text_b), ",", lexical_diversity(text_c), "respectively.")

There are  3560 , 3640 , 2287 types in each respective text and the type-token ratio is  0.1644 , 0.159 , 0.1501 respectively.


#### Simpson's D index
Next, I wrote some codes that calculate "Simpson's D" index, which is one of the ways that were introduced in Savoy(2020) to measure vocabulary richness. The closer the value is to 0, the more diverse the vocabulary is.

*n* is the size of the corpus - in other words, token size.<br>
*VOC(r)* is the number of words(type) that appear exactly *r* times in the given text.

This code works well except when *r*=*n*, in other words, when the given text has only one type. 
When this is the case, the code returns 0.0. I don't know what the problem is here, but since my sample texts are not going to have any such instance, I just added a couple of lines that tells the function to return 1 (which is what happens when *r*=*n*, since there is no vocabulary diversity).

First attempt looked like this:

In [None]:
# Simpson_D_version1

def simpson_D(text):
    types = set(text)
    n = len(text)
    def VOC(r):
        VOC = 0
        for word in types:
            if text.count(word) == r:
                VOC += 1
        return VOC
    for r in range(1,n):
        if sum(VOC(r) for r in range(1, n-1)) == 0:
            return 1
        else:
            return sum(VOC(r) * (r**2 - r) / (n**2 - n) for r in range(1,n))

print(simpson_D(text_a))
print(simpson_D(text_b))
print(simpson_D(text_c))

This works, but it took too long time to compute. I wrote another version:

In [5]:
# Simpson_D_version2: Use text that have not already been split

def simpson_D(text):
    count = collections.Counter(text)
    types = set(text)
    n = len(text)
    def VOC(r):
        VOC = 0
        for i in types: # i is a word(type)
            if count.get(i) == r:
                VOC += 1
        return VOC
    for r in range(1,n):
        if sum(VOC(r) for r in range(1, n-1)) == 0:
            return 1
        else:
            return sum(VOC(r) * (r**2 - r) / (n**2 - n) for r in range(1,n))

print(simpson_D(text_a))
print(simpson_D(text_b))
print(simpson_D(text_c))

0.0070774527083991255
0.00654202743748956
0.00979478876975365


#### Mean word length and word length distribution
Next, I wanted to know mean word length and mean sentence length. <br>
As for the sentence length, the pre-processing was going to be tedious, because of the kind of text my samples are: blog posts. <br>
Blog posts typically have multiple small headers, which was not considered when the texts were collected. <br>
Usually, we could imagine sentence borders are where full stops (.), exclamation marks (!), and question marks (?) appear. However, since these small headers are effectively titles for each small section and thus do not have such marks at the end, it is difficult to take them into account without manually marking or deleting them - which would defy the whole point of calculating with a computer. <br><br>
Another method I thought of was to get sentences by defining a sentences as "words between two (aforementioned) sentence-ending marks. But this does not help with the headers either, and I have no idea what this could mean to the sample size.

Hence, I decided to just calculate mean word length. I took the first 5k words from each file. 

In [20]:
text_a_5k = text_a[:5000]
text_b_5k = text_b[:5000]
text_c_5k = text_c[:5000]

print(len(set(text_a_5k)))
print(len(set(text_b_5k)))
print(len(set(text_c_5k)))

average_a = sum(len(word) for word in text_a_5k) / len(text_a_5k)
average_b = sum(len(word) for word in text_b_5k) / len(text_b_5k)
average_c = sum(len(word) for word in text_c_5k) / len(text_c_5k)

print(average_a)
print(average_b)
print(average_c)

1217
1306
1140
4.4192
4.6376
4.8154


One thing I am curious about, but haven't tried yet, is whether these values (simpson_D, mean word length, etc) change meaningfully depending on how many words the text contains. Since I already have the codes, it will be simple enough to try and check this: for example, I could take the first 5k, 10k, 15k and 20k words from text_a and see if there is any change in the **simpson_D** value, or if the value becomes stable after a certain threshold. I haven't done this yet, but it is on the list.

#### Lexical density
Lexical density is the ratio between the number of lexical items (1-functional words) and the text length.

For this I thought of creating a list of function words - articles, prepositions, conjunctions, pronouns, etc., and counting their numbers.


Some comparative values were necessary: I followed Savoy(2020, p.30) where it said (and I paraphrase) that an LD value of around 0.3 for an oral production and around 0.4 and higher for writings are the norm.

There were other measures that makes use of vocabulary, such as "Big word index" which refers to the percentage of words with 6 letters or more, 

### Distance-based method
#### Burrow's Delta (Savoy 2020: 34-36)

Burrow's Delta considers 40-150 most frequent word types, and the style is reflected through the word choice. I will also consider 150 most frequent types(MFWs) first. I assume I could just change some numbers in the code to get more word types from the frequency list, I will check out if the results change meaningfully if I raise the threshold to, say, 300 or 500.