# Progress report: as of the morning of 21 Apr 2023
### Jihyeong Lee
### LING4181

## Preparation

Import relevant packages and read the text files.
Each .txt file includes text from a single author, a, b, c respectively.
The length of each .txt file is different, with "author c" being the shortest with about 15k words.
Then, unnecessary line breaks are removed.
<br>Since punctuation marks cannot be removed for an already split text, I revise the first cell like the following, creating 2 versions of a text - one with lowercase letters, and the other split and without punctuation marks.

In [67]:
import nltk
import os
import time
import random
import pandas as pd
import collections
import string
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
pd.set_option("display.max_rows", None)

nltk.download('punkt')
print("Done!")

Done!


[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# preprocessing functions

def preprocess(filename):
    text = open(filename, 'r').read().replace("\n", " ").lower()
    return text.translate(str.maketrans("","", string.punctuation)).split()
    
def preprocess2(filename):
    text = open(filename, 'r').read().lower()
    t = sent_tokenize(text)
    return t

text_a = preprocess('author_a.txt') # word-unit
text_a_s = preprocess('author_a.txt') # sentence-unit
text_b = preprocess('author_b.txt')
text_c = preprocess('author_c.txt')

print(text_a)



## 1. Lexical Measurement
### type-token ratio
First, I could calculate type-token ratio for each author.

In [3]:
# number of tokens

print(len(text_a))
print(len(text_b))
print(len(text_c))

21654
22897
15234


In [4]:
# number of types

type_a = (len(set(text_a)))
type_b = (len(set(text_b)))
type_c = (len(set(text_c)))

# type - token ratio
def lexical_diversity(text):
    return round(len(set(text))/len(text),4)

lexical_diversity(text_a)
lexical_diversity(text_b)
lexical_diversity(text_c)

print("There are ", type_a,",", type_b,",", type_c, "types in each respective text and the type-token ratio is ", lexical_diversity(text_a), ",", lexical_diversity(text_b), ",", lexical_diversity(text_c), "respectively.")

There are  3560 , 3640 , 2287 types in each respective text and the type-token ratio is  0.1644 , 0.159 , 0.1501 respectively.


### Simpson's D index
Next, I wrote some codes that calculate "Simpson's D" index, which is one of the ways that were introduced in Savoy(2020) to measure vocabulary richness. The closer the value is to 0, the more diverse the vocabulary is.

*n* is the size of the corpus - in other words, token size.<br>
*VOC(r)* is the number of words(type) that appear exactly *r* times in the given text.

This code works well except when *r*=*n*, in other words, when the given text has only one type. 
When this is the case, the code returns 0.0. I don't know what the problem is here, but since my sample texts are not going to have any such instance, I just added a couple of lines that tells the function to return 1 (which is what happens when *r*=*n*, since there is no vocabulary diversity).

First attempt looked like this:

In [5]:
# Simpson_D_version1

def simpson_D_1(text):
    types = set(text)
    n = len(text)
    def VOC(r):
        VOC = 0
        for word in types:
            if text.count(word) == r:
                VOC += 1
        return VOC
    for r in range(1,n):
        if sum(VOC(r) for r in range(1, n-1)) == 0:
            return 1
        else:
            return round(sum(VOC(r) * (r**2 - r) / (n**2 - n) for r in range(1,n)),4)

In [6]:
# DO NOT RUN THIS:

print(simpson_D_1(text_a))
print(simpson_D_1(text_b))
print(simpson_D_1(text_c))

KeyboardInterrupt: 

This works, but it took too long time to compute. I wrote another version:

In [84]:
# Simpson_D_version2: Use text that have not already been split

def simpson_D(text):
    count = collections.Counter(text)
    types = set(text)
    n = len(text)
    def VOC(r):
        VOC = 0
        for i in types: # i is a word(type)
            if count.get(i) == r:
                VOC += 1
        return VOC
    if sum(VOC(r) for r in range(1, n-1)) == 0:
        return 1
    else:
        return round(sum(VOC(r) * (r**2 - r) / (n**2 - n) for r in range(1,n)),4)

In [85]:
text = ["text","text","text","text","text","text","text"]
print(simpson_D(text))

1


In [None]:
print(simpson_D(text_a))
print(simpson_D(text_b))
print(simpson_D(text_c))

### Mean word length and word length distribution
Next, I wanted to know mean word length and mean sentence length. <br>
As for the sentence length, the pre-processing was going to be tedious, because of the kind of text my samples are: blog posts. <br>
Blog posts typically have multiple small headers, which was not considered when the texts were collected. <br>
Usually, we could imagine sentence borders are where full stops (.), exclamation marks (!), and question marks (?) appear. However, since these small headers are effectively titles for each small section and thus do not have such marks at the end, it is difficult to take them into account without manually marking or deleting them - which would defy the whole point of calculating with a computer. <br><br>
Another method I thought of was to get sentences by defining a sentences as "words between two (aforementioned) sentence-ending marks. But this does not help with the headers either, and I have no idea what this could mean to the sample size.

Hence, I decided to just calculate mean word length. I took the first 5k words from each file. 

In [8]:
text_a_5k = text_a[:5000]
text_b_5k = text_b[:5000]
text_c_5k = text_c[:5000]

print(len(set(text_a_5k)))
print(len(set(text_b_5k)))
print(len(set(text_c_5k)))

def average(text):
    return sum(len(word) for word in text) / len(text)

print(average(text_a))
print(average(text_b))
print(average(text_c))

print(average(text_a_5k))
print(average(text_b_5k))
print(average(text_c_5k))

1217
1306
1140
4.5131153597487765
4.736472026903088
4.8074701325981355
4.4192
4.6376
4.8154


Looking at **word length distribution** might be interesting, too. It looks simple enough, too. (I've learned to not underestimate the time and effort it takes to code a simple thing in a way I want, though.)

One thing I am curious about, but haven't tried yet, is whether these values (simpson_D, mean word length, etc) change meaningfully depending on how many words the text contains. Since I already have the codes, it will be simple enough to try and check this: for example, I could take the first 5k, 10k, 15k and 20k words from text_a and see if there is any change in the **simpson_D** value, or if the value becomes stable after a certain threshold. I haven't done this yet, but it is on the list.

### Lexical density
Lexical density is the ratio between the number of lexical items (1-functional words) and the text length.

For this I thought of creating a list of function words - determiners, pronouns, conjunctions, prepositions, auxiliary words etc., and counting their numbers. <br>

Some comparative values were necessary: I followed Savoy(2020, p.30) where it said (and I paraphrase) that an LD value of around 0.3 for an oral production and around 0.4 and higher for writings are the norm.

I manually created a list of function words as a .txt file, and I only included single-word function words as the purpose is to see its percentage against the number of all words used. Hence, function words(phrases) like "thanks to", "in favor of", "instead of" etc were excluded, but "to", "in", "of", "instead" etc were included and counted.

I acknowledge that there might be some missing words, but not significant.

In [10]:
# lexical density based on function words 

f_words = open('function_words.txt','r')
f_words = f_words.read().lower().split()

def f_ratio(text): # use split text, before removing punctuation marks! (text_X_orig)
    return round((sum(text.count(word) for word in f_words) / len(text)),4)
def l_density(text):
    return round((1 - f_ratio(text)),4)

print("text a's lexical density is", l_density(text_a))
print("text b's lexical density is", l_density(text_b))
print("text c's lexical density is", l_density(text_c))

text a's lexical density is 0.5247
text b's lexical density is 0.5777
text c's lexical density is 0.6133


There were other measures that makes use of vocabulary, such as "Big word index" which refers to the percentage of words with 6 letters or more. <br> In the codes below instead, I used 7 as the threshold, because I was sceptical of this, since 4/5 character long nouns can take plural form and become 6 character long. (Same for present/past tense verbs.)

In [26]:
def BWI1(text):
    big_word = 0
    for word in text:
        if len(word) >= 7:
            big_word += 1
    return round(big_word / len(text),4)


def BWI2(text):
    big_words = list(word for word in text if len(word) >= 7)
    return round((sum(text.count(word) for word in set(big_words)) / len(text)),4)

The two functions in the cell above both work, as seen in the next cell:

In [31]:
print(BWI1(text_a))
print(BWI2(text_a))

0.1906
0.1906


however, as we can see from the cell below, the second function takes significantly longer processing time.

In [33]:
start = time.time()

print(BWI1(text_a))

print("big word function 1:", f"{time.time()-start:.4f} sec")

start = time.time()

print(BWI2(text_a))

print("big word function 2:", f"{time.time()-start:.4f} sec")

0.1906
big word function 1: 0.0053 sec
0.1906
big word function 2: 0.5469 sec


## 2. Distance-based method
### Burrow's Delta (Savoy 2020: 34-36)

Burrow's Delta considers 40-150 most frequent word types, and the style is reflected through the word choice. I will also consider 150 most frequent types(*MFWs*) first. I will check out if the results change meaningfully if I raise the threshold to, say, 300 or 500. (I assume I could just change some numbers in the code to get more word types from the frequency list.)

얻어야 하는 값: 
1. 문장부호 제거하고 split한 텍스트에서 가장 흔한 word type (MFW)를 추출함. 
1-1. 문제: 스펠링이 같고 의미가 다른 homograph들을 구별할 수 없음. 이 점은 Savoy(2020:p.34)도 지적하는데, POS 태깅은 일정 부분 manual correction 없이 완전 자동화하기 어려운 과정임.
2. 전체 token에 대한 MFW의 비율도 알면 좋겠음. (Savoy(2020: p.34)에 의하면 150개의 MFW만 추출해도 대개 50~65%의 토큰을 cover한다고 함.

In [4]:
# Getting the list of most common words in a given text (threshold: top 15

def MFW(text):
    freq = FreqDist(text)
    MFWlist = freq.most_common(300) # list of 150 most common words with the frequency
    return MFWlist

def MFW_100(text): # the percentage of MFW tokens in relation to the entire text
    return 100 * sum(i[1] for i in MFW(text)) / len(text)

print(sorted(MFW(text_c)))
print(MFW_100(text_a))

MFW(text_b)


[('10', 10), ('12', 13), ('3', 10), ('6', 9), ('a', 412), ('about', 12), ('add', 24), ('after', 27), ('all', 12), ('allow', 15), ('also', 30), ('an', 40), ('and', 460), ('annual', 9), ('any', 16), ('are', 138), ('areas', 9), ('around', 11), ('as', 99), ('at', 49), ('available', 10), ('avoid', 15), ('basil', 54), ('be', 93), ('because', 35), ('bed', 10), ('beds', 14), ('before', 25), ('begonia', 67), ('begonias', 16), ('benefits', 14), ('best', 31), ('between', 15), ('bin', 17), ('blooming', 12), ('blue', 54), ('both', 13), ('bright', 16), ('brown', 20), ('bush', 9), ('but', 24), ('by', 47), ('can', 84), ('care', 29), ('cause', 12), ('choose', 11), ('clean', 14), ('cold', 11), ('color', 10), ('colors', 12), ('common', 20), ('companion', 30), ('compost', 57), ('composter', 10), ('composting', 17), ('conditions', 27), ('container', 16), ('cool', 16), ('cover', 10), ('cut', 27), ('cutting', 14), ('cuttings', 17), ('day', 11), ('days', 22), ('develop', 10), ('direct', 12), ('do', 10), ('doe

[('the', 776),
 ('a', 687),
 ('to', 667),
 ('and', 604),
 ('of', 543),
 ('in', 501),
 ('is', 269),
 ('for', 245),
 ('garden', 235),
 ('you', 233),
 ('i', 202),
 ('are', 197),
 ('that', 192),
 ('plant', 186),
 ('can', 177),
 ('with', 175),
 ('or', 171),
 ('your', 165),
 ('it', 164),
 ('as', 156),
 ('this', 145),
 ('plants', 145),
 ('from', 141),
 ('soil', 137),
 ('my', 130),
 ('on', 118),
 ('be', 106),
 ('when', 96),
 ('some', 92),
 ('them', 90),
 ('they', 90),
 ('have', 89),
 ('flowers', 85),
 ('like', 85),
 ('also', 84),
 ('if', 79),
 ('an', 78),
 ('new', 78),
 ('growing', 74),
 ('seed', 74),
 ('at', 68),
 ('will', 68),
 ('so', 67),
 ('seeds', 66),
 ('more', 66),
 ('grasses', 64),
 ('but', 62),
 ('one', 60),
 ('these', 60),
 ('into', 55),
 ('planting', 55),
 ('it’s', 54),
 ('grow', 53),
 ('rose', 52),
 ('bulbs', 52),
 ('spring', 51),
 ('potting', 51),
 ('meadow', 51),
 ('about', 50),
 ('blooms', 50),
 ('lily', 50),
 ('up', 48),
 ('by', 48),
 ('which', 48),
 ('other', 47),
 ('out', 47)

In [43]:
## This part might not be useful

def merge_set(x1,x2,x3):
    a = []
    b = []
    c = []
    for i in MFW(x1):
        a.append(i[0])
    for i in MFW(x2):    
        b.append(i[0]) 
    for i in MFW(x2):    
        c.append(i[0]) 
    d = list(set(a+b+c))
    return d

set_merge = merge_set(text_a, text_b, text_c)
len(set_merge)
print(sorted(set_merge))

['a', 'about', 'add', 'after', 'all', 'also', 'am', 'an', 'and', 'are', 'around', 'as', 'at', 'back', 'be', 'because', 'been', 'before', 'best', 'bloom', 'blooms', 'both', 'bulbs', 'but', 'by', 'can', 'can’t', 'choose', 'compost', 'conditions', 'container', 'cosmos', 'courtesy', 'craig', 'cuban', 'cut', 'different', 'disease', 'do', 'don’t', 'down', 'early', 'easy', 'fall', 'few', 'first', 'flower', 'flowers', 'foliage', 'for', 'from', 'full', 'garden', 'gardeners', 'gardening', 'gardens', 'get', 'go', 'going', 'good', 'grass', 'grasses', 'great', 'greenhouse', 'grow', 'growing', 'had', 'hardy', 'has', 'have', 'he', 'her', 'here', 'his', 'how', 'i', 'if', 'in', 'include', 'into', 'is', 'it', 'its', 'it’s', 'i’m', 'i’ve', 'just', 'keep', 'know', 'leaves', 'like', 'lilies', 'lily', 'little', 'long', 'look', 'lot', 'love', 'make', 'many', 'mary', 'may', 'me', 'meadow', 'mix', 'more', 'most', 'much', 'my', 'native', 'need', 'new', 'no', 'not', 'now', 'of', 'off', 'on', 'once', 'one', 'or',

In [17]:
# absolute frequency table

def abs_table(xa, xb, xc):
    dict_b = (dict(MFW(xb)))
    dict_c = (dict(MFW(xc)))
    table = pd.DataFrame(MFW(xa)).rename(columns={0: 'word', 1:'a'})
    table.set_index('word',inplace=True)
    table["b"] = ""
    table["c"] = ""
    for n in MFW(xa):
        word = n[0]
        if dict_b.get(word) != None and dict_c.get(word) != None:
            table.loc[word,"b"] = dict_b.get(word)
            table.loc[word,"c"] = dict_c.get(word)
        else:
            table.loc[word,"b"] = np.nan
            table.loc[word,"c"] = np.nan
        table.dropna(inplace= True)
    return table

abs_table(text_a,text_b,text_c)

Unnamed: 0_level_0,a,b,c
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
the,914,776,994
and,678,604,460
to,670,667,482
of,456,543,321
a,447,687,412
...,...,...,...
cut,12,26,27
green,12,17,32
best,11,25,31
watering,10,16,13


In [19]:
# making a relative frequency table and adding mean & sd

def rel_table(xa, xb, xc):
    table = abs_table(xa, xb, xc)
    table = table.astype(float)
    table["words"] = table.index
    table.loc[:,"a"] = round(table["a"] / len(xa),5)
    table.loc[:,"b"] = round(table["b"] / len(xb),5)
    table.loc[:,"c"] = round(table["c"] / len(xc),5)
    table.loc[:,"mean"] = table.mean(axis='columns')
    table.loc[:,"sd"] = table.std(axis='columns')
    return table

table = rel_table(text_a,text_b,text_c)
table

  table.loc[:,"mean"] = table.mean(axis='columns')
  table.loc[:,"sd"] = table.std(axis='columns')


Unnamed: 0_level_0,a,b,c,words,mean,sd
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,0.04221,0.03389,0.06525,the,0.047117,0.013264
and,0.03131,0.02638,0.03020,and,0.029297,0.002112
to,0.03094,0.02913,0.03164,to,0.030570,0.001058
of,0.02106,0.02371,0.02107,of,0.021947,0.001247
a,0.02064,0.03000,0.02704,a,0.025893,0.003906
...,...,...,...,...,...,...
cut,0.00055,0.00114,0.00177,cut,0.001153,0.000498
green,0.00055,0.00074,0.00210,green,0.001130,0.000690
best,0.00051,0.00109,0.00203,best,0.001210,0.000626
watering,0.00046,0.00070,0.00085,watering,0.000670,0.000161


In [29]:
d = dict(collections.Counter(q))
d.get("the")

161

In [70]:
# z-score table + delta table

q = preprocess('q3.txt')

def zscore_table(a,b,c):
    dict_q = dict(collections.Counter(q))
    table = abs_table(a, b, c)
    table = table.astype(float)
    table["words"] = table.index
    table["q"] = ""
    table.loc[:,"a"] = round(table["a"] / len(a),5)
    table.loc[:,"b"] = round(table["b"] / len(b),5)
    table.loc[:,"c"] = round(table["c"] / len(c),5)
    table.loc[:,"mean"] = table.mean(axis='columns')
    table.loc[:,"sd"] = table.std(axis='columns')
    for word in table["words"]:
        if dict_q.get(word) != None:
            table.loc[word,"q"] = round((dict_q.get(word) / len(q)),5)
        else:
            table.loc[word,"q"] = np.nan
    table.loc[:,"z_a"] = (table["a"] - table["mean"]) / table["sd"] # calculates z-scores for columns a,b,c,q
    table.loc[:,"z_b"] = (table["b"] - table["mean"]) / table["sd"]
    table.loc[:,"z_c"] = (table["c"] - table["mean"]) / table["sd"]
    table.loc[:,"z_q"] = (table["q"] - table["mean"]) / table["sd"]
    table.dropna(inplace= True) # deletes rows that contain NaN
    table.drop('words', axis = 'columns',inplace= True) # deletes the redundant column
    return table

zscore = zscore_table(text_a,text_b,text_c)
zscore

  table.loc[:,"mean"] = table.mean(axis='columns')
  table.loc[:,"sd"] = table.std(axis='columns')


Unnamed: 0_level_0,a,b,c,q,mean,sd,z_a,z_b,z_c,z_q
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
the,0.04221,0.03389,0.06525,0.10675,0.047117,0.013264,-0.369911,-0.997151,1.367061,4.495722
and,0.03131,0.02638,0.0302,0.0219,0.029297,0.002112,0.953467,-1.381264,0.427797,-3.502886
to,0.03094,0.02913,0.03164,0.03467,0.03057,0.001058,0.349857,-1.361604,1.011748,3.876791
of,0.02106,0.02371,0.02107,0.03102,0.021947,0.001247,-0.711113,1.414206,-0.703093,7.276878
a,0.02064,0.03,0.02704,0.02555,0.025893,0.003906,-1.344843,1.051299,0.293544,-0.087893
in,0.01755,0.02188,0.02022,0.01642,0.019883,0.001784,-1.308162,1.119413,0.188749,-1.941686
that,0.01746,0.00839,0.00807,0.00821,0.011307,0.004353,1.413577,-0.670032,-0.743544,-0.711383
is,0.01298,0.01175,0.01326,0.0073,0.012663,0.000656,0.482825,-1.392568,0.909744,-8.177528
it,0.00998,0.00716,0.00387,0.00091,0.007003,0.002497,1.192166,0.062746,-1.254912,-2.440403
are,0.00859,0.0086,0.00906,0.00091,0.00875,0.000219,-0.72979,-0.684178,1.413968,-35.759716


In [68]:
# Now, Delta:
# Delta is a distance value, so the bigger it is, the more distant the query text and the sample text are.

def delta(df): # calculates delta score between the column a in the given dataframe and the query text
    delta_a = round(sum(list(abs(df["z_a"]-df["z_q"]))) / len(df),5)
    delta_b = round(sum(list(abs(df["z_b"]-df["z_q"]))) / len(df),5)
    delta_c = round(sum(list(abs(df["z_c"]-df["z_q"]))) / len(df),5)
    return delta_a, delta_b, delta_c, 'are Delta distance values between the query text and text a, b, c, respectively.'

delta(zscore)

(4.97545,
 4.91509,
 4.62071,
 'are Delta distance values between the query text and text a, b, c, respectively.')

Since the frequency of MFW will be heavily influenced by the size of the corpus, a relative term frequency (*rtf*) is more useful. <br> Burrows proposes a standardized score Z. Z score is obtained through the following process:
A word(type) *t*i's relative frequency *rtf*i,j in a document *Dj* is computed.
Then we substract the mean*i* and divide by the standard deviation.

#### Z-score
I will proceed as following:
1. I will create a list of most frequent words from each author. --> I could use the same function again - collections.Counter().most_common(150). From this I create the MFW list for each author. The Counter dictionary will be used later to fetch frequency value (to compare it to the number of all the tokens.
2. The three MFW lists will be merged - since there are a lot of overlap, the merged list will be shorter than 450 items. --> this step is not very important, since the end purpose is to compare the MFW frequency to a new text, not to each other.
3. The absolute frequency chart can be created at this stage.
4. For the relative frequency list, the results from (3) can be exploited; I will take these numbers and divide it by the number of all tokens for each text.
5. From (4), we can calculate mean and sd for each item and calculate Z-score.
6. I want to have a neat chart or graph to visualize the results.
7. I also want to know how much of the original texts MFWs take up. I will use the three lists from (1) and calculate this against each text.


Problem: in the first step, I will already have the MFW list and the items' frequency in the respective texts. In (3) and (4), there are some new items to be counted (i.e. an item that was not in the top 150 list for author A but author B or C has to be tested against author A as well). Can I avoid having to calculate everything once again? Or, will calculating everything once again be easier than creating a list of items that were not in (at least) one of the MFW lists and calculating only them in (3) and (4)? (I could try and see.)

#### Profile-based Distance value (Savoy(2020:p.37-38))
Using the Z-score, I can compare an author profile(or text) A and a mystery query text Q.
The process looks like the following:
1. Get the MFW list for Q and compute the Z-scores the same way
2. The distance value Delta(A,Q): get the sum of the absolute value of the Z-score difference for each term and divide it by the number of terms.
3. Do the same for author B and C, and compare the distance values. The closest profile is "the least unlikely" to be the author of the text.

#### instance-based distance value (Savoy p.38)
Not sure if this one is worth trying.

Note: It might be better to find a different way to remove punctuation marks, with more liberty in choosing what marks to remove. The current code has a list of predetermined punctuation marks, but often I want to remove all marks but full stops(.) etc.
<br>Note_2: comparing the texts only according to use of pronouns might be interesting.

#### some possible amendments/suggestions to this method
1. Contracted forms are important?
2. Pronouns shouldn't be included in the list?
3. One can remove terms occuring with very high frequency in one text (e.g. personal names in a novel)

### Kullback-Leibler Divergence method
if the very frequent word-types of a given language correspond to the functional words, it is not required to define them according to a corpus. I could simply define them prior to investigating the corpus.
I might be use the "function_words" list for this, too.
The degree of disagreement between the two probabilistic distributions can be evaluated. (Formula: p.40) The more similar the two texts are, the smaller the resulting value is.

### Labbe's Intertextual distance method (p.43)
This method takes the entire lexis (all word-types) into account.
This method also takes each text from a single author as a *surrogate*. I might have to import each short text instead of using one long merged .txt file, by which the size of each sample will be very small.

## 3. Machine learning models
I am studying the methods and codes in this study: https://github.com/alexanderbira/Using-Machine-Learning-Techniques-for-Authorship-Attribution . <br>
This uses 1. predetermined "stylistic features" and 2. BERT, and find the latter more useful.

## 4. Ways forward
1. Analysing texts using the methods mentioned above: on two sets of text sets. (And first of all, finish collecting text for set 2) <br>
The two sets differ in the domain(topic) of the included texts. The first set involves 3 blogs(authors) about gardening, and the other includes another 3 blogs(authors) about true crime. <br> I haven't finished collecting text for the latter set, but when I do, I want to make it a lot bigger than the gardening set. For many of these methods, the corpus size seems to be an important factor.


2. I want to try these methods with different thresholds and corpus size. <br>
Intuitively the result will be more accurate/useful with bigger corpus size, but how big is big? Is there a certain threshold where the improvement in accuracy according to the corpus size is not meaningful anymore? (In which case we could use it as a practical maximum corpus size.) <br> Or is there a threshold where there is a huge jump in the index values (in which case we could use it as a minimum corpus size.)? <br>
It will be able to use some good visualization, too.


