# 9. Diction

The term 'diction' generally refers to the stylistic choices that are made by an author while writing a text. A study of the diction of an author may concentrate, among other things, on the words that are chosen. In stylometric research, it can be interesting to study the words that are characteristic of a given author, and to examine how the words that are chosen by one author differ from the words chosen by other authors. 

This notebook explains how you can corpare the words in two different text, and how you can identify the words that are distinctive within each of these texts.

## Defining 'subcorpora'

This notebook explains how you can corpare the words in two 'subcorpora'. Within the overarching corpus, you may have collected texts by two different authors or texts in two different genres, or texts from different periods. The code below firstly enables you to define such subcorpora. In the lists that are  named `corpus1` and `corpus2`, you need to list all the texts from the two subcorpora whoe words you want to compare. 

The code in this notebook carries out a comparative analysis of the diction in two novels only: *Through the Looking Glass and *Ulysses*. The former test is added as an item to `corpus1`, and the second text is appended to `corpus2`.

In [None]:
dir = 'Corpus'

corpus1 = [ 'ThroughtheLookingGlass.txt' ]
corpus2 = [ 'HeartofDarkness.txt' ]

# Calculating frequencies

The code in the cell below reads in the full text of the files that are listed in `corpus1`. In this case, we are dealing with one text file only. Next, we calculate the frequencies of all of these words. These frequencies are stored in a dictionary named `freq1`.

Once the first subcorpus has been processed, the code does the same for the texts in `corpus2`. The word frequencies are placed in a dictionary named `freq2`.

After running this code, the variable `full_text1` will contain the *complete* texts of all the texts in `corpus1`. The dictionary named `freq1` will contain the frequencies of all the words in this full text. 

The variables `full_text2` and `freq2` store the same type of information for the texts in `corpus2`.

In [None]:
from tdmh import *
from os.path import join
from nltk import word_tokenize

from nltk.corpus import stopwords

stopwords = stopwords.words('english')


def tokenise_remove_stopwords(full_text):
    words = word_tokenize(full_text)
    new_list= []
    for w in words:
        w = w.lower().strip()
        orig = ''
        if w.isalnum() and w not in stopwords:
            new_list.append( w )
    return new_list


freq1 = dict()
full_text1 = ''

for text in corpus1:
    print('Reading ' + text + ' ... ')
    with open( join( dir,text) ) as file_handler:
        full_text1 += file_handler.read() + ' '

words = tokenise_remove_stopwords( full_text1  )

for w in words:
    freq1[w] = freq1.get(w,0) +1
    
        
        
freq2 = dict()
full_text2 = ''
    
for text in corpus2:
    print('Reading ' + text + ' ... ')
    with open( join( dir,text) ) as file_handler:
        full_text2 += file_handler.read() + ' '

words = tokenise_remove_stopwords(  full_text2 )

for w in words:
    freq2[w] = freq2.get(w,0) +1
    


##  Dunning's log likelihood

One of statistical methods that can be used to find such distinctive words is *Dunning's log likelihood*. In short, it analyses the distinctiveness of word in one set of texts compared to the texts in a reference corpus, by calculating probabilities based on word frequencies. A good explanation of the fomula can be found on the [wordHoard](https://wordhoard.northwestern.edu/userman/analysis-comparewords.html#loglike) website. 

Using the frequencies that have been calculated above, the Dunning log likelihood scores are calculated for all of the words that occur both in `corpus1` and `corpus2` in the cell below. The actual calculation takes place in a method named `log_likelihood()`. The scores that are calculated are all stored in a dictionary named `ll_scores`

The formula that is implemented in the `log_likelihood` method returns a number which can either be positive or negative. A postive score indicates that there is a high probability that the word will be used in the first corpus. A negative probability indicates that occurence of the word is more common in the second corpus. The tokens that are assigned the highest scores, in other words, are also most distincive of the first corpus. 

The code below lists the 25 words that are assigned a positive log likelihood score in the first corpus. 

In [None]:
import math

def log_likelihood( word_count1 , word_count2, total1 , total_2 ):

    a = word_count1
    b = word_count2
    c = total1
    d = total2
 
    perc1 = (a/c)*100
    perc2 = (b/d)*100
    polarity = perc1 - perc2
 
    E1 = c*(a+b)/(c+d)
    E2 = d*(a+b)/(c+d)
    
    ln1 = math.log(a/E1)
    ln2 = math.log(b/E2)
    G2 = 2*((a* ln1) + (b* ln2))
    
    #if polarity < 0:
    #    G2 = -G2
    if a * math.log(a / E1) < 0:
        G2 = -G2

    return G2



ll_scores = dict()

total1 = 0
total2 = 0

for word1 in freq1:
    total1 += freq1[word1]
for word2 in freq2:
    total2 += freq2[word2]

for word in freq1:
    if word in freq2:

        ll_score = log_likelihood( freq1[word] , freq2[word] , total1 , total2 )
        ll_scores[word] = ll_score

max = 25
i = 0 
        
for word in sortedByValue(ll_scores , ascending = False ):
    print( word , ll_scores[word] )
    i += 1
    if i == max: 
        break        

Words with negative log likelihood scores are more likely to appear in the reference corpus (i.e. the second corpus) than in the first corpus. 

The code below lists the 25 words with the highest negative scores. 

In [None]:
max = 25
i = 0 

for word in sortedByValue(ll_scores ) :
    print( word , ll_scores[word] )
    i += 1
    if i == max:
        break   

## Mann Whitney formula

In a [blogpost on identifying literary diction](https://tedunderwood.com/2011/11/09/identifying-the-terms-that-characterize-an-author-or-genre-why-dunnings-may-not-be-the-best-method/), Ted Underwood argues that Dunning's log likelihood function also has a number of disadvantages. It is sensitive to outliers, for example. He explains that the Mann Whitney ranks test can be a good alternative. 

To perform the Mann-Whitney ranks test, we firstly need to find all the words the two corpora to be compared have in common. Next, we need to divide the full texts of the two corpora to be compared into smaller chuncks, all of the same size. These can be fragments of 500 words, for instance. Next, we need to count the number of times each word occurs in these chunks. Using these counts, we can determine whether the word is more frequent in corpus 1 than in corpus 2 (or vice versa). As a final step, we determine the total number of fragments in which the word is most frequent, both in the first and the second corpus. If it is found, using these steps, that a word is much more common in one of the two corpora, this word can be identified as a distinctive word. The Mann-Whitney ranks test really looks at occurrences across the whole corpus, and it is neutralises the effect of exceptionally high counts in one or two of these chunks.      

The Mann Whitney test can be performed in Python using the `mannwhitneyu()` method from the `scipy.stats` module. 

In [None]:
from scipy.stats import mannwhitneyu

## make a list of all the words in both corpora
words1 = tokenise_remove_stopwords(full_text1)
words2 = tokenise_remove_stopwords(full_text2)

def divide_into_chunks(words, length):

    chunks=[]
    ## chunk contains dictionaries
    # with word frequencies
    
    for i in range(0, len(words), length):
        counts = dict()
        for j in range(length):
            if i+j < len(words):
                word = words[i+j]
                counts[word] = counts.get(word,0)+1
        chunks.append(counts)
    return chunks


length = 500
chunks1 = divide_into_chunks(words1,length)
chunks2 = divide_into_chunks(words2,length)


# vocab is the union of terms in both sets
all_words = dict()
    
for chunk in chunks1:
    for word in chunk:
        all_words[word]= all_words.get(word,0) + 1
for chunk in chunks2:
    for word in chunk:
        all_words[word]= all_words.get(word,0) + 1
    
rho =  dict()
    
for word in all_words:
        
    a=[]
    b=[]
        
    for chunk in chunks1:
        a.append(chunk.get(word,0))
    for chunk in chunks2:
        b.append(chunk.get(word,0))

    stat,pval=mannwhitneyu(a,b, alternative="two-sided")
    mean =len(chunks1)*len(chunks2)*0.5
    if stat <= mean:
        pval = 0 - pval
            
    rho[word]= ( pval )


The words that are most distinctive in corpus 1 have a negative value.

In [None]:
print( "The following words are most distinctive in corpus 1:" )  

i = 0
max = 25

for word in sortedByValue( rho ):
    if rho[word] > 0:
        print( f'{word}\t{rho[word]:.22f}' ) 
        i += 1
        if i == max:
            break

The words that are most distinctive in corpus 2 have a negative value. 

In [None]:
print( "The following words are most distinctive in corpus 2:"  )  

i = 0
max = 25

for word in sortedByValue( rho , ascending = False ) :
    if rho[word] < 0:
        print( f'{word}: {rho[word]:.22f}' ) 
        i += 1
        if i == max:
            break

## Bibliography

* Dunning, Ted, 'Accurate Methods for the Statistics of Surprise and Coincidence', in *Computational Linguistics*, 19:1 (1993).
* Rayson, P. and Garside, R., 'Comparing corpora using frequency profiling', in *Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000)* (2000)
* H. Mann and D. Whitney, 'On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other', in *Ann. Math. Statist.*, 1:18 (1947). <https://doi.org/10.1214/aoms/1177730491>
* Adam Kilgarriff, *Comparing Corpora*, in *International Journal of Corpus Linguistics*, 6:1 (2001). <https://doi.org/10.1075/ijcl.6.1.05kil>

# Exercises

## Exercise 9.1

Can you compare the diction of *Pride and Prejudice* using the Mann Whitney formula?

In [None]:
dir = 'Corpus'

corpus1 = [ 'PrideandPrejudice.txt' ]
corpus2 = [ 'Ulysses.txt' ]


def tokenise_remove_stopwords(full_text):
    words = word_tokenize(full_text)
    new_list= []
    for w in words:
        w = w.lower().strip()
        orig = ''
        if w.isalnum() and w not in stopwords:
            new_list.append( w )
    return new_list


full_text1 = ''
full_text2 = ''

for text in corpus1:
    print('Reading ' + text + ' ... ')
    with open( join( dir,text) ) as file_handler:
        full_text1 += file_handler.read() + ' '

for text in corpus2:
    print('Reading ' + text + ' ... ')
    with open( join( dir,text) ) as file_handler:
        full_text2 += file_handler.read() + ' '

from scipy.stats import mannwhitneyu

## make a list of all the words in both corpora
words1 = tokenise_remove_stopwords(full_text1)
words2 = tokenise_remove_stopwords(full_text2)

def divide_into_chunks(words, length):

    chunks=[]
    ## chunk contains dictionaries
    # with word frequencies
    
    for i in range(0, len(words), length):
        counts = dict()
        for j in range(length):
            if i+j < len(words):
                word = words[i+j]
                counts[word] = counts.get(word,0)+1
        chunks.append(counts)
    return chunks


length = 500
chunks1 = divide_into_chunks(words1,length)
chunks2 = divide_into_chunks(words2,length)


# vocab is the union of terms in both sets
all_words = dict()
    
for chunk in chunks1:
    for word in chunk:
        all_words[word]= all_words.get(word,0) + 1
for chunk in chunks2:
    for word in chunk:
        all_words[word]= all_words.get(word,0) + 1
    
rho =  dict()
    
for word in all_words:
        
    a=[]
    b=[]
        
    for chunk in chunks1:
        a.append(chunk.get(word,0))
    for chunk in chunks2:
        b.append(chunk.get(word,0))

    stat,pval=mannwhitneyu(a,b, alternative="two-sided")
    mean =len(chunks1)*len(chunks2)*0.5
    if stat <= mean:
        pval = 0 - pval
            
    rho[word]= ( pval )
    
print( f"\nThe following words are most distinctive in {corpus1}" )  

i = 0
max = 25

for word in sortedByValue( rho ):
    if rho[word] > 0:
        print( f'{word}' ) 
        i += 1
        if i == max:
            break
            

print( f"\nThe following words are most distinctive in {corpus2}" )  

i = 0
max = 25

for word in sortedByValue( rho , ascending = False ) :
    if rho[word] < 0:
        print( f'{word}' ) 
        i += 1
        if i == max:
            break