# Diction

The term 'diction' generally refers to the stylistic choices that are made by an author while writing a text. A study of the diction of an author may concentrate, among other thing, on the words that are chosen. In stylometric research, it can be interesting to study the words that are characteristic of a given author, and to examine how the words that are chosen differ from the words chosen by other authors. 

One of statistical methods that can be used to find such distinctive words is *Dunning's log likelihood*. In short, it analyses the distinctiveness of word in one set of texts compared to the texts in a reference corpus, by calculating probabilities based on word frequencies. A good explanation of the fomula can be found on the [wordHoard](https://wordhoard.northwestern.edu/userman/analysis-comparewords.html#loglike) website. 

This notebook explains how to compare the diction used within two distinct corpora using the *Dunning's log likelihood*. 

## Defing corpora

As a first step, we need to define the two corpora whose words need to be compared. In this notebook, the words in two early 20th century novels will be compared to the words used in two early 19th century novels. 

The code below defines two lists named `corpus1` and `corpus2`. The files that are mentioned can all be used in a folder named `Corpus`. 

In [None]:
dir = 'Corpus'

corpus1 = [ 'ARoomWithAView.txt' , 'SonsandLovers.txt' ]
corpus2 = [ 'Ivanhoe.txt' , 'TreasureIsland.txt' ]

# Calculating frequencies

The code in the cell below reads in the full text of the texts that are listed in `corpus1`. Using the function `findWordsFrequencies`, it finds the freuquencies of the tokens in these texts. These frequencies are added to a dictionary named `freq1`, using the method `update()`. 

After this, the code does the same for the texts in `corpus2`. The word frequencies are placed in a dictionary named `freq2`.

In [None]:
import tdm 
from os.path import join

def findWordsFrequencies( file ):
    freq = dict()
    with open( file ) as file_handler:
        full_text = file_handler.read()
    
    words = tdm.word_tokenise( full_text )
    for w in words:
        freq[w] = freq.get(w,0) +1
    return freq

freq1 = dict()

for text in corpus1:
    print(text)
    freq_text = findWordsFrequencies( join(dir,text) )
    freq1.update(freq_text) 
    
freq2 = dict()
    
for text in corpus2:
    print(text)
    freq_text = findWordsFrequencies( join(dir,text) )
    freq2.update(freq_text)


Finally, using the frequencies that have been calculated in this way, the Dunning log likelihood scores are calculated for all of the words that occur both in `corpus1` and `corpus2`. The actual calculation takes place in a method named `log_likelihood()`. The scores that are calculated are all stored in a dictionary named `ll_scores`

The formula that is implemented in the `log_likelihood` returns a number which can either be positive or negative. A postive score indicates that there is a high probability that the word will be used in the first corpus. and a relatively low probability that the the word occurs in the  second corpus. The tokens that are assigned the highest scores, in other words, are also most distincive of the first corpus. 

The code below lists the words that are given a positive log likelihood score. 

In [None]:
import tdm
import math

def log_likelihood( word_count1 , word_count2, total1 , total_2 ):

    a = word_count1
    b = word_count2
    c = total1
    d = total2
 
    perc1 = (a/c)*100
    perc2 = (b/d)*100
    polarity = perc1 - perc2
 
    E1 = c*(a+b)/(c+d)
    E2 = d*(a+b)/(c+d)
    
    ln1 = math.log(a/E1)
    ln2 = math.log(b/E2)
    G2 = 2*((a* ln1) + (b* ln2))
    
    #if polarity < 0:
    #    G2 = -G2
    if a * math.log(a / E1) < 0:
        G2 = -G2

    return G2



ll_scores = dict()

total1 = 0
total2 = 0

for word1 in freq1:
    total1 += freq1[word1]
for word2 in freq2:
    total2 += freq2[word2]

for word in freq1:
    if word in freq2:

        ll_score = log_likelihood( freq1[word] , freq2[word] , total1 , total2 )
        ll_scores[word] = ll_score
        
for word in reversed( tdm.sortedByValue(ll_scores) ):
    #print( ll_scores['lowly'] )
    print( word , ll_scores[word] )
    if ll_scores[word] < 0:
        break        

Words with negative log likelihood scores are more likely to appear in the reference corpus (i.e. the second corpus) than in the first corpus. 

The code below lists the words with the lowest scores. 

In [None]:
for word in tdm.sortedByValue(ll_scores) :
    print( word , ll_scores[word] )
    if ll_scores[word] > 0:
        break   

## Bibliography

* Dunning, Ted, 'Accurate Methods for the Statistics of Surprise and Coincidence', in *Computational Linguistics*, 19:1 (1993).
* Rayson, P. and Garside, R., 'Comparing corpora using frequency profiling', in *Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000)* (2000)