# IST664 Natural Language Processing 
## Homework 1 
### Martin Alonso 
### 2019-04-27

For this homework, we'll be working with two texts authored by James Joyce. The first _The Dead_ is a novella, and the last story within the _Dubliners_ collection of stories. The second, is Joyce's final, most ambitious and questionable work, _Finnegans Wake_, which works with myriad forms of words in the English language, transforming sentences in such a way that each sentence can have, at a bare minimum, three separate interpretations.  
We will compare both of these works to understand how Joyce's use of language has changed over the course of his writing career: both in the words he most frequently uses and bigram interaction among these. Word frequency scores will also be applied to both works.  
Finally, we want to answer how Joyce's choice of word pairs changes over time, allowing him to build more complex mental imagery that allows for far more interpretative possibilites than at the beginning of his, more straightforward, writing career. For this, we will see how Joyce's initial writing focuses more on characters while his last work focuses more on the use of actions, along with adjectives, adverbs, and nouns, rather than multiple characters and actions performed by them. 

#### Obtaining the Corpora
Both corpora were obtained from the University of Adelaide's library collection. The books were downloaded in ebook format and converted to text files outside the Python script.  
No further text manipulation was done to either text, keeping the title, author, and entire corpus. 

#### Text Preprocessing
We will begin loading the _nltk_ and *re* packages that will allow us to process both texts and do some initial cleaning, such as lowercasing words and stemming. 

In [37]:
# Loading required packages
import nltk
from nltk.corpus import stopwords
from nltk import bigrams
from nltk.collocations import *
import re 

In [2]:
# Open texts. Latin-1 encoding is used because of Joyce's penchance for borrowing words (and characters) from other languages. 
text1 = open("James Joyce The Dead.txt", "r", encoding="latin-1")
text2 = open("James Joyce Finnegans Wake.txt", "r", encoding="latin-1")

In [4]:
# Reading the texts
the_dead = text1.read()
finnegans_wake = text2.read()

We will first lowercase every word within each corpus. Then, we will remove stopwords from both corpora; however, we will not add any initial stopwords to nltk's stopwords list. If any additional words show up within the corpora that are deemed unnecessary, these words will then be added and will be reprocessed.  
Finally, we will stem every token within the corpora. But this event will come with two caveats. Given how Joyce uses complex language and sometimes replaces certain letters within words to create different meanings and puns, we will limit ourselves to using the less conservative Porter stemmer. Nevertheless, this could also create several, very similar, tokens, especially within _Finnegans Wake_, which may lead to many similar tokens and bigrams.  
But, we look forward to finding these as it could lead to more interesting insight into Joyce's idiosyncratic use of language to create mental imagery.   

In [5]:
# Tokenize words to check length of corpora
the_dead_tokens = nltk.word_tokenize(the_dead)
finnegans_wake_tokens = nltk.word_tokenize(finnegans_wake)

# Print token length
print('The Dead: {} tokens'.format(len(the_dead_tokens)))
print('Finnegans Wake: {} tokens'.format(len(finnegans_wake_tokens)))

The Dead: 17983 tokens
Finnegans Wake: 258500 tokens


In [6]:
# Create lowercase function that loops through every token in a corpus and returns a list of each lowercased token. 
def lowercase(text):
    token_list = []
    for word in text:
        word_lower = word.lower()
        token_list.append(word_lower)
    return(token_list)

In [7]:
# Lowercase tokens in each text for easier stop word removal, stemming, and do frequency distribution
# for each corpus. 
td_lower = lowercase(the_dead_tokens)
fw_lower = lowercase(finnegans_wake_tokens)

In [9]:
# Remove stop words
stop_words = stopwords.words('english')
td_lower_stop = [word for word in td_lower if word not in stop_words]
fw_lower_stop = [word for word in fw_lower if word not in stop_words]

In [10]:
# Count the new number of tokens now that stopwords have been removed.
print('The Dead: {} tokens'.format(len(td_lower_stop)))
print('Finnegans Wake: {} tokens'.format(len(fw_lower_stop)))

The Dead: 10548 tokens
Finnegans Wake: 160874 tokens


This is very interesting. By removing stopwords from both corpora, we see that _The Dead_ loses around 7,000 tokens. But _Finnegans Wake_ loses nearly 100,000 tokens. Proportionally, _The Dead_ loses 41.3 percent of its tokens, while _Finnegans Wake_ loses 37.8 percent of its text.  
This may mean that, not only Joyce avoids using more traditional stopwords in his later novel but, given Joyce's character manipulation within tokens in _Finnegans Wake_, he may have created new forms of stopwords that elude nltk's stopwords stemmer.   

In [11]:
# stem the words 
porter = nltk.PorterStemmer()

In [12]:
td_stemmed = []
for word in td_lower_stop:
    stemmed = porter.stem(word)
    td_stemmed.append(stemmed)

fw_stemmed = []
for word in fw_lower_stop:
    stemmed = porter.stem(word)
    fw_stemmed.append(stemmed)


In [17]:
# Remove special characters and print out the first 50 words for each corpus.
# The Dead
td_clean = [re.sub(r'[^a-zA-Z0-9]+', '', word) for word in td_stemmed]
td_clean = list(filter(None, td_clean))

#Finnegans Wake
fw_clean = [re.sub(r'[^a-zA-Z0-9]+', '', word) for word in fw_stemmed]
fw_clean = list(filter(None, fw_clean))

#### Text Analysis
We'll now move on to checking the 50 most common unigrams and bigrams for the corpora. We'll then review mutual information scores for each bigram, provided that each token within the bigram has a frequency distribution greater than four. 

In [32]:
# Use NLTK's FreqDist function to calculate token counts on the cleaned corpora
td_freq = nltk.FreqDist(td_clean)
fw_freq = nltk.FreqDist(fw_clean)

# Print the 50 most common tokens in The Dead...
print("The 50 most common words in The Dead are:\n")
print(td_freq.most_common(50))

The 50 most common words in The Dead are:

[('said', 187), ('gabriel', 158), ('aunt', 109), ('mr', 79), ('kate', 75), ('miss', 67), ('brown', 59), ('julia', 56), ('mari', 54), ('jane', 53), ('would', 50), ('ask', 48), ('one', 44), ('like', 40), ('go', 39), ('i', 39), ('well', 38), ('come', 35), ('malin', 34), ('face', 34), ('freddi', 33), ('hand', 33), ('voic', 33), ('eye', 32), ('turn', 31), ('o', 31), ('good', 29), ('look', 29), ('back', 27), ('see', 27), ('could', 26), ('it', 26), ('ivor', 26), ('room', 25), ('know', 25), ('darci', 25), ('never', 24), ('long', 24), ('moment', 24), ('gretta', 24), ('old', 23), ('think', 23), ('came', 22), ('year', 22), ('still', 22), ('mrs', 22), ('smile', 22), ('he', 22), ('head', 21), ('answer', 21)]


In [33]:
# ...and the 50 most common in Finnegans Wake. 
print("The 50 most common words in Finnegans Wake are:\n")
print(fw_freq.most_common(50))

The 50 most common words in Finnegans Wake are:

[('one', 686), ('like', 589), ('old', 469), ('would', 355), ('time', 342), ('say', 311), ('us', 306), ('two', 287), ('well', 280), ('come', 271), ('man', 269), ('may', 269), ('till', 268), ('see', 267), ('let', 254), ('make', 246), ('first', 237), ('go', 226), ('way', 224), ('know', 218), ('good', 218), ('ever', 207), ('tell', 195), ('look', 192), ('could', 192), ('never', 185), ('love', 181), ('ye', 180), ('take', 177), ('upon', 172), ('three', 172), ('day', 169), ('everi', 163), ('back', 162), ('still', 162), ('littl', 157), ('hear', 156), ('call', 153), ('mr', 152), ('hand', 148), ('it', 148), ('night', 147), ('made', 147), ('ill', 146), ('round', 143), ('long', 141), ('yet', 140), ('eye', 139), ('shall', 138), ('name', 138)]


It's interesting to note how Joyce in _The Dead_ is more character driven than in _Finnegans Wake_. The most common words in the first corpus refer to characters and titles and actions regarding those characters, especially when it comes to what each character says.  
However, _Finnegans Wake_ is more focused on actions which is interesting given that the novel is very convoluted and there is no clear storyline but a stream of ideas that are open to interpretation.  
We'll now repeat the same exercise for bigrams, finding the 50 most common bigrams. 

In [40]:
# Create the bigram finder and score bigram frequency
# We'll first work with The Dead 
bigram_measures = nltk.collocations.BigramAssocMeasures()
td_finder = BigramCollocationFinder.from_words(td_clean)
td_scored = td_finder.score_ngrams(bigram_measures.raw_freq)

In [44]:
# Print the first 50 bigram scores for The Dead
for bscore in td_scored[:50]:
    print (bscore)

(('aunt', 'kate'), 0.008059592135791917)
(('mari', 'jane'), 0.006472096715105629)
(('mr', 'brown'), 0.005373061423861277)
(('aunt', 'julia'), 0.0041519111002564415)
(('said', 'aunt'), 0.0035413359384540237)
(('said', 'gabriel'), 0.0032971058737330567)
(('freddi', 'malin'), 0.002930760776651606)
(('miss', 'ivor'), 0.002930760776651606)
(('said', 'mr'), 0.001953840517767737)
(('said', 'mari'), 0.0018317254854072536)
(('bartel', 'darci'), 0.00170961045304677)
(('mr', 'bartel'), 0.0015874954206862866)
(('mrs', 'conroy'), 0.001465380388325803)
(('mr', 'darci'), 0.0013432653559653192)
(('ask', 'gabriel'), 0.0012211503236048357)
(('miss', 'morkan'), 0.0012211503236048357)
(('gabriel', 'said'), 0.0010990352912443521)
(('miss', 'dali'), 0.0009769202588838686)
(('mrs', 'malin'), 0.0009769202588838686)
(('ask', 'mr'), 0.000854805226523385)
(('kate', 'said'), 0.000854805226523385)
(('ladi', 'gentlemen'), 0.000854805226523385)
(('old', 'gentleman'), 0.000854805226523385)
(('said', 'miss'), 0.000854

Looking at the bigrams for _The Dead_, the idea posited by looking at the token frequency distribution confirms that most of the novella is character driven through conversation. The bigrams either refer to a character or what a character has said, done, or asked.  
We'll now review the bigrams for _Finnegans Wake_. 

In [45]:
# Bigram scorer for Finnegans Wake
fw_finder = BigramCollocationFinder.from_words(fw_clean)
fw_scored = fw_finder.score_ngrams(bigram_measures.raw_freq)

# Print the first 50 bigram scores for Finnegans Wake
for bscore in fw_scored[:50]:
    print (bscore)

(('let', 'us'), 0.00030433638218069355)
(('poor', 'old'), 0.0001316049220240837)
(('come', 'back'), 0.00012337961439757848)
(('anna', 'livia'), 0.000106928999144568)
(('look', 'like'), 0.000106928999144568)
(('one', 'one'), 0.000106928999144568)
(('ah', 'ho'), 9.870369151806278e-05)
(('tell', 'us'), 9.870369151806278e-05)
(('one', 'time'), 9.047838389155755e-05)
(('shaun', 'repli'), 9.047838389155755e-05)
(('tell', 'tell'), 9.047838389155755e-05)
(('would', 'like'), 9.047838389155755e-05)
(('ay', 'ay'), 8.225307626505231e-05)
(('everi', 'time'), 7.402776863854709e-05)
(('grand', 'old'), 7.402776863854709e-05)
(('one', 'two'), 7.402776863854709e-05)
(('see', 'see'), 7.402776863854709e-05)
(('thousand', 'one'), 7.402776863854709e-05)
(('could', 'tell'), 6.580246101204185e-05)
(('hide', 'seek'), 6.580246101204185e-05)
(('old', 'man'), 6.580246101204185e-05)
(('one', 'day'), 6.580246101204185e-05)
(('say', 'noth'), 6.580246101204185e-05)
(('wait', 'till'), 6.580246101204185e-05)
(('well', 

The first thing that's noticeable about the 50 most common bigrams is that there are many that are word repetitions. 'Ay ay', 'ho ho', and 'hand hand' are some examples. The second thing to notice is that there are almost no characters as compared to _The Dead_. Ther are three (Anna Livia, Shaun, and old man) but it's a much reduced number. What's also interesting is that there are instances of counting within the corpus that repeat themselves ('one two', 'two three').  
But, though most of the text is action based, and for the most seems like rambling non-sense, there is a bigram ('four', 'us') that shows how Joyce borrows the word 'four', replacing 'for' to create a new meaning. It would take looking at this entire sentence to (slightly) understand why this change was made. Nevertheless, we can appreciate how he mixes up word usage to convey multiple ideas.  
  
Finally, let's look at bigrams using Pointwise Mutual Information to see if there are any changes whatsoever to the most frequent token combinations and what they can tell us about these two corpora. For this last part, we'll use the pmi funtion rather than Raw Frequency because the former gives us single event measurements of bigram occurence. 

In [49]:
# Apply a mininmum word frequency of five to The Dead and print the top 50 bigrams. 
td_finder.apply_freq_filter(5)
td_scored = td_finder.score_ngrams(bigram_measures.pmi)
for bscore in td_scored[:50]:
    print (bscore)

(('michael', 'furey'), 10.677543477645322)
(('jolli', 'gay'), 10.192116650475079)
(('gay', 'fellow'), 8.999471572532684)
(('old', 'gentleman'), 8.283264538533277)
(('bartel', 'darci'), 8.256079709207047)
(('mrs', 'conroy'), 8.037539613366205)
(('young', 'men'), 7.803074359729179)
(('ladi', 'gentlemen'), 7.784458681561835)
(('gone', 'away'), 7.6071541497539235)
(('freddi', 'malin'), 7.452577112645049)
(('mari', 'jane'), 7.244584070369218)
(('miss', 'dali'), 6.933382382074912)
(('miss', 'ocallaghan'), 6.933382382074912)
(('miss', 'furlong'), 6.93338238207491)
(('miss', 'ivor'), 6.817905164654977)
(('young', 'man'), 6.677543477645321)
(('could', 'hear'), 6.620959949278953)
(('mr', 'bartel'), 6.489239946888153)
(('mrs', 'malin'), 6.452577112645047)
(('miss', 'morkan'), 6.447955554904668)
(('mr', 'brown'), 6.2724793936310395)
(('aunt', 'kate'), 6.046862676618332)
(('aunt', 'julia'), 5.511395166948493)
(('mr', 'darci'), 5.511266253218151)
(('o', 'mr'), 4.063422608856067)
(('ask', 'mr'), 3.91

Using mi_like and setting a five word frequency limit, we find that there are only 38 qualifying bigrams. Furthermore, these all refer to characters and actions associated with those characters. Joyce's first novella, thus is mostly a work on several characters and how these act and address one another, rather than being a work more subjected to a repeating idea or ideas, like _Finnegans Wake_ appears to be. 

In [50]:
# Repeat the same exercise for Finnegans Wake
fw_finder.apply_freq_filter(5)
fw_scored = fw_finder.score_ngrams(bigram_measures.pmi)
for bscore in fw_scored[:50]:
    print (bscore)

(('whoish', 'whoish'), 13.598717183444904)
(('nin', 'nin'), 13.054997664955627)
(('puff', 'pive'), 12.98460833706423)
(('jarl', 'von'), 12.343062307976707)
(('von', 'hoother'), 12.343062307976707)
(('hee', 'hee'), 12.0435020261178)
(('anna', 'livia'), 11.930160553041855)
(('n', 'k'), 11.869131119644297)
(('marcu', 'lyon'), 11.705632387361415)
(('rann', 'rann'), 11.399645836343074)
(('shaun', 'repli'), 10.579441081809447)
(('wather', 'part'), 10.448555436824021)
(('hide', 'seek'), 9.59229091428547)
(('hat', 'lipoleum'), 9.284168618923138)
(('rain', 'rain'), 9.136611430509284)
(('ay', 'ay'), 8.812547591277927)
(('six', 'seven'), 8.496607566230162)
(('ho', 'ho'), 8.35400317078736)
(('ah', 'ho'), 8.296241451223715)
(('hold', 'hard'), 8.27863043538171)
(('tri', 'hide'), 7.975619553836976)
(('im', 'sorri'), 7.913742006344588)
(('stone', 'stone'), 7.8125475912779265)
(('order', 'order'), 7.8106814050644235)
(('la', 'bell'), 7.681558252382558)
(('right', 'enough'), 7.520500231217458)
(('dont',

Again, we find that most bigrams in _Finnegans Wake_ feature word repetition, which may evoke a character's impatience (or perhaps the author's impatience) to keep the action flowing. There are many descriptive bigrams that either involve adjectives, nouns, or adverbs, but there is no clear idea or character. Except for 'anna livia', who seems to be a character, and the constant word repetition, most of the bigrams are action based. Joyce clearly moves away from character-centred writing to more action-based writing; which, judging by the contents within the novel, is evident given that there are only four characters in the book, compared to 13 in _The Dead_.   
  
We can see that, in the case of the frequency and PMI bigrams for both texts, there isn't much difference. Both bigrams for _The Dead_ reference multiple characters appearing through out the novella. Furthermore, these characters appear either referrencing their title, familial relationship, or with the word 'said', meaning that characters are mostly speaking or being referred to by other characters.  
  
Regarding _Finnegans Wake_, the bigrams are mostly similar, though, using PMI, we can see that the focus is more on word repetition than in the raw frequency distribution. The issue with _Finnegans Wake_ is that the novel does not make much sense at face value, and it needs a more in depth reading to draw any definite idea of what the novel is about. The raw frequency shows that most bigrams are action driven, joined by adjectives or adverbs; this lessens in the PMI. 

Overall, there aren't many differences between both novels' bigrams. However, there is a difference between both corpora, denoting how Joyce's writing style evolves from the last story in _Dubliners_, his first collection of short stories, to his final work. 

#### Conclusions
As we can see, there is a definite difference to Joyce's writing style. The evolution of his writing style takes place over 25 years, starting with _The Dead_ being published as part of _Dubliners_ in 1914, and ending with _Finnegans Wake_, first published in 1939.  
  
It's interesting how, when contemplating Joyce's ouvreu, most of his work is very character focus. However, in _Ulysses_ there is a mix of both elements, with the novel focusing both on characters and their actions, along with surreal elements in the later parts of the novel. _Finnegans Wake_ finally submerges itself deeply into the surreal aspect of Joyce's writing, having characters taking a secondary role with the language coming to the forefront of the novel. This is evident by, sans the appearance of the bigrams 'anna livia' and 'shaun repli', no characters are mentioned within the bigrams for the novel. 

What is also interesting is how the bigrams are stacked together. In the case of _The Dead_, the raw frequency scores show that there is a mix between character names and these characters speaking. However, when looking at the PMI scores, it is evident that the character name bigrams are more important and thus receive higher scores. The bigrams that involve the token 'said' are still scored highly, but are grouped together among the lower scores.  
  
In the case of _Finnegans Wake_ the difference in how the bigrams are scored can be noticed in the number of bigrams that repeat the first and second token (i.e. 'rann, rann' or 'whoish, whoish'). These are more prominent in the PMI scores than in the raw frequency scores, but still appear in the latter. 

There are clear differences in Joyce's writing style between his first novella and last novel. Though these differences are evident between both corpora, the PMI scores and raw frequency scores are not completely different as the same bigrams appear but are ranked differently. 


Resources
* https://ebooks.adelaide.edu.au/j/joyce/james/j8d/chapter15.html
* https://ebooks.adelaide.edu.au/j/joyce/james/j8f/complete.html