## Topic Analysis: Anne of Green Gables Series

The Canadian author L.M. Montgomery published over 20 novels and many short stories and poems. One of her best known is the series beginning with *Anne of Green Gables*, which was published in 1908. The series follows Anne Shirley, an orphan who is adopted by an older couple on Prince Edward Island. The eight books cover Anne's adoption at the age of 11, her years in school, teaching, and university, her first years of marriage, and finally the experience of first world war, where the focus is on her 15 year old daughter.

The content of the novels evolves as Anne and her family move through many different life experiences. We can use various natural language processing techniques to take a closer look at these novels and see how they are both similiar and different. For this project, we're going to look at the topics in the novels

For this project, we'd like to look at the corpus of the *Anne* novels and see what sort of topics they contain. Question we could ask are "Which novels are most similar to each other?" or "Are any of the novels really different from the rest?".

As Anne said in *Anne of Green Gables*: "People laugh at me because I use big words. But if you have big ideas, you have to use big words to express them, haven’t you?"

### Topic analysis: Introduction

First, let's define what a topic is. A piece of text is usually written about a specific topic, like astronomy or cats or cooking. We would expect to find similar words in any documents that are about the same topic. Articles and books about astronomy would include terms like "planet", "orbit", and "telescope". A text about cooking would have different words; for example, "mixing", "oven", and "diced" are common terms used in cooking. But, these documents on two different topics would also have words in common, such as "temperature", "scale", and "measure".

So how do we determine which "topics" are represented in a document? Math! We can use various mathematical and statistical techniques to find out what these topics are and their representation in a text. This module will focus on a topic modeling technique called [latent Dirichlet allocation(LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0).

#### Latent Dirichlet allocation

The basis of the LDA model is that it assumes that a document is a mixture of topics and that all the words in the document (after removing stop words and stemming/lemmatization) belong to a topic. A simpler way might be to say each document is a mixture of topics and each topic is a mixture of words.

More generally, LDA is an unsupervised learning (clustering) technique, where the clusters are the topics. You can also think about representing a document in "topic space" in the same way that we use word embeddings to represent a word in a vector space.

One of the parameters to specify when fitting an LDA model is the number of topics. This isn't something that can be measured or determined before actually assigning words to topics. But, the number of topics can be optimized by determining how well different numbers of topics works by measuring the performance using [topic coherence](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

### Document preparation

For this topic analysis, we'll use six of the eight *Anne of Green Gables* novels; these six are in the public domain. Each of the six novels was downloaded from [Project Gutenberg](https://www.gutenberg.org/), cleaned of the Gutenberg beginning and ending text, and uploaded to a [repository](https://github.com/nwhoffman/sentiment).

After reading in each of the text files, we'll create a string of text for each novel, and append them to a list; this is our corpus or collection of documents.

In [26]:
# Imports
import urllib.request
import re

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
# List of text files in the corpus
text_files = ['green_gables', 'avonlea', 'anne_island',
             'house_dreams', 'rainbow_valley', 'rilla_ingleside']

In [3]:
# Load all text files
path = "https://raw.githubusercontent.com/nwhoffman/sentiment/master/text_files/"

# Initial a list to hold the novels (documents)
all_text = []

for text in text_files:
    
    text_temp = []

    for line in urllib.request.urlopen(path+text+'.txt'):
        text_temp.append(line.decode('utf-8'))
    
    # Combine into a single string of text
    text_str = ''.join(text_temp)
    # Append to the list of all text files
    all_text.append(text_str)

### Cleaning the raw text

The text downloaded from Gutenberg has some extra stuff that we need to clean: `\n` characters, extra spaces, and for some reason, lots of ` . . .`. We'll also convert all characters to lower case.

In [4]:
def clean_text(text):
    """
    Accepts a single text document and uses regex and string methods
    to clean the text.
    """
    # remove newlines and extra punctuation
    text = text.replace("\n", " ")
    text = text.replace(" . . .", ".")
    
    # remove spaces
    multi_white_spaces = "[ ]{2,}"
    text = re.sub(multi_white_spaces, " ", text)
    
    # apply case normalization and remove extra space
    return text.lower().lstrip().rstrip()

In [5]:
# Clean the text with the above function

# Initialize a list to hold the cleaned text
all_text_cleaned = []
for doc in all_text:
    all_text_cleaned.append(clean_text(doc))

# Look at the results
all_text_cleaned[1][:400]

'i an irate neighbor a tall, slim girl, “half-past sixteen,” with serious gray eyes and hair which her friends called auburn, had sat down on the broad red sandstone doorstep of a prince edward island farmhouse one ripe afternoon in august, firmly resolved to construe so many lines of virgil. but an august afternoon, with blue hazes scarfing the harvest slopes, little winds whispering elfishly in t'

This text looks clean and ready for tokenization and lemmatization!

### Tokenization



In [6]:
# Import NLP library: spacy
import spacy

# Download the language model if you haven't already
#python -m spacy download en_core_web_sm

In [7]:
# load in the spaCy language model
nlp = spacy.load("en_core_web_sm")

Before we lemmatize the text, we'll want to remove stop words. Having run a topic model earlier, we also noticed that the words "mr" and "mrs" were very common and should be added to a custom stop word list to be removed. We can do additional filtering of the lemmatized tokens later to remove tokens that represent extremes - words that are either very common or not very common.

In [8]:
# Print out the list of default stop words
#print(nlp.Defaults.stop_words)

# Add to default stop word list
custom = {"mrs", "mr", "dr", "em", "ma'am"}
nlp.Defaults.stop_words |= custom

all_stopwords = nlp.Defaults.stop_words

# Define the function to remove stop words and lemmatize
def tokenize(text):
    """
    Parse a raw string return lemmas
    """
    
    doc = nlp(text)
    lemmas = []
    
    for token in doc:
        if (token.is_stop == False) and (token.is_punct == False) and (token.pos != 'PRON'):
            lemmas.append(token.lemma_)
            
    # Do one more pass to remove lemmas that are now stop words
    # e.g. "said" becomes "say" when lemmatized
    lemmas_nostop = [word for word in lemmas if not word in all_stopwords]
    
    return lemmas_nostop

In [9]:
# Create all document lemmas

# Inititalize a list to hold the lemmas
all_lemmas = []
for book in all_text_cleaned:
    all_lemmas.append(tokenize(book))

In [27]:
# View some of the lemmas to spot-check the results
all_lemmas[1][975:1000]

['donnell',
 'thing',
 'rent',
 'peter',
 'sloane',
 'old',
 'house',
 'peter',
 'hire',
 'man',
 'run',
 'mill',
 'belong',
 'east',
 'know',
 'shiftless',
 'timothy',
 'cotton',
 'family',
 'white',
 'sand',
 'simply',
 'burden',
 'public',
 'consumption']

### Topic model

Now that we have lemmas for each of our documents, we can create a topic model. We'll use the `gensim` library and utilize an LDA model (described abmove) to find the topics in our collection of novels.


#### Topic modeling steps:

* Create the dictionary to match each lemma to an id
* Filter out extremes of lemma frequencies
* Create the corpus which is the collection of all the words in all of the documents
* Train the topic model!

In [28]:
# Import topic model library: gensim
import gensim
import gensim.corpora as corpora
from gensim import models

In [29]:
# Create dictionary
id2word = corpora.Dictionary(all_lemmas)

# How many words do we have in the dictionary?
print('Words in dictionary: ', len(id2word.keys()))

# Apply some filtering
# Keep tokens which are contained in at least 2 documents
# Keep tokens which are contained in no more than 0.95 
id2word.filter_extremes(no_below=2, no_above=0.95)

print('Words in dictionary (after filtering): ', len(id2word.keys()))

Words in dictionary:  13085
Words in dictionary (after filtering):  5571


In [13]:
# Create corpus
corpus = [id2word.doc2bow(text) for text in all_lemmas]

In [14]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters
num_topics = 10
chunksize = 500
passes = 20
iterations = 25
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Instantiate the LDA model
model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

Now that we've modeled the topics, let's see what they look like. Remember that a topic is composed of a mixture of words and a word can occur in more than one topic. Topics can also be either very similar or very different from each other.

In [15]:
# Print out 5 words for each topic

topics = model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.000*"rilla" + 0.000*"susan" + 0.000*"jim" + 0.000*"cornelia" + 0.000*"walter"')
(1, '0.000*"leslie" + 0.000*"jim" + 0.000*"captain" + 0.000*"cornelia" + 0.000*"susan"')
(2, '0.035*"una" + 0.029*"meredith" + 0.023*"rosemary" + 0.018*"ellen" + 0.017*"cornelia"')
(3, '0.000*"susan" + 0.000*"rilla" + 0.000*"meredith" + 0.000*"una" + 0.000*"cornelia"')
(4, '0.000*"rilla" + 0.000*"susan" + 0.000*"davy" + 0.000*"walter" + 0.000*"cornelia"')
(5, '0.040*"rilla" + 0.031*"susan" + 0.013*"walter" + 0.013*"davy" + 0.012*"jem"')
(6, '0.023*"davy" + 0.022*"phil" + 0.012*"priscilla" + 0.011*"redmond" + 0.010*"patty"')
(7, '0.043*"leslie" + 0.037*"captain" + 0.037*"jim" + 0.031*"cornelia" + 0.017*"dick"')
(8, '0.034*"matthew" + 0.015*"barry" + 0.010*"allan" + 0.009*"ruby" + 0.007*"pye"')
(9, '0.000*"susan" + 0.000*"rilla" + 0.000*"jim" + 0.000*"cornelia" + 0.000*"matthew"')


Because it can sometimes be difficult to interpret the topics, using some sort of visualization can help. The pyLDAvis library is a very useful tool but won't be explain in detail here. Basically, we can "project" the term-topic space onto a 2D plane for easier visualization.

In [30]:
# Import the visualization library
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

In [31]:
# Feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(model, corpus, id2word)

In [32]:
lda_viz