<a href="https://colab.research.google.com/github/madelinewilson/intro-digital-humanities/blob/main/IDH_Final_Maddy_Wilson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Model of William Shakespeare's Sonnets
by Maddy Wilson


## Introduction

This program will create a topic model visualization based on Shakespeare's *Sonnets*, a collection of 154 sonnets on a variety of themes first published in 1609. The corpus was aqcquired from Project Gutenberg. The program utilizes the pandas and spaCy libraries to clean up the dataset by lemmatizing it and removing stopwords, and then creates a dataframe based on the text. The program turns that dataframe into a .csv file, which contains the author (Shakespeare), the title of the work (The Sonnets), and the lemmas that are in the sonnet. Then, using latent Dirichlet allocation (LDA), the program creates a visual topic model which shows the patterns of co-occurring words across all 154 sonnets as topics. This program and visualization will hopefully show certain connections across the collection of sonnets that might otherwise be difficult to identify as a human reader. What topics emerge across the body of sonnets? How does Shakespeare's use of "love" or "beauty" change in different contexts?

## Code

In [1]:
# importing the necessary libraries and tools for a topic model
import requests
import pandas as pd
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

In [3]:
# URL is a .txt file of William Shakespeare's Sonnets
response = requests.get('https://ia800300.us.archive.org/5/items/shakespearessonn01041gut/wssnt10.txt')
text = response.text

In [4]:
# making sure the URL presents the text correctly
text[:1000]

"*******The Project Gutenberg Etext of Shakespeare's Sonnets******\r\n#2 in our series by William Shakespeare\r\n\r\n[#1 in our series is the Complete Works of Shakespeare,\r\nas presented to use by the World Library, copyrighted.\r\nWe will be presenting those as individual plays, now that\r\nwe have we have reached the presentation of Etext #1,000]\r\n\r\nThis Etext was prepared by the Project Gutenberg Shakespeare Team.\r\nThis Etext is an independent production presented as Public Domain.\r\n\r\n\r\nCopyright laws are changing all over the world, be sure to check\r\nthe copyright laws for your country before posting these files!!\r\n\r\nPlease take a look at the important information in this header.\r\nWe encourage you to keep this file on your own disk, keeping an\r\nelectronic path open for the next readers.  Do not remove this.\r\n\r\n\r\n**Welcome To The World of Free Plain Vanilla Electronic Texts**\r\n\r\n**Etexts Readable By Both Humans and By Computers, Since 1971**\r\n\r\n

In [5]:
# locating the beginning of the actual text
text.find('From fairest creatures we desire increase,')

11875

In [6]:
# locating the end of the actual text
text.find('End of The Project Gutenberg Etext of Shakespeare')

110188

In [7]:
# setting the start and end parameters for the topic model
start = 11875
end = 110188 -1

In [8]:
# setting a variable for the text
sonnets = text[start:end]

In [9]:
# splitting the text into paragraphs
sonnets_paras = sonnets.split('\r\n\r\n')

In [10]:
# creating empty lists for the dataset
author = []
title = []

In [11]:
# naming the columns in the dataframe
for para in sonnets_paras:
    author.append('William Shakespeare')
    title.append('The Sonnets')

In [12]:
# creating a dataframe
sonnets_df = pd.DataFrame(list(zip(author, title, sonnets_paras)), columns=['author', 'title', 'text'])

In [13]:
# printing the dataframe
sonnets_df.head

<bound method NDFrame.head of                   author        title  \
0    William Shakespeare  The Sonnets   
1    William Shakespeare  The Sonnets   
2    William Shakespeare  The Sonnets   
3    William Shakespeare  The Sonnets   
4    William Shakespeare  The Sonnets   
..                   ...          ...   
303  William Shakespeare  The Sonnets   
304  William Shakespeare  The Sonnets   
305  William Shakespeare  The Sonnets   
306  William Shakespeare  The Sonnets   
307  William Shakespeare  The Sonnets   

                                                  text  
0    From fairest creatures we desire increase,\r\n...  
1                                                   II  
2    When forty winters shall besiege thy brow,\r\n...  
3                                                  III  
4    Look in thy glass and tell the face thou viewe...  
..                                                 ...  
303  Cupid laid by his brand and fell asleep:\r\nA ...  
304                  

In [39]:
# processes text by removing new line characters and stopwords, and lemmatizing text
def process_text(text):
    text = text.replace('\n', ' ')
    doc = nlp(text)
    e_stopwords = ['thy', 'thou', 'thee', 'o', 'shalt', 'art', 'doth', 'O','Thine','thine','Thee','Thou', 'shall', 'dost']
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    no_elizabethan_stopwords = [token for token in no_punct if token.text not in e_stopwords]
    lemmas = [token.lemma_ for token in no_elizabethan_stopwords]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [40]:
# adding a new column to the dataframe for lemmas and applying the process function
sonnets_df['lemmas'] = sonnets_df['text'].apply(process_text)

In [41]:
# printing dataset again
sonnets_df.head

<bound method NDFrame.head of                   author        title  \
0    William Shakespeare  The Sonnets   
1    William Shakespeare  The Sonnets   
2    William Shakespeare  The Sonnets   
3    William Shakespeare  The Sonnets   
4    William Shakespeare  The Sonnets   
..                   ...          ...   
303  William Shakespeare  The Sonnets   
304  William Shakespeare  The Sonnets   
305  William Shakespeare  The Sonnets   
306  William Shakespeare  The Sonnets   
307  William Shakespeare  The Sonnets   

                                                  text  \
0    From fairest creatures we desire increase,\r\n...   
1                                                   II   
2    When forty winters shall besiege thy brow,\r\n...   
3                                                  III   
4    Look in thy glass and tell the face thou viewe...   
..                                                 ...   
303  Cupid laid by his brand and fell asleep:\r\nA ...   
304          

In [42]:
# saving the dataframe as a .csv
sonnets_df.to_csv('shakespeare_dataframe.csv', index=False)

In [43]:
# installing the necessary libraries
! pip install funcy



In [19]:
! pip install tzdata

Collecting tzdata
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tzdata
Successfully installed tzdata-2023.3


In [20]:
! pip install --no-dependencies pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyLDAvis
Successfully installed pyLDAvis-3.4.1


In [21]:
! pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=18492c4dff45f6afba85ec21fc07bd2d6e17b344f931cb3e1231cf1057e056e5
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [44]:
from collections import defaultdict
import wget
from gensim import corpora, models
from google.colab import files  # <-- only necessary if you are using Colab
import io # <-- only necessary if you are using Colab
import pandas as pd
import pyLDAvis.gensim
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [45]:
documents = sonnets_df['lemmas'].to_list()

In [46]:
# removing floats from dataframe
floats= []
for d in documents:
  if type(d)== float:
    floats.append(d)

In [47]:
floats

[]

In [48]:
clean = sonnets_df.dropna()


In [49]:
# saving the clean dataframe as a csv
clean.to_csv('shakespeare_dataframe_clean.csv')

In [50]:
# reading the csv as a dataframe
df = pd.read_csv('shakespeare_dataframe_clean.csv')

In [51]:
df.head()

Unnamed: 0.1,Unnamed: 0,author,title,text,lemmas
0,0,William Shakespeare,The Sonnets,"From fairest creatures we desire increase,\r\n...",fair creature desire increase beauty rose die ...
1,1,William Shakespeare,The Sonnets,II,ii
2,2,William Shakespeare,The Sonnets,"When forty winters shall besiege thy brow,\r\n...",winter besiege brow dig deep trench beauty fie...
3,3,William Shakespeare,The Sonnets,III,iii
4,4,William Shakespeare,The Sonnets,Look in thy glass and tell the face thou viewe...,look glass tell face viewest time face form fr...


In [52]:
# creating topic model
texts =[
    [word for word in document.lower().split()]
    for document in documents
]

In [53]:
frequency = defaultdict(int)
for text in texts:
  for token in text:
    frequency[token] += 1

In [54]:
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

In [55]:
dictionary = corpora.Dictionary(texts)

In [56]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [57]:
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=50)

In [58]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

## Discussion

Because the corpus chosen was so small (154 14-line poems), the topic model is not as expansive as it could be if the corpus was larger, and the topics that it creates are quite repetitive. However, certain patterns do emerge across these topics.

Based on the topic model created from the *Sonnets* dataset, it seems that "love" is the most salient term across all of the sonnets, and that topics 1 and 2 are related to the word "love." This is not a surprise, since we already know that many of the sonnets are about love. However, it is interesting that it recurs across so many different topics and alongside quite different words. Topic 1 also includes "beauty," "time," "sweet," and "old," and so seems to be words associated with affectionate love towards another person, or love towards something that is ephemeral. Topic 2, however, also includes "know," "eye," "truth," and "good." The "love" with which topic 2 is associated may be more of an intellectual or spiritual love, or one that is more nebulous, less directed towards a particular person or object. A further development of this project could be to explore these various associations with love that Shakespeare makes through a philosophical lens and to see how they are similar or different.

Topics 15 and 18 are also interesting. These two topics are two of the most unique from the others, as we see in the visualization from the fact that they are towards the outside of the axes. Topic 18 does not include the word "love" at all but does include "hate," and is one of the few topics to do so. Based on this topic, it appears that the left side of the axes are less associated with the word and idea of "love." The most salient term for topic 15 is "think," and other words include "world" and "earth." "Think" is also associated with topic 2, which we could perhaps call intellectual love, whereas "world" and "earth" are not associated with topic 2 at all but are associated with topic 1, which we could perhaps call earthly or ephemeral love. Thus the distinction created by the two most salient topics holds as we examine other topics throughout the corpus.

In short, even a cursory topic model analysis of Shakespeare's sonnets can reveal an interesting pattern, which is that there are different connotations to the word "love" as it is used across the poems. Exploring this difference further, especially in connection with philosophical analyses of love (I am thinking particularly of Spinoza), could be enlightening as to how Elizabethan conceptions of love varied across usage, and how they may be similar or different to how we understand the concept today.