# Modeling Shakespeare with PCA

**Complete by: Tuesday 25 Nov. at class time**  
Data: (See below.)

At the start of the semester, we looked at what data analysis could show us about the history of film. Since then we've explored many different subjects where we might expect to find lots of data: sports, ecology, business, health. Now we need to ask: can we use data analysis to understand a subject when we don't have any numbers at all?

Shakespeare might seem like the farthest possible thing from data science, but the reality is that people have been analyzing Shakespeare with data just as long as they've been writing books and essays about him. In this workshop, we'll explore all 37 of Shakespeare's plays using data.

We can use a combination of PCA and clustering to help us understand a question that readers of Shakespeare's plays have argued over for generations: what genre categories do the plays belong to? In the First Folio (the first complete publication of most of Shakespeare's plays, published in 1623), the publishers attempted to categorize the plays in the table of contents:

<a title="William Shakespeare
, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:First_Folio,_Shakespeare_-_0017.jpg"><img width="512" alt="First Folio, Shakespeare - 0017" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/First_Folio%2C_Shakespeare_-_0017.jpg/512px-First_Folio%2C_Shakespeare_-_0017.jpg"></a>

This is a reasonable first attempt! We've got a nice even set of 3 categories: Comedy, Tragedy, and History. Scholars have since added a fourth category, Romance or Tragicomedy, that includes plays like *The Tempest*, *The Winter's Tale*, *Cymbeline*, and *Pericles*. Last week, you clustered Shakespeare's plays to determine what potential groupings of plays may exist. **In this week's workshop, you'll use Principal Component Analysis to explore your data and get more accurate clusters.** Here are the steps you should take:

## Instructions

Your report should have the following sections:

1. **Data Wrangling**: Using the same files and the same code as last time, import the Shakespeare data and turn it into a DataFrame of TF-IDF scores. Remember to remove the `.ipynb_checkpoints` row.  (n.b. You can add the column at any time, but you **don't** want to include this column when you run PCA. It will disrupt your results! Use your data wrangling skills to only have the genre column when you need it.)
2. **PCA for Exploration**: Run PCA on the TF-IDF data to reduce it to just *two* dimensions (features), with all the necessary steps. Create a scatter plot of your two principle components, and color the plot according to genre. You'll need to add the list of genres that I've included below to your results dataframe. Create a second scatter plot of two features (words) from your original data, with colors according to genre. How are these plots similar or different? What are you learning about your data through PCA?
3. **PCA for Modeling**: Run PCA on the TF-IDF data to reduce it to *ten* dimensions. Use the resulting dataset to run K-means clustering again, with all the necessary assessment and validation steps. How are these results similar to or different from the clustering you ran last week?
4. **Conclusion**: Write a brief paragraph summarizing your results. How effective was PCA as a tool for exploration and for modeling? Were your results this week improved over last weeks? What other things might you try with this dataset?

This should be a polished and clearly-formatted report. Remember that in all of these steps **your interpretations are just as important as your code**. You should be taking time to interpret at each stage of your report, and make sure you are interpreting things *completely, accurately, and in terms of the data*.

In [25]:
genres = {'much-ado-about-nothing': 'comedy',
 'richard-iii': 'history',
 'the-winters-tale': 'romance',
 'richard-ii': 'history',
 'henry-vi-part-3': 'history',
 'the-two-noble-kinsmen': 'romance',
 'timon-of-athens': 'tragedy',
 'the-merchant-of-venice': 'comedy',
 'loves-labors-lost': 'comedy',
 'troilus-and-cressida': 'tragedy',
 'a-midsummer-nights-dream': 'comedy',
 'henry-iv-part-1': 'history',
 'henry-vi-part-1': 'history',
 'henry-v': 'history',
 'pericles': 'romance',
 'the-merry-wives-of-windsor': 'comedy',
 'as-you-like-it': 'comedy',
 'king-john': 'history',
 'cymbeline': 'romance',
 'alls-well-that-ends-well': 'comedy',
 'henry-viii': 'history',
 'julius-caesar': 'tragedy',
 'the-tempest': 'romance',
 'macbeth': 'tragedy',
 'hamlet': 'tragedy',
 'the-taming-of-the-shrew': 'comedy',
 'coriolanus': 'tragedy',
 'othello': 'tragedy',
 'romeo-and-juliet': 'tragedy',
 'measure-for-measure': 'comedy',
 'antony-and-cleopatra': 'tragedy',
 'henry-vi-part-2': 'history',
 'titus-andronicus': 'tragedy',
 'twelfth-night': 'comedy',
 'henry-iv-part-2': 'history',
 'king-lear': 'tragedy',
 'the-comedy-of-errors': 'comedy',
 'the-two-gentlemen-of-verona': 'comedy'}
genres = pd.Series(genres)