# Comparing Detective Novels from Agatha Christie and Arthur Conan Doyle
Leah Hogenmiller (lmh2ur)

UVA Box: https://virginia.box.com/s/qtgurhhcp0w7d2vucegrn1ppr0ys3wxw \
GitHub: https://github.com/lmh2ur/detective_novels_text_analysis

## Introduction

I am a big fan of anything crime related, whether that be true crime podcasts, murder mystery movies, or some TV show. One of my favorite TV shows is Sherlock Holmes, which is based on the books written by Arthur Conan Doyle. I also love watching murder mystery movies, many of which have been inspired by the books by Agatha Christie. I thought that it would be interesting to compare the literature from both Agatha Christie and Arthur Conan Doyle as they are both prominent authors when it comes to the detective fiction genre. They are both British writers from around the 1900th century and are wildly regarded as some of the best detective novelists who have sold billions of copies of their works. 

For this project, I chose seven novels by Agatha Christie and six novels by Arthur Conan Doyle to compare. Both of these authors have many more novels and short stories that they have written, but only a few of them were available on Project Gutenberg. I specifically chose to focus on novels rather than both novels and short stories since novels have a more defined structure, with chapters, paragraphs, and sentences that better to be put into an OHCO structure. I wanted to see how two prominent British writers from the 1900th century compare in terms of their novels and see how Agatha Christie, a female, compares to Arthur Conan Doyle, a male. Although they both write the same genre of books, there ought to be similarities and differences between them that can be furthur looked into using text analysis. Such as how similar or different are their novels, what topics are the most relevant, what are the emotions and sentiment for each novel? 

## Source Data

Source Data UVA Box: https://virginia.box.com/s/xhgiqb0xh6q1rol1gq3ytl553o261x7p

The source files were obtained from Project Gutenberg, a website that provides free eBooks, in the Detective Fiction bookshelf (https://www.gutenberg.org/ebooks/bookshelf/30). All novels were downloaded as plain text UTF-8 files. There are seven novels by Agatha Christie and six novels by Arthur Conan Doyle. While Project Gutenberg has both novels and short stories from both authors, novels were specifically chosen as they have a more standard OHCO form which makes them easier to compare. The average book length for the corpus is around 67,564 tokens. Both Agatha Christie and Arthur Conan Doyle have many more published novels, I was only able to obtain the ones that were available on Project Gutenberg's website and it would be interesting in the future to see how the analysis would change if all their books were included. 

![LIB-table.png](attachment:d3aa0925-483e-4890-9ad1-f7a4c96943a9.png)

*LIB Table*

## Data Model

Analytical Tables UVA Box: https://virginia.box.com/s/sjmixy83obo18w896jkf0p4kvwn9capf

![tables.png](attachment:bfc498d3-ba32-48c3-82f6-9cadc9e9862c.png)

*All Created Tables*

## Exploration

### Hierarchical Clustering Analysis

The first analysis done was using hierarchical clustering to see how the novels by both authors would cluster together using different distance metrics. This was calculated using chapters as the bag, computing the TFIDF using the sum method, and then aggregating TFIDF for each book. The first dendrogram shows clustering using cosine as the distance metric and ward as the linkage method. The green cluster contains mostly works from Conan Doyle but also include two novels by Christie, The Murder on the Links and The Man in the Brown Suit. The orange cluster contains mostly works from Christie but includes A Study in Scarlet by Conan Doyle. The same clustering pattern was seen when using Jaccard for the distance and complete linkage as well as using Jensen-Shannon with ward linkage. This shows that for some reason, Christie's novels The Murder on the Links and The Man in the Brown Suit are more similar to Conan Doyle's detective novels while his novel A Study in Scarlet is more similar to Christie's novels.

![hca_cosine_ward.png](attachment:440b59b0-5fb0-4e73-a053-eba8a3c75db6.png)

*Hierarchical Clustering Analysis using Cosine with Ward Linkage*

![hca_jaccard_complete.png](attachment:efb7482f-584c-46f6-b721-4df4ccd5c9dc.png)

*Hierachical Clustering Analysis using Jaccard with Complete Linkage*

![HCA_js_ward.png](attachment:0857e89f-cda9-4edf-8d2f-362e2f3b4e04.png)

*Hierarchical Clustering Analysis using Jensen-Shannon with Ward Linkage*

### Principal Component Analysis

To compare the authors' work in a different way, principal component analysis was used. This was done using only the nouns from the corpus, L2 normalization, centering the term vectors, and then picking the top 10 components. Looking at PC 0 and 1, there is a weird triangular shape to the distribution of the different novels in the corpus. This shape was also seen when looking at the authors as well as the loadings. All of Conan Doyle's novels are in the center, while some of Christie's are located in the center as well as outside the center. 

![PC01_novels.png](attachment:3e9b11fa-3ffd-49d0-a8d6-fe8bbb37dd63.png)

*PC 0 and 1: Novels*

Principal components 2 and 3 showed a more clear distribution that was a little bit easier to interpret. Christie's novels have a wider distribution in comparison to Conan Doyle, which may be because all of Conan Doyle's novels that were selected were specifically Sherlock Holmes books, while Christie's novels contained Poirot novels as well as some other ones. Looking at the novels, Christie's The Man in the Brown Suit is on the left side with the Conan Doyle novels, which is similar to the results found during the hierarchical clustering but the Murder on the Links doesn't seem to cluster as it did with the hierarchical clustering. Along the same lines, Conan Doyle's A Study in Scarlett is the closest novel to the ones by Christie but isn't that close or overlapping. Finally, looks at the distribution of the nouns, on one side there is Holmes, the main character of Conan Doyle's novels, and on the other side there is Poirot, Tuppence, Tommy, Renauld, and Anthony, which are all main characters of her stories. It is interesting how the principal component analysis was able to fully separate the main characters from each author's books and how many of the other nouns on the outer portion of the distribution are all characters as well.

![PC23_authors.png](attachment:a41b40d8-d7db-4741-bcaf-a00743b9e078.png)

*PC 2 and 3: Authors*

![PC23_novels.png](attachment:84337f02-0e98-4f0c-a44d-37bac85c6eee.png)

*PC 2 and 3: Novels*

![PC23_words.png](attachment:5bd9cc3a-0a8b-4be5-a745-4eadfa54dd7c.png)

*PC 2 and 3: Nouns*

### Word Embeddings

To look more closely into the words used by both authors, word2vec word embedding was used. This was done by separating the corpus and vocab by author and using only the nouns and verbs to create the word embeddings. The word2vec parameters used were 2 window, 246 size, 50 min_count, and 4 workers, while the TSNE parameters used were 40 perplexity, 2 n_components, pca init, and 2500 n_iter. Looking at a cluster of words from Christie, there doesn't seem to be too much of a correlation between the words. Open, passage, stairs and front, inside, and step make sense as they are all describing a location but looking at the word dead, it doesn't seem to correlate as much to the words around it. The words from Conan Doyle seem to be more related to each other in this specific cluster. There is dog, bell, and cry which are all sounds next to silence is interesting, as well as body and blood being next to each other which is as expected.

![christie_words.png](attachment:de2c57ab-0207-4779-971f-279804bb0aa2.png)

*Word Embedding Cluster from Christie*

![conan_words.png](attachment:35aefceb-fa79-4314-ac0b-2da89794423c.png)

*Word Embedding Cluster from Conan Doyle*

Using the same word embedding, we can find similar words for the words murder, crime, and detective for both Christie and Conan Doyle. Murder is the most interesting to me as the most similar word to murder for Christie is husband and the second most similar is mother. In many crime novels, it is common for husbands to murder their wives so this doesn't surprise me too much especially coming from a woman writer. For Conan Doyle, murder is most similar to other words that would be used to analyze how a murder was committed which seems like something that Sherlock Holmes would write. Looking at the word crime, both authors had murder as one of the most similar words to crime. Again Christie has wife related to crime, which is consistent with the previous statement of how husbands often murder their wives. Finally, the word detective is interesting as for both authors the word week was similar which I am not sure why. 

![sim_murder.png](attachment:57764519-0123-4e7d-9a39-992d02e2aa31.png)![sim_crime.png](attachment:ed71c10e-2ac5-428f-bbd5-caf1a81c26a2.png)![sim_detective.png](attachment:6f7c1b23-d2c2-495d-a525-6ad160e3d7f2.png)

*Similar words*

Next, Latent Dirichlet Allocation was used for topic modeling. This was done using chapters as the bag and only using nouns. The parameters for the CountVectorizer was 5000 max_features, \[1,2\] ngram_range and LDA parameters of 20 n_components, 5 max_iter, and 50 learning_offset. T10 had the highest document weight sum while T16 had the highest term frequency in the corpus. Both of these topics are more highly correlated with the works of Christie while Conan Doyle is more associated with T14 and T00. The most common words in the topics are man, house, room, head, and hand. Christie seems to be more associated with topics that have more variety in the words and are more descriptive while Conan Doyle has more of the same simple and repeating words. 

![topics_with_weights.png](attachment:5e939e80-5a13-4339-904e-59d5443ba853.png)

*Topics Models*

### Sentiment Analysis

Finally, sentiment analysis was performed to see each novel's emotions and overall sentiment. All the novels in the corpus had an overall negative sentiment, which is unsurprising for detective novels that deal with murder. Interestingly, trust and fear are the highest emotions in the novels. The Adventures of Sherlock Holmes and The Memoirs of Sherlock Holmes by Conan Doyle both have the least amount of overall emotions and sentiment in the corpus, while Christie's novels tend to have the highest. 

![sentiment.png](attachment:1767d841-4990-4a8b-9e2a-7eb5dc30f8f1.png)

*Emotions and Overall Sentiment for Each Novel*

Comparing The Mysterious Affair at Styles by Christie and The Adventures of Sherlock Holmes by Conan Doyle, we see that both of the overall sentiments of the novel are driven by different emotions. The sentiment for The Mysterious Affair at Styles looks to be the inverse of the fear throughout the novel. As the fear increases, the sentiment becomes more negative. The only part of the novel that has a positive sentiment is at the beginning, but after chapter 2, the sentiment becomes negative. In The Adventures of Sherlock Holmes, the sentiment looks to be correlated with trust throughout the novel. As the trust increases, the sentiment increases, and vice versa. The overall sentiment is not as negative as for Christie and looks to peak around chapter 4, dropping down at chapter 8, and then is around neutral when the novel ends. 

![mysterious_sentiment.png](attachment:cab05abd-c344-40fc-a10c-f0657be16bf7.png)

*Sentiment for The Mysterious Affair at Styles by Christie*

![adventures_sentiment.png](attachment:fa94fbbc-426b-47b5-8dc3-f7a7cce7d554.png)

*Sentiment for The Adventures of Sherlock Holmes by Conan Doyle*

## Interpretation and Conclusion

The analysis of works by Agatha Christie and Arthur Conan Doyle showed that there were similarities and differences between their works. The hierarchical clustering analysis showed that The Murder on the Links and The Man in the Brown Suit by Christie were the most similar works to Conan Doyle as they clustered with his work even when using different distance and linkage metrics. This was partially confirmed using principal component analysis which showed that The Man in the Brown Suit on the same side clustered with Conan Doyle's novels. This is one of the novels that is not of Poirot, one of Christie's most famous characters which many of her other novels are about which may be why it is not clustering with her other works most of which are of Poirot. For Conan Doyle, A Study in Scarlett was the most similar to Christie's novels according to the hierarchical clustering analysis as it kept clustering with her other works. Looking at the principal component analysis, this was the closest novel from Conan Doyle to Christie's novels but it was not as close as I expected given the results from the hierarchical clustering analysis. It could be that rather than A Study in Scarlett being the most similar to Christie's novels it is just the most dissimilar to his other works since it was the first book that he wrote and published. Often times a writer's style will change as they write more and this might be the case with Conan Doyle. 

Taking a further look into the nouns from the principal component analysis, it did a good job of having the characters from Conan Doyle's novels aka Holmes on one side and the characters from Christie's novels on the other side. This shows the polarity in these characters and how the authors have written about them making them so distinct. This fits with the fact that both authors' main characters, Holmes and Poirot are beloved characters and what is most notable about these novels. It was difficult to make any other analysis of the nouns from the corpus from the principal component analysis because they were all very closely clustered together in the middle showing that the other words used in the novels are very similar which is to be expected since they both are writing detective novels. The word embedding graphs for both authors did not really tell me anything new or that interesting about them but the word similarities did. The top similar words for murder are husband and for crime are wife for Christie is interesting since many novels include stories about husbands murdering their wives which is reflected here. The top similar words for murder and crime for Conan Doyle were a little more generic and analytical which is expected from a story about Sherlock Holmes who is a very straight forward no-frills detective. The words for detective did not give me anything as insightful about the novels which I was not expecting since these are detective novels. 

Topics most associated with Christie seemed to be more descriptive and had more variety in the words but for Conan Doyle, they were more simple words. It does not surprise me that Christie's topics are more descriptive as she is regarded as one of, if not the best crime writer of her time. Her novels in this corpus as well are more varied, while Conan Doyle's works are only for his Sherlock Holmes novels. It is interesting that hand, head, man, room, and house are the most common words seen in the topics. I am guessing that most of the crimes occur in a room in a house involving a man and the hand/head words maybe have to do with the body or with the murderer even but it is hard to know. The sentiment analysis showed that all the novels in the corpus had a negative sentiment with the highest emotions dealing with fear and trust, which is expected for novels having to do with crime and murder. On average Christie's work had a larger amount of emotion and sentiment in comparison to Conan Doyle, which might be because she is a woman and we tend to write with more emotion. The sentiment analysis also showed that Conan Doyle's A Study in Scarlett had the most similar sentiment to Christie's novels which could be another reason why the hierarchical clustering showed it as clustering with Christie's works. Just looking at Christie's A Mysterious Affair at Styles and Conan Doyle's The Adventures of Sherlock Holmes, we can see the difference in which emotions are most associated with the overall sentiment of the novels. For A Mysterious Affair at Styles, the sentiment is the inverse of the fear in the novel, so as fear increases, the more negative the sentiment is. The fear is also the highest at the beginning of the novels, which makes me wonder if this correlates to when the crime happens/investigation begins which would be when characters are the most fearful. While for The Adventures of Sherlock Holmes, the sentiment correlates to the trust in the novel so as trust increases, the sentiment becomes more positive. Trust on the other hand is the lowest closer to the end of the novel, possibly signifying when they are having trouble figuring out who the murderer is and then increases in the next chapter when the killer is caught. 

Overall, it was very insightful to look at the novels by Agatha Christie and Arthur Conan Doyle. While I have watched TV shows and/or movies based on the works of both of these authors, I have never actually read any of their works. Just having prior knowledge about crime and murder mysteries, made it easier to process the analysis done on both of these authors and see how they are similar or different. They both use similar words in their novels because they are writing about the same things but they both write about them in a different way which can be seen with the word embedding and topic modeling. Sentiment analysis showed that overall both their novels had negative sentiment but the amount of emotion and the way that the emotions influence the sentiment are different. Most of their novels cluster together with each other showing that there is a distinct difference between their novels but there are a few that overlap and if I had access to all of their works this could be further investigated. 