# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Jacqui Unciano
- Userid: jdu5sq
- GitHub Repo URL: https://github.com/jacquiunciano/MSDS-at-UVA-2023/tree/main/DS5001/final_project_files
- UVA Box URL: https://virginia.box.com/s/0bhf4ws7qbcdktkk2aknw8vp7yn5y0l6

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

<font color='blue'>The source materials are three book series, 1 series that I've read from my childhood (age 5-7), 1 series that I've read in my pre-teens (age 10-12), and 1 series that I've read in my late teens (age 16-18). They come in a mix of pdf or txt files, depending on how easy it was to parse through the text (i.e. finding chapters, paragraphs) with the entire book contents in them. They were found on the [Internet Archives](https://archive.org/) using the following Google search scheme: "\<book title>" "\<author name>" internet archive. However, Warrior Cats books where obtained on [Weebly](https://readwarriorbooks.weebly.com/warriors.html) using the Google search scheme: into the wild erin hunter free pdf. Then finding the link that has the pdf and/or txt file available for download.</font>

<font color='blue'>The first series is Warrior Cats written between the years 2003 and 2004 under a collective pseudonym, Erin Hunter. The second series is The Last Dragon Chronicles written between the years 2001 and 2013 by Chris D'Lacey. The last series is A Song of Ice and Fire (ASOIAF) written by George R.R. Martin, with the first novel being published in 1996 and the most recent novel in 2011. It is currently incomplete, so I have only obtained the works that have been completed and published (Book 1-5).</font>

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL
  
<font color='blue'>
Warrior Cats:
<ol>
    <li>https://readwarriorbooks.weebly.com/warriors.html</li>
</ol>
The Last Dragon Chronicles:
<ol>
    <li>search terms on Internet Archive: creator: (chris d'lacey)</li>
    <li>https://archive.org/search?query=creator%3A%28Chris+D%27Lacey%29&page=2</li>
</ol>
A Song if Ice and Fire:
<ol>
    <li>search terms on Internet Archive: a game of thrones AND creator:(george r.r. martin)</li>
    <li>https://archive.org/search?query=a+game+of+thrones+AND+creator%3A%28george+r.r.+martin%29&sin=TXT</li>
</ol>
</font>

- UVA Box URL: https://virginia.box.com/s/0bhf4ws7qbcdktkk2aknw8vp7yn5y0l6
- Number of raw documents: <font color='blue'>18 documents (6 Warrior Cats, 7 Dragon Chronicles, 5 Song of Ice and Fire)</font>
- Total size of raw documents (e.g. in MB): <font color='blue'>~54.7 MB</font>
- File format(s), e.g. XML, plaintext, etc.: 

<font color='blue'>
<ol>
    <li>Warrior Cats: 6 pdfs</li>
    <li>The Last Dragon Chronicles: 6 pdfs, 1 txt</li>
    <li>A Song if Ice and Fire: 5 txt</li>
</ol>
</font>

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

<font color='blue'>Each document has a table and contents and chapters with paragraphs in each chapter. Most had prologues and epilogues to open and close the novels, which were treated as additional chapters since they contained content related to the story. Some of the pdfs had drawings of maps at the beginning, and if I remember correctly, both ASOIAF and Warrior Cats also had an index of character names and associations. ASOIAF was at the end under Appendix and Warrior Cats was at the beginning under Allegiences. Most of them also had About the Author sections at the end which was not included in anaylsis.</font>

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_starter_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Number of observations: <font color='blue'>18</font>
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.):

<font color='blue'>
<ol>
    <li>the title of the book (book_title)</li>
    <li>the number of terms in the book (book_length)</li>
    <li>the number of chapters in the book (n_chaps)</li>
    <li>the series the book belongs to (series)</li>
    <li>the year the book was published (year)</li>
    <li>the author of the book (author)</li>
</ol>
</font>
    
- Average length of each document in characters: <font color='blue'>The average length was calculated by taking the .mean() of the book_length column, ~155327 terms per book.</font>

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_starter_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): <font color='blue'>2795883</font>
- OHCO Structure (as delimitted column names): <font color='blue'>'book_title', 'chap_num', 'para_num', 'sent_num', 'token_num'</font>
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`): 	<font color='blue'>'token_str', 'term_str', 'pos', 'pos_group'</font>

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_starter_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Number of observations: <font color='blue'>2086493</font>
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): <font color='blue'>'df', 'dfidf', 'n', 'p', 'i', 'pos_tag', 'max_pos', 'pos_group', 'max_pos_group', 'stop', 'stem_porter', 'ngram_length'</font>
- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

<font color='blue'>
    
1.	effect
2.	forests
3.	claws raked
4.	lurked
5.	wasn listening
6.	wasn long
7.	nodded slowly
8.	kindled
9.	exquisite
10.	think asked
11.	whispered ear
12.	outlaws
13.	beetle
14.	cat thought
15.	borrowed
16.	did wish
17.	dead eyes
18.	expressionless
19.	puckered
20.	say wanted

</font>

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_ana_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Bag (expressed in terms of OHCO levels): <font color='blue'>BOOKS and CHAPS</font>
- Number of observations: <font color='blue'>34786 for bag=BOOKS and 34786 for bag=CHAPS</font>
- Columns (as delimitted names, including `n`, `tfidf`): <font color='blue'>n, max_pos, dfidf, mean_tfidf</font>

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_ana_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Bag (expressed in terms of OHCO levels): <font color='blue'>BOOKS and CHAPS</font>

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_ana_tables.ipynb
- Delimitter: <font color='blue'>comma<font color='blue'>
- Description of TFIDIF formula ($\LaTeX$ OK):

<font color='blue'>
<ol>
    <li>Applied log scaling to the term frequency: TF = (np.log2(1 + DTCM.T)).T</li>
    <li>Applied smoothing factor to the sklearn IDF method = np.log2((N_docs + 1)/(DF + 1)) + 1</li>
</ol>
</font>

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_ana_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Number of features (i.e. significant words): <font color='blue'>1000</font>
- Principle of significant word selection: <font color='blue'>used dfidf to reduce table to top 1000 features, I'm also only interesting in adjectives, so I filtered by max_pos for words with the following max_pos:</font>

<font color='blue'>
pos_list = ['JJ','JJR','JJS']
</font>

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_pca_tables.ipynb
- Delimitter: <font color='blue'>comma</font>
- Number of components: <font color='blue'>3</font>
- Library used to generate: <font color='blue'>sklearn</font>
- Top 5 positive terms for first component: <font color='blue'>for PC0 (books) evil spiked equal brittle fierce, for PC0 (chaps) hundred bloody black red sweet</font>
- Top 5 negative terms for second component: <font color='blue'>for PC1 (books) sweet shattered brittle dumb wooden, for PC1 (chaps) polar human red ten little</font>

<font color='blue'>For bag=BOOKS</font>

![image.png](attachment:6543e12f-87ca-4224-b278-8d45c0d21573.png)

<font color='blue'>for bag=CHAPS</font>

![image.png](attachment:9b3470ff-f2c3-450b-8f8b-0db892bc7c29.png)

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_pca_tables.ipynb
- Delimitter: <font color='blue'>comma</font>

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/make_pca_tables.ipynb
- Delimitter: <font color='blue'>comma</font>

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

<font color='blue'>Graphs is for bag=CHAPS</font>
![image.png](attachment:7678662c-dfef-49cb-9d45-6dd2be0af670.png)

![image.png](attachment:4d457b5d-4d24-4529-a029-094ac499fa27.png)

Briefly describe the nature of the polarity you see in the first component:

<font color='blue'>Since the first principal component had terms like "evil," "spiked," and "brittle", that suggests to me that this component may be separating books that deal with heavier themes from those with lighter, more benign content. These words have a darker, more intense feeling. This, compared to sweet, friendly, and paceful further tells me what the first component captures. How it relates to some type of theme or conflict, maybe in terms of relationships? Interpresonal communication? This DOES surprise me a little bit because from what I recall, all three series are pretty dark and intense (as dark and intense as a children's book can get?). So maybe it's just not picking up on that like a human would...</font>

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

<font color='blue'>Graphs is for bag=CHAPS</font>
![image.png](attachment:40101f6d-f8bb-462a-8f97-952d9678cc7a.png)

![image.png](attachment:996219b6-9562-457a-bcca-18215f14f5c4.png)

Briefly describe the nature of the polarity you see in the second component:

<font color='blue'>The second principal component reminds me of personality or emotion or the range of emotion one can feel. Terms such as "sweet," "shattered," and "wooden" are all really intense, descriptive words that someone typically uses to convey a more vivid or imaginative emotion. So I'm thinking that this component could be distinguishing between books based on the emotional depth and variety they offer, from more straightforward storytelling to complex, emotionally charged narratives. Which makes senses since we've got children's books vs. pre-teen books vs. high school/college(?) books.</font>

## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL of count matrix used to create: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/lda_topics.ipynb
- Delimitter: <font color='blue'>comma</font>
- Libary used to compute: <font color='blue'>sklearn</font>
- A description of any filtering, e.g. POS (Nouns and Verbs only): <font color='blue'>filtered for words with adjective pos tag (JJ, JJR, JJS)</font>
- Number of components: <font color='blue'>20</font>
- Any other parameters used:

<font color='blue'>
    
  1. n_features = 10000
  2. stopwords = 'english'
  3. lda_max_iter = 10
  4. lda_n_top_terms = 5

</font>

- Top 5 words and best-guess labels for topic five topics by mean document weight:
![image.png](attachment:d0fe08fd-0e9d-499b-a6a4-bdc87f9935f9.png)

<font color='blue'>For the first topic, the one with the highest document weight, may point the strength or heroic strength. I assume the "ll" is for will (i.e. so from I'll --> I will and you'll --> you will etc) which has an assertive tone. Paired with "warrior" and "good" makes me think of strength, will-power, and dominance. The "old" and "young" might have to do with life or protecting life, feeding into the "heroic" part of the best-guess label. Sort of like, "young (strength) protects old, old (wisdom/mentor) protects young".</font>

<font color='blue'>For the next topic, it has a more morbid tone with words like "black", "dead", and "red". Both colors plus "old" remind me of death, like darkness, blood, old age. "Will" reminds me of "will to live" in this context, whether it's having one or not having one. I think this topic could be about the more somber or dramatic moments in the books, focusing on the existential struggles of the characters.</font>

<font color='blue'>My best-guess label for the third topic might be a bit of a stretch, but it might be about turning points or opportunities. I image that for "little", "open", and "small", it could be about how the characters only have a small opportunity to turn around a situation, or there's a little opening that will lead to something that will change the trajectory of their situation. It's a bit of a stretch, but if combined with "will" and "have", I think it works.</font>

<font color='blue'>As for the fourth label, this is also a stretch, but it oddly reminds me of goodness or purity, in relation to morals and heroism. The term "great" feeds into the heroism aspect of the topic while "white", "good", and "little" reminds me of the goodness or purity side of the topic.</font>

<font color='blue'>Finally, for the last topic from the top 5 list, I'm sensing a time related theme within this topic. "Old" and "dead" are like how death is inevitable, in the sense that we have limited time to live. "Long" makes me think of a long journey ahead, sort of feeding to an adveturistic theme, but in the sense of an adventure that will span months, if not years. And then "little" and "good" feeds into the first part. Something like "do good with the little time you have" or on the contrary, "why do good when there's little time left" etc.</font>

<font color='blue'>All of these themes, from what I remember, deserve to be in the top five because they're all very prevelant in each series. Every series, despite the varying reading levels, have heroic themes, discuss death and dying in vivid detail, and much more as seen in these top 5 topics.</font>

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/lda_topics.ipynb
- Delimitter: <font color='blue'>comma</font>

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/lda_topics.ipynb
- Delimitter: <font color='blue'>comma</font>

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

![image.png](attachment:0d7faa3b-917e-4195-9d20-5510694bdd28.png)>

<font color='blue'>I'm not sure if I made this graph correctly, but I wanted the visual to show clustering that could correspond to the thematic (topic) groupings by book series. For example, T15 (the topic about death) is very prevelant in each series, but from the visual, you can see that the document weight for the series ASOIAF is much larger than the other two. This makes sense since ASOIAF is for a far more mature audience than the Warrior Cats and Dragon Chronicles series.</font>

<font color='blue'>As for the size variation of points, I used the mean document weight, so it highlights the dominance or subtlety of these themes in the corpus, grouped by series. We can see that the Warrior Cats series dominates T06, T04, T10, T19, etc. Which also makes sense because if you think about T06 in a more literal context, words like "small" and "little" would fit more with a series about cats as opposed to a series about dragons or humans.</font>

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- UVA Box URL for source lexicon: https://www.dropbox.com/scl/fo/0k07nufmrurva2nv8m4vn/h/lexicons?dl=0&subfolder_nav_tracking=1
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/lda_topics.ipynb
- Delimitter: <font color='blue'>comma</font>

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/sent_ana.ipynb
- Delimitter: <font color='blue'>comma</font>

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/sent_ana.ipynb
- Delimitter: <font color='blue'>comma</font>
- Document bag expressed in terms of OHCO levels: <font color='blue'>'book_title', 'chap_num'</font>

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![image.png](attachment:37ca858d-d96e-4f04-ab77-f2ebd44ccdfd.png)
![image.png](attachment:45eadde9-c50f-43bc-89de-9826ef627465.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/pi0zdnmifnj5risjq0a70erc5wsaj41c
- GitHub URL for notebook used to create: https://github.com/jacquiunciano/MSDS-at-UVA-2023/blob/main/DS5001/final_project_files/wrod2vec.ipynb
- Delimitter: <font color='blue'>comma</font>
- Document bag expressed in terms of OHCO levels: <font color='blue'>'book_title', 'chap_num'</font>
- Number of features generated: <font color='blue'>3573</font>
- The library used to generate the embeddings: <font color='blue'>Genshim</font>

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![image.png](attachment:2218645b-fd0e-4243-93f2-ccaf1898349c.png)

### By Series...
<font color='blue'>ASOIAF</font>
![image.png](attachment:b25cfb60-8962-4505-976c-68c25ec75402.png)

<font color='blue'>Cats</font>
![image.png](attachment:a1158223-dac2-4d85-bd76-2e0e46e0e366.png)

<font color='blue'>Dragons</font>
![image.png](attachment:33b18f82-bc06-4612-8f65-5d74834fabcb.png)

<font color='blue'>Cluster that was interesting: (ASOIAF) wings, claws, fur, gaze, eyes, jaws, (Cats) rule, hold, prove, accept, forgive, require, (Dragons) monster, laughter, music, whisper, screaming, (All) rich, bold, foolish, strong, stubborn</font>

<font color='blue'>The general cluster for all series seems to focus on character traits. These adjectives describe personalities or behaviors that are central to hero and antihero narratives. "Rich" might refer to detailed, layered characterizations or material wealth; "bold," "strong," and "stubborn" remind me of characters with strong wills, possibly leading to pivotal actions within the stories; and "foolish" introduces a humanizing flaw, creating complex, relatable characters.</font>

<font color='blue'>For the ASOIAF cluster, maybe it's about the physical and sensory elements associated with creatures or characters. These terms refer to vivid descriptions of dragons, direwolves, and other fantastical beings that play pivotal roles in the series. And the focus on parts of the body like "eyes" and "gaze" suggests a narrative emphasis on perception and interaction, possibly indicating moments of intense action or emotional connection.</font>

<font color='blue'>For Warrior Cats, there are a lot of verbs in this cluster that reflect interpersonal relationships and social dynamics within the series. These terms may highlight themes of leadership, social structure, and the moral and ethical decisions that characters must navigate. Which makes sense since the series is about the climb to clan leader for the main character. The words like "rule" and "forgive" suggest a focus on governance, conflict resolution, and the maintenance of order within the cat clans, emphasizing the social and moral complexities faced by the characters.</font>

<font color='blue'>And finally, for the Last Dragon Chronicles, this cluster holds a range of emotional and atmospheric settings, from the terrifying to the whimsical. "Monster" might represent the fantastical creatures encountered, but it might also be about a person who acts like a monster. Meanwhile "laughter," "music," and "screaming" suggest scenes varying from joyful to distressing. This combination could reflect the dramatic fluctuations in tone and mood typical of fantasy narratives that explore both dark and light elements.</font>

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

![image.png](attachment:2b015eff-c95f-44eb-b158-ea690525b804.png)ies.






</font>

<font color='blue'>For my first riff visualization, I created a hierarchical cluster dendrogram to visualize the similarities between different books based on the Ward's method applied to Euclidean distances. With the colors, the different clusters can be distinguished and you can see how the the different series can be separated. I thought that was really cool to see especially since the distance between the more mature series (ASOIAF and Dragon Chronicles) and the children's series. From this dendrogram, I can see and develop an understanding of the relationships and similarities between the various books within a larger corpus. For example, ASOIAF is more similar to Dragon Chronicles than Warrior Cats; makes sense considering ASOIAF and Dragon Chronicles are for more mature audiences and their themes are quite similar based on the other graphs and analyses done earlier.</font>

## Riff 2 (5)

![image.png](attachment:a59c189b-17f4-49e1-b8cf-701a606b2d69.png) ![image.png](attachment:e0452509-2034-4437-b714-8be873778bf8.png)

<font color='blue'>Along with the dendrogram, I wanted to try K-Means, one where k=3 and another with k=4. Mostly because I wanted to see if k=4 would be able to distingush which books were written by which author. k=3 was able to differentiate between the book series, but k=4 was not able to cluster the authors together. Which didn't really surprise me since I never noticed that Erin Hunter was actually written by a group of authors until later in life. So either the editors have been doing a really great job at keeping things consistent or the authors are like-minded in writing styles.</font>

## Riff 3 (5)

![image.png](attachment:414f75aa-88b3-4508-8f1a-ad9b1b4bcf82.png)

![image.png](attachment:3005575d-80be-4eac-8fa8-4a1cc135055c.png)

<font color='blue'>These two graphs are "follow ups" to the first graph, Fear by Series. When I was looking at the sentiment scores, I saw that fear was number one for all series, but trust and sadness fluctuated between 2nd and 3rd place for each series. I wanted to see how different each series scored for trust and sadness. I found that both the Dragon Chronicles and ASOIAF follow similar, almost identical even, scores and trajectory for trust. And even for Warrior Cats, the tradjectory is similar to the other two series. Meanwhile, surprisingly ASOIAF and Warrior Cats share similar trajectories for sadness.</font>

<font color='blue'>The similarity in the trust scores between ASOIAF and the Dragon Chronicles suggest to me that the story plot or character development center on the establishment, betrayal, and/or restoration of trust, indicating that trust plays a critical role in the development of the plot or relationships for both series. The similar increase trajectory in trust in these all three series could also reflect a common narrative style/structure where initial distrust or conflict gradually lead to uniting against common challenges. So, building alliances.</font>

<font color='blue'>As for the similarities between ASOIAF and Warrior Cats in terms of sadness, for ASOIAF, it corroborates with the well known intense and often dark narrative twists found in the series. The fact that Warrior Cats also shares a similar trajectory suggests that it, too, deals with increasingly serious or dark themes, despite being a series that is aimed for children.</font>

<font color='blue'>As a last note, I would like to mention that since sadness increases alongside fear, I think it tells of a narrative where losses are accumulating, or the stakes are getting higher, affecting both the characters and the my investment in the story. It also indicates how my taste in books didn't change as I grew.</font>

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

<font color='blue'>I think the most interesting thing that I've noticed when completing the process was discovering the similarities between ASOIAF and Warrior Cats. For the most part, ASOIAF and the Dragon Chronicles were more similar to each other than Warrior Cats. To me, that was expected since both series are targeted to teenagers and young adults, both have a more fantastical theme with mythical creatures, and the political climate for both series are complex and often dark. However, in some aspects, I was surprised to see Warrior Cats resonate more with ASOIAF than the Dragon Chronicles. For example, the last riff visualization really highlighted the surprisingly mature content/thematic elements that could be found in Warrior Cats despite the series' target audience. It was comparable to ASOIAF which is leagues more explicit than Warrior Cats.</font>

<font color='blue'>Another example would be when I was looking at the PCA visualizations for PC1 vs PC2. I had already established how PC1 might have something to do with emotional themes and such, and while obivously the Dragon Chronicles and ASOIAF were clustered closer together, I saw how ASOIAF was slightly more to the right and closer to Warrior Cats than the Dragon Chronicles were. That was slightly surprising because based on what I interpreted PC1 to involve, I expected the order to be ASOIAF, Dragon, then Cat, not Dragon, ASOIAF, then Cat. I'm thinking that both series might be exploring these emotional themes through different lenses appropriate for the targeted age group (i.e. fantasy and political intrigue in ASOIAF, and animal allegory in Warrior Cats).</font>

<font color='blue'>In addition, what this also tells me is that the power struggles, survival themes, and the experiences of trust, fear, and loss, are compelling at any age, and both Warrior Cats and ASOIAF are able to detail this on different levels. Hence, while less explicit, Warrior Cats still share a thematic and emotional common ground with the more adult-oriented ASOIAF adn Dragon Chronicles.</font>