# Final Project Notebook

DS 5001 Text as Data | Spring 2025

# Metadata

- Full Name: John Hope
- Userid: jah9kqn
- GitHub Repo URL: https://github.com/johnhope829/movie-review-text-analytics
- UVA Box URL: https://virginia.box.com/s/zf7r4jxkvc2udvrchmwh2efqw5g8lm13

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

The data for this project is sourced from the Large Movie Review Dataset, and contains IMDb movie reviews. The dataset contains 50,000 movie reviews across over 7,000 movies. Of the 50,000 reviews, the distribution is balanced among positive and negative reviews, with 25,000 each. More informatio to the dataset homepage can be found [here](https://ai.stanford.edu/~amaas/data/sentiment/)

The data in the set contains the review texts, and URLs pointing to the IMDb page. Because the dataset doesn't contain information regarding the movies/shows the reviews are about, this information had to be collected seperately. For these purposes, the [IMDbPy](https://pypi.org/project/IMDbPY/) (now cinemagoer) package was used to gather movie/show metadata, including titles, release years and genres.

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: https://ai.stanford.edu/~amaas/data/sentiment/
- UVA Box URL: https://virginia.box.com/s/iajc1o8j068wyyb8dn88g0febmctrjkg
- Number of raw documents: 50,000
- Total size of raw documents (e.g. in MB): ~(63.4 MB)
- File format(s), e.g. XML, plaintext, etc.: plaintext (.txt)

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

Each document consists of an indivudal movie/show review. They typically only contain one or multiple content paragraphs providing a hollistic review of their viewing.

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/lg8kxpizyplaes1s6qw1juankksvm9lb
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/tokenization.ipynb
- Delimitter: comma (.csv)
- Number of observations: 5,000 (subset from original 50,000)
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): Title, Release Year, Rating, Genre
- Average length of each document in characters: 1316

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/jc3e5nm006rrueq6xuv8ugcstyedbdhu
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/tokenization.ipynb
- Delimitter: comma (.csv)
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): 1,329,085
- OHCO Structure (as delimitted column names): [`review_id`, `para_num`, `sent_num`, `token_num`]
- Columns (as delimitted column names): [`pos_tuple`, `pos`, `pos_group`, `token_str`, `term_str`]

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/xi7nco98p5py46mzhon7aan78s54ysw1
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/tokenization.ipynb
- Delimitter: comma (.csv)
- Number of observations: 44,495
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): [`n`, `p`, `i`, `porter_stem`, `max_pos`, `max_pos_group`, `stop`, `df`,`idf`, `dfidf`]
- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

1. good
2. more
3. some
4. when
5. what
6. would
7. up
8. very
9. only
10. if
11. has
12. out
13. he
14. time
15. or
16. can
17. just
18. see
19. even
20. no

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/gteuhpfex0axitl43zi4p8y4nqtu24oa
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/bow_tfidf.ipynb
- Delimitter: comma (.csv)
- Bag (expressed in terms of OHCO levels): `review_id`
- Number of observations: 694,927
- Columns (as delimitted names): [`n`,`tf`,`tfidf`]

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/6eos5hq6a5ga66x5ihbgikqyajzx0x3x
- UVA Box URL of BOW used to generate (if applicable): 
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/bow_tfidf.ipynb
- Delimitter: comma (.csv)
- Bag (expressed in terms of OHCO levels): `review_id`

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL:https://virginia.box.com/s/t0hisy338zygh2wl8l7ymdrhf2q5luh9
- UVA Box URL of DTM or BOW used to create:
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/bow_tfidf.ipynb
- Delimitter: comma (.csv)
- Description of TFIDIF formula ($\LaTeX$ OK): 

$$
TF(t,d) = \frac{\text{number of times } t \text{ appears in } d}{\text{total number of terms in } d}
$$

$$
IDF(t) = \log \left( \frac{N}{1 + df} \right)
$$

$$
TFIDF(t,d) = TF(t,d) \times IDF(t)
$$


## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/srkx21qs5q0b7jw29dwk8vjj66kw16ou
- UVA Box URL of source TFIDF table:
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/bow_tfidf.ipynb
- Delimitter: comma (.csv)
- Number of features (i.e. significant words): 5000
- Principle of significant word selection: Top 5000 nouns

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/aavwkqsrs562plackefcexbqn4ircqx5
- UVA Box URL of the source TFIDF_L2 table:
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/pca.ipynb
- Delimitter: comma (.csv)
- Number of components: 10
- Library used to generate: Scikit-learn
- Top 5 positive terms for first component: [show, episode, series, episodes, tv]
- Top 5 negative terms for second component: [movie, movies, show, horror, acting]

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/z0osarw0z9gu0j987q1k7m1kblpmw4d1
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/pca.ipynb
- Delimitter: comma (.csv)

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/0qcnm5ee3djsrj06fuj71hxdm1699fp8
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/pca.ipynb
- Delimitter: comma (.csv)

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![](../figures/pca_1.png)

![](../figures/pca_2.png)

Briefly describe the nature of the polarity you see in the first component:



Looking at the visuals above, there appears to be a lot of similarity in the first comonent across genres, but there are a few differences we can take away. Looking at the bar plot, most of the genres have a very similar median, but what differentiates them is the wideness of their ranges. More popular genres, like Comedy, Action, and Drama, have much wider ranges than less popular genres, like Game-Show, Western, and Adult, which makes sense as the more popular genres have many more documents. The same is true for word groups, where nouns have much larger ranges than adjectives and verbs, likely due to the higher presence in the documents.

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![](../figures/pca_3.png)

![](../figures/pca_4.png)

Briefly describe the nature of the polarity you see in the second component:

Looking at the visuals above, similar to the last part, there appears to be a lot of similarity in the second comonent across genres. Looking at the bar plot, most of the genres have a very similar median, but what differentiates them is the wideness of their ranges. The key difference noted here is the huge increase in range specifically for the Game-Show genre, which had a very narrow range in the first component. The same pattern, as in the last part, is true for word groups, where nouns have much larger ranges than adjectives and verbs, likely due to the higher presence in the documents.

## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/jxmskmo5yiwb5528frkaj9ghhmfbel8l
- UVA Box URL of count matrix used to create:
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/lda.ipynb
- Delimitter: comma (.csv)
- Libary used to compute: Scikit-learn
- A description of any filtering, e.g. POS (Nouns and Verbs only): nouns and plural nouns
- Number of components: 20
- Any other parameters used:
  - n_terms = 5000
  - n_topics = 20
  - max_iter = 5
  - n_top_terms = 7
  - ngram_range = (1, 2)
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00: movie time way director story (Label: Game-Show)
  - T01: film films movie plot characters (Label: News)
  - T02: time family people story life (Label: Documentary)
  - T03: game movie life time scene (Label: Film-Noir)
  - T04: movie seat character card cat (Label: Western)

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/clrd4vie6zxwg6rs8ej6fzi4nc4p2f3f
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/lda.ipynb
- Delimitter: comma (.csv)

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/spwpt4c0o0nab52yexlblxcti341c56o
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/lda.ipynb
- Delimitter: comma (.csv)

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

![](../figures/lda_pca.png)

Looking at the above visal, there appears to be a few interesting patterns.

The first point being that the size of the points tend to increase as the value for the second component (1) decreases. This indicates that these more dominant topics may be less distinctive or more general.

Additionally, another interesting point is that the all the genres assigned to the topics are less popular. None of the topics are pointed to the most popular genres (action, comedy, horror, drama, crime). This indicates that documents representing the popular genres don't have very distinctive topic distributions, whereas more niche genres do.

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/qfkl3o3aclj1jvaf592wydjip148hyft
- UVA Box URL for source lexicon:
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/sentiment_analysis.ipynb
- Delimitter: comma (.csv)

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/adhc8c42gcnbr9fa7y327j15gx07osqg
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/sentiment_analysis.ipynb
- Delimitter: comma (.csv)

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/60w4v2k59j3jp4wpnhlpgfq6yqhyi02b
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/sentiment_analysis.ipynb
- Delimitter: comma (.csv)
- Document bag expressed in terms of OHCO levels: `review_id`

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![](../figures/sentiment_genre.png)

![](../figures/sentiment_year.png)

From the visuals above, we can see interesting trends in sentiment across genre and time. Looking at the bar plot for genre, we see that genres like News, Reality-TV, and Family are among the most positive average sentiments, whereas Sports and Game-Show are the two lowest/most negative. These seem to be plausible, as viewers are likely more receptive to informative and grounded stories like the positive genres, whereas sports and game-shows promote competition, which may sway viewers to be more harsh. However, all of these genres do have a limited representation in the corpus, so these results may not be very generalizable.

Looking at sentiment over time, there doesn't appear to be any major changes, maybe slight increases in sentiment after the 1980's, except the line appears to converge starting in the 1960's which is likely due to simply having more films/shows in these periods represented in the corpus.

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/dqwdmdb4fhfqg8tsloe7ikavh3inlx9v
- GitHub URL for notebook used to create: https://github.com/johnhope829/movie-review-text-analytics/blob/main/src/word2vec.ipynb
- Delimitter: comma (.csv)
- Document bag expressed in terms of OHCO levels: `review_id`
- Number of features generated: 256
- The library used to generate the embeddings: Gensim

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![](../figures/word2vec.png)

Looking at the visual above there appears to be a few clusters. One interesting one is the group of verbs (purple) at around (-60, 10) on the graph. Many of the verbs are dealing with the viewership of the associated films/shows, including words like saw, seen, viewed, liked, enjoyed. Clearly, the embedding model was able to detect well the semantic similarity of these words within the context of these reviews.

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

![](../figures/genre_decade_sentiment.png)

The above heatmap shows the relationship between average sentiment across genres and decades. While there appears to be relatively neutral sentiments for the majority of points here, we can see moderate increases in sentiment scores for genres like Comedy, Action, and Drama throughout time, while genres like Horror and Musical show more fluctuating or negative sentiments over time.

## Riff 2 (5)

![](../figures/sentiment_by_rating.png)

An interesting takeaway from this visual is there being that there is a clear bimodal distribution, with the lowest rated and highest rated movies having the highest average sentiments, though the sample size of these movies is limited compared to the other bins. This could indicate that very high rated movies are spoken well of, and that the lowest rated movies may have high sentiment for nostalgia or how comically bad the movies/shows were received. A kin to this could be Morbius, which was an incredibly poorly rated movie that became subject to memes and a lot of comedy.

## Riff 3 (5)

![](../figures/wordcloud_comedy.png)
![](../figures/wordcloud_drama.png)
![](../figures/wordcloud_horror.png)

Looking at the word clouds of the top words (by TFIDF) across some of the most popular genres, there appears to be a major consensus in the most common terms used. Words like "film", "movie", and "story" appear to be some of the largest in all. But there are also a few interesting differences. For comedy, we can see that "love" is a more prominent word, and that "fun" and "laugh" were also included. For drama, "characters" and "people" are much more frequent than the other genres, and "plot" is slightly larger. For horror, we see key additions, like "blood", but also see less emphasis on words like "plot" and "characters". 

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

Throughout the project and analyzing this corpus of movie review, I was able to gain interesting in notable insights in the text and differences across genre, time, and rating.

One interesting discovery was the distribution of sentiment across genres. Genres like Comedy, Action, and Drama exhibited much more neutral average sentiments, with broader ranges, suggesting a wider variety of emotional responses from audiences. Though this is likely due to the higher presence of these genres in the data, it makes sense due to their popularity in modern film. In contrast, more specific genres such as Game-Show and Western had more narrow sentiment distributions, possibly due to their more niche appeal. This indicates that more popular genres tend to evoke a wider spectrum of emotions, while niche genres elicit more consistent reactions.

Another interesting observation came from the topic modeling results using PCA and LDA. When visualizing the PCA components, it was clear that genres like Comedy and Action had a more dispersed distribution, pointing to a greater diversity in themes within those genres. On the other hand, the more niche genre reviews were more concentrated in the feature space, having higher mean document weights, reflecting a more focused set of thematic elements.

In addition, analyzing TFIDF revealed some key linguistic differences across genres. For instance, words like "laugh", "fun", and "love" dominated in Comedy, while genres like Drama saw terms like "characters" and "story" appear more frequently and horor with "blood". These findings reinforced how the language used in movie reviews reflects the genre’s emotional and thematic focus.

Overall, this project revealed how sentiment and thematic content vary across genres and the reviews corresponding to them, highlighting how the language in such reviews mirrors the expectations and characteristics of different movie categories over time.