# Final Project Notebook

DS 5001 Text as Data | Spring 2025

# Metadata

- Full Name: Mustakim Muhurto Rahman
- Userid: bur6yx 
- GitHub Repo URL:
- UVA Box URL:

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

My source material consists of select works by Franz Kafka, primarily those that have been translated from German to English (this was a limitation for the sake of interpretability). The collection includes Kafka's major novels - *The Trial*, *The Castle* and *Amerika* - in addition to novellas such as *The Metamorphosis* and *The Judgement*, and a myriad of short stories including *A Country Doctor* and *The Hunger Artist*. The translations used were primarily by Ian Johnston, Edwin and willa Muir, and Tania and James Stern. I accessed these texts through publically availably literary archives such as Project Gutenberg, as well as other online repositories of classic literature. While an estimated 90% of Kafka's work was lost or destroyed (often by Kafka himself), what remains forms the core of my analysis on one of my favorite authors of all time.

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: https://www.gutenberg.org/ebooks/author/1735 , https://antilogicalism.com/wp-content/uploads/2017/07/kafka.pdf , https://www.kafka-online.info/works.htm 
- UVA Box URL: https://virginia.box.com/s/k6s0w9oq0vj23160dd98ifok9gnr6hy3
- Number of raw documents: 20
- Total size of raw documents (e.g. in MB): 2.33 MB
- File format(s), e.g. XML, plaintext, etc.: plaintext UTF-8 encoded txt files

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

The internal structure of each document changes based on the type of work. Kafka's novels, such as *The Trial* and *The Castle*, are typically divided into chapters, which are often unnammed or simply numbers using Roman numerals. However, short stories (such as *A Country Doctor*) and novellas typically consist of a continous body of text without chapter breaks. Works like *Meditation* are structured as a collection of short stories and prose pieces, the smallest of which can be a paragraph in length, often more allegorical than narrative in nature. 
Because of this variety, we anticipate that analytical approachs like bag-of-words models face challenges when applied uniformly across the texts. To account for these differences, we may isolate texts by form - for instance, analyzing only the novels or onlt short stories when structural consistecy is needed. In other scenarions such as sentiment analysis or thematic clustering, it might make more sense to include the entire corpus regardless of structural variation.

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

In [52]:
    import pandas as pd
    import numpy as np
    from glob import glob
    import re
    import nltk
    import plotly_express as px
    import configparser
    import os
    config = configparser.ConfigParser()
    config.read("../../../env-sample.ini")
    data_home = '/Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/KafkaFinal/data'
    output_dir = '/Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/KafkaFinal/output'
    local_lib = '/Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/DS5001-2025-01-R/lessons/lib'
    import sys
    sys.path.append(local_lib)
    from textparser import TextParser

    clip_pats = [
        r"\*\*\*\s*START OF",
        r"\*\*\*\s*END OF"
    ]

    # All are 'chap'and 'm'
    roman = '[IVXLCM]+'
    caps = "[A-Z';, -]+"
    ohco_pat_list = [
        (5200,   rf"^\s*CHAPTER\s+{roman}\s*$"), #Metamorphosis
        (7849,   rf"^\s*{roman}\s*$"), #The Trial
        (6969,  rf"^\s*LETTER .* to .*$"), # The Castle
        (6262,   rf"^CHAPTER\s+{roman}$"), # Amerika
        (6161,   rf"^CHAPTER\s+\d+$"), # The Judgement
        (6060,   rf"^Chapter\s+\d+$"), # Dearest Father
        (6363,  rf"^Chapter\s+\d+$"), # In the Penal colony
        (6464,   rf"^CHAPTER\s+\d+$"), # The Hunger Artist
        (6565, rf"^\s*CHAPTER\s+{roman}\."), # The Jackals and Arabs
        (6666, rf"^\s*CHAPTER\s+{roman}\s*$"), # A Country Doctor
        (6767, rf"^\s*CHAPTER\s+{roman}\s*$"), # An Imperial Message
        (5959,  rf"^(?:ETYMOLOGY|EXTRACTS|CHAPTER)"), # A report for an Academy
        (5858,  rf"^\s*CHAPTER\s+{roman}\.\s*$"), # The Great Wall of China
        (5757, rf"^\s*{roman}\.\s*$"), # The Hunter Gracchus
        (5656,  rf"^\s*{roman}\. .*$"), # Up in the Gallery
        (5555, rf"^CHAPTER\s+{roman}\.?$"), # Before the Law
        (5454, rf"^\s*[A-Z,;-]+\.\s*$"), # Josephine the Songstress
        (5353,  rf"^CHAPTER "), # The Burrow
        (5252, rf"^CHAPTER\s+{roman}\.\s*$"), # Blumfeld
        (23532, rf"Chapter\s+{roman}") # Meditation
    ]
    chapter_regexes = [
        (5200,   rf"^\s*{roman}\s*$"),
        (7849,   rf"^\s*Chapter\s+(?:One|Two|Three|Four|Five|Six|Seven|Eight|Nine|Ten)\s*$"),
        (6969,   rf"^\s*\d+\s*$"),
        (6262,   rf"^\s*\d+\s*$"),
        (6161,   "NOCHAPTERS"),
        (6060,   "NOCHAPTERS"),
        (6363,   "NOCHAPTERS"),
        (6464,   "NOCHAPTERS"),
        (6565,   "NOCHAPTERS"),
        (6666,   "NOCHAPTERS"),
        (6767,   "NOCHAPTERS"),
        (5959,   "NOCHAPTERS"),
        (5858,   "NOCHAPTERS"),
        (5757,   "NOCHAPTERS"),
        (5656,   "NOCHAPTERS"),
        (5555,   "NOCHAPTERS"),
        (5454,   "NOCHAPTERS"),
        (5353,   "NOCHAPTERS"),
        (5252,   "NOCHAPTERS"),
        (23532,  rf"^(Children on the country road|Unmasking a con artist|The Sudden Walk|Resolutions|The trip to the mountains|The Bachelor's Misfortune|The Merchant|Distracted Looking Out|The Way Home|The Passers-by|Passenger|Dresses|The rejection|Food for thought for gentlemen riders|The Alley Window|Desire to become an Indian|The Trees|Unhappiness)$")  # Poem title on line 1
    ]
    ohco_pat_list = chapter_regexes
    source_files = f'{data_home}'
    source_file_list = sorted(glob(f"{source_files}/*.*"))

    book_data = []
    for source_file_path in source_file_list:
        # Get the filename only, e.g. 'pg5353.txt'
        filename = os.path.basename(source_file_path)
        # Extract the numeric ID from the filename (remove 'pg' and '.txt')
        book_id = int(filename.replace('pg', '').replace('.txt', ''))
        # Use filename (without extension) as a raw title (optional: clean further)
        book_title = filename.replace('.txt', '').replace('_', ' ')
        # Append a tuple of (book_id, path, title)
        book_data.append((book_id, source_file_path, book_title))
    # Convert to DataFrame
    LIB = pd.DataFrame(book_data, columns=['book_id', 'source_file_path', 'raw_title']) \
            .set_index('book_id') \
            .sort_index()
    book_titles = {
        5200: "Metamorphosis",
        7849: "The Trial",
        6969: "The Castle",
        6262: "Amerika",
        6161: "The Judgement",
        6060: "Dearest Father",
        6363: "In the Penal Colony",
        6464: "The Hunger Artist",
        6565: "The Jackals and Arabs",
        6666: "A Country Doctor",
        6767: "An Imperial Message",
        5959: "A Report for an Academy",
        5858: "The Great Wall of China",
        5757: "The Hunter Gracchus",
        5656: "Up in the Gallery",
        5555: "Before the Law",
        5454: "Josephine the Songstress",
        5353: "The Burrow",
        5252: "Blumfeld",
        23532: "Meditation"
    }
    book_titles = {f'pg{key}': value for key, value in book_titles.items()}
    try:
        LIB['author'] = 'KAFKA, FRANZ'
        LIB['title'] = LIB.raw_title.replace(book_titles).str.upper()
        LIB = LIB.drop('raw_title', axis=1)
    except AttributeError:
        pass
    LIB['chap_regex'] = LIB.index.map(pd.Series({x[0]:x[1] for x in ohco_pat_list}))
    LIB

Unnamed: 0_level_0,source_file_path,author,title,chap_regex
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5200,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",METAMORPHOSIS,^\s*[IVXLCM]+\s*$
5252,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",BLUMFELD,NOCHAPTERS
5353,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",THE BURROW,NOCHAPTERS
5454,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",JOSEPHINE THE SONGSTRESS,NOCHAPTERS
5555,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",BEFORE THE LAW,NOCHAPTERS
5656,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",UP IN THE GALLERY,NOCHAPTERS
5757,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",THE HUNTER GRACCHUS,NOCHAPTERS
5858,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",THE GREAT WALL OF CHINA,NOCHAPTERS
5959,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",A REPORT FOR AN ACADEMY,NOCHAPTERS
6060,/Users/muhur/OneDrive/Desktop/Muhurto/Data Sci...,"KAFKA, FRANZ",DEAREST FATHER,NOCHAPTERS


In [53]:
# This cell takes 16 seconds to run
def tokenize_collection(LIB):

    clip_pats = [
        r"\*\*\*\s*START OF",
        r"\*\*\*\s*END OF"
    ]

    books = []
    for book_id in LIB.index:

        # Announce
        print("Tokenizing", book_id, LIB.loc[book_id].title)

        # Define vars
        chap_regex = LIB.loc[book_id].chap_regex
        ohco_pats = [('chap', chap_regex, 'm')]
        src_file_path = LIB.loc[book_id].source_file_path

        # Create object
        text = TextParser(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats, use_nltk=True)
        # text = TextImporter(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats) 

        # Define parameters
        text.verbose = True
        text.strip_hyphens = True
        text.strip_whitespace = True

        # Parse
        text.import_source().parse_tokens();

        # Name things
        text.TOKENS['book_id'] = book_id
        text.TOKENS = text.TOKENS.reset_index().set_index(['book_id'] + text.OHCO)

        # Add to list
        books.append(text.TOKENS)
        
    # Combine into a single dataframe
    CORPUS = pd.concat(books).sort_index()

    # Clean up
    del(books)
    del(text)
        
    print("Done")
        
    return CORPUS
CORPUS = tokenize_collection(LIB)

Tokenizing 5200 METAMORPHOSIS
Importing  /Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/KafkaFinal/data\pg5200.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*[IVXLCM]+\s*$
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 5252 BLUMFELD
Importing  /Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/KafkaFinal/data\pg5252.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone NOCHAPTERS
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 5353 THE BURROW
Importing  /Users/muhur/OneDrive/Desktop/Muhurto/Data Science Grad School/DS5001/KafkaFinal/data\pg5353.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone NOCHAPTERS
li

  div_lines = self.TOKENS[src_col].str.contains(div_pat, regex=True, case=True)


Done


## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter: 
- Number of observations:
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.):
- Average length of each document in characters: 

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.):
- OHCO Structure (as delimitted column names):
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`):

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:
- Number of observations:
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`):
- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

(INSERT LIST HERE)

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:
- Bag (expressed in terms of OHCO levels):
- Number of observations:
- Columns (as delimitted names, including `n`, `tfidf`):

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL:
- UVA Box URL of BOW used to generate (if applicable):
- GitHub URL for notebook used to create:
- Delimitter:
- Bag (expressed in terms of OHCO levels):

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL:
- UVA Box URL of DTM or BOW used to create:
- GitHub URL for notebook used to create:
- Delimitter:
- Description of TFIDIF formula ($\LaTeX$ OK):

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL:
- UVA Box URL of source TFIDF table:
- GitHub URL for notebook used to create:
- Delimitter:
- Number of features (i.e. significant words):
- Principle of significant word selection:

# Models

## PCA Components (4)

- UVA Box URL:
- UVA Box URL of the source TFIDF_L2 table:
- GitHub URL for notebook used to create:
- Delimitter:
- Number of components:
- Library used to generate:
- Top 5 positive terms for first component:
- Top 5 negative terms for second component:

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

(INSERT IMAGE HERE)

(INSERT IMAGE HERE)

Briefly describe the nature of the polarity you see in the first component:

(INSERT DESCRIPTION HERE)

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

(INSERT IMAGE HERE)

(INSERT IMAGE HERE)

Briefly describe the nature of the polarity you see in the second component:

(INSERT DESCRIPTION HERE)

## LDA TOPIC (4)

- UVA Box URL:
- UVA Box URL of count matrix used to create:
- GitHub URL for notebook used to create:
- Delimitter:
- Libary used to compute:
- A description of any filtering, e.g. POS (Nouns and Verbs only):
- Number of components:
- Any other parameters used:
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00:
  - T01:
  - T02:
  - T03:
  - T04:

## LDA THETA (4)

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:

## LDA PHI (4)

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL:
- UVA Box URL for source lexicon:
- GitHub URL for notebook used to create:
- Delimitter:

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:
- Document bag expressed in terms of OHCO levels:

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

(INSERT IMAGE HERE)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL:
- GitHub URL for notebook used to create:
- Delimitter:
- Document bag expressed in terms of OHCO levels:
- Number of features generated:
- The library used to generate the embeddings:

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

(INSERT IMAGE HERE)

(INSERT DESCRIPTION HERE)

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Riff 2 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Riff 3 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

(INSERT INTERPRETATION HERE)