# Data Model

DS 5001 Spring 2023 Final Project

Rachel Grace Treene

rg5xm@virginia.edu

## Output: directory of files produced to meet requirements of the project
- **CORPUS**: annotated tokens table (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/CORPUS.csv)
    - pos_tuple: a tuple representing the part of speech and the token string
    - pos: abbreviation representing the part of speech
    - token_str: string representing the token with its formatting (capital letters, etc.)
    - term_str: string representing the term without formatting like capital letters
- **LDA-PHI**: topics and term counts (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/LDA-PHI.csv)
    - T00, T01, T02, ... T18, T19: features representing topics 1-20 produced in LDA
- **LDA-THETA**: document and topic concentrations (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/LDA-THETA.csv)
    - T00, T01, T02, ... T18, T19: features representing topics 1-20 produced in LDA
- **LIB**: metadata for the source files (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/LIB.csv)
    - title: title of each document
    - chapter_regex: regex corresponding with the chapter titles
    - book_len: number of tokens
    - n_chaps: number of chapters
    - kendall_sum: kendall statistic for rank correlation measurement
- **PCA-DCM**: table of documents and components (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/PCA-DCM.csv)
    - PC0, PC1, ... PC8, PC9: features representing principal components 1-10 produced in PCA
    - title: title of each document
    - chapter_regex: regex corresponding with the chapter titles
    - book_len: number of tokens
    - n_chaps: number of chapters
    - kendall_sum: kendall statistic for rank correlation measurement
    - doc: title of each document with corresponding chapter numbers of each observation
- **PCA-LOADINGS**: table of components and word counts (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/PCA-LOADINGS.csv)
    - PC0, PC1, ... PC8, PC9: features representing principal components 1-10 produced in PCA
- **SA-DOCEMOTIONS**: sentiment polarity and emotions for each document (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/SA-DOCEMOTIONS.csv)
    - anger: measurement of anger sentiment in each book
    - anticipation: measurement of anticipation sentiment in each book
    - disgust: measurement of disgust sentiment in each book
    - fear: measurement of fear sentiment in each book
    - joy: measurement of joy sentiment in each book
    - sadness: measurement of sadness sentiment in each book
    - surprise: measurement of surprise sentiment in each book
    - trust: measurement of trust sentiment in each book
    - polarity: measurement of overall sentiment in each book
- **SA-VOCAB**: sentiment and emotion values as feature (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/SA-VOCAB.csv)
    - n: count of term in corpus
    - p: probabiliy of term occurrence in corpus
    - i: information for each term in corpus
    - n_chars: number of characters for each term in corpus
    - max_pos: maximally-occuring part of speech for term
    - n_pos: number of unique parts of speech for the term
    - cat_pos: tag for each unique part of speech for the term
    - stop: dummy variable indicating whether a term is a stopword
    - stem_porter: stem for the term according to the porter method
    - stem_snowball: stem for the term according to the snowball method
    - stem_lancaster: stem for the term according to the lancaster method
    - dfidf: global boolean term entropy
    - mean_tfidf: average significance of the term in a document
    - anger, anticipation, disgust, fear, joy, sadness, surprise, trust, sentiment, negative, positive: features representing the sentiment analysis lexicon mapped to corpus terms
    - polarity: measurement of overall sentiment for each term
- **VOCAB**: extracted vocabulary from corpus (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/VOCAB.csv)
    - n: count of term in corpus
    - p: probabiliy of term occurrence in corpus
    - i: information for each term in corpus
    - n_chars: number of characters for each term in corpus
    - max_pos: maximally-occuring part of speech for term
    - n_pos: number of unique parts of speech for the term
    - cat_pos: tag for each unique part of speech for the term
    - stop: dummy variable indicating whether a term is a stopword
    - stem_porter: stem for the term according to the porter method
    - stem_snowball: stem for the term according to the snowball method
    - stem_lancaster: stem for the term according to the lancaster method
    - dfidf: global boolean term entropy
    - mean_tfidf: average significance of the term in a document
- **W2V-VOCAB**: word2vec terms and embeddings (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output/W2V-VOCAB.csv)
    - n: count of term in corpus
    - p: probabiliy of term occurrence in corpus
    - i: information for each term in corpus
    - n_chars: number of characters for each term in corpus
    - max_pos: maximally-occuring part of speech for term
    - n_pos: count of unique parts of speech for the term
    - cat_pos: tag for each unique part of speech for the term
    - stop: dummy variable indicating whether a term is a stopword
    - stem_porter: stem for the term according to the porter method
    - stem_snowball: stem for the term according to the snowball method
    - stem_lancaster: stem for the term according to the lancaster method
    - dfidf: global boolean term entropy
    - mean_tfidf: average significance of the term in a document
    - vector: representation of vectorized term in (x-coord, y-coord) form
    - x: x-coordinate for vectorized term
    - y: y-coordinate for vectorized term
## Output-Viz: directory of files produced to be imported in the visualization notebook
- **DOCS**: (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/DOCS.csv)
    - n: number of tokens in the chapter
    - book_chap_sig: significance of the chapter
- **MT**: top 1000 terms by DFIDF (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/MT.csv)
    - pressed, recognized, perfectly...: aggregate tfidf (significance of term in book) for top 1000 terms of corpus by DFIDF excluding proper nouns
- **PAIRS**: correlations between books with various measures (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/PAIRS.csv)
    - correl: correlation between documents a and b
    - euclidean: euclidean distance between documents a and b
    - cosine: cosine distance between documents a and b
    - cityblock: cityblock distance between documents a and b
    - jaccard: jaccard distance between documents a and b
    - js: js distance between documents a and b
- **POS-GROUP**: part of speech group characteristics (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/POS-GROUP.csv)
    - n: count of words of each part of speech in corpus
    - pos_def: definition of each part of speech
    - p: probability of each part of speech
    - i: information for each part of speech
    - h: entropy of each part of speech
    - n_terms: clount of unique terms of each part of speech
    - n_tokens: count of tokens of each part of speech
- **POS**: parts of speech (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/POS.csv)
    - pos_def: definition of each part of speech
    - n: count of each part of speech in corpus
    - pos_group: part of speech group associated with each part of speech
    - punc: boolean with True for all parts of speech that are punctuation
- **SA-CHAPEMOTIONS**: sentiment polarity and emotions for each chapter (https://github.com/rachelgracetreene/text-analytics-final-project/blob/main/output-viz/SA-CHAPEMOTIONS.csv)
    - anger: measurement of anger sentiment in each chapter
    - anticipation: measurement of anticipation sentiment in each chapter
    - disgust: measurement of disgust sentiment in each chapter
    - fear: measurement of fear sentiment in each chapter
    - joy: measurement of joy sentiment in each chapter
    - sadness: measurement of sadness sentiment in each chapter
    - surprise: measurement of surprise sentiment in each chapter
    - trust: measurement of trust sentiment in each chapter
    - polarity: measurement of overall sentiment in each chapter