# 4. Part-of-Speech Tagging &amp; Docuscope's Rhetorical Tagging

**Goal**: Learn about the textual analysis method called speech tagging.

Speech-tagging methods can help us humans computationally parse, sort, and filter sentences in a large corpus, so we can retroactively also better understand any words of interest in their contexts of use.

We will be working with the English-language `docuscospacy` model in this lesson—a LLM developed by rhetoricians at Carnegie Mellon University. I'd like to thank them for providing users their helpful documentation, which I have modified to use in this lesson.

In this lesson, we are limited to the English language. Yet, feel free to check out Melanie Walsh's and Quinn Dombrowski's curated tutorials to work with other languages:

- [Chinese POS Tagging](Multilingual/Chinese/03-POS-Keywords-Chinese)
- [Danish POS Tagging](Multilingual/Danish/03-POS-Keywords-Danish)
- [Portuguese POS Tagging](Multilingual/Portuguese/03-POS-Keywords-Portuguese)
- [Russian POS Tagging](Multilingual/Russian/03-POS-Keywords-Russian)
- [Spanish POS Tagging](Multilingual/Spanish/03-POS-Keywords-Spanish)


## 4.1 What are Part-of-Speech &amp; DocuScope Tagging?

### 4.1.1 Part-of-Speech tagging

Part-of-speech (POS) tagging, also called grammatical tagging, is the culmination of decades of work and labor to annotate lexicons by their typical syntactical role(s) by manual and probablistic/computational means. (See [Lancaster U's summary](https://ucrel.lancs.ac.uk/annotation.html#POS) of their original process for more information about POS tagging.)

If you have ever diagrammed a sentence, like below, then you have a sense about POS. Note how the parts-of-speech have been tagged in the sentence. Lookup those tags in the [Lancaster's table of POS](https://ucrel.lancs.ac.uk/annotation.html#POS). 

<img src="../images/pos-tagging-example.png" alt="">

You may currently feel as though these syntactical relations are very dense to understand and use at first. That's normal! Eventually, with some practice, POS tagging can help you explore and surface interesting language patterns used in a particular context and domain worth reporting.

### 4.1.2 DocuScope's *Categorical* Tagging

In addition to POS tags, Docuscope's creators — David Kaufer, Suguru Ishizaki, and Kerry Ishizaki — have been developing a LLM for over 20 years that tag texts for a variety of experiences a writer can create for readers in a text. Docuscope can detect variations of word tokens, based on its token context. For instance, consider the following variations of rhetorical patterns of "I":

- "I" = self-reference
- "I know" = self-confidence
- "I have to do chores today" = self-reluctance
- "I get to do chores today" = eagerness

As the example above demonstrates, Docuscope's rhetorical approach to language assumes that meaning is isolated better in "streams of words," i.e., co-occurrences, rather than individual words. *In other words*, single words, when accompanied by different words, can be interpreted differently. Take the word "sick" below.

- I'm feeling as sick as a dog. = I am feeling very ill.
- That was so sick! = I thought something was really cool.
- I'm sick of talking about AI! = I don't want to talk about AI.

As of 2019, DS's database included 44,000 semantic stream classes and over 70 million unique English language patterns. The result? DocuScope's LLM offers people **categorical tags** to pair with POS tagging.

Categorical tags isolate potential rhetorical patterns within the corpus. To understand what I mean by rhetorical patterns, let's consider DS's categorical tags of `Reasoning` and `Character`. `Reasoning` isolates words or streams of words that are associated with reasoning statements, such as "because," "therefore," "even if," etc., while `Character` isolates multiple dimensions of a person. For example, depending on the word's company, the name "Pauline" might be tagged in relation to work-related characteristics or other domain-specific contextual information. 

So, if you decide to filter your corpus by the co-occurence of Reasoning tagged words with Character words, you could analyze the top and bottom relationships between the *who* is represented in the corpus (or not!) and words related to how they *reason*. From there, you can perhaps begin to imagine how you could conduct an EDA to ask questions of the patterns.

#### How DS tags categories

According to [DS's creators](https://docuscospacy.readthedocs.io/en/latest/docuscope.html), DS "consists of an enormous lexicon organized into a 3-level taxonomy. An analogue would be the lexicons typically used in sentiment analysis." A sentiment analysis typically cateogrizes words and phrases into either positive or negative sentiments. DS works similarily, but it "organizes its strings into many more categories and is orders of magnitude larger. A typical sentiment lexicon may match 3-5 thousand strings. DocuScope matches 100s of millions."

Let's walk through an example sentence and what categorical tags offer in addition to POS tags:

> Jaws is a shrewd cinematic equation which not only gives you one or two very nasty turns when you least expect them but, possibly more important, knows when to make you think another is coming without actually providing it.

When this sentence is analyzed by the DS, it yields the following table (headers are as follows: `tag`==POS tag, `ent`==Named Entity tag; `ent_type`==DS categorical tag).

Note how in addition to POS tagging, DS categorical tags offer a helpful rhetorical and contextual dimension to the token. DS tags offer numerous other categorical tags:

##### *Categorical tag* results from the "Jaws is a shrewd ..." sentence above.

| **** | **text**  | **tag\_** | **ent\_** | **ent\_type\_**       |
|------|-----------|-----------|-----------|-----------------------|
| 0    | Jaws      | NN1       | B         | Character             |
| 1    | is        | VBZ       | B         | InformationStates     |
| 2    | a         | AT1       | I         | InformationStates     |
| 3    | shrewd    | JJ        | B         | Strategic             |
| 4    | cinematic | JJ        | B         | PublicTerms           |
| 5    | equation  | NN1       | B         | AcademicTerms         |
| 6    | which     | DDQ       | B         | SyntacticComplexity   |
| 7    | not       | XX        | B         | ForceStressed         |
| 8    | only      | RR        | I         | ForceStressed         |
| 9    | gives     | VVZ       | B         | Interactive           |
| 10   | you       | PPY       | I         | Interactive           |
| 11   | one       | MC1       | O         |                       |
| 12   | or        | CC        | B         | MetadiscourseCohesive |
| 13   | two       | MC        | B         | InformationExposition |
| 14   | very      | RG        | B         | ConfidenceHigh        |
| 15   | nasty     | JJ        | B         | Negative              |
| 16   | turns     | NN2       | O         |                       |
| 17   | when      | RRQ       | B         | Narrative             |
| 18   | you       | PPY       | I         | Narrative             |
| 19   | least     | RRT       | B         | InformationExposition |
| 20   | expect    | VV0       | B         | Future                |
| 21   | them      | PPHO2     | B         | Narrative             |
| 22   | but       | CCB       | B         | MetadiscourseCohesive |
| 23   | ,         | Y         | B         | Contingent            |
| 24   | possibly  | RR        | I         | Contingent            |
| 25   | more      | RGR       | B         | InformationExposition |
| 26   | important | JJ        | I         | InformationExposition |
| 27   | ,         | Y         | O         |                       |
| 28   | knows     | VVZ       | B         | ConfidenceHigh        |
| 29   | when      | RRQ       | I         | ConfidenceHigh        |
| 30   | to        | TO        | O         |                       |
| 31   | make      | VVI       | B         | Interactive           |
| 32   | you       | PPY       | I         | Interactive           |
| 33   | think     | VVI       | B         | Character             |
| 34   | another   | DD1       | B         | MetadiscourseCohesive |
| 35   | is        | VBZ       | B         | InformationStates     |
| 36   | coming    | VVG       | O         |                       |
| 37   | without   | IW        | O         |                       |
| 38   | actually  | RR        | B         | ForceStressed         |
| 39   | providing | VVG       | B         | Facilitate            |
| 40   | it        | PPH1      | O         |                       |
| 41   | \.        | Y         | O         |                       |


#### Reference Table of DS Categorical Tags

| **\*\*Category \(Cluster\)\*\*** | **\*\*Description\*\***                                                                                                                                                                                                                                                       | **\*\*Examples\*\***                                                                                |
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| **Academic Terms**               | Abstract, rare, specialized, or disciplinary\-specific terms that are indicative of informationally dense writing                                                                                                                                                             | \*market price\*, \*storage capacity\*, \*regulatory\*, \*distribution\*                            |
| **Academic Writing Moves**       | Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales \(1981\) and Cotos et al\. \(2015, 2017\)                                                                                                 | \*in the first section\*, \*the problem is that\*, \*payment methodology\*, \*point of contention\* |
| **Character**                    | References multiple dimensions of a character or human being as a social agent, both individual and collective                                                                                                                                                                | \*Pauline\*, \*her\*, \*personnel\*, \*representatives\*                                            |
| **Citation**                     | Language that indicates the attribution of information to, or citation of, another source\.                                                                                                                                                                                   | \*according to\*, \*is proposing that\*, \*quotes from\*                                            |
| **Citation Authorized**          | Referencing the citation of another source that is represented as true and not arguable                                                                                                                                                                                       | \*confirm that\*, \*provide evidence\*, \*common sense\*                                            |
| **Citation Hedged**              | Referencing the citation of another source that is presented as arguable                                                                                                                                                                                                      | \*suggest that\*, \*just one opinion\*                                                              |
| **Confidence Hedged**            | Referencing language that presents a claim as uncertain                                                                                                                                                                                                                       | \*tends to get\*, \*maybe\*, \*it seems that\*                                                      |
| **Confidence High**              | Referencing language that presents a claim with certainty                                                                                                                                                                                                                     | \*most likely\*, \*ensure that\*, \*know that\*, \*obviously\*                                      |
| **Confidence Low**               | Referencing language that presents a claim as extremely unlikely                                                                                                                                                                                                              | \*unlikely\*, \*out of the question\*, \*impossible\*                                               |
| **Contingent**                   | Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge                                                                                                                                                                       | \*subject to\*, \*if possible\*, \*just in case\*, \*hypothetically\*                               |
| **Description**                  | Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects                                                                                                                                                                                | \*stay quiet\*, \*gas\-fired\*, \*solar panels\*, \*soft\*, \*on my desk\*                          |
| **Facilitate**                   | Language that enables or directs one through specific tasks and actions                                                                                                                                                                                                       | \*let me\*, \*worth a try\*, \*I would suggest\*                                                    |
| **First Person**                 | This cluster captures first person\.                                                                                                                                                                                                                                          | \*I\*, \*as soon as I\*, \*we have been\*                                                           |
| **Force Stressed**               | Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms                                                                                                                                                                        | \*really good\*, \*the sooner the better\*, \*necessary\*                                           |
| **Future**                       | Referencing future actions, states, or desires                                                                                                                                                                                                                                | \*will be\*, \*hope to\*, \*expected changes\*                                                      |
| **Information Change**           | Referencing changes of information, particularly changes that are more neutral                                                                                                                                                                                                | \*changes\*, \*revised\*, \*growth\*, \*modification to\*                                           |
| **Information Change Negative**  | Referencing negative change\.                                                                                                                                                                                                                                                 | \*going downhill\*, \*slow erosion\*, \*get worse\*                                                 |
| **Information Change Positive**  | Referencing positive change\.                                                                                                                                                                                                                                                 | \*improving\*, \*accrued interest\*, \*boost morale\*                                               |
| **Information Exposition**       | Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons                                                                                                                                    | \*final amount\*, \*several\*, \*three\*, \*compare\*, \*80%\*                                      |
| **Information Place**            | Language designating places\.                                                                                                                                                                                                                                                 | \*the city\*, \*surrounding areas\*, \*Houston\*, \*home\*                                          |
| **Information Report Verbs**     | Informational verbs and verb phrases of reporting\.                                                                                                                                                                                                                           | \*report\*, \*posted\*, \*release\*, \*point out\*                                                  |
| **Information States**           | Referencing information states, or states of being\.                                                                                                                                                                                                                          | \*is\*, \*are\*, \*existing\*, \*been\*                                                             |
| **Information Topics**           | Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text                                                                                                                                                                              | \*time\*, \*money\*, \*stock price\*, \*phone interview\*                                           |
| **Inquiry**                      | Referencing inquiry, or language that points to some kind of inquiry or investigation                                                                                                                                                                                         | \*find out\*, \*let me know if you have any questions\*, \*wondering if\*                           |
| **Interactive**                  | Addresses from the author to the reader or from persons in the text to other persons\. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention\-getters, feedback, interactive genre markers, and the use of the second person\. | \*can you\*, \*thank you for\*, \*please see\*, \*sounds good to me\*                               |
| **Metadiscourse Cohesive**       | The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive                                                                                                                    | \*or\*, \*but\*, \*also\*, \*on the other hand\*, \*notwithstanding\*, \*that being said\*          |
| **Metadiscourse Interactive**    | The use of words to build cohesive markers that interact with the reader                                                                                                                                                                                                      | \*I agree\*, \*let’s talk\*, \*by the way\*                                                         |
| **Narrative**                    | Language that involves people, description, and events extending in time                                                                                                                                                                                                      | \*today\*, \*tomorrow\*, \*during the\*, \*this weekend\*                                           |
| **Negative**                     | Referencing dimensions of negativity, including negative acts, emotions, relations, and values                                                                                                                                                                                | \*does not\*, \*sorry for\*, \*problems\*, \*confusion\*                                            |
| **Positive**                     | Referencing dimensions of positivity, including actions, emotions, relations, and values                                                                                                                                                                                      | \*thanks\*, \*approval\*, \*agreement\*, \*looks good\*                                             |
| **Public Terms**                 | Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility                                                                                                                                                   | \*discussion\*, \*amendment\*, \*corporation\*, \*authority\*, \*settlement\*                       |
| **Reasoning**                    | Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise                                                                              | \*because\*, \*therefore\*, \*analysis\*, \*even if\*, \*as a result\*, \*indicating that\*         |
| **Responsibility**               | Referencing the language of responsibility\.                                                                                                                                                                                                                                  | \*supposed to\*, \*requirements\*, \*obligations\*                                                  |
| **Strategic**                    | This dimension is active when the text structures strategies activism, advantage\-seeking, game\-playing cognition, plans, and goal\-seeking\.                                                                                                                                | \*plan\*, \*trying to\*, \*strategy\*, \*decision\*, \*coordinate\*, \*look at the\*                |
| **Syntactic Complexity**         | The features in this category are often what are called “function words,” like determiners and prepositions\.                                                                                                                                                                 | \*the\*, \*to\*, \*for\*, \*in\*, \*a lot of\*                                                      |
| **Uncertainty**                  | References uncertainty, when confidence levels are unknown\.                                                                                                                                                                                                                  | \*kind of\*, \*I have no idea\*, \*for some reason\*                                                |
| **Updates**                      | References updates that anticipate someone searching for information and receiving it                                                                                                                                                                                         | \*already\*, \*a new\*, \*now that\*, \*here are some\*                                             |

## 4.2 Why is Speech Tagging Useful?

<img src="https://imgs.xkcd.com/comics/language_nerd.png" alt="I don't mean to go all [Language Nerd](https://xkcd.com/1443/) on you, but parts of speech are important. Even if they seem kind of boring. *Parts of speech* are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.">


By computationally identifying parts of speech, we can start computationally exploring *syntax*, the relationship between words, rather than only focusing on words in isolation, as other analyses perform such as *tf-idf*. 

Though parts of speech may seem pedantic, and difficult to apply at first, the POS and DS tags (*LANGUAGE AS DIGITAL DATA!*) help us work with our computational media, i.e., computers, to work toward that ever-elusive abstract noun: *meaning*.

The *xkcd* comic above nicely illustrates how words are not monolithic or static in meaning. Why? Language is a dynamic part of the human experience. Language has a history—a history that feeds into what we do with language right now, which also impacts how words develop into our future. Consequently, syntactical tagging of word relationships - even within the seemingly simple language unit that we call sentences - is difficult. Words can and do shift roles and functions and how they impact social relationships (rhetoric!), because langauge is more than a "bag of words" and their syntactical relationships, e.g. "legit" as *adjective* and *adverb*. We've tangled up language in all aspects of our world.

## 4.3 Speech Tagging Lesson

### 4.3.1 Install necessary libraries

In [None]:
%pip install -U spacy==3.5.0 
# Special modified version of tmtoolkit to work with docuscospacy (I logged the issue with DS, so this works with 0.2.3)
%pip install pip install https://github.com/lingeringcode/temptmtoolkit/archive/refs/heads/main.zip
%pip install docuscospacy==0.2.3

### 4.3.2 Import Libraries + EDA

In [None]:
import spacy
from spacy import displacy
from temptmtoolkit.corpus import Corpus, vocabulary_size, corpus_num_tokens
import re
from docuscospacy.corpus_analysis import convert_corpus, frequency_table, tags_table, ngrams_table, coll_table, tags_dtm, kwic_center_node, keyness_table
import pandas as pd
import json
import statistics

#### Quick EDA to understand the data

In [None]:
##############
## Import Data
##############

# Open/read the sample JSON file
file_adj_corpus = open('../data/04-textual-analysis/kaggle-nb-example/dict_viz_and_adjacent_cells.json', 'r')

# Convert the JSON data into python dict object
dict_adj_corpus = json.load(file_adj_corpus)

# Check the type of the Python object
print(type(dict_adj_corpus))

# Iterate through the dictionary
# And print the key: value pairs
# for key, value in dict_adj_corpus.items():
#   print(f"\nKey: {key}")
#   print(f"Value: {value}\n")

# Close the opened sample JSON file
file_adj_corpus.close()

#### What is the sum of cells adjacent to visual cells per position?

**PROVENANCE**: This data comes from Kaggle.com, where data scientists share data and notebooks that analyze it. This set specifically collected notebooks that employ machine-learning techniques with the goal to understand how people used visuals throughout the process. Consequently, the data are organized by two factors:

1. whether or not a notebook cell is a "code" cell or a "Markdown" cell, and 
2. whether the cell is positioned before or after a cell that produces a visualization.

Review the summary stats work below to get to know the data some more before we use other more specific `.txt` files that I created from this particular data set.

In [None]:
####################################################################
# 1. What is the sum of cells adjacent to visual cells per position?
####################################################################
def sum_adj_cells(c):
  md_before_count = 0
  md_after_count = 0
  code_before_count = 0
  code_after_count = 0
  for vcell_key in c:
    if c[vcell_key]['adjacent_before'] != 'None':
      if c[vcell_key]['adjacent_before']['cell_type'] == 'markdown':
        md_before_count = md_before_count + 1
      elif c[vcell_key]['adjacent_before']['cell_type'] == 'code':
        code_before_count = code_before_count + 1
    elif c[vcell_key]['adjacent_after'] != 'None':
      if c[vcell_key]['adjacent_after']['cell_type'] == 'markdown':
        md_after_count = md_after_count + 1
      elif c[vcell_key]['adjacent_after']['cell_type'] == 'code':
        code_after_count = code_after_count + 1

  print(
    'Q1: What is the sum of cells adjacent to visual cells per posiiton?'
    '\n- md_before_count:', md_before_count,
    '\n- md_after_count:', md_after_count,
    '\n- code_before_count:', code_before_count,
    '\n- code_after_count:', code_after_count,
    '\n- TOTAL:', md_before_count+md_after_count+code_before_count+code_after_count
  )

sum_adj_cells(dict_adj_corpus)

#### What positions are visual cells in the notebook?

In [None]:
##################################################
# 2. What positions are visual cells in the notebook?
##################################################
def describe_viz_cell_positions(c):
  # Overall positioning
  cp_occurrences = 0
  ni_cp_occurrences = 0
  overall_cell_position_sum = 0
  no_import_cell_position_sum = 0
  sum_cells_sum = 0
  overall_position_list = []
  no_import_position_list = []
  sum_of_cells_list = []
  for vcell_key in c:
    # Tally occurrences
    cp_occurrences = cp_occurrences+1
    sum_cells_sum = sum_cells_sum+c[vcell_key]['total_cells_in_nb']

    # List total cells in notebooks
    sum_of_cells_list.append(c[vcell_key]['total_cells_in_nb'])

    overall_cell_position_sum = overall_cell_position_sum + c[vcell_key]['cell_position_in_nb']
    overall_position_list.append(c[vcell_key]['cell_position_in_nb'])

    # Filter out imports
    re_import = r'import\s'
    import_match = re.match(re_import, c[vcell_key]['source'])
    # print(import_match)
    if import_match == None:
      ni_cp_occurrences = ni_cp_occurrences+1
      no_import_cell_position_sum = no_import_cell_position_sum + c[vcell_key]['cell_position_in_nb']
      no_import_position_list.append(c[vcell_key]['cell_position_in_nb'])

  # MEDIAN
  # Sort the lists in ascending order
  overall_position_list = sorted(overall_position_list)
  no_import_position_list = sorted(no_import_position_list)
  sum_of_cells_list = sorted(sum_of_cells_list)

  overall_position_median = statistics.median(overall_position_list)
  no_imports_position_median = statistics.median(no_import_position_list)
  sum_cells_median = statistics.median(sum_of_cells_list)

  # MODE
  # Round numbers in list to whole number
  whole_num_overall_position_list = []
  for op in overall_position_list:
    whole_num_overall_position_list.append(round(op))
  
  whole_num_ni_position_list = []
  for nip in no_import_position_list:
    whole_num_ni_position_list.append(round(nip))
  
  overall_position_mode = statistics.mode(whole_num_overall_position_list)
  no_imports_position_mode = statistics.mode(whole_num_ni_position_list)
  sum_cells_mode = statistics.mode(sum_of_cells_list)

  print(
    '\n\nQ2: What positions are visual cells in the notebook?',
    '\n- OVERALL Avg Mean Cell Position:', overall_cell_position_sum / cp_occurrences,
    '\n- OVERALL Median Cell Position:', overall_position_median,
    '\n- OVERALL Mode Cell Position:', overall_position_mode,
    '\n- NO IMPORTS Avg Mean Cell Position:', no_import_cell_position_sum / ni_cp_occurrences,
    '\n- NO IMPORTS Median Cell Position:', no_imports_position_median,
    '\n- NO IMPORTS Mode Cell Position:', no_imports_position_mode,
    '\n- TOTAL CELLS Avg Mean:', sum_cells_sum / cp_occurrences,
    '\n- TOTAL CELLS Median:', sum_cells_median,
    '\n- TOTAL CELLS Mode:', sum_cells_mode,
    '\n- OVERALL Sum Total Cells:', cp_occurrences,
    '\n- NO IMPORTS Sum Total Cells:', ni_cp_occurrences,
    '\n- IMPORTS ONLY Sum Total Cells:', (ni_cp_occurrences-cp_occurrences),
    '\n'
  )

describe_viz_cell_positions(dict_adj_corpus)

#### Other EDA?

Feel free to use the pandas dataframes below to do any more EDA, if you'd like. But, the main part of this lesson is to learn more about POS and Categorical tagging.

In [None]:
# Isolate NON-IMPORT-ADJACENT Markdown Cells
def get_non_import_adj_md_cells(c):
  list__all_cell_content = []
  list__ni_md_before_cell_content = []
  list__ni_md_after_cell_content = []
  for vcell_key in c:
    # Filter out imports
    re_import = r'import\s'
    import_match = re.match(re_import, c[vcell_key]['source'])
    if (import_match == None):
      # Before Vcell
      if c[vcell_key]['adjacent_before'] != 'None' and (c[vcell_key]['adjacent_before']['cell_type'] == 'markdown'):
        list__ni_md_before_cell_content.append(c[vcell_key]['adjacent_before']['source'])
        list__all_cell_content.append(c[vcell_key]['adjacent_before']['source'])
      # After Vcell
      elif c[vcell_key]['adjacent_after'] != 'None' and (c[vcell_key]['adjacent_after']['cell_type'] == 'markdown'):
        list__ni_md_after_cell_content.append(c[vcell_key]['adjacent_after']['source'])
        list__all_cell_content.append(c[vcell_key]['adjacent_after']['source'])

  dict__ni_md_cell_contents = [
    {'all_vcells': list__all_cell_content},
    {'before_vcells': list__ni_md_before_cell_content},
    {'after_vcells': list__ni_md_after_cell_content}
  ]

  return dict__ni_md_cell_contents

In [None]:
# Remove "Import" cells from samples
dict__ni_adj_corpus = get_non_import_adj_md_cells(dict_adj_corpus)

# Convert dictionaries to dataframes
df__ALL_ni_adj_corpus = pd.DataFrame(dict__ni_adj_corpus[0]['all_vcells']).rename(columns={0:'all_vcells'})
df__BEFORE_ni_adj_corpus = pd.DataFrame(dict__ni_adj_corpus[1]['before_vcells']).rename(columns={0:'before_vcells'})
df__AFTER_ni_adj_corpus = pd.DataFrame(dict__ni_adj_corpus[2]['after_vcells']).rename(columns={0:'after_vcells'})

In [None]:
# Optional EDA code

### 4.3.3 Download LLM

Ok, now that you know the data a little better than before, we can start our POS and Categorical tags analysis. 

First, we need to download DocuScope's LLM: `en_docusco_spacy`. Their (English-only) LLM will process and make predictions about our texts. 

- **NOTE**: Another commonly used LLM is spacy's `en_core_web_sm`. This LLM was trained on the annotated "OntoNotes" corpus. 

To download and install the `en_docusco_spacy` model, 

1. Execute the cell below: `%pip install en_docusco_spacy`. 
2. Execute the subsequent cell that uses `spacy` to load the `en_docusco_spacy` model as a variable method.

Now, we can use DS's LLM!

In [None]:
%pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl

In [None]:
nlp = spacy.load('en_docusco_spacy')

### 4.3.4 Conducting a corpus analysis with docuscopacy

In this part of the lesson, we will follow the `docuscospacy` package's [documentation tutorial](https://docuscospacy.readthedocs.io/en/latest/corpus_analysis.html). 

Using this LLM, we will conduct some of the following actions to conduct a corpus analysis:

- Token frequencies
- Ngrams
- Word collocations around a node word
- Keyword comparisions against a reference corpus

What's great about `docuscospacy` (DS) is how it provides methods by which review the results from the analysis—another form of EDA! 

For example, DS can render outputs, such as tables with either POS or DS tags. This will help you differentiate potential moments when a POS like *can* functions as a *noun* vs. *can as verb*. It may prove interesting! Who knows until see what language patterns may be found!

In another example, we'll review how `docuscospacy` conducts and aggregates multi-token sequenced tags, e.g., where *in spite of* is tagged as a token sequence, it is combined into a signle token.

Overall, our goal is to get to know the data and get creative with how we analyze it.

### 4.3.5 Load Corpus Data and Pre-Processing the Corpus

Accurate tagging requires some processing/cleaning of the data. In the function below, we 

1. Make all of the text uniformly lowercase, because remember that computers define every type of character, such as lowercase `k` vs. uppercase `K`.
2. Remove any potential HTML tags in this particular corpus.
3. Split the possessive `its` into two tokens. 
4. Removes carriage returns, tabs, extra spaces, etc.

In [None]:
def pre_process(txt):
    # 1. normalize to all lowercase
    txt=txt.lower()
    # 2. remove HTML tags, common markdown syntax, newline chars, URLs
    txt=re.sub(r"</?.*?>", '', txt)
    txt=re.sub(r"\*{1,3}", '', txt)
    txt=re.sub(r"\\n", ' ', txt)
    txt=re.sub(r"http\S+", '', txt)
    # 3. remove posessives
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    # Split text into tokens (tokenize it)
    txt = " ".join(txt.split())
    
    return(txt)

In [None]:
# You may need to change this variable, if you end up using diferent data sets
nlp.max_length = 1300000 #1256620
# 1300000 for Kaggle's by-position
print(nlp.max_length)

In [None]:
%%time
corp = Corpus.from_folder(
    # Note how this uses a folder reference, not just a file.
    # It parses txt files divvied up into differently sorted parts of a corpus/es

    # CHOOSE YOUR CORPUS! :-) The lesson starts by using the Kaggle 'by-position' data.
    # Use this data first. Then, for your application, choose another set if you'd like

    '../data/04-textual-analysis/kaggle-nb-example/by-position', 
    # '../data/04-textual-analysis/kaggle-nb-example/by-votes',
    
    # NOTE: For the sake of file sizes and in-memory work, I randomly sampled the Jeopardy questions (10000 per document set)
    # '../data/04-textual-analysis/jeopardy/split', 
    spacy_instance=nlp, # Our loaded LLM
    raw_preproc=[pre_process], # Cleaning function
    spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'] #Run these tagging functions
)

### 4.3.6 Basic Summary Stats About Corpus

It is always a good idea to calcu;ate and review some aggregated information about your corpus. These stats can help you identify potential oddities before you even begin the other work. And, some of the values will be useful later.

In [None]:
corpus_total = corpus_num_tokens(corp)
corpus_types = vocabulary_size(corp)
total_punct = []
for i in range(0,len(corp)):
    total_punct.append(sum(corp[i]['is_punct']))
total_punct = sum(total_punct)
non_punct = corpus_total - total_punct

print(
    'Aphanumeric tokens:', non_punct, 
    '\nPunctuation tokens:', total_punct, 
    '\nTotal tokens:', corpus_total, 
    '\nToken types:', corpus_types
)

### 4.3.7 Converting the corpus to a DS class object

Before we generate any tables, we first need to convert the corpus into a convenient object that we can manipulate. From `docuscospacy.corpus_analysis` we will import a number of functions including `convert_corpus`, which takes the object produced by the `Corpus.from_folder` function and converts it into a dictionary, whose keys are the names of the corpus files.

In [None]:
from docuscospacy.corpus_analysis import convert_corpus, frequency_table, tags_table, ngrams_table, coll_table, tags_dtm, kwic_center_node, keyness_table

In [None]:
tp = convert_corpus(corp)

In [None]:
# Listify the keys to review them for accuracy: Did it get all of the desired files?
list(tp.keys())

In [None]:
# Listify a file's content in the dictionary by changing the positional values to change the file and tagged tokens in the file
list(tp.values())[1][10:]

### 4.3.8 Generate Frequency Tables

We can use DS's `frequency_table` function to easily review counts. Here's the description from the API:

<blockquote>
<h4><a href="https://docuscospacy.readthedocs.io/en/latest/api.html#corpus-analysis-frequency-table-tok-n-tokens-count-by-pos">corpus_analysis.frequency_table(tok, n_tokens, count_by=’pos’)</a></h4>
<p>Generate a count of token frequencies.</p>
<div><dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>tok</strong> - A dictionary of tuples as generated by the <cite>convert_corpus</cite> function</p></li>
<li><p><strong>n_tokens</strong> - A count of total tokens against which to normalize</p></li>
<li><p><strong>count_by</strong> - One of ‘pos’ or ‘ds’ for aggregating tokens</p></li>
</ul>
</dd>
<dt class="field-even">Return</dt>
<dd class="field-even"><ul class="simple">
<li><p>a dataframe of absolute frequencies, normalized frequencies (per million tokens) and ranges</p></li>
</ul>
</dd>
</dl>
</div>
</blockquote>

`frequency_table` also [normalizes](https://en.wikipedia.org/wiki/Normalization_(statistics)) the data across the corpus, so that the counts can be compared across the corpi, i.e. compare and contrast the returned absolute frequencies (col `AF`), normalized relative frequencies (per million tokens) (col `RF`), and ranges (col `Range`). token shares a common scale.

#### 4.3.8.1 POS tables

In our first frequency table, we will normalize the counts with the non_punct (or the total number of tokens that are not punctuation), because POS token counts omit tokens tagged as punctuation. Later we will count ALLTHETHINGS, since DS includes punctuation.

In [None]:
freq_table_POS = frequency_table(
    tok=tp, # converted corpus into dict objs
    count_by='pos', # Count by POS tags
    n_tokens=non_punct # norm the corpus by excluding punctuation
)

In [None]:
freq_table_POS.head(10).style.hide(axis='index').format(precision=2)

Since it's a dataframe, we can filter and sort it easily with the `query()` method. Yay!

Below, we filter for an AF greater than 10 and token tags as verbs (starting with ‘V’). 

Refer to the [CLAWS7 POS table](https://ucrel.lancs.ac.uk/claws7tags.html) for desired speech tags to filter against.

In [None]:
freq_table_POS.query('AF > 10 and Tag.str.startswith("V")').head(10).style.hide(axis='index').format(precision=2)

In [None]:
# Filter for adverbs, yo! How are we modifying those verbs?
freq_table_POS.query('Tag.str.startswith("R")').head(20).style.hide(axis='index').format(precision=2)

#### 4.3.8.2 DS category tables

Note how we normalize by ‘corpus_total’ as DocuScope includes punctuation in its tagging system.

In [None]:
freq_table_DS = frequency_table(
    tok=tp, 
    n_tokens=corpus_total, 
    count_by='ds'
)

In [None]:
freq_table_DS.head(10).style.hide(axis='index').format(precision=2)

Let's sort by DS tags!

Review the DS tags in their [user docs](https://docuscospacy.readthedocs.io/en/latest/docuscope.html#categories).

In [None]:
# By INQUIRY
freq_table_DS.query('Tag.str.startswith("Inquiry")').head(10).style.hide(axis='index').format(precision=2)

In [None]:
# By INFORMATION TOPICS
freq_table_DS.query('Tag.str.startswith("Information Topics")').head(20).style.hide(axis='index').format(precision=2)

In [None]:
# By INFORMATION REPORT VERBS
freq_table_DS.query('Tag.str.startswith("Information Report Verbs")').head(20).style.hide(axis='index').format(precision=2)

In [None]:
# By INFORMATION CHANGE
freq_table_DS.query('Tag.str.startswith("Information Change")').head(10).style.hide(axis='index').format(precision=2)

In [None]:
# By REASONING
freq_table_DS.query('Tag.str.startswith("Reasoning")').head(10).style.hide(axis='index').format(precision=2)

In [None]:
# By UNCERTAINTY
freq_table_DS.query('Tag.str.startswith("Uncertainty")').head(10).style.hide(axis='index').format(precision=2)

#### 4.3.8.3 Tags Frequency Tables

We can focus on the tag counts by using the `tags_table` function. It works just like the `frequency_table` function, taking a dictionary created by the `convert_corpus` function, an integer against which to normalize, and a `count_by` argument of either `‘pos’` or `‘ds’`.

In [None]:
tags_freq_table_POS = tags_table(
    tok=tp,
    n_tokens=non_punct,
    count_by='pos'
)

In [None]:
tags_freq_table_POS.head(10).style.hide(axis='index').format(precision=2)

And by DocuScope category:

In [None]:
tags_freq_table_DS = tags_table(
    tok=tp,
    n_tokens=corpus_total,
    count_by='ds'
)

In [None]:
tags_freq_table_DS.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

#### 4.3.8.4 N-gram Frequency Tables

Ngrams (between bigrams and 5-grams) can be calculated using the `ngrams_table` function. It works much like the `frequency_table` function but with the addition of a span argument `ng_span` consisting of an integer between 2 and 5.

This will return a table of 3-grams:

In [None]:
ngram_freq_table_POS = ngrams_table(
    tok=tp,
    ng_span=3,
    n_tokens=non_punct,
    count_by='pos'
)

The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

In [None]:
ngram_freq_table_POS.head(10).style.hide(axis='index').format(precision=2)

Now we can filter ngrams that, for example, start with a verb:

In [None]:
ngram_freq_table_POS.query('Tag1.str.startswith("V")').head(20).style.hide(axis='index').format(precision=2)

Or sequences that end with a *past participle* (‘VVN’) preceded by a *to be verb* (‘VB’), thus showing passive constructions:

In [None]:
ngram_freq_table_POS.query('Tag3.str.startswith("VVN") and Tag2.str.startswith("VB")').head(15).style.hide(axis='index').format(precision=2)

Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams again:

In [None]:
ngram_freq_table_DS = ngrams_table(
    tok=tp,
    ng_span=3,
    n_tokens=corpus_total,
    count_by='ds'
)

In [None]:
ngram_freq_table_DS.head(10).style.hide(axis='index').format(precision=2)

Let's find a sequence tagged as ‘Positive’ on the right and on the left we filter out Untagged (‘O’) and ‘Syntactic Complexity’:

In [None]:
ngram_freq_table_DS.query('Tag1.str.startswith("Positive") and (~Tag2.str.startswith("Syntactic") and ~Tag2.str.startswith("Untagged") and ~Tag3.str.startswith("Untagged"))').head(10).style.hide(axis='index').format(precision=2)

In [None]:
ngram_freq_table_DS.query('Tag1.str.startswith("Negative") and (~Tag2.str.startswith("Syntactic") and ~Tag2.str.startswith("Untagged") and ~Tag3.str.startswith("Untagged"))').head(10).style.hide(axis='index').format(precision=2)

#### 4.3.8.5 Collocation Frequency Tables

Calculate *collocations* within a left-and-right span of a node word can be calculated according to several association measures. 

Default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like `frequency_table`, `coll_table` requires a dictionary of the type generated by the `convert_corpus` function. It also requires 

- a node word, 
- a node tag, and 
- an association measure statistic

<blockquote>
<h4><a href="https://docuscospacy.readthedocs.io/en/latest/api.html#corpus-analysis-coll-table-tok-node-word-l-span-4-r-span-4-statistic-pmi-count-by-pos-node-tag-none-tag-ignore-false">corpus_analysis.coll_table(tok, node_word, l_span=4, r_span=4, statistic=’pmi’, count_by=’pos’, node_tag=None, tag_ignore=False)</a></h4>
<div><dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>tok</strong> - A dictionary of tuples as generated by the <cite>convert_corpus</cite> function</p></li>
<li><p><strong>node_word</strong> - The token around with collocations are measured</p></li>
<li><p><strong>l_span</strong> - An integer between 0 and 9 representing the span to the left of the node word</p></li>
<li><p><strong>r_span</strong> - An integer between 0 and 9 representing the span to the right of the node word</p></li>
<li><p><strong>statistic</strong> - The association measure to be calculated. One of: ‘pmi’, ‘npmi’, ‘pmi2’, ‘pmi3’</p></li>
<li><p><strong>count_by</strong> - One of ‘pos’ or ‘ds’ for aggregating tokens</p></li>
<li><p><strong>node_tag</strong> - A value specifying the tag of the node word. If the node_word were ‘can’, a node_tag ‘V’ would search for can as a verb.</p></li>
<li><p><strong>tag_ignore</strong> - A boolean value indicating whether or not tags should be ignored during analysis.</p></li>
</ul>
</dd>
<dt class="field-even">Return</dt>
<dd class="field-even"><ul class="simple">
<li><p>a dataframe containing collocate tokens, tags, the absolute frequency the collocate in the corpus, the absolute frequency of the collocate within the designated span, and the association measure.</p></li>
</ul>
</dd>
</dl>
</div>
</blockquote>

##### POS tags

In [None]:
collocations_freq_table_POS = coll_table(
    tok=tp,
    node_word='can',
    node_tag='V',
    statistic='pmi',
    count_by='pos'
)
collocations_freq_table_POS.head(10).style.hide(axis='index').format(precision=2)

In [None]:
collocations_freq_table_POS.query('`Freq Total` > 5 and MI > 3 and Tag.str.startswith("V")').head(10).style.hide(axis='index').format(precision=2)

In [None]:
collocations_freq_table_POS_npmi = coll_table(
    tok=tp,
    node_word='data',
    node_tag='N',
    statistic='npmi',
    count_by='pos',
    # tag_ignore=True
)
collocations_freq_table_POS_npmi.head(10).style.hide(axis='index').format(precision=2)

##### DS tags

In [None]:
# DS Time!
collocations_freq_table_DS_npmi = coll_table(
    tok=tp,
    node_word='data',
    node_tag='Academic Terms',
    statistic='npmi', 
    count_by='ds'
)
collocations_freq_table_DS_npmi.head(10).style.hide(axis='index').format(precision=2)

We can also calculate collocations, while ignoring tags completely by setting `tag_ignore` to `True`

In [None]:
collocations_freq_table_DS_npmi_ignore_tag = coll_table(
    tok=tp,
    # Dial up the span of words to search for
    l_span=6,
    r_span=6,
    node_word='justice',
    tag_ignore=True, 
    statistic='npmi'
)
# Note the sorting method used too
collocations_freq_table_DS_npmi_ignore_tag.sort_values(by=['Freq Total'], ascending=False).head(10).style.hide(axis='index').format(precision=2)

## 4.4 Document-term matrices for tags

Document-term matrices (DTMs) are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These can be produced by tmtoolkit) using the `dtm` function.

The `docuscopspacy` package allows for the creation of DTMs with (*raw*) tag counts (rather than token counts) as variables.

These are produced by the `tags_dtm` function, which takes a dictionary created by the `convert_corpus` function and a `count_by` argument of either `‘pos’` or `‘ds’`.

### 4.4.1 POS DTM

In [None]:
dtm_POS = tags_dtm(
    tok=tp,
    count_by='pos'
)

In [None]:
dtm_POS.head(10).style.hide(axis='index').format(precision=0)

### 4.4.2 DS DTM

In [None]:
dtm_DS = tags_dtm(
    tok=tp,
    count_by='ds'
)
dtm_DS.head(10).style.hide(axis='index').format(precision=0)

### 4.4.3 Count frequencies, then give weight (importance) to document-term counts

We can use the tmtoolkit library to create weighted counts (using its `tf_proportions` function), `tf-idf` values (using the tfidf function), or other types of data structures.

In [None]:
from temptmtoolkit.bow.bow_stats import tf_proportions, tfidf

#### 4.4.3.1 Count the document term frequencies (raw counts)

In [None]:
dtm_POS.set_index('doc_id', inplace=True)
dtm_POS.head(10).style.format(precision=0)

In [None]:
dtm_DS.set_index('doc_id', inplace=True)
dtm_DS.head(10).style.format(precision=0)

#### 4.4.3.2 Weigh the document-terms by proportions

Simply take the percentage of the frequencies across the sum across the corpus, e.g., AT () below comes out to roughly 6% of the corpus across the documents.

In [None]:
tf_proportions(dtm_POS).head(10)

In [None]:
tf_proportions(dtm_DS).head(10)

#### 4.4.3.3 Weigh document terms by Term Frequencies in relation to Inverse Document Frequency (TF-IDF)

What's TF-IDF? It solves the raw-count problem, where all terms are treated equally. For instance, we would expect a corpus about pets to include high frequencies of terms like the nouns of pets: *cat*, *dog*, etc. Those raw counts aren't necessarily helpful in many cases, so one way to weigh the importance of terms to derive more targeted meaning from the corpus is by calculating either TF, IDF (Inverse Document Frequency), or TF-IDF. 

- TF: Relative frequency of a term within document by taking the sum of the term and dividing it by the sum total of all words in the document.
- IDF: Calculation that diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. Hence, the IDF of a rare term is high, whereas the IDF of a frequent term is likely to be low.

TF-IDF is the composite measure of TF and IDF. The tf-idf weighting scheme considers each document in a corpus as a vector, so it assigns to term $t$ a weight in document $d$. The end result? Each term can be considered in the following ways:

1. highest when $t$ occurs many times within a small number of documents (thus lending high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.

<p style="background-color:#FFCCCB;color:black;margin:1rem;padding:1rem;font-size:0.9rem"><strong>NOTE</strong>: IDF should be based on a decently large corpora and should be representative of texts you would be using to extract keywords. Several articles on the web compute the IDF using only a handful of documents, so just be aware of this requirement.<br><br>To understand why IDF should be based on a fairly large collection, please read this <a href="./readings/IDF-explainer.pdf" target"_blank" rel="noopenner noreferrer">page from Standford's IR book</a> (<a href="https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html" target"_blank" rel="noopenner noreferrer">original src</a>).</p>

**Factoid**: Approximately 83% of digital libraries use text-based recommender systems that employ TF-IDF. (See Breitinger, Corinna; Gipp, Bela; Langer, Stefan (2015). Research-paper recommender systems: A literature survey. *International Journal on Digital Libraries*, 17(4), 305–338. doi:10.1007/s00799-015-0156-0.)

In [None]:
tfidf(dtm_POS).head(10)

In [None]:
tfidf(dtm_DS).head(10)

## 4.5 Application: Your turn to explore the data

Go back through the sections above. Explore the words and tags of your own interest. Consider importing a different dataset in the following folder `../data/04-textual-analysis`:

Recall how this notebook uses data that I collected for a study on Kaggle notebooks. This NB uses data that I processed by sampling markdown cells by the position in which it occurs in the NB: the first third, second third, and last third ('../data/04-textual-analysis/kaggle-nb-examples/by-position'). Feel free to import the following other datasets and explore them. 

- **Kaggle NBs divded by distribution of their total votes**: `../data/04-textual-analysis/kaggle-nb-examples/by-votes`
- **Amazon "Video Game" reviews by product ratings (1-5)** ([src](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)): `../data/04-textual-analysis/amazon`
- **Jeopardy questions**: `../data/04-textual-analysis/jeopardy`

## 4.6 Other Codebits

In [None]:
samp = """Who knew language and media were so tangled up with code, data, and computation?!"""
demo_doc = nlp_web_sm(samp)
options = {"compact": True, "distance": 50, "color": "yellow", "bg": "black", "font": "Gill Sans"}
displacy.render(demo_doc, style="dep", options=options)