<a href="https://colab.research.google.com/github/juliajung11/Basic-Text-Analysis-using-R/blob/master/Phrase_BERT_notebook_copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Moving from words to phrases when doing NLP
- [Abe Handler](https://www.abehandler.com/) University of Colorado, Boulder
- [Shufan Wang](https://people.cs.umass.edu/~shufanwang/) University of  Massachusetts, Amherst

## Phrase-BERT

Learning representations of phrases is important
for many tasks, such as semantic parsing, translation, question answering, or general corpus exploration.

### 1. Why is BERT insufficient?
While pretrained language models such as BERT have led to performance improvements in a variety of NLP tasks, we find that
they still struggle to produce semantically meaningful embeddings for shorter linguistic units (sentences and phrases). In fact, BERT, when used as an off-the-shelf model to produce sentence or phrase embeddings, often underperforms simple baselines such as averaging GloVe vectors
in semantic textual similarity tasks! That makes BERT less effective for use cases that involve phrases understanding.


### 2. Modify BERT to understand phrases
We develop Phrase-BERT, by
fine-tuning BERT with a contrastive learning objective to produce more powerful phrase embeddings. Specifically, we target two major weaknesses
of BERT for phrase embeddings: 

(1)
BERT never sees short texts (e.g., phrases) during pretraining, as its inputs are chunks of 512 tokens;

(2) BERT relies heavily on lexical similarity (word content overlap) to determine semantic relatedness. 

Hence, we construct two datasets of lexically-diverse phrasal paraphrases, and phrases associated with their contexts.
We then use the paraphrase data and contextual information to finetune BERT with an contrastive learning objective. The goal is that the embedding model learns to place phrase embeddings close to both their paraphrases
and the contexts in which they appear.



### 3. Phrase-BERT-based Neural Topic Model (PNTM)
Most existing topic models use lists of *unigrams* to describe topics but we believe having phrases into the mix can help corpus exploratin too. Here, we show that phrase-BERT can be easily integrated with an autoencoder model to
build a phrase-based neural topic model (PNTM). PNTM is aware of phrase semantics and phrasal diversities and can hence present topics as mixtures of words, phrases and even short sentences. 

Despite its simple architecture, PNTM outperforms
other topic model baselines in our human evaluation studies in terms of topic coherence and topic-to-document relatedness.




## Quickstart on Phrase-BERT

Phrase-BERT is essentially an encoder to produce meaningful embeddings for phrases. Particularly, given a phrase, Phrase-BERT outputs a 768-dimensional vector. The embedding vector lies in a phrase embedding space that is "semantically coherent". Here, "semantic coherence" refers to the property that embeddings of semantically similar phrases are placed close while others are placed apart.



Let's download the dependencies and then Phrase-BERT:

In [None]:
!pip install transformers==3.0.2
!pip install sentence-transformers==0.3.3
!git clone https://github.com/sf-wa-326/phrase-bert-topic-model.git

fatal: destination path 'phrase-bert-topic-model' already exists and is not an empty directory.


Next, let's download a pretrained Phrase-BERT model

In [None]:
!wget https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
!unzip phrase-bert-model.zip -d phrase-bert-model/
!rm phrase-bert-model.zip

--2022-02-16 05:22:03--  https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.152.128, 209.85.145.128, 142.250.125.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.152.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405370341 (387M) [application/zip]
Saving to: ‘phrase-bert-model.zip’


2022-02-16 05:22:06 (134 MB/s) - ‘phrase-bert-model.zip’ saved [405370341/405370341]

Archive:  phrase-bert-model.zip
replace phrase-bert-model/pooled_context_para_triples_p=0.8/config.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


This is how we can load a model of Phrase-BERT, using the sentence-transformer library:

In [None]:
from sentence_transformers import SentenceTransformer
model_path = '/content/phrase-bert-model/pooled_context_para_triples_p=0.8'
model = SentenceTransformer(model_path)

Next, let us try some phrases and get their embeddings from Phrase-BERT

In [None]:
phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs

In [None]:
for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding.shape)
    print("")

Phrase: play an active role
Embedding: (768,)

Phrase: participate actively
Embedding: (768,)

Phrase: active lifestyle
Embedding: (768,)



Once we convert phrases into vectors, we can computer their similarity using dot product.

In [None]:
import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}') 
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}') 
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}') 

The dot product between phrase 1 and 2 is: 218.43597412109375
The dot product between phrase 1 and 3 is: 165.48489379882812
The dot product between phrase 2 and 3 is: 160.51705932617188


Or we can also use cosine similarity:

In [None]:
import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')

The cosine similarity between phrase 1 and 2 is: 0.814253568649292
The cosine similarity between phrase 1 and 3 is: 0.6130305528640747
The cosine similarity between phrase 2 and 3 is: 0.5848934054374695


## Topic Model Case Study: Let's get the data first!

Contiue using the dataset from convokit:

In [None]:
!pip install convokit==2.5.2



And download the "supreme-corpus" fold from convokit

In [None]:
import convokit
data_dir = '/content/data/'
root_dir = convokit.download('supreme-corpus', data_dir=data_dir)
! wget https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl -O cases.jsonl

Dataset already exists at /content/data/supreme-corpus
--2022-02-16 05:23:06--  https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl
Resolving zissou.infosci.cornell.edu (zissou.infosci.cornell.edu)... 128.253.51.178
Connecting to zissou.infosci.cornell.edu (zissou.infosci.cornell.edu)|128.253.51.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13337468 (13M) [application/octet-stream]
Saving to: ‘cases.jsonl’


2022-02-16 05:23:06 (27.3 MB/s) - ‘cases.jsonl’ saved [13337468/13337468]



We load the downloaded corpus and extract utterances from all justices

In [None]:
import json
from convokit import Corpus
corpus = Corpus( root_dir )


In [None]:
from tqdm import tqdm 
import os

def get_justices(input_file='/content/cases.jsonl'):
    '''Get names of all justices in the dataset'''
    all_justices = set()
    with open(input_file, "r") as inf:
        for j in inf:
            j = json.loads(j)
            if j["votes"] is not None:
                for justice in j["votes"].keys():
                    all_justices.add(justice)
    return all_justices

all_justices = get_justices()



utterances = [] # build a list of the utterances we are interested in

for u in tqdm(corpus.get_utterance_ids()):
    u = corpus.get_utterance(u)
    if u.speaker.id in all_justices and u.meta["case_id"][0:3] == "201":
        utterances.append(u)

100%|██████████| 1700789/1700789 [00:03<00:00, 540282.19it/s]


We will use the utterances from two Justices: Justice Roberts and Justice Ginsburg

In [None]:
jrj_text_list = []
for u in tqdm( utterances ):
    if u.speaker.meta['name'] == 'John G. Roberts, Jr.' and 200 < len( u.text ) < 400:
        jrj_text_list.append( u.text )

rbg_text_list = []
for u in tqdm( utterances ):
    if 'ginsburg' in u.speaker.meta['name'].lower() and 200 < len( u.text ) < 400:
        rbg_text_list.append( u.text )

with open( os.path.join('/content/data/jrj', 'text_list.json'), 'w' ) as f:
    print(len(jrj_text_list))
    json.dump( jrj_text_list, f, indent=4 ) 

with open( os.path.join('/content/data/rbg', 'text_list.json'), 'w' ) as f:
    print(len(rbg_text_list))
    json.dump( rbg_text_list, f, indent=4) 

100%|██████████| 78598/78598 [00:00<00:00, 311348.81it/s]
100%|██████████| 78598/78598 [00:00<00:00, 502540.27it/s]

2016
1879





## Unigram Topic Model (LDA)

The code in the unigram topic section uses an LDA (Latent Dirichlet Allocation) model and is modified from this [notebook](https://github.com/kapadias/mediumposts/blob/master/natural_language_processing/topic_modeling/notebooks/Introduction%20to%20Topic%20Modeling.ipynb) (credit to the author Shashank Kapadia)

Now we have the text data, let's train a standard topic model (LDA) on the data!

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Remove stopwords for the topic model training

In [None]:
stop_words = stopwords.words('english')

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]

import json
name = 'jrj'
text_fname = f'/content/data/{name}/text_list.json'

with open(text_fname, 'r') as f:
    text_list = json.load(f)
print( len(text_list) )

data_words = list(sent_to_words(text_list))

# remove stop words
data_words = remove_stopwords(data_words)

2016


Build the topic model dictionary in gensim:

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]


Train the LDA topic model!

In [None]:
from pprint import pprint

# number of topics
num_topics = 20

# Build LDA model
# LdaMulticore from gensim is typically pretty fast for training 
# But other libraries like Mallet (and there is a gensim wrapper on that) may give better topics 
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.012*"well" + 0.010*"would" + 0.010*"know" + 0.008*"going" + 0.008*"mean" '
  '+ 0.007*"government" + 0.006*"get" + 0.006*"case" + 0.006*"different" + '
  '0.006*"one"'),
 (1,
  '0.073*"know" + 0.048*"mr" + 0.047*"stay" + 0.047*"granddaughter" + '
  '0.034*"united" + 0.033*"states" + 0.024*"get" + 0.024*"things" + 0.024*"go" '
  '+ 0.024*"back"'),
 (2,
  '0.048*"foreign" + 0.037*"entity" + 0.029*"case" + 0.028*"got" + '
  '0.028*"different" + 0.028*"effective" + 0.021*"know" + 0.015*"one" + '
  '0.015*"would" + 0.014*"like"'),
 (3,
  '0.020*"say" + 0.016*"know" + 0.015*"well" + 0.013*"mean" + 0.011*"think" + '
  '0.009*"want" + 0.008*"case" + 0.008*"would" + 0.008*"right" + 0.007*"says"'),
 (4,
  '0.012*"well" + 0.010*"would" + 0.006*"going" + 0.006*"statute" + '
  '0.005*"state" + 0.005*"mean" + 0.005*"say" + 0.005*"may" + 0.004*"know" + '
  '0.004*"says"'),
 (5,
  '0.138*"land" + 0.047*"words" + 0.047*"somebody" + 0.047*"anything" + '
  '0.046*"else" + 0.046*"someone" + 0.04

The descriptions on topics using unigram only may not always be formative, espeically when there are complicatd abstract concepts in the corpus. Hence, we would like to add phrases into the mix to help describe the topics.

## Phrase-Based Neural Topic Model (PNTM)

### Get phrases through consituency parsing

In [None]:
"""                 Sentence
                     |
       +-------------+------------+
       |                          |
  Noun Phrase                Verb Phrase
       |                          |
     John                 +-------+--------+
                          |                |
                        Verb          Noun Phrase
                          |                |
                        sees              Bill
"""
# the above example is taken from: https://stackoverflow.com/a/10401433

'                 Sentence\n                     |\n       +-------------+------------+\n       |                          |\n  Noun Phrase                Verb Phrase\n       |                          |\n     John                 +-------+--------+\n                          |                |\n                        Verb          Noun Phrase\n                          |                |\n                        sees              Bill\n'

In [None]:
!pip install benepar

Collecting transformers[tokenizers,torch]>=4.2.2
  Using cached transformers-4.16.2-py3-none-any.whl (3.5 MB)
[31mERROR: Operation cancelled by user[0m


The dependencies to perform constituency chunking

In [None]:
import pickle
import json
from typing import Counter
import spacy
import benepar
from benepar.integrations.spacy_plugin import SentenceWrapper
import os, time

Load the saved text data.

In [None]:

import json
name = 'jrj'
text_fname = f'/content/data/{name}/text_list.json'

with open(text_fname, 'r') as f:
    text_list = json.load(f)
print( len(text_list) )


Run the constituency parsing process

In [None]:
# !python -m spacy download en_core_web_md # download the required data for spacy if needed 
import nltk
from tqdm import tqdm
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md',
            exclude=['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
print('Loaded spacy model')
nlp.add_pipe('sentencizer')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

parse_string_list = []
for text in tqdm(text_list):
    doc = nlp(text)
    sent = list(doc.sents)[0]
    parse_string = sent._.parse_string
    parse_string_list.append( parse_string )


Get the constituency chunks

In [None]:
import nltk
from nltk.tree import * 
def ExtractPhrases( myTree, target_types):
    myPhrases = []
    if (myTree.label() in target_types):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, target_types)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases

target_types = [ 'VP', 'ADJP', 'ADVP', 'PP', 'NP' ]
all_result_phrases = []

from collections import Counter

for parse_string in tqdm( parse_string_list ):
    t1 = Tree.fromstring(parse_string)
    result_phrases = ExtractPhrases(t1, target_types)
    all_result_phrases.extend( result_phrases)

Get the constituency chunks collection as our phrases set

In [None]:
result_counter = Counter()
for phrase in all_result_phrases:
    if len(phrase.leaves()) > 1: # if the constituency is a unigram, then we skip it
        label = phrase.label()
        phrase_str = ' '.join( phrase.leaves() )
        phrase_str = phrase_str.lower()
        result_counter.update([phrase_str])

with open( f'/content/data/{name}/phrase_counter.pkl', 'wb') as f:
    print(len(result_counter))
    pickle.dump(result_counter, f)

In [None]:
result_counter.most_common(10)

### Get unigrams through tokenizing

In [None]:

import string
import pickle, json
from tqdm import tqdm
from transformers import BasicTokenizer

justice_name = 'rbg'
save_fname = f'/content/data/{justice_name}/unigram_counter.pkl'
with open(f'/content/data/{justice_name}/text_list.json', 'r') as f:
    text_list = json.load(f)

tokenizer = BasicTokenizer()

doc_tokens_list = [] # a nested list
for doc in tqdm( text_list ):
    doc_tokens = tokenizer.tokenize(doc)
    filtered_doc_tokens = []
    for t in doc_tokens:
        if t in string.punctuation:
            continue
        if t in string.digits:
            continue
        if len(t) == 1:
            continue
        filtered_doc_tokens.append( t )
    doc_tokens_list.extend(filtered_doc_tokens)

from collections import Counter
unigram_counter = Counter(doc_tokens_list)
with open(save_fname, 'wb') as f:
    pickle.dump(unigram_counter, f)

print( unigram_counter.most_common(10) )

100%|██████████| 1879/1879 [00:03<00:00, 543.24it/s]

[('the', 6859), ('that', 3509), ('to', 2548), ('you', 2329), ('it', 2205), ('is', 2044), ('of', 1722), ('and', 1685), ('in', 1477), ('was', 1028)]





### Combine the phrases and the unigrams to form the vocab set

In [None]:
justice_name = 'jrj'
outdir = f'/content/data/{justice_name}/'
unigram_fname = f'/content/data/{justice_name}/unigram_counter.pkl'
phrase_fname = f'/content/data/{justice_name}/phrase_counter.pkl'
with open(unigram_fname, 'rb') as f:
    unigrams_counter = pickle.load(f)

with open(phrase_fname, 'rb') as f:
    phrase_counter = pickle.load(f)

print( f'The number of unigrams loaded: {len(unigrams_counter)}' )
print( f'The Number of phrases loaded: {len(phrase_counter)}' )

vocab_list = [ k for k, v in unigrams_counter.items()] + \
    [ k for k, v in phrase_counter.items()]
vocab_list = list(set(vocab_list))
print(len(vocab_list))

word2id_dict = {}
id2word_dict = {}
id2freq_dict = {}
for id, vocab in enumerate(vocab_list):
    word2id_dict[vocab] = id
    id2word_dict[id] = vocab

for id, vocab in id2word_dict.items():
    id2freq_dict[id] = unigrams_counter[vocab] if vocab in unigrams_counter else phrase_counter[vocab]

print( f'The number of vocabualries (phrases + unigrams pooled): {len(word2id_dict)}')
print( f'The number of vocabualries (phrases + unigrams pooled): {len(id2word_dict)}')
print( f'The number of vocabualries (phrases + unigrams pooled): {len(id2freq_dict)}')

import os
with open( os.path.join(outdir, 'combined_word2id_dict.pkl'), 'wb') as f:
    pickle.dump(word2id_dict, f)

with open( os.path.join(outdir, 'combined_id2word_dict.pkl'), 'wb') as f:
    pickle.dump(id2word_dict, f)

with open( os.path.join(outdir, 'id2freq_dict.pkl'), 'wb') as f:
    pickle.dump(id2freq_dict, f)


The number of unigrams loaded: 6111
The Number of phrases loaded: 20457
26568
The number of vocabualries (phrases + unigrams pooled): 26568
The number of vocabualries (phrases + unigrams pooled): 26568
The number of vocabualries (phrases + unigrams pooled): 26568


### Use Phrase-BERT to produce embeddings for input text and vocabularies

In [None]:
!pip install transformers==3.0.2 # the constituency parsing process earlier requires transformers version 4.6.1 so we have to reinstall the correct one
# Having multiple reinstallations is not great. If you have conflicting dependencies in your project, you may use virtual environments and anaconda.



In [None]:
!python -u /content/phrase-bert-topic-model/phrase-topic-model/preprocess.py \
    --topic_model_data_path "/content/data/rbg/" \
    --emb_model_path "/content/phrase-bert-model/pooled_context_para_triples_p=0.8"

loaded 25434 vocabs
25434
Batches: 100% 3180/3180 [01:12<00:00, 44.13it/s]
1879
Batches: 100% 235/235 [00:19<00:00, 12.26it/s]
Done


In [None]:

!python -u /content/phrase-bert-topic-model/phrase-topic-model/run_topic_model.py \
    --num_topics 20 \
    --num_epochs 100 \
    --random_seed 42 \
    --topic_model_data_path "/content/data/rbg/" \
    --emb_model phrase-bert > "/content/data/rbg/output.txt"

  0% 0/5 [00:00<?, ?it/s]100% 5/5 [00:00<00:00, 422.39it/s]


Topics produced by PNTM:

In [None]:
# topic 0 : the legislature, in the north carolina statute, congress, the puerto rico government, imposed under the internal revenue codes, executive , legislature, under the internal revenue codes, government as sovereign, the legislature and the government, congress , the legislature ,
# topic 1 : to live with one 's spouse, between males and females, disabilities, sexual abuse, women of child - bearing age, sexual abuse of an adult, gender discrimination, with disabilities, gender, men and women who are parents
# topic 2 : set standards for emissions, 25 minutes, using the taser, contains hazardous substances, hazardous substances, over the standard amount, a lethal injection protocol, a federal safety standard, this questionable drug, narcotics
# topic 3 : the very basic argument, technicality, argument 's sake, specification, formulation, evidentiary, argument 's, the essential argument, the simple argument, the technicality
# topic 4 : the internal revenue codes, banks, the consumer finance protection bureau, distributor, bankshares, shopkeeping, export administration, bank, the bank, the government regulators
# topic 5 : paid the lawyer, noticed, his other money, gave the proceeds to his, had all of his other money, paid out of trust funds, was very nervous, asked for that relief, on my income tax return, my income tax return
# topic 6 : the federal rules, any specific jurisdiction case, an ordinary litigation in federal court, the federal law, choice of law, federal law, by a federal common law rule, dealing with choice of law, the court 's rationale, a federal common law rule
# topic 7 : make an arrest, be stopped, make the stop, intercept, the stop, pass, get out at the same time, commence, sign, to pass
# topic 8 : a domestic assault, employer, inherent authority, this obligation, rests on an employer acting unlawfully, the offense of possession, an unlawful employment practice, the inherent authority, on an employer acting unlawfully, an employer acting unlawfully
# topic 9 : to 30 percent, on the 10 percent, 30 within 30, the 10 - year limitation, ten years mandatory minimum for that, this 3 - year outside limit, require further reductions, at the foreclosure amount, trebled the amount at the outset, was additional insurance in 1028
# topic 10 : suspected, the pretrial detainee, the unlawful detention, contains hazardous substances, traceable, mentioned the pretrial detainee, violation of a federal safety standard, this questionable drug, unlawful entry, detainee
# topic 11 : embassy, foreign governments, need not be a government officer, foreign plaintiffs, deference to state domestic relations law, the iran u.s. claims commission, the foreign state, governed by foreign law, a resident alien, state domestic relations law
# topic 12 : escape, the inmate contraband, to suffer pain, work out the back pay, getting disability pay, induce this unconscious state, detainee, pay or compensation, be possible to find the victims, be likely subject to torture
# topic 13 : commencement of a lawsuit, a state claim for negligence, a conviction entered by a court, the statutory damages, breach of fiduciary claims, the compensatory damages, be brought as a class action, for breach of fiduciary claims, an implied damages action, punitive damages
# topic 14 : gone into court, be a debt owed to, an effort to get a warrant, get a warrant, the prosecutor, have the informant 's tip, for false arrest, a debt owed to, a warrant, the informant 's tip
# topic 15 : parole, can release the prisoner, the minimum term of imprisonment, release the prisoner, arrest, go to trial, have a criminal trial, plead guilty, into an arrest, make an arrest
# topic 16 : on court, the lawyer for the defense, injuring california people, tough, a lawyer, any careful lawyer, bad things, the lawyer, paid the lawyer, do bad things
# topic 17 : use the word " person, husband, father / child relationships, men and women who are parents, want, spouse, live with one 's spouse, your spouse, to live with one 's spouse, for the wife
# topic 18 : 's what this court has declared, entered by a court, decree, by its sovereign immunity, a petition filed, its sovereign immunity, was announced by this court, before foreign tribunals, the mandate issued, announced by this court
# topic 19 : was he wanted the testimony, asked before, have the informant 's tip, about the police officer, ask what i assume was, was not telling them, want to go to trial, go to trial, suppose he had survived, for suspicion


Comparing PNTM vs LDA:

In [None]:
# the "international" topic from LDA:
# mr, case, international, one, domestic, foreign, entity, states, united, position

# the "international" topic from PNTM: 
# embassy, foreign governments, need not be a government officer, foreign plaintiffs, deference to state domestic relations law, 
# the iran u.s. claims commission, the foreign state, governed by foreign law, a resident alien, state domestic relations law



Comparing two PNTM-produced topics:

In [None]:

# the "finance" topic from PNTM (Justice Roberts):
# monetary, the compensation, pensions, payment, paycheck, money remuneration, financial, budgeting, budget, the hourly wage

# the "finance" topic from PNTM (Justice Ginsburg):
# the internal revenue codes, banks, the consumer finance protection bureau, distributor, bankshares, shopkeeping, export administration, bank, the bank, the government regulators
