<a href="https://colab.research.google.com/github/katyasmpsn/thesis/blob/main/clusters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [32]:
# only run the install line below if the kernel has re-started 
# !pip install transformers
import torch
import pandas as pd
from collections import Counter, defaultdict
from transformers import BertTokenizerFast, BertModel

model = BertModel.from_pretrained('bert-base-uncased')
t = BertTokenizerFast.from_pretrained('bert-base-uncased')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# 🧹 Pre-Processing


---



The note text has been pre-cleaned locally due to the size of the raw JSON blob. See `thesis/dev/thesis-rough-work.ipynb` 

1.   usernames, hashtags, urls, stopwords, punctuation, and digits omitted
2.   Stop words removed 

**TO DO**: try without omitting stopwords. BERT might capture the "unnaturalness" of note text without stopwords 



In [33]:
df = pd.read_csv("/content/drive/MyDrive/Thesis/cleaned_data.csv")
"""creating a column with a list of words for each text snippet so that it's easier to to calculate term frequencies over the corpus"""
df['noteTextList'] = df['noteText'].str.lower().str.split()
df = df[~df['noteTextList'].isnull()]  # why are there empty notes at this point? 

# 🔢 Corpus Statistics 

Sia et al found that using term frequency as their corpus statistic worked best. I'll also use term frequency with their notation

tf $= \frac{n_{t}}{\Sigma_{t'}n_{t'}}$ where $n_{t}$ is the count of the word type $t$. 


In [34]:
# initializing Counter object 
vocab_counts = Counter()
# updating the counter 
df['noteTextList'].apply(vocab_counts.update)
corpus_denominator = sum(vocab_counts.values())
# tf weights 
vocab_counts = {key:value/corpus_denominator for (key,value) in vocab_counts.items()}

# 💽 Generate Word Type Embeddings

We get the last hidden state from BERT for each word token, using the entire note/tweet/reply as the context window. Then the embeddings are averaged over each word type. 



In [35]:
def getTokenEmbeddings(t1):
  """
  INPUT: full text 
  OUTPUT: dictionary with each tokens last hidden state. If the original token 
  was broken down into subwords, the average over subword representations is
  returned 

  {token : 1x768 vector}
  """

  # this is possibly bad coding, `t` and `model` were instantiated outside of this function
  # in the first code block 

  tokens = t(t1, return_attention_mask=False, return_token_type_ids=False)
  words_ids = tokens.word_ids() 

  encoded_input = t(t1, return_tensors='pt')
  output = model(**encoded_input)

  # Average subword representations 
  # Generate dummies for words_ids, multiply by the tensor
  wi_d = pd.get_dummies(pd.Series(words_ids)).T
  squeezed_states = torch.squeeze(output['last_hidden_state'])
  reduced_states = torch.matmul(torch.from_numpy(wi_d.values.astype('float32')), squeezed_states)

  words = t1.split()

  res = {words[i]: reduced_states[i] for i in range(len(words))}
  return res


In [36]:
fake_data = ["don't look up the comet can't be real", "look up everything is about to change because of the comet"]
fake = pd.DataFrame(fake_data, columns=["text"])

In [37]:
# df.tweetText.tolist()

In [41]:
text = fake.text.tolist()
# so this is iterating through a list of text strings, and list of dicts [{t1_tok1: embed, t1_tok2: embed},{t2_tok1: embed, t2_tok2: embed}] 
text = [getTokenEmbeddings(x) for x in text]

In [45]:
# just a toy function for now, but it's creating a master dictionary for the vocab. 
# in the example: "look" is used twice, and is in two separate dictionaries in `text`. 
# this adds all of the embeddings for "look" into one list
d = {}
for d_t in text: 
  for k,v in d_t.items():
    try:
      if d[k]:
        d[k].append(v)
    except KeyError:
      d[k] = [v]

# this now averages all of the embeddings for tokens that have more than one

for k,v in d.items():
  if len(v) > 1:
    d[k] = [torch.mean(torch.stack(v), dim=0)]

In [43]:
text[0]["look"].shape

torch.Size([768])

In [44]:
text[1]["look"].shape

torch.Size([768])

In [47]:
d["look"][0].shape

torch.Size([768])

In [None]:
tensor_list = [text[0]["look"], text[1]["look"]]
mean = torch.mean(torch.stack(tensor_list), dim=0)


In [None]:
mean.shape

torch.Size([768])

In [None]:
# text[1]["look"]

In [None]:
text[1].keys()

dict_keys(['look', 'up', 'everything', 'is', 'about', 'to', 'change', 'because', 'of', 'the', 'comet'])

In [None]:
embeddings_dict = defaultdict(list,{ k:[] for k in list(vocab_counts.keys()) })

for t1 in notes[:2]:

  tokens = t(t1, return_attention_mask=False, return_token_type_ids=False)
  words_ids = tokens.word_ids() 

  encoded_input = t(t1, return_tensors='pt')
  output = model(**encoded_input)

  ## Generate dummies for words_ids, multiply by the tensor
  wi_d = pd.get_dummies(pd.Series(words_ids)).T
  squeezed_states = torch.squeeze(output['last_hidden_state'])
  reduced_states = torch.matmul(torch.from_numpy(wi_d.values.astype('float32')), squeezed_states)

  words = t1.split()
  print(words)
  for i in range(len(words)):
    embeddings_dict[words[i]].append(reduced_states[i])
  



['blm', 'organization', 'terrorist', 'organization', 'approximately', 'percent', 'recent', 'blm', 'protests', 'peaceful', 'national', 'organization', 'specifically', 'calls', 'peaceful', 'protest']
['post', 'claims', 'blm', 'organization', 'deserve', 'nobel', 'peace', 'prize', 'incited', 'violence', 'blm', 'organization', 'specifically', 'non-violent', 'explicitly', 'calls', 'page', 'vast', 'majority', 'blm', 'related', 'protests', 'us', 'peaceful']


In [None]:
# not sure if I should average these or not 
torch.mean(torch.stack(embeddings_dict['organization']))

tensor(-0.0102, grad_fn=<MeanBackward0>)

In [None]:
df.iloc[0]['noteText']

'blm organization terrorist organization approximately percent recent blm protests peaceful national organization specifically calls peaceful protest'

In [None]:
# getEmbeddings(df.iloc[0]['noteText'])

In [None]:
tokens = t(t1, return_attention_mask=False, return_token_type_ids=False)
words_ids = tokens.word_ids()


In [None]:
words_ids

[None,
 0,
 1,
 2,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 8,
 9,
 10,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 21,
 22,
 23,
 24,
 25,
 None]

In [None]:
marked_text = "[CLS] " + "bella is my cat" + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = t.tokenize(marked_text)

# Print out the tokens.
print (tokenized_text)

['[CLS]', 'bella', 'is', 'my', 'cat', '[SEP]']


In [None]:
encoded_input = t(t1, return_tensors='pt')
output = model(**encoded_input)

In [None]:
output['last_hidden_state'].shape

torch.Size([1, 32, 768])

In [None]:
## Generate dummies for words_ids, multiply by the tensor
wi_d = pd.get_dummies(pd.Series(words_ids)).T
squeezed_states = torch.squeeze(output['last_hidden_state'])
reduced_states = torch.matmul(torch.from_numpy(wi_d.values.astype('float32')), squeezed_states)


In [None]:
reduced_states.shape

torch.Size([26, 768])

In [None]:
wi_d.shape

(26, 32)

In [None]:
# words = t1.split()
res = {words[i]: reduced_states[i] for i in range(len(words))}

In [None]:
words

['post',
 'claims',
 'blm',
 'organization',
 'deserve',
 'nobel',
 'peace',
 'prize',
 'incited',
 'violence',
 'blm',
 'organization',
 'specifically',
 'non-violent',
 'explicitly',
 'calls',
 'page',
 'vast',
 'majority',
 'blm',
 'related',
 'protests',
 'us',
 'peaceful']