Note: copied from: https://github.com/arsena-k/discourse_atoms/blob/master/DATM%20Tutorial%20Part%202%20of%202%20-%20Assigning%20Atoms%20-%20for%20Public.ipynb

My notes (18-02-2022):
- I had to update the code for Gensim 4.0, which I think I did correctly (otherwise it probably wouldn't work)
- I also didn't quite understand what datatype their corpus was. I thought it was lists of tokens nested inside a list of comments, but it seems it was actually a dataframe. From past experience I think pandas dataframes aren't very good at being iterated over, so I avoided using dataframes as much as possible. Only at the end I create a dataframe, but even then I think it's better to store it as a sparse matrix rather than a csv file.
- Also since I'm not combining two datasets as they do, I skipped half the parts they do double. 
- In the end it seems every went correctly, but tbh I'm in over my head so I am not certain

My notes (17-3-2022):
- I changed the corpus to a gensim LineSentence style (i.e., a text file with each comment on a line and each term seperated by a whitespace
- When I tried to include all comments (instead of a sample), I was getting a few (2 in 2 million) weighted sents of length zero, which caused a divide by zero error. I'm not sure how this is possible
- I also don't understand why there were 2 samples taken per comment in the sample_cts function. They way it was coded (the same 3 lines copy/pasted below each other) makes it seem unintentional. I only take nne per comment
- Furthermore, I realized that this windowsize doesn't need to be the same as that of the word embedding vector.


# Discourse Atom Topic Modeling (DATM) Tutorial 

## Part 2 of 2: Mapping Atoms to Text Data

* This code is written in Python 3.7.2, and uses Gensim version 3.8.3. 

* This code is provides an the outline of how we assigned atoms to our cleaned data, which we show how to identify in Part 1 of 2. Note that we cannot redistribute the data used in our paper "Integrating Topic Modeling and Word Embedding" in any form, and researchers must apply directly to the Centers for Disease Control and Prevention for access. Details on data access are provided in the paper. We add comments with tips for adapting this code to your data.  
* In our case, the goal of this code is to take a given narrative, get rolling windows of contexts from this narrative, find the SIF sentence embedding from each rolling window, and match the SIF embedding onto the closest (by cosine similarity) atom in the Dictionary loaded in earler. The SIF embedding is the maximum a posteriori (MAP) estimate of what the atom is for that sentence. So we'll get out a rolling window (i.e., a sequence) of atoms underlying the narrative.
* In our case, we get atoms separately for law enforcement (narle) and medical examiner (narcme) narratives, and then combine the two distributions, as described in our paper. 

In [1]:
# Used
import pandas as pd
import pickle
from gensim.models import Word2Vec
import numpy as np
from random import seed, sample
from time import perf_counter
from tqdm import tqdm
from scipy.sparse import csr_matrix, save_npz

from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import linear_kernel

from gensim.models.word2vec import LineSentence

In [2]:
# Not used?
from __future__ import division
import cython
import math
from collections import Counter
from ksvd import ApproximateKSVD 
from itertools import combinations
import matplotlib.pyplot as plt
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.linalg import norm

%matplotlib inline

# Setup

In [2]:
# Import
chosen_wv_model = './models/gensim_model_window25_vector300'
chosen_datm_model = './datm/325comp_5nonzeros_dictionary'
# ngram_comments_file = './data/ngram_comments.p' #del
tokenized_comments_v2_file = './data/tokenized_comments' #del
tokenized_comments_text_file = './data/tokenized_comments.txt'

# Export atom loadings per comment
# I thought this window size had to be the same as that for word embedding, but it doesn't
# A-K et al choose a word embedding with window size 5, but had a window size 10 for datm
atoms_npz = "./data/atoms.npz" #this was with window_size =25 (meaning only comments of over 50 words were looked at)
atoms_window5_npz = "./data/atoms_window5.npz"

In [56]:
weight_a = 0.0001 #reasonable range for weight a is .001 to .0001 based on Arora et al SIF embeddings. The extent to which we re-weight words is controlled by the parameter $a$, where a lower value for $a$ means that frequent words are more aggressively down-weighted compared to less frequent words. 
seed_number = 5

# n_sample currently ignored because looping over full corpus.
n_sample = 100000 #adjusting here to corpus sample, but consider using full corpus for final SIF embeddings. 
window_size = 5 # From chosen word embedding model ?? or not
vector_size = 300 # From chosen word embedding model
min_tokens = window_size*2-1 #this is set so that only narratives with at least 19 tokens in the w2v model vocab are considered. 

In [5]:
w2vmodel=Word2Vec.load(chosen_wv_model)

In [3]:
with open(chosen_datm_model,'rb') as f:
    mydictionary=pickle.load(f)

In [12]:
len(mydictionary)

325

### Prepare two pieces of informatin from the text data, which we will need to compute SIF Sentence Embeddings (MAP) of any given sentence 

* SIF Sentence Embedding is from: "A Simple but tough-to-beat baseline for sentence embedding" https://github.com/PrincetonML/SIF
* To do SIF embeddings, we need to prep functions and two pieces of information from the raw text: (1) frequency weights for each word, and (2) the "common discourse vector" ($C_0$)

In [7]:
corpus = LineSentence(tokenized_comments_text_file)

**1. The first input to SIF embeddings is an estimate of the frequency weights (based on probabilites) for each word in the corpus. Compute this here.**
* This will naturally downweight stopwords when we compute a sentence embedding. It requires the raw text data of the corpus. 
* Either train a dictonary of weights (1), or upload a saved dictionary (2). 

In [8]:
def get_freq_dict(w2vmodel, weight_a=.0001): 
    freq_dictionary = {word: w2vmodel.wv.get_vecattr(word, "count") for word in w2vmodel.wv.index_to_key} 
    total= sum(freq_dictionary.values())
    freq_dictionary = {word: weight_a/(weight_a + (w2vmodel.wv.get_vecattr(word, "count") / total)) for word in w2vmodel.wv.index_to_key} #best values according to arora et al are between .001 and .0001
    return(freq_dictionary)

#function to yield a weighted sentence, using the above weight dictionary
def get_weighted_sent(tokedsent,freq_dict, w2vmodel=w2vmodel): 
    weightedsent= [freq_dict[word]*w2vmodel.wv[word] for word in tokedsent if word in freq_dict.keys()]
    if len(weightedsent) != 0:
        return(sum(weightedsent)/len(weightedsent)) #weightedsent_avg  #divide the weighted average by the number of words in the sentence
    else:
        return None


In [9]:
freq_dict= get_freq_dict(w2vmodel, weight_a)

**2. The second input to SIF embeddings is $C_0$, the common discourse vector. Compute this here.**
* Get this with a random sample of discourse vectors since the data is so large, or compute using all narratives. 

In [10]:
# I use all docs instead of a sample
# I added a if for sampvecs.append because otherwise sometimes divided by zero. Note: also changed to code for get_weighted_sent
# also why are there two samples per doc???

def samp_cts(docs, windowsize, freq_dictionary, vector_size):
    #sampnarrs=  sample(docs, n_sample) #sample of narratives. Will take 1 random window and discourse vector of this window, from each narrative. 
    sampvecs= []

    for i in tqdm(docs): 
        if len(i)>windowsize: #want window length to be at least windowsize words
            n= sample(range(0,len(i)-windowsize), 1)[0] #get some random positon in the narrative (at least windowsize steps behind the last one though)
            sent= i[n:n+windowsize] #random context window 
            weighted_sent = get_weighted_sent(i, freq_dictionary)
            if weighted_sent is not None: # length zero
                sampvecs.append(weighted_sent) #sample a discourse vector, and append to a list of sample discourse vectors.
    sampvecs= np.asarray(sampvecs)

    return(sampvecs)

def get_c0(sampvecs):
    svd = TruncatedSVD(n_components=1, n_iter=10, random_state=0) #only keeping top component, using same method as in SIF embedding code
    svd.fit(sampvecs) #1st singular vector  is now c_o
    return(svd.components_[0])

def remove_c0(comdiscvec, modcontextvecs):
    curcontextvec= [X - X.dot(comdiscvec.transpose()) * comdiscvec for X in modcontextvecs] #remove c_0 from all the cts
    curcontextvec=np.asarray(modcontextvecs)
    return(curcontextvec)

In [11]:
# With 100,000 sample took about 30 secs
# With full corpus (2mil comments) took 9 minutes

sampvecs2_narcme= samp_cts(corpus, window_size, freq_dict, vector_size)

2114724it [04:48, 7327.97it/s] 


In [12]:
sampvecs2_narcme= normalize(sampvecs2_narcme, axis=1) #l2 normalize the resulting context vectors

pc0_narcme= get_c0(sampvecs2_narcme)
sampvecs2_narcme = remove_c0(pc0_narcme, sampvecs2_narcme) 

In [13]:
for c, i in enumerate(sampvecs2_narcme):
    if i is None:
        #print(len(i))
        print(f"Position {c}")
        print(i)
        print(type(i))

In [14]:
len(sampvecs2_narcme)

948627

###  Resulting function to get SIF MAPs along rolling windows, for a given narrative. 

* This the function we use to find rolling windows and assign MAPs to them, for a given narrative.
* Note that this is set for our embedding size, which was 200-dimensions. 

In [31]:
def sif_atom_seqs(toked_narrative, window_size, vector_size, topics_dictionary, c0_vector, freq_dict, w2vmodel): 
    
    toked_narr2 = [i for i in toked_narrative if i in w2vmodel.wv.index_to_key] #remove words not in vocab
    if len(toked_narr2)> min_tokens :  
        it = iter(toked_narr2)
        win = [next(it) for cnt in range(0,window_size)] #first context window
        MAPs= normalize(remove_c0( c0_vector, get_weighted_sent(win, freq_dict, w2vmodel).reshape(1,vector_size))) #doing the SIF map here. Hardcoding in the dimensionality of the space to speed this up.
        for e in it: # Subsequent windows
            win[:-1] = win[1:]
            win[-1] = e
            MAPs = np.vstack((MAPs, normalize(remove_c0(c0_vector, get_weighted_sent(win, freq_dict, w2vmodel).reshape(1,vector_size)))))  #this will be matrix of MAPs

        costri= linear_kernel(MAPs, topics_dictionary) 
        atomsseq= np.argmax(costri, axis=1) #this is for the index of the closest atom to each of the MAPs
        #maxinRow = np.amax(costri, axis=1) #this is for the closest atom's cossim value to each of the maps
        return(atomsseq.tolist()) #returns sequence of the closest atoms to the MAPs
    else:
        return(None)


## Transform Documents in a Corpus into a Sequences of Atoms 

* This is the **final result** we want from all code above
* First, get c0 from narcme narratives, and then get the atom sequence for the narcme narratives
* Then, get c0 from narle narratives, and then get the atom sequence for the narle narratives

Get SIF Atom Seqs on NARCME narratives

In [57]:
# This took about 3 hours for 2.1 million comments

atom_seq = []
with open(tokenized_comments_text_file, 'r') as f:
    
    for x in tqdm(f.readlines()):
        x = x.replace("\n", "").split(" ")
        x = sif_atom_seqs(x, window_size, vector_size, mydictionary , pc0_narcme, freq_dict, w2vmodel)
        if not(x == None):
            x = ' '.join([str(elem) for elem in x])
        else:
            x = ''

        atom_seq.append(x)

100%|█████████████████████████████████████████████████████████████████████| 2116273/2116273 [3:10:06<00:00, 185.54it/s]


## Reformatting the Resulting Atom Sequences into Variables, by Vectorizing the Atoms

Transform each sequence into a distribution over topics

In [58]:
bow_transformer = TfidfVectorizer(analyzer = 'word', norm='l1', use_idf=False, token_pattern='\S+') #need token pattern, otherwise splits using using 0-9 single digits too! #note that atoms that are part of all or no documents will not be transformed here, can reset this default, but I left as is for now since makes prediction easier (fewer features). #includes l1 normalization so that longer documents don't get more weight, l1 normalizes with abs value but all our values are pos anyways
#bow_transformer.fit([x for x in atom_seq if x != '']) #corpus needs to be in format ['word word word'], NOT tokenized already
bow_transformer.fit(atom_seq)

vecked = bow_transformer.transform(atom_seq).toarray() #consider instead:  vecked = bow_transformer.transform(corpus['narcme_narle_atom_seq_combined'].dropna(inplace=True).tolist()).toarray() #this is the "feature" data, now in an array for sklearn models

In [59]:
df = pd.DataFrame(vecked, columns = bow_transformer.get_feature_names())

## Save it as a sparse matrix

In [60]:
df_c = csr_matrix(df.astype(pd.SparseDtype("float", 0)).sparse.to_coo())

In [61]:
df_c

<2116273x325 sparse matrix of type '<class 'numpy.float64'>'
	with 23672877 stored elements in Compressed Sparse Row format>

In [62]:
save_npz(atoms_window5_npz, df_c)

## Inspect

In [51]:
pd.options.display.max_columns = 60

In [65]:
df.head(10)

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,107,108,109,11,110,111,112,113,114,115,116,117,118,119,12,120,121,122,123,124,...,72,73,74,75,76,77,78,79,8,80,81,82,83,84,85,86,87,88,89,9,90,91,92,93,94,95,96,97,98,99
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.176471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.004902,0.009804,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.02451,0.0,0.009804,0.0,0.0,0.029412,0.0,0.0,0.034314,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.039216,0.0,0.0,0.0,0.0,0.0,0.014706,0.0,0.004902,0.009804,0.004902,0.0,0.0,0.0,0.014706,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.009434,0.0,0.0,0.0,0.0,0.0,0.0,0.018868,0.0,0.0,0.0,0.023585,0.0,0.009434,0.0,0.0,0.018868,0.0,0.0,0.033019,0.0,0.0,0.004717,0.0,0.009434,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037736,0.0,0.0,0.0,0.0,0.0,0.014151,0.0,0.009434,0.009434,0.0,0.0,0.0,0.0,0.014151,0.014151,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.044444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.019608,0.0,0.0,0.0,0.0,0.0,0.019608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039216,0.0,0.0,0.0,0.0,0.04902,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.009804,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019608,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
x=0
for i in corpus:
    x+=1
    print(" ".join(i))
    print()
    if x == 10:
        break
# visual inspection confirms that current topic names (i.e., 1-250 match the discourse atoms)
# For example, corpus[4] scores strongly on topic 94, which is she/her and a bunch of female first names

additionally thank_you so much for caring enough to ask in a genuinely_curious and respectful way i 'm_curious to hear your thoughts

i am watching shape of water it says some stuff in russian how can i find out what they are saying

extremely underwhelmed by the film it just felt dull the 'romance between fish-dick and mute-chick was too quick and i never felt any connection or empathy for the 'asset the pie-shop scenes were so quick and short and ultimately meant nothing i want to get to know you ew no and that was it just the film saying hah look the 40 's hated gays the strange musical imagination scene was a complete tonal change from the rest of the film and it seems the film just ignored all logic to tell the story the 'asset had no security at all no guards no cameras the cleaner was allowed in and out whenever she felt like it most of the time when the guy 's hand was injured it took ages for any help to arrive in a top-security lab a cleaner and her elderly friend manage to b