## 1. SGNS

### How to build the ingroup/outgroup dimensions?
Build one vector for each dimension based on a set of words *L* from the replacement experiments coded as "ingroup" or "outgroup".

* Top word (|*L*| = 1), given some measure, e.g. frequency or "keyness" (see below)
* Top *k* words (|*L*| = *k*), given some measure, e.g. frequency or "keyness"
* Manual selection of *L*; can be more or less motivated by some measure, e.g. frequency or "keyness"

Given *L* the association with a dog whistle expression (DWE) over time can be measured by:

* **CENTROID**: The similarity of the vector of the DWE at *t* and the mean of vectors of words in *L* (at *t*)
* **PAIRWISE**: The mean of similarity of the vector of the DWE (at *t*) and the vector of each word in *L* (given this latter approach, it is possible to use a *T*-test to compare if there is a difference in in- vs out-group association; the methodological implications of this shold be though through, though). **Does this matter?**


### Morphology
The replacement tests are done with specific forms, e.g. plural of *globalist*: "globalister". Hence, the responses (replacements) will typically align with the target form: "elitist*er*". Also *förortsgäng* is a collective noun with its replacement typically being plural ("kriminella ungdom*ar*"). These considerations poses a question about how (if at all) to expand *L* by including morphologically related forms when building the in/outgroup dimensions. 

### Should multiple words be combined (e.g., *skicka tillbaka*)? If so, how? 
(https://www.baeldung.com/cs/sentence-vectors-word2vec)

For the SGNS models, it is easier if we ignore phrases and exclude "stopwords". But otherwise, the following methods can be considered to combine vectors:

* addition
* averaging
* weigthed averaging

### How to avoid overlap of dimensions? Keyness?
For example, for *förortsgäng*, *kriminell* is a frequent term in replacements of both the ingroup and the outgroup. To handle this, some kind of keyword methodology can be considered (cf. corpus linguistics and meausures of "keyness", that considers the probability of term *t* in an ingoup vs. en outgroup "corpus").

### Some definitions
* *I*<sub>1</sub> =<sub>df.</sub> words in first replacement coded as ingroup meanings
* *O*<sub>1</sub> =<sub>df.</sub> words in first replacement coded as outgroup meanings
* *I*<sub>2</sub> =<sub>df.</sub> words in second replacement coded as ingroup meanings
* *O*<sub>2</sub> =<sub>df.</sub> words in second replacement coded as outgroup meanings

* *I*<sub>*both*</sub> =<sub>df.</sub> *I*<sub>1</sub> ⋃ *I*<sub>2</sub> 
* *O*<sub>*both*</sub> =<sub>df.</sub> *O*<sub>1</sub> ⋃ *O*<sub>2</sub>

#### A-List
*A*<sub>*X*</sub> =<sub>df.</sub> a selected set of words of set *X*, where *X*=*I* (ingroup) or *X*=*O* (outgroup); given some selection function ***Select()***, such that = *Select*(*X*) = *A*<sub>*X*</sub>. 

For example, let ***Select*<sub>*All*</sub>** be the function that selects all words of *X* (*Select*<sub>*All*</sub>(*X*) = *X*); and ***Select*<sub>*TopK*</sub>** be the function that selects the *K* most frequent word in *X*; e.g., *Select*<sub>*TopK*</sub>(*X*) = {*x*: *x* is among first *K* elements of *rank*(*X*, *f*)}, where *rank*(*X*, *f*) ranks the elements of *X* given some feature *f* (e.g., frequency).

#### B-List
*B*<sub>*X*</sub> =<sub>df.</sub> the set of words used to model the vector of dimensions **ingroup** (*X*=*I*) or **outgroup** (*X*=*O*). 

*B*<sub>*X*</sub> is related to *A*<sub>*X*</sub> via the function ***F***, such that *F*(*A*<sub>*X*</sub>) = *B*<sub>*X*</sub>. *F* should at least selects words in *A*<sub>*X*</sub> that also are in the vocabulary *V* of the `word2vec` model (matrix). A minimalist version of *F*, i.e. ***WINV***, does this and nothing else. More or less sophisticated version of *F* can be defined, based on additional steps to this minimal requirement (see below).

### Methods:
#### Really Naive (RN)
* *B*<sub>*I*</sub> = *WINV*(*I*<sub>*both*</sub>)
* *B*<sub>*O*</sub> = *WINV*(*O*<sub>*both*</sub>)

*Problems:* 
* Overlap of dimensions (*I* and *O* share terms for some DWEs)
* Misses morphologically related forms
* Give equal attention to common and less common replacements

#### Naive No Overlap (NNO)
* *B*<sub>*I*</sub> = *WINV*({*w*: *w* in *I*<sub>*both*</sub> and *w* not in *O*<sub>*both*</sub>})
* *B*<sub>*O*</sub> = *WINV*({*w*: *w* in *O*<sub>*both*</sub> and *w* not in *I*<sub>*both*</sub>})

*Problems:* 
* The NO criterion can be too strong (note coding of replacements)!
* Shares problems with RN

#### Top 1 (T1), including NO


#### Top 3 (T3), including NO


#### Multi-Steps (MS) approaches

For ***Select()***:

|Feature  |Top|Threshold|
|---------|---|---------|
|Frequency|*a*|*b*      |
|Keyness  |*c*|*d*      |


For ***F()***:
for *w* in *A*:
* Identify lemma (or stem) *s* of *w*
* (split compounds?)
* Identify inflectional paradigm *G* for *s*; *G* = {wordform *w*: *w* is a inflectional varaint of *s*}
* *G*<sup>*WINV*</sup> = *WINV*(*G*)
* *C* = *FREQ*(*G*<sup>*WINV*</sup>)

Next, three options:
* Ignorant: ignore *C*, *β* = *G*<sup>*WINV*</sup>; *B* = *β*<sub>1</sub> ⋃ ... *β*<sub>*n*</sub> (*n* is the length of *A*)
* Threshold: *β* = {*w*: *C*(*w*) < *T*} where T is some threshold 
* Weigthing: Use *C* as weights in building dimension




HOW SIMILAR ARE WORDS WITHIN AN INFLECTIONAL PARADIGM? ---> ignore outliers - or keep them!?

#### Manual!
...


## 2. AVG BERT
### Scope
Complete or partial? I.e., all `meaning = 1` or just a selection? (a question similar to determine *L* above)

### Out of vocabulary
Will out of vocabulary be a problem for a non-domain adapted BERT? Will the tokenization of BERT "take care" of this? I.e., given the short "sentences" in replacements, will BERT be able to differentiate in/outgroup dimensions? This is the question of Hertzberg (et al.). And the answer was "yes" (but note fine-tuned vs. not fine-tuned). 

## 3. CLT BERT
The same questions of scope and OOV as for AVG BERT applies for CLT BERT. Additional options for how to relate in/outgroup dimensions to clusters include:
* Use the centroids of clusters
* Use jaccard similarity with top terms of clusters, given some weigthed frequency distribution; conider, for example, class-based TF-IDF. 

What about *hjälpa på plats*?

In [None]:
##### Stemmer

In [None]:
# https://www.nltk.org/api/nltk.stem.SnowballStemmer.html?highlight=stopwords
from nltk.stem import SnowballStemmer # See which languages are supported

In [None]:
print(" ".join(SnowballStemmer.languages)) 
# arabic danish dutch english finnish french german hungarian
# italian norwegian porter portuguese romanian russian
# spanish swedish
stemmer = SnowballStemmer("german")
stemmer.stem("Autobahnen") # Stem a word
# 'autobahn'

In [None]:
# print(" ".join(SnowballStemmer.languages)) 
# arabic danish dutch english finnish french german hungarian
# italian norwegian porter portuguese romanian russian
# spanish swedish
stemmer = SnowballStemmer("swedish")
stemmer.stem("deporterar") # Stem a word
# 'autobahn'

In [None]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')


In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

In [1]:
import stanza
nlp = stanza.Pipeline(lang='sv', processors='tokenize,pos,lemma')

2023-11-15 09:16:28 INFO: Loading these models for language: sv (Swedish):
| Processor | Package   |
-------------------------
| tokenize  | talbanken |
| pos       | talbanken |
| lemma     | talbanken |

2023-11-15 09:16:28 INFO: Use device: cpu
2023-11-15 09:16:28 INFO: Loading: tokenize
2023-11-15 09:16:29 INFO: Loading: pos
2023-11-15 09:16:30 INFO: Loading: lemma
2023-11-15 09:16:30 INFO: Done loading processors!


In [4]:
#nlp = stanza.Pipeline(lang='sv', processors='tokenize,pos,lemma')
doc = nlp("invandrarungdomar")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')
# doc = nlp('This is a test sentence for stanza. This is another sentence.')
# for i, sentence in enumerate(doc.sentences):
#     print(f'====== Sentence {i+1} tokens =======')
#     print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

word: invandrarungdomar 	lemma: invandrarungdom


In [6]:
doc.sentences[0].words[0].lemma

'invandrarungdom'

In [7]:
def lemmatizer(w):
    doc = nlp(w)
    return doc.sentences[0].words[0].lemma

In [8]:
lemmatizer("banan")

'bana'

In [10]:
lemmatizer("bananen")

'banan'

In [11]:
lemmatizer("kluckelimuckare")

'kluckelimuckare'

In [12]:
import unimorph
dir(unimorph)

['CITATION',
 'List',
 'UNIMORPH_DIR',
 'UNIMORPH_DIR_',
 'USERHOME',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 'analyze_word',
 'argparse',
 'download_unimorph',
 'get_list_of_datasets',
 'inflect_word',
 'is_empty',
 'load_dataset',
 'logging',
 'main',
 'not_loaded',
 'os',
 'parse_args',
 'pathlib',
 'pd',
 'subprocess',
 'sys']

In [17]:
for s in ["äter", "elit"]:
    doc = nlp(s)
    for sent in doc.sentences:
        for word in sent.words:
            print(word.text, "-->", word.lemma)
            print(unimorph.inflect_word(word.lemma, lang="swe"))
    

äter --> äta
äta	äta	V;NFIN;ACT
äta	ätas	V;NFIN;PASS
äta	ätit	V;V.CVB;ACT
äta	ätits	V;V.CVB;PASS
äta	ät	V;IMP;ACT
äta	äter	V;IND;SG;ACT;PRS
äta	åt	V;IND;SG;ACT;PST
äta	äts	V;IND;SG;PASS;PRS
äta	ätes	V;IND;SG;PASS;PRS
äta	åts	V;IND;SG;PASS;PST
äta	äta	V;IND;PL;ACT;PRS
äta	åto	V;IND;PL;ACT;PST
äta	ätas	V;IND;PL;PASS;PRS
äta	åtos	V;IND;PL;PASS;PST
äta	äte	V;SBJV;ACT;PRS
äta	åte	V;SBJV;ACT;PST
äta	ätes	V;SBJV;PASS;PRS
äta	åtes	V;SBJV;PASS;PST
äta	ätande	V;V.PTCP;PRS
äta	äten	V;V.PTCP;PST

elit --> elit



In [23]:
unimorph.inflect_word("mata", lang="swe")

'mata\tmata\tV;NFIN;ACT\nmata\tmatas\tV;NFIN;PASS\nmata\tmatat\tV;V.CVB;ACT\nmata\tmatats\tV;V.CVB;PASS\nmata\tmata\tV;IMP;ACT\nmata\tmatar\tV;IND;SG;ACT;PRS\nmata\tmatade\tV;IND;SG;ACT;PST\nmata\tmatas\tV;IND;SG;PASS;PRS\nmata\tmatades\tV;IND;SG;PASS;PST\nmata\tmata\tV;IND;PL;ACT;PRS\nmata\tmatade\tV;IND;PL;ACT;PST\nmata\tmatas\tV;IND;PL;PASS;PRS\nmata\tmatades\tV;IND;PL;PASS;PST\nmata\tmate\tV;SBJV;ACT;PRS\nmata\tmatade\tV;SBJV;ACT;PST\nmata\tmates\tV;SBJV;PASS;PRS\nmata\tmatades\tV;SBJV;PASS;PST\nmata\tmatande\tV;V.PTCP;PRS\nmata\tmatad\tV;V.PTCP;PST\n'

In [26]:
[line.split("\t")[1] for line in unimorph.inflect_word("mata", lang="swe").split("\n") if line != ""]

['mata',
 'matas',
 'matat',
 'matats',
 'mata',
 'matar',
 'matade',
 'matas',
 'matades',
 'mata',
 'matade',
 'matas',
 'matades',
 'mate',
 'matade',
 'mates',
 'matades',
 'matande',
 'matad']

In [28]:
lemma = "mat"
wfs = [line.split("\t")[1] for line in unimorph.inflect_word(lemma, lang="swe").split("\n") if line != ""]
print(wfs)

[]


In [None]:
# http://www.cs.cmu.edu/~aanastas/software/unimorph_inflect.html
# https://pypi.org/project/unimorph/
# https://unimorph.github.io/
# NOT GOOD! Rule based

In [None]:
pd.DataFrame([wf.split("\t") for wf in unimorph.inflect_word("invandring", lang="swe").split("\n")])

In [None]:
unimorph.inflect_word("invandrarungdomar", lang="swe")

In [None]:
unimorph.analyze_word("invandring", lang="swe")

In [None]:
import sys
sys.path

In [None]:
import os
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors
from pathlib import Path
from collections import Counter
from itertools import combinations
#from util import load_metric, read_util

from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import ttest_ind

In [None]:
import stanza
nlp = stanza.Pipeline(lang='sv', processors='tokenize,pos,lemma')

## Data

In [None]:
with open(Path("../../data/utils/stopwords-sv.txt")) as f:
    stopwords = [sw.replace("\n", "") for sw in f.readlines()]

In [None]:
path_to_data = Path("/home/max/Documents/research/replacement_data/panel_wide_onlyreplace.csv")
replacement_data = pd.read_csv(path_to_data, sep="\t")

In [None]:
#replacement_data

In [None]:
#replacement_data = replacement_data.drop(0)

In [None]:
#replacement_data

In [None]:
#df= df.applymap(lambda s:s.lower() if type(s) == str else s)
replacement_data = replacement_data.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
replacement_data

In [None]:
set(w.split("_")[0] for w in replacement_data.columns)

## Functions

In [None]:
replacement_data.columns

In [None]:
replacement_data.loc[replacement_data['forortsgang_w1_C'] == 1, "forortsgang_text_w1"]

In [None]:
def inspect(
    df,            # Replacement Dataframe
    dwe,           # Dog Whistle Expression
    meaning,       # 1 for ingroup, 2 for outgroup
    phase,         # 1 for first phase of data collection, 2 for second phase
    sw = None,     # stopwords
    punct = None,  # remove punctuations
    verbose = True,
    multi = False, # Keep the multi-word units of the replacements
    rel_freq = False # use relative frequncies freq / no. of documents
):
    
    counter = Counter()
    
    if type(df) == pd.DataFrame:
        column = df.loc[df[f"{dwe}_w{phase}_C"] == meaning, f"{dwe}_text_w{phase}"]
    else:
        column = df
    
    for x in column:
        if punct != None:
            for p in punct:
                x = x.replace(p, "")
        x = x.split()
        if sw != None:
            x = [w for w in x if w not in sw]
        
        if multi:
            x = ["_".join(x)]
        
        counter.update(set(x)) # Obs. terms are only counted once per "document"
    
    if rel_freq:
        counter = Counter({w: c/len(column) for w,c in counter.items()})
        
    if verbose:
        for w, f in sorted(counter.items(), key = lambda x: x[1], reverse = True)[:15]:
            print(f"{w:<30}{f}")
        print("-----------------------")
        print("Total no. of types:", len(counter))
    
    

    return counter

In [None]:
x = inspect(replacement_data,"forortsgang",1,1,sw=stopwords,verbose=False)

In [None]:
sum(x.values())

In [None]:
max(x.values())

In [None]:
len(x) # Number of words, length of vocabulary

In [None]:
replacement_data[replacement_data["forortsgang_w1_C"] == 1].shape[0] # Number of "documents"

In [None]:
def keyness(trg, ref, min_frq = 3, verbose = True): # Consider metric
    
    d = dict()
    
#     trg_tot = sum(trg.values())
#     ref_tot = sum(ref.values())
    trg_tot = len(trg)
    ref_tot = len(ref)
    
    for w in trg.keys():
        if trg[w] < min_frq:
            continue
        if w in ref:
            d[w] = (trg[w] / trg_tot) / (ref[w] / ref_tot) # Odds Ratio (OR)
        else:
            d[w] = np.inf
    
    if verbose:
        for word, trg_freq, keyness  in sorted([(w, trg[w], k) for w, k in d.items()], key = lambda x: x[1], reverse = True)[:20]:
            if word in ref:
                ref_freq = ref[word]
            else:
                ref_freq = 0
            print(f"{word:<30}{trg_freq:<4}{(trg_freq/trg_tot):<6.3f}{ref_freq:<4}{(ref_freq/ref_tot):<6.3f}{keyness:.4}")        
    
    return d
    

In [None]:
def select_A(
    df,            # Replacement Dataframe
    dwe,           # Dog Whistle Expression
    phase,         # 1 for first phase of data collection, 2 for second phase, "both" for both
    sw = None,     # stopwords
    punct = None,  # remove punctuations
    k = None,
    min_freq = None,
    min_OR = None,
    empty_intersect = False
):
    
    if type(k) == tuple:
        k_in, k_out = k
    else:
        k_in  = k
        k_out = k
    if type(min_freq) == tuple:
        min_freq_in, min_freq_out = min_freq
    else:
        min_freq_in  = min_freq
        min_freq_out = min_freq
    if type(min_OR) == tuple:
        min_OR_in, min_OR_out = min_OR
    else:
        min_OR_in  = min_OR
        min_OR_out = min_OR
    
    if phase == "both":
        x = pd.concat([
            df.loc[df[f"{dwe}_w{1}_C"] == 1, f"{dwe}_text_w{1}"],
            df.loc[df[f"{dwe}_w{2}_C"] == 1, f"{dwe}_text_w{2}"]
        ]).to_list()
                
        y = pd.concat([
            df.loc[df[f"{dwe}_w{1}_C"] == 2, f"{dwe}_text_w{1}"],
            df.loc[df[f"{dwe}_w{2}_C"] == 2, f"{dwe}_text_w{2}"]
        ]).to_list()

        ingroup = inspect(x, dwe, None, None, sw, punct, verbose = False, rel_freq = True)
        outgroup = inspect(y, dwe, None, None, sw, punct, verbose = False, rel_freq = True)
#         _ = inspect(x, dwe, None, None, sw, punct, multi = True)   
#         _ = inspect(y, dwe, None, None, sw, punct, multi = True)   

        keyness_in2out = keyness(ingroup, outgroup, verbose = False, min_frq = -1)
        keyness_out2in = keyness(outgroup, ingroup, verbose = False, min_frq = -1)
        
    else:    
    
        ingroup = inspect(df, dwe, 1, phase, sw, punct, verbose = False, rel_freq = True)
        outgroup = inspect(df, dwe, 2, phase, sw, punct, verbose = False, rel_freq = True)
#         _ = inspect(df, dwe, 1, phase, sw, punct, multi = True)
#         _ = inspect(df, dwe, 2, phase, sw, punct, multi = True)
        keyness_in2out = keyness(ingroup, outgroup, verbose = False, min_frq = -1)
        keyness_out2in = keyness(outgroup, ingroup, verbose = False, min_frq = -1)
    
    A_in  = [w for w in ingroup.keys()]
    A_out = [w for w in outgroup.keys()]
    #print(A_out)
    
    if empty_intersect:
        A_in  = [w for w in A_in if w not in outgroup.keys()]
        A_out = [w for w in A_out if w not in ingroup.keys()]
        
    if min_freq != None:
        A_in  = [w for w in A_in if ingroup[w] >= min_freq_in]
        A_out = [w for w in A_out if outgroup[w] >= min_freq_out]
    
    #print(A_out)
    
    if min_OR != None:
        A_in  = [w for w in A_in if keyness_in2out[w] >= min_OR_in]
        A_out = [w for w in A_out if keyness_out2in[w] >= min_OR_out] # too strict to have the same threshold for both
        
    #print(A_out)    
    
    if k != None:
        A_in  = [w for w,_ in sorted(ingroup.items(), key = lambda x: x[1], reverse = True) if w in A_in][:k_in]
        A_out = [w for w,_ in sorted(outgroup.items(), key = lambda x: x[1], reverse = True) if w in A_out][:k_out]
    
    
    return A_in, A_out

In [None]:
def lemmatizer(word):
    
    doc = nlp(word)
    
    lemma = doc.sentences[0].words[0].lemma
    
    return lemma

In [None]:
doc = nlp("äter")
doc.sentences[0].words[0].lemma

In [None]:
# for s in ["äter", "maten"]:
#     doc = nlp(s)
#     for sent in doc.sentences:
#         for word in sent.words:
#             print(word.text, "-->", word.lemma)
#             print(unimorph.inflect_word(word.lemma, lang="swe"))    

In [None]:
def build_B_prim(A):
    
    if lemmatize:
        A = [lemmatizer(w) for w in A]
    if realize:
        pass

        
    return B_prim

In [None]:
def build_B(B_prim, vocab, wv, min_freq, exclude_marginal, margin):
    
    B = [w for w in B_prim if w in wv.key_to_index and w in vocab]
    
    if min_freq:
        pass
    if exclude_marginal:
        pass
    
#     B.sort()
    
    return B

    
    
    
    

In [None]:
def other(df, dwe, meaning, phase, criteria, verbose = True):
    
    oth = []
    
    if type(df) == pd.DataFrame:
        column = df.loc[df[f"{dwe}_w{phase}_C"] == meaning, f"{dwe}_text_w{phase}"]
    else:
        column = df

    for x in column:
        keep = True
        for criterion in criteria:
            if criterion in x:
                keep = False
                continue
        if keep:
            oth.append(x)
    
    if verbose:
        for x in oth:
            print(x)
    
    return oth


In [None]:
def inspect_all(
    df,            # Replacement Dataframe
    dwe,           # Dog Whistle Expression
    phase,         # 1 for first phase of data collection, 2 for second phase
    sw = None,     # stopwords
    punct = None,  # remove punctuations
):
    
    if phase == "both":
        print("Most frequent")
        print("=============")        
        
        x = pd.concat([
            df.loc[df[f"{dwe}_w{1}_C"] == 1, f"{dwe}_text_w{1}"],
            df.loc[df[f"{dwe}_w{2}_C"] == 1, f"{dwe}_text_w{2}"]
        ]).to_list()
        
                
        y = pd.concat([
            df.loc[df[f"{dwe}_w{1}_C"] == 2, f"{dwe}_text_w{1}"],
            df.loc[df[f"{dwe}_w{2}_C"] == 2, f"{dwe}_text_w{2}"]
        ]).to_list()

        print("In-group:")
        print("(a) Split")
        ingroup = inspect(x, dwe, None, None, sw, punct)
        print()
        print("(b) Keep phrases")
        _ = inspect(x, dwe, None, None, sw, punct, multi = True)   
        print("Out-group:")
        print("(a) Split")
        outgroup = inspect(y, dwe, None, None, sw, punct)
        print()
        print("(b) Keep phrases")
        _ = inspect(y, dwe, None, None, sw, punct, multi = True)   

        
        print()
        print("Keyness")
        print("=======")
        print(f"(a) in-group --> out-group")
        _ = keyness(ingroup, outgroup)
        print()
        print(f"(b) out-group --> in-group")
        _ = keyness(outgroup, ingroup)
        print()        
        
    else:    
    
        print("Most frequent")
        print("=============")
        print("In-group:")
        print("(a) Split")
        ingroup = inspect(df, dwe, 1, phase, sw, punct)
        print()
        print("(b) Keep phrases")
        _ = inspect(df, dwe, 1, phase, sw, punct, multi = True)
        print()
        print("Out-group:")
        print("(a) Split")
        outgroup = inspect(df, dwe, 2, phase, sw, punct)
        print()
        print("(b) Keep phrases")
        _ = inspect(df, dwe, 2, phase, sw, punct, multi = True)
        print()

        print("Keyness")
        print("=======")
        print("(a) in-group --> out-group")
        _ = keyness(ingroup, outgroup)
        print()

        print("(b) out-group --> in-group")
        _ = keyness(outgroup, ingroup)
        print()

    

In [None]:
#inspect_all(replacement_data, "forortsgang", "both", stopwords, [","])

In [None]:
def collect_forms(base_or_paradigm, m_dir, f_dir, stop = None):
    """ Collects word forms starting with `base`
    """
    
    summary = []
    sim_matrices = {}
    
    for file in sorted(os.listdir(m_dir)):
        
        if not file.endswith(".w2v"):
            continue
        print(file)    
        year = int(file.replace(".w2v", ""))
        if stop != None and year > stop:
            break
            
        wv = KeyedVectors.load_word2vec_format(models_at / file)
        vocab = load_metric(vocabs_at / file.replace(".w2v", ".txt"))
        
        print(len(wv), len(vocab))

        if type(base_or_paradigm) == str:
            g = [w for w in wv.key_to_index if w.startswith("invandrare")]
        else:
            assert type(base_or_paradigm) == list, "`base_or_paradigm` is neither string nor list"
            g = [w for w in base_or_paradigm if w in wv.key_to_index]
        g = [w if w in vocab else print(w, "not in vocab") for w in g]
        g = [w for w in g if w != None]
        g.sort()
    #     print(g)

        
        c = [(w, int(vocab[w])) for w in g]
        summary.append(
            {
            "year": year, 
            "n": len(g), 
            "forms": g, 
            "counts": c,
            "forms_str": ", ".join([f"{w} {n}" for w, n in [(w, vocab[w]) for w in g]])
            }
        )

        vectors = [wv[w] for w in g]

        sim_matrix = cosine_similarity(vectors)

        sim_matrices[year] = pd.DataFrame(sim_matrix, index=g, columns=g)
    
    
    return summary, sim_matrices

In [None]:
for x, y in combinations(np.array([[1,2,3,4],[4,5,7,1],[4,2,8,7]]), 2):
    print(cosine_similarity([x,y])[0][1])

In [None]:
np.array([[1,2,3], [1,2,3]]).mean(axis=0)

In [None]:
def pairwise_comparison(vectors):
    
    s = []
    
    for x, y in combinations(vectors, 2):
        similarity = cosine_similarity([x,y])[0][1] # returns a matrix
        s.append(similarity)
    
    s = np.array(s)
    
    return s

In [None]:
def pipeline(dwe, repl_data, m_dir, f_dir):
    
    results = {}
    
    for dim in [1, 2]:
        A = 
        B_prim = 

        for file in sorted(os.listdir(m_dir)):
            
            year = int(file.replace(".txt", ""))

            if not file.endswith(".w2v"):
                continue            
            
            wv = KeyedVectors.load_word2vec_format(models_at / file)
            
            vocab = load_metric(vocabs_at / file.replace(".w2v", ".txt"))
            
            dwe_vec = wv[dwe]
            
            B = build_B(B_prim, vocab, wv, min_freq, exclude_marginal, margin)

            vectors = np.array([wv[w] for w in B])

            avg_vec = vectors.mean(axis = 0)
            
            distance = cosine_similarity([dwe_vec, avg_vec])

            series = pairwise_comparison(vectors)
            
            pairwise_mean = series.mean()
            
            dim_term = "ingroup" if dim == 1 else "outgroup"
            
            results[dim_term] = {}
            
    
    for year in years:
        t, p = ttest(series_in, series_out)
    
    
            

In [None]:
for dim in [1, 2]:
    dim_term = "ingroup" if dim == 1 else "outgroup"
    print(dim_term)

##### Development

In [None]:
cosine_similarity([[1,2,3], [2,3,4], [4,5,6]])

In [None]:
models_at = Path("/home/max/Results/fb_pol-yearly-rad3/models")
vocabs_at = Path("/home/max/Corpora/flashback-pol-time/yearly/fb-pt-radical3/vocab")

In [None]:
summary, sim_mtrx = collect_forms("invandrare", models_at, vocabs_at, stop = 2005)

In [None]:
pd.DataFrame(summary)

In [None]:
for year in sim_mtrx.keys():
    print(year)
    df = sim_mtrx[year]
    print(df)
    print()


In [None]:
a = [np.array([1.2, 2.3]), np.array([5.6, 4.2])]

In [None]:
np.array(a).shape

In [None]:
vocab = {
    "mamma": 1,
    "mamman": 3,
    "mu": 100,
    "kråka": 1,
    "kråkor":20
}


Bigt = [("mamma", "mamman"), ("mu",), ("kråka", "kråkan", "kråkor")]
voc_B = {lexeme[0].upper(): {w: vocab[w] for w in lexeme if w in vocab} for lexeme in Bigt}

In [None]:
voc_B

In [None]:
voc_B["MAMMA"]

In [None]:
prop = {lexeme: {w: (voc_B[lexeme][w]/sum(voc_B[lexeme].values())) for w in voc_B[lexeme]} for lexeme in voc_B}

In [None]:
prop

In [None]:
1/20

In [None]:
sorted(voc_B["MAMMA"].items(), key = lambda x: x[1], reverse=True)[:1000]

In [None]:
#os.listdir(models_at)

In [None]:
load_metric(vocabs_at / "2000.txt")

In [None]:
model = "2004.w2v"
wv = KeyedVectors.load_word2vec_format(models_at / model)

In [None]:
type(wv["mamma"]) == np.ndarray


In [None]:
#wv.get_vector("sverige")

In [None]:
#wv["sverige"]

In [None]:
for model in sorted(os.listdir(models_at)):
    if not model.endswith(".w2v"):
        continue
    wv = KeyedVectors.load_word2vec_format(models_at / model)
    g = [w for w in wv.key_to_index if w.startswith("invandrare")]
    g.sort()
#     print(g)
    print(model[:4], len(g), ", ".join(g))

In [None]:
for model in sorted(os.listdir(models_at)):
    if not model.endswith(".w2v"):
        continue
    wv = KeyedVectors.load_word2vec_format(models_at / model)
    g = [w for w in wv.key_to_index if w.startswith("deportera")]
    g.sort()
#     print(g)
    print(model[:4], len(g), ", ".join(g))

In [None]:
for model in sorted(os.listdir(models_at)):
    if not model.endswith(".w2v"):
        continue
    wv = KeyedVectors.load_word2vec_format(models_at / model)
    g = [w for w in wv.key_to_index if w.startswith("elitist")]
    g.sort()
#     print(g)
    print(model[:4], len(g), ", ".join(g))

In [None]:
len(wv)

In [None]:
#wv[1]

In [None]:
"sverige" in wv

In [None]:
#wv["sverige"]

In [None]:
[w for w in wv.key_to_index if w.startswith("invandr")]

In [None]:
# get_mean_vector(keys, weights=None, pre_normalize=True, post_normalize=False, ignore_missing=True)
wv.get_mean_vector(['invandrare', 'invandrarna'], weights=[0.7, 0.3], pre_normalize=True, post_normalize=False, ignore_missing=True)

In [None]:
wv.similarity('invandrare', 'invandrarna')

In [None]:
wv.n_similarity(['invandrare'], ['invandrarna', 'invandrarnas'])

In [None]:
wv.similarity('invandrare', 'invandrargäng')

In [None]:
wv.similarity('invandrare', 'människor')

In [None]:
wv.most_similar("V1_berika")

In [None]:

igt = [
    "förstöra",
    "utnyttja",
    "förpesta",
    "försämra", 
    "brott",
    "fördärva"
    ]
ogt = [
    "förbättra",
    "tillföra",
    "bidra",
    "förgylla",
    "nytta",
    "hjälpa", 
    "främja",
    "bättre",
    "utveckla",
    "stärka",
    "gynna",
    "förhöja",
    "positiv"
    ]
wv.n_similarity(["V1_berika"], igt), wv.n_similarity(["V1_berika"], ogt)

##### Association metric

In [None]:
def association(
    trg,        # Targets (list)
    igt,        # In-group terms
    ogt,        # Out-group terms
    kwv,        # Keyed word vector (`gensim.models.KeyedVectors`)
    method = "centroid", # "centroid" for similarity of averaged vector (centroid); 
                         # "mean" for mean of similarity of each comparison
    weigths_trg = None,  # weigths of trg
    weigths_igt = None,  # weigths of igt
    weigths_ogt = None,  # weigths of ogt
):
    
    if method == "centroid":
        
        trg_vec = kwv.get_mean_vector(trg, weights=weigths_trg, pre_normalize=True, post_normalize=False, ignore_missing=True)
        igt_vec = kwv.get_mean_vector(igt, weights=weigths_igt, pre_normalize=True, post_normalize=False, ignore_missing=True)
        ogt_vec = kwv.get_mean_vector(ogt, weights=weigths_ogt, pre_normalize=True, post_normalize=False, ignore_missing=True)
        
        igs = cosine_similarity([trg_vec, igt_vec])[0][1] # returns a matrix! must be indexed
        ogs = cosine_similarity([trg_vec, ogt_vec])[0][1]
        
        t = None
        p = None

    if method == "mean":
        
        igscores = []
        ogscores = []
        
        for trg_i in trg:      # What about OOV items? What do those calculations return?
            for igt_i in igt:
                ig_scores.append(kwv.similarity(trg_i, igt_i))
        
        for trg_i in trg:
            for ogt_i in ogt:
                og_scores.append(kwv.similarity(trg_i, ogt_i))
        
        igs = np.mean(igscores) # list to numpy array? axis = ? Check!
        ogs = np.mean(ogscores)
        
        t, p, df = ttest_ind(igscores, ogscores) # it is not perfectly clear that a t-test is a good idea here
        
    return igs, ogs, t, p
    
    

In [None]:
##### Test: återvandring

In [None]:
cosine_similarity([[1,2,2], [2,33,1]])

In [None]:
igt = [
        "utvisa", 
        "skicka", 
        "deportering", 
        "tvångsförflyttning", 
        "kasta", 
        "sända", 
        "tvinga", 
        "slänga", 
        "avvisa",
        "sparka",
        "avhysa"
]
ogt = [
    "återvända",
    "återreas",
    "repatriering",
    "hemresa",
    "hemåtvända",
    "utvandring",
    "flytta"
    ]
trg = ["N1_återvandring"]
for file in sorted([file for file in os.listdir(models_at) if file.endswith(".w2v")]):
    
    wv = KeyedVectors.load_word2vec_format(models_at / file)
    
    print(file, association(trg, igt, ogt, wv))

In [None]:
# sorted({"a": 1, "b":20}.items(), key = lambda x: x[1], reverse=True)
sorted({"a": 1, "b":20, "c":10}.items(), key = lambda x: x[1])

## Explorations

### (1) Förortsgäng

In [None]:
inspect_all(replacement_data, "forortsgang", "both", stopwords, [","])

Notes:
* keyness > 5

Some considerations in coding:
* In *O* when it should not:
    * invandrargäng 9
    * invandrare 12
    * invandrarungdomar 2
    * invandrarkillar 1
    * utländsk 4

In [None]:
select_A(
    replacement_data,            # Replacement Dataframe
    "forortsgang",           # Dog Whistle Expression
    "both",         # 1 for first phase of data collection, 2 for second phase, "both" for both
    sw = stopwords,     # stopwords
    punct = [","],  # remove punctuations
    k = 5,
    min_freq = None,
    min_OR = None,
    empty_intersect = True
)

In [None]:
select_A(
    replacement_data,            # Replacement Dataframe
    "forortsgang",           # Dog Whistle Expression
    "both",         # 1 for first phase of data collection, 2 for second phase, "both" for both
    sw = stopwords,     # stopwords
    punct = [","],  # remove punctuations
    k = None,
    min_freq = 0.03,
    min_OR = (10, 2),
    empty_intersect = False
)

#### Phase 1

In [None]:
inspect_all(replacement_data, "forortsgang", 1, stopwords, [","])

##### Ingroup terms

In [None]:
igt = [
        "invandr", 
        "blatt", 
        "utländsk", 
        "nysvensk", 
        "indvad",     # Miss-spelling ...
        "invandar", 
        "invadra", 
        "innvandr", 
        "indvandr"
    ]
_ = other(
    df = replacement_data, 
    dwe = "forortsgang", 
    meaning = 1, 
    phase = 1, 
    criteria = igt
)

##### Outgroup terms

Hard! See above!

#### Phase 2

In [None]:
inspect_all(replacement_data, "forortsgang", 2, stopwords, [","])

##### Ingroup terms

In [None]:
igt = [
        "invandr", 
        "blatt", 
        "babb",
        "svartskall",
        "neger",
        "utlänning", 
        "utlandsfödd", 
        "ickesvensk",
        "nysvensk",
        "arab",
        "utländsk",
        "utomnordisk",
        "utomeuropeisk",
        "invanda",        # Miss-spelling
        "invadr",
        "ivandr"
    ] 
_ = other(
    df = replacement_data, 
    dwe = "forortsgang", 
    meaning = 1, 
    phase = 2, 
    criteria = igt
)

##### Outgroup terms

Hard! See above!

### (2) Återvandring

In [None]:
inspect_all(replacement_data, "aterinvandring", "both", sw = stopwords, punct = [","])

In [None]:
select_A(
    replacement_data,            # Replacement Dataframe
    "aterinvandring",           # Dog Whistle Expression
    "both",         # 1 for first phase of data collection, 2 for second phase, "both" for both
    sw = stopwords,     # stopwords
    punct = [","],  # remove punctuations
    k = None,
    min_freq = 0.03,
    min_OR = (10, 2),
    empty_intersect = False
)

#### Phase 1

In [None]:
inspect_all(replacement_data, "aterinvandring", 1, sw = stopwords, punct = [","])

##### Ingroup terms

In [None]:
igt = [
        "utvis", 
        "skick", 
        "deport", 
        "tvångsförflyttning", 
        "kast", 
        "sänd", 
        "tvinga", 
        "släng", 
        "avvis"
    ]
_ = other(
    df = replacement_data, 
    dwe = "aterinvandring", 
    meaning = 1, 
    phase = 1, 
    criteria = igt
)

*Comment:* wrong code? återvändande, tillbakaflytt, repatriering, hemvändande

##### Outgroup terms

In [None]:
ogt = [
    "hemvänd",
    "återvänd",
    "återres",
    "repatrier",
    "hemres",
    "hemåtvända",
    "utvandring",
    "flytt"
    ]
_ = other(
    df = replacement_data, 
    dwe = "aterinvandring", 
    meaning = 2, 
    phase = 1, 
    criteria = ogt
)

*Comment:* wrong code? On several. utvisning, etnisk rensning, skicka tillbaka, avvisa

#### Phase 2

In [None]:
inspect_all(replacement_data, "aterinvandring", 2, sw = stopwords, punct = [","])

##### Ingroup terms

In [None]:
igt = [
        "utvis", 
        "skick", 
        "deport", 
        "tvångsförflyttning", 
        "kast", 
        "sänd", 
        "tvinga", 
        "släng", 
        "avvis",
        "spark",
        "avhys",
        "förvis"
    ]
_ = other(
    df = replacement_data, 
    dwe = "aterinvandring", 
    meaning = 1, 
    phase = 2, 
    criteria = igt
)

*Comment:* wrong code? 

##### Outgroup terms

In [None]:
ogt = [
    "hemvänd",
    "återvänd",
    "återres",
    "repatrier",
    "hemres",
    "hemåtvända",
    "utvandring",
    "flytt"
    ]
_ = other(
    df = replacement_data, 
    dwe = "aterinvandring", 
    meaning = 2, 
    phase = 2, 
    criteria = ogt
)

*Comment:* wrong code? förpassning

### (3) Berikar

In [None]:
inspect_all(replacement_data, "berikar", "both", sw = stopwords, punct = [","])

#### Phase 1

In [None]:
inspect_all(replacement_data, "berikar", 1, sw = stopwords, punct = [","])

##### Ingroup terms

In [None]:
igt = [
    "förstör",
    "utnyttja",
    "förpest",
    "försämra", 
    "brott",
    "fördärv"
    ]
_ = other(
    df = replacement_data, 
    dwe = "berikar", 
    meaning = 1, 
    phase = 1, 
    criteria = igt
)

##### Outgroup terms

In [None]:
ogt = [
    "förbättra",
    "tillför",
    "bidra",
    "förgyll",
    "nytta",
    "hjälp", 
    "främja",
    "bättre",
    "utveckla",
    "stärk",
    "gynna",
    "förhöj",
    "positiv"
    ]
_ = other(
    df = replacement_data, 
    dwe = "berikar", 
    meaning = 2, 
    phase = 1, 
    criteria = ogt
)

#### Phase 2

In [None]:
inspect_all(replacement_data, "berikar", 2, sw = stopwords, punct = [","])

##### Ingroup terms

In [None]:
igt = [
    "förstör",
    "utnyttja",
    "förpest",
    "försämra", 
    "brott",
    "fördärv"
    ]
_ = other(
    df = replacement_data, 
    dwe = "berikar", 
    meaning = 1, 
    phase = 1, 
    criteria = igt
)

##### Outgroup terms

In [None]:
ogt = [
    "förbättra",
    "tillför",
    "bidra",
    "förgyll",
    "nytta",
    "hjälp", 
    "främja",
    "bättre",
    "utveckla",
    "stärk",
    "gynna",
    "förhöj", 
    "positiv"
    ]
_ = other(
    df = replacement_data, 
    dwe = "berikar", 
    meaning = 2, 
    phase = 1, 
    criteria = ogt
)

### (4) Globalister

In [None]:
inspect_all(replacement_data, "globalister", "both", sw = stopwords, punct = [","])

#### Phase 1

In [None]:
inspect_all(replacement_data, "globalister", 1, sw = stopwords, punct = [","])

##### Ingroup terms

In [None]:
igt = [
    "judar",
    "elit"
    ]
_ = other(
    df = replacement_data, 
    dwe = "globalister", 
    meaning = 1, 
    phase = 1, 
    criteria = igt
)

##### Outgroup terms

In [None]:
ogt = [
    "internationell" # ?

    ]
_ = other(
    df = replacement_data, 
    dwe = "globalister", 
    meaning = 2, 
    phase = 1, 
    criteria = ogt
)

#### Phase 2

In [None]:
inspect_all(replacement_data, "globalister", 2, sw = stopwords, punct = [",", "."])

##### Ingroup terms

In [None]:
igt = [
    "judar",
    "judisk",
    "soro",
    "elit"
    ]
_ = other(
    df = replacement_data, 
    dwe = "globalister", 
    meaning = 1, 
    phase = 2, 
    criteria = igt
)

##### Outgroup terms

In [None]:
ogt = [
    "internationell", # ?
    "värld"

    ]
_ = other(
    df = replacement_data, 
    dwe = "globalister", 
    meaning = 2, 
    phase = 2, 
    criteria = ogt
)

*Comment:* Polysemous dogwhistle... 
* landsförrädare, landsförädare, icke-nationalister; 
* kapitaliser; 
* liberalister; 
* sossar