# The Worldview Toolkit

The worldview toolkit is a software package that accompanies the paper: "Discovering Multidimensional Worldview and Ideology with Embedding Alignment" (EMNLP 2021). The software was written by Jeremiah Milbauer.

## Introduction

This software package allows you to replicate the experimentation and analysis from our paper. It will also enable you to perform the same kind of analysis for whatever data might be interesting to you! For the experiments contained in our paper, we optimized some of the code for our particular computing infrastructure. This package, however, does not have the same optimizations.

## Usage

What follows below is a step-by-step guide to using the toolkit.

### 1. Set up the environment

First, setup your environment using the requirements.txt or requirements.yml:

<code>pip install -r requirements.txt</code>

<code>conda create --name wvtk --file requirements.yml</code>

### 2. Collect Data

In order to use this toolkit, you must collect text data from the communities you wish to analyze. Ideally, you will be able to collect at least 1,000,000 sentences from the community. You can draw this text from anywhere -- we used online social media communities, but some work has been done to look at how culture changes over time, treating each decade as a "community".

Once you have collected your data, preprocess it so that the following constraints are respected:
- Punctuation is not attached to words
- Words are lower case

You can also enforce some optional constraints, which may yield interesting results:
- Common bigrams and trigrams have been merged into phrases
- Words have been stemmed or lemmatized

For text preprocessing, consider using NLTK. In our paper, we used NLTK to preprocess the text and form phrases.

### 3. Compute Corpus Statistics

An important step of this process is computing some corpus stats.

You will need to compute the word frequencies for each worldview file. This will give you the shared vocabularies to use for alignment.

Run the following: <code>python3 ./src/stats/counts.py ./corpus ./data/counts.json</code>

You will also need to compute a general-purpose embedding to use for word clustering. You need to choose a text file that best represents "generic" language among your communities. You could use the union of the worldview files. That's what we did, and the method implemented by <code>multitrain.sh</code>

First, build the general-purpose embedding: <code>./src/modeling/multitrain.sh ./corpus ./data/master.model</code>

Then, compute the clusters: <code>python3 ./src/stats/topics.py ./data/master.model ./data/clusters.json</code>

Now you're ready to move onto the model training!

### 4. Train the Community Models

Place your preprocessed text file, each representing one worldview, into corpus/

Run: <code>./src/modeling/train.sh ./corpus ./data/models</code>

You will now have trained gensim word2vec models in <code>data/models</code>

### 5. Align the Models

Your models should be located at <code>data/models/</code>

Run one of the following aligners:
   
- <code>python3 ./src/alignment/align.py ./data/models/ ./data/aligners/cca/ ./data/counts.json cca 1000</code>
- <code>python3 ./src/alignment/align.py ./data/models/ ./data/aligners/svd/ ./data/counts.json svd 1000</code> (Khudabukhsh, et al.)
- <code>python3 ./src/alignment/align.py ./data/models/ ./data/aligners/lstsq/ ./data/counts.json lstsq 1000</code>

If you have a lot of embeddings, this process may take a long time while the embeddings load.

Each will learn pairwise alignments between the files in the directory, and save an <code>Aligner</code> object which can be used to analyze the alignment

It's also worth noting that you can experiment with the stopword strategy.
- -1 will use all the shared words between the communities.
- n will use the n most frequent shared words between the communities.
- 0 will use NLTK stopwords (see Khudabukhsh, et al.)

### 6. Analyze the Alignments

Now we have a number of trained aligner objects, and we can begin to use them to explore the ideological dialects of the communities they connect!

Check out each of the analysis notebooks for ideas.