For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week12-word-context-vectors

# Hist 3368 - Week 4: Word Context Vectors for Congress

#### By Jo Guldi

In this notebook, we'll learn some more basics of working with Python. We'll also use our newfound skills at "running" code in Jupyter to create some very intensive visualizations of how word usage has changed over time in Congress.

## Basics: loading software from elsewhere

#### Importing new software packages

Users of code often borrow software written by other people because it means taking a shortcut rather than reinventing the wheel.

We will frequently 'import' software packages so that we can use commands and variables that other people have invented.

In general, we will import software packages with the command 'import.'

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from nltk.corpus import wordnet as wn

Run the line of code above. If all goes well, you won't get an error message.  You'll just see a star (*) by the cell while the computer thinks for a moment.

You will need to 'import' a software package at the beginning of each session in which you use that package.  Most of our future notebooks will begin with a string of 'import' commands to set things up.

#### Installing new software packages

The first time you use a new software package, that package first needs to be 'installed' on your part of the super computer.  

To install a new software package, we need special punctuation that tells M2 that we want control over the settings of the supercomputer.  Here is the syntax to install a package called 'bs4.'  We used many commands of this kind in the first-time-setup notebook:

In [None]:
!pip install gensim --upgrade --user # you only need to run this once

In [None]:
import gensim
from gensim.models import KeyedVectors

If any of the above 'import' commands failed to work, you can use the **!pip install xxx --user** syntax to install them  now. 

#### Error messages that indicate installations/importing are needed

You will also get error messages if you try to use a command from a package that hasn't been installed or imported.  

For example, here is a function that I made up:

In [None]:
breakdance(x)

If breakdance were a legitimate function, seeing the 'NameError' next to a function would be a good sign that we need to try installing and importing a software package that defines the function 'breakdance().'


## Working With Word Embeddings

In this notebook, we're going to be calling some files called "word embeddings" that were generated by your professor and stored on M2. Word embeddings are multi-dimensional models of language over time.



#### Loading pre-generated word embeddings models

The following line of code tells the computer to look in the course folder.

In [None]:
cd /scratch/group/history/hist_3368-jguldi/

Next, we tell the computer to load the pre-generated files.

In [None]:
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')

#### Calling the model to investigate words

Now that we have the word embeddings model loaded, we can use the commands 'wv(word)' (word vector) and 'wv.similar_by_vector(word, n)' to call the words are 'similar' in terms of the model's understanding of speech.

With  'wv(word)' and 'wv.similar_by_vector(word_vector, n)', the user can switch out the "word" for any word spoken in English.  You can switch out n for any number.  That number of words will be returned.

Note that the output is a list of words that were commonly used alongside the word "man."  Each word is paired with a number, which represents ths "similarity score" of how frequently that word or phrase appears with the word "man."

In [None]:
man_vector = wv['man']
wv.similar_by_vector(man_vector, 25)

### Explore the Contents of Your Vector Model

Please note that in this exercise, you are less interested in learning every single command than you are in "playing" with the model to understand what it is telling you about speech in Congress.

#### Find the CONTEXT for One Word

In [None]:
woman_vector = wv['woman']
wv.similar_by_vector(woman_vector)

In [None]:
individual_vector = wv['person']
wv.similar_by_vector(individual_vector)

In [None]:
soldier_vector = wv['soldier']
wv.similar_by_vector(soldier_vector)

#### Interpreting vector similarity

Try your own hand at interpreting these outputs. 

How do you interpret these similarities?

In [None]:
wv.most_similar("iraq", topn = 20)

In [None]:
wv.most_similar("america", topn = 20)

In [None]:
wv.most_similar("britain", topn = 20)

## Subtracting Vectors

In [None]:
diff = wv['man'] - wv['woman']
wv.similar_by_vector(diff)

In [None]:
diff = wv['woman'] - wv['boy']
wv.similar_by_vector(diff)

In [None]:
diff = wv['people'] - wv['person']
wv.similar_by_vector(diff)

In [None]:
diff = wv['person'] - wv['people']
wv.similar_by_vector(diff)

In [None]:
diff = wv['think'] - wv['heart']
wv.similar_by_vector(diff)

In [None]:
diff = wv['feel'] - wv['think']
wv.similar_by_vector(diff)

### Adding vectors to find synonyms

In [None]:
keyword_context = [word[0] for word in wv.most_similar("woman", topn = 100)]

sum = wv[keyword_context[0]] 

for word in keyword_context[1:len(keyword_context)]:
    next_vector = wv[word] 
    sum = sum + next_vector
    
wv.similar_by_vector(sum)

In [None]:
keyword_context = [word[0] for word in wv.most_similar("soldier", topn = 100)]
sum = wv[keyword_context[0]] 
for word in keyword_context[1:len(keyword_context)]:
    next_vector = wv[word] 
    sum = sum + next_vector
wv.similar_by_vector(sum)

In [None]:
keyword_context = [word[0] for word in wv.most_similar("happy", topn = 100)]
sum = wv[keyword_context[0]] 
for word in keyword_context[1:len(keyword_context)]:
    next_vector = wv[word] 
    sum = sum + next_vector
wv.similar_by_vector(sum)

In [None]:
keyword_context = [word[0] for word in wv.most_similar("american", topn = 100)]
sum = wv[keyword_context[0]] 
for word in keyword_context[1:len(keyword_context)]:
    next_vector = wv[word] 
    sum = sum + next_vector
wv.similar_by_vector(sum)

### Distance and Similarity with Vectors in GENSIM

With similarity, the higher the number, the more alike two terms are in the context in which they are used. 

In [None]:
wv.similarity('woman', 'female')

In [None]:
wv.similarity('woman', 'man')

In [None]:
wv.similarity('soldier', 'man')

In [None]:
wv.similarity('woman', 'person')

In [None]:
wv.similarity('woman', 'rock')

#### Visualize the similarities as a Dendrogram

In [None]:
keywords = ['dream',  'war',  'wealth', 'happy',  'tomorrow', 'past', 'present', 'future', 'america', 'democracy', 'riot', 'dictator', 'money', 'oppression', 'prison',  'britain', 'china', 'democrat', 'republican', 'welfare', 'communism', 'russia', 'congress', 'protest']

In [None]:
keyword_vectors = wv[keywords]

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
links = linkage(keyword_vectors, method='complete', metric='seuclidean')

In [None]:
from matplotlib import pyplot as plt

l = links

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=0,  # rotates the x axis labels
    leaf_font_size=16,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(keywords[v])
)
plt.show()


*Note: if you get an error above, delete any words from the list.*

### Visualizing Abstract Relatedness

In [None]:
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [None]:
#%matplotlib inline

def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.wv.key_to_index.keys()), sample)
        else:
            words = [ word for word in model.wv.key_to_index ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(wv, keywords)

# Study change over time

In [None]:
cd '/scratch/group/history/hist_3368-jguldi/congress-embeddings'

In [None]:
dataname = 'lemmatized-stopworded-bigrammed-congress-model'

In [None]:
keyword1 = 'woman'  # the word you want to research

In [None]:
periodnames = []
for i in range(1870,2010, 5):
    periodnames.append(i) 
periodnames

In [None]:
#########  after the first run, use this line to call the old data without generating it again
keyword_context = []
dates_found = []

# cycle through each period
for period1 in periodnames:
    print('working on ', period1)
    
    # load the model from period1
    #period_model = gensim.models.Word2Vec.load(dataname + '-model-' + str(period1)) # to load a saved model
    period_wv = KeyedVectors.load(dataname + '--wv-model-' + str(period1)) # load the saved model
    
    ## is the keyword found?
    if keyword1 in period_wv.key_to_index:
        print('found ', keyword1)
        
        # get the context vector for keyword1
        keyword_context_period = period_wv.most_similar(keyword1, topn = 5000) 
        
        # save it for later
        keyword_context.append(keyword_context_period) # save the context of how women were talked about for later
        dates_found.append(period1)

#### Visualize it

In [None]:
# helper function to abstract only unique values while keeping the list in the same order -- the order of first appearance
def unique2(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

In [None]:
all_words = []
for i in range(len(dates_found)):
    words = [item[0] for item in keyword_context[i]][:10]
    all_words.append(words)

all_words2 = []
for list in all_words:
    for word in list:
        all_words2.append(word)

numwords = 10


In [None]:
cd ~/ # go to your home directory

In [None]:
%matplotlib inline
#from matplotlib.colors import ListedColormap, LinearSegmentedColormap

from adjustText import adjust_text
from numpy import linspace
from matplotlib import cm

colors = [ cm.viridis(x) for x in linspace(0, 1, len(unique2(all_words2))+10) ]

# change the figure's size here
plt.figure(figsize=(20,20), dpi = 300)

texts = []

# plt.annotate only plots one label per iteration, so we have to use a for loop 
for i in range(len(dates_found)):    # cycle through the period names
    
    #yyy = int(keyword_per_year[keyword_per_year['5yrperiod'] == int(xx)]['count'])   # how many times was the keyword used that year?
                     
    for j in range(5):     # cycle through the first ten words (you can change this variable)
        
        xx = dates_found[i]        # on the x axis, plot the period name
        yy = [item[1] for item in keyword_context[i]][j]         # on the y axis, plot the distance -- how closely the word is related to the keyword
        txt = [item[0] for item in keyword_context[i]][j]        # grab the name of each collocated word
        colorindex = unique2(all_words2).index(txt)   # this command keeps all dots for the same word the same color
        
        plt.scatter(                                             # plot dots
            xx, #x axis
            yy, # y axis
            linewidth=1, 
            color = colors[colorindex],
            edgecolors = 'darkgray',
            s = 100, # dot size
            alpha=0.8)  # dot transparency

        # make a label for each word
        texts.append(plt.text(xx, yy, txt))

# Code to help with overlapping labels -- may take a minute to run
adjust_text(texts, force_points=0.2, force_text=.7, 
                    expand_points=(1, 1), expand_text=(1, 1),
                    arrowprops=dict(arrowstyle="-", color='black', lw=0.5))

plt.xticks(rotation=90)

# Add titles
plt.title("What words were most similar to ''" + keyword1 + "' in Congress?", fontsize=20, fontweight=0, color='Red')
plt.xlabel("period")
plt.ylabel("similarity to " + keyword1)


filename = 'words-similar-to-' + keyword1 + '-' + dataname
plt.savefig(filename)