# Word embeddings

The previous analyses have shown that fake news tends to use consistently inflammatory and subjective vocabulary, and tends to cover issues that may incite controversy.

Let's drill down to the word level and look for connotations among words used in both fake and real news. This could reveal underlying biases that shape how certain words like `election` or `president` are perceived.

1. [Load data](#Load-data)
2. [Train embeddings](#Train-embedding-models)
3. [Nearest neighbors](#Look-at-nearest-neighbors)
4. [Different neighbors](#Find-words-with-different-neighbors)
5. [Exploration](#Exploration)

### Load data

In [None]:
## data = fake news challenge
import pandas as pd
fake_news_article_data = pd.read_csv('data/fake_news_challenge/Fake.csv', sep=',', index_col=False)
real_news_article_data = pd.read_csv('data/fake_news_challenge/True.csv', sep=',', index_col=False)
display(fake_news_article_data.head())

In [None]:
# ## clean data
from nltk.tokenize import PunktSentenceTokenizer, WordPunctTokenizer
sent_tokenizer = PunktSentenceTokenizer()
word_tokenizer = WordPunctTokenizer()
def get_sentence_word_tokens(text, word_tokenizer, sent_tokenizer):
    text_sents = sent_tokenizer.tokenize(text)
    text_sent_tokens = list(map(word_tokenizer.tokenize, text_sents))
    return text_sent_tokens
fake_news_sentences = fake_news_article_data.loc[:, 'text'].apply(lambda x: get_sentence_word_tokens(x, word_tokenizer, sent_tokenizer))
real_news_sentences = real_news_article_data.loc[:, 'text'].apply(lambda x: get_sentence_word_tokens(x, word_tokenizer, sent_tokenizer))
# flatten for processing
from functools import reduce
def flatten_list_data(data):
    flat_data = []
    for x in data:
        flat_data.extend(x)
    return flat_data
fake_news_sentences = flatten_list_data(fake_news_sentences)
real_news_sentences = flatten_list_data(real_news_sentences)

### Train embedding models
Let's first train the word embedding models on the full data.

In [None]:
## train word2vec embeddings
from gensim.models.word2vec import Word2Vec
def train_word2vec_model(text_sents, model_out_file):
    dim = 50
    alpha = 0.025
    window = 5
    min_count = 5
    model = Word2Vec(sentences=text_sents, size=dim, alpha=alpha, window=window, min_count=min_count, seed=123)
#     model.build_vocab(text_sents)
    model.save(model_out_file)
fake_news_word2vec_model_out_file = 'data/fake_news_challenge/fake_news_word2vec_embed.model'
real_news_word2vec_model_out_file = 'data/fake_news_challenge/real_news_word2vec_embed.model'
## skipping these steps to save time during tutorial
# train_word2vec_model(fake_news_sentences, fake_news_word2vec_model_out_file)
# train_word2vec_model(real_news_sentences, real_news_word2vec_model_out_file)

In [None]:
## load from file
fake_news_word2vec_embed_model = Word2Vec.load(fake_news_word2vec_model_out_file)
real_news_word2vec_embed_model = Word2Vec.load(real_news_word2vec_model_out_file)

In [None]:
## train Glove embeddings
from glove import Glove, Corpus
def fit_glove_model(text_sents, model_out_file):
    dim = 50
    learning_rate = 0.05
    alpha = 0.025
    random_state = 123
    train_epochs = 100
    num_threads = 4
    window = 5
    glove_corpus = Corpus()
    glove_corpus.fit(text_sents, window=window)
    glove_embed_model = Glove(no_components=dim, learning_rate=learning_rate, 
                              alpha=alpha, random_state=random_state)
    # note: this takes ~ 5 minutes with 4 threads on a server
    glove_embed_model.fit(glove_corpus.matrix, epochs=train_epochs,
                          no_threads=num_threads, verbose=True)
    glove_embed_model.add_dictionary(glove_corpus.dictionary)
    glove_embed_model.save(model_out_file)
fake_news_glove_model_out_file = 'data/fake_news_challenge/fake_news_glove_embed.model'
real_news_glove_model_out_file = 'data/fake_news_challenge/real_news_glove_embed.model'
## skipping these steps to save time during tutorial
# print('fitting Glove embeddings for fake news')
# fit_glove_model(fake_news_sentences, fake_news_glove_model_out_file)
# print('fitting Glove embeddings for real news')
# fit_glove_model(real_news_sentences, real_news_glove_model_out_file)

In [None]:
## reload models after training
fake_news_glove_embed_model = Glove.load(fake_news_glove_model_out_file)
real_news_glove_embed_model = Glove.load(real_news_glove_model_out_file)

### Look at nearest neighbors

Let's start out by looking at the nearest neighbors for some test words. 

We'll get the test words by filtering from the most frequent words.

In [None]:
from collections import Counter
from stop_words import get_stop_words
import pandas as pd
pd.set_option('display.max_rows', 100)
news_word_counter = Counter()
for sent_i in fake_news_sentences:
    news_word_counter.update(sent_i)
for sent_i in real_news_sentences:
    news_word_counter.update(sent_i)
news_word_counts = pd.Series(dict(news_word_counter)).sort_values(inplace=False, ascending=False)
en_stops = set(get_stop_words('en')) & set(news_word_counts.index)
news_word_counts.drop(en_stops, inplace=True)
display(news_word_counts.head(100))

In [None]:
test_words = ['Trump', 'President', 'election', 'Republicans', 'Democratic']

In [None]:
## test word2vec first
N_neighbors = 10
for test_word_i in test_words:
    print(f'testing word = {test_word_i}')
    print(f'\tfake news neighbors')
    print(fake_news_word2vec_embed_model.most_similar(test_word_i, topn=N_neighbors))
    print(f'\treal news neighbors')
    print(real_news_word2vec_embed_model.most_similar(test_word_i, topn=N_neighbors))

In [None]:
## test Glove embeddings
N_neighbors = 10
for test_word_i in test_words:
    print(f'testing word = {test_word_i}')
    print(f'\tfake news neighbors')
    print(fake_news_glove_embed_model.most_similar(test_word_i, number=N_neighbors))
    print(f'\treal news neighbors')
    print(real_news_glove_embed_model.most_similar(test_word_i, number=N_neighbors))

We see some aspects of potential bias with these test words.

For `word2vec`:
- `Trump` is associated with almost exclusively Republican politicians in fake news and with a mix of politicians in real news
- `President` is associated more with U.S. politics in fake news and more with international politicians in real news
- `Democratic` are associated more with U.S. politics in fake news and more with international politics in real news

For `Glove`:
- `Trump` is associated with himself (and news network? `Q13FOXWATCH`) in fake news and with other presidents in real news
- `President` is associated with Trump and Obama in fake news and more with international politicians in real news
- `Democratic` is associated with U.S. party politics in both fake and real news

This qualitative analysis helps us understand that some words may indeed have significant divergence in their connotations between the different data sets, while others are more stable.

### Find words with different neighbors

Which words are the most different across the data?

We'll measure "difference" using the overlap in nearest neighbors (i.e. Jaccard similarity).

$$\text{diff(word1, word2)} = 1 - \frac{\text{neighbors(word1)} \: \cap \: \text{neighbors(word2)}}{\text{neighbors(word1)} \cup \text{neighbors(word2)}}$$

A difference of 100% means that the words have no neighbors in common, while a difference of 0% means that the words have identical neighbors.

In [1]:
def compute_neighbor_diff(neighbors_1, neighbors_2):
    neighbor_intersect = set(neighbors_1) & set(neighbors_2)
    neighbor_union = set(neighbors_1) | set(neighbors_2)
    neighbor_diff = 1 - len(neighbor_intersect) / len(neighbor_union)
    return neighbor_diff
def compute_neighbor_diff_model(word, model_1, model_2, N_neighbor, model_type='word2vec'):
    if(model_type == 'word2vec'):
        neighbors_1, neighbor_scores_1 = zip(*model_1.wv.most_similar(word, topn=N_neighbor))
        neighbors_2, neighbor_scores_2 = zip(*model_2.wv.most_similar(word, topn=N_neighbor))
    elif(model_type == 'glove'):
        neighbors_1, neighbor_scores_1 = zip(*model_1.most_similar(word, number=N_neighbor))
        neighbors_2, neighbor_scores_2 = zip(*model_2.most_similar(word, number=N_neighbor))
    neighbor_diff = compute_neighbor_diff(neighbors_1, neighbors_2)
    return neighbor_diff

In [None]:
# get shared vocabulary
shared_word2vec_vocab = list(set(fake_news_word2vec_embed_model.wv.vocab.keys()) & set(real_news_word2vec_embed_model.wv.vocab.keys()))
print(f'{len(shared_word2vec_vocab)} words in word2vec vocab')
# compute neighbor differences for all valid words
model_type = 'word2vec'
N_neighbor = 10
fake_vs_real_word2vec_neighbor_diffs = list(map(lambda x: compute_neighbor_diff_model(x, fake_news_word2vec_embed_model, real_news_word2vec_embed_model, N_neighbor, model_type=model_type), shared_word2vec_vocab))
# add vocabulary as index
fake_vs_real_word2vec_neighbor_diffs = pd.Series(fake_vs_real_word2vec_neighbor_diffs, index=shared_word2vec_vocab)
fake_vs_real_word2vec_neighbor_diffs.sort_values(inplace=True, ascending=False)

In [None]:
top_k = 20
print('words with most neighbor difference')
print(fake_vs_real_word2vec_neighbor_diffs.head(top_k))
print('words with most neighbor similarity')
print(fake_vs_real_word2vec_neighbor_diffs.tail(top_k))

The words with the biggest neighbor differences don't seem to be super informative and may reflect topical differences (e.g. fake news tends to discuss `Charlie` more often and therefore has more consistent nearest neighbors).

What if we restrict to the top-1000 most frequent words?

In [None]:
# only keep the words that are in the word2vec vocab
word2vec_vocab_news_word_counts = news_word_counts.loc[(news_word_counts.index & set(shared_word2vec_vocab))].sort_values(inplace=False, ascending=False)
top_N_words = word2vec_vocab_news_word_counts.iloc[:1000].index.tolist()
top_N_fake_vs_real_word2vec_neighbor_diffs = fake_vs_real_word2vec_neighbor_diffs.loc[top_N_words].sort_values(inplace=False, ascending=False)
top_k = 50
print('frequent words with most neighbor difference')
print(top_N_fake_vs_real_word2vec_neighbor_diffs.head(top_k))

OK! This leaves us with some interesting words to investigate:

- `left` (related to politics?)
- `Barack`
- `twitter`
- `Black`
- `Islamic`
- `corruption`

In [None]:
# print neighbors for all high-difference words
high_diff_words = ['left', 'Barack', 'twitter', 'Black', 'Islamic', 'corruption']
N_neighbors = 10
for word_i in high_diff_words:
    print(f'testing word = {word_i}')
    print(f'\tfake news neighbors')
    print(fake_news_word2vec_embed_model.most_similar(word_i, topn=N_neighbors))
    print(f'\treal news neighbors')
    print(real_news_word2vec_embed_model.most_similar(word_i, topn=N_neighbors))

This reveals some serious bias going on in the fake news articles.

- `left` is more associated with extreme political views in fake news, and more associated with the traditional verb sense in real news
- `Barack` is more associated with the Obama administration (and his "unusual" name `Hussein`) in fake news, and more associated with world leaders in real news
- `twitter` is more associated with "alternative" news sources in fake news, and more associated with social media in general in real news
- `Black` is more associated with the Black Lives Matter movement and other left-wing movements (`antifa`) in fake news, and more associated with a variety of organizations in real news
- `Islamic` is more associated with terrorist and perceived "radical" movements in fake news, and more associated with Middle Eastern politics in real news

In [3]:
## TODO: visualize?? https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim

### Exploration
Now it's time for you to try out some more tests with word embeddings!

- Increasing the **window size** when training embeddings can help the embeddings capture more global context (e.g. associating `tomato` with cooking details from the wider sentence context). How would this help capture divides between fake news and real news?
- One way to determine the **connotation** of a word in embedding space is to look at its proximity to positive and negative words: e.g. if `Barack` is consistently closer to words like `bad` and `terrible` than to `good` and `nice`. Can you come up with a way to test word connotations using this kind of approach, and determine whether some words have consistently better or worse connotations in fake news articles?
- Another useful aspect of word embeddings is their tendency to **cluster** words into general semantic fields, e.g. grouping all politician names near one another. Using the visualization technique from earlier, try to find words that (1) consistently fall into neat clusters and (2) sometimes appear outside of the expected clusters in the data. Which political and organizational words tend to be represented outside of their expected cluster, and why do you think that happens? 