Earlier this summer FiveThirtyEight shared a corpus of nearly three million tweets associated with accounts linked to Russia's Internet Research Agency. The evidence suggests these tweets were part of a campaign to influence the 2016 US election. What was communicated, and how do we make sense of it?
One possibility is to simulate a conversation among the trolls using a word embedding model and tf-idf transforms.
- Build an embedding model of all the words in the Russian troll tweets corpus. This will enable the use of Gensim's Word2Vec module, specifically the most_similar function which can generate analogies for each word in a given text with a pair of pre-selected words (such as liberal and conservative).
- Transform the corpus of tweets into a tf-idf matrix.
- Implement the following algorithm until 50,000 words have printed, beginning with a randomly selected tweet.
- Print the tweet.
- Remove the tf-idf vector for the tweet from the matrix (this avoids repetition).
- Replace each word in the tweet by analogy with the word pair and the embedding model.
- Print the modified tweet.
- Transform the modified tweet as a tf-idf vector based on the structure of the matrix.
- Select the tweet for which the vector in the matrix is most similar to the vector of the modified tweet (using cosine similarity).