# Extract information for the speaker landscape

Running this notebook will read in the `word_embedding.emb` and `clean_data.txt` files to extract information that is important to visualise and analyse the speaker landscape.

The information is stored in a pandas dataframe and the file `landscape_info.pkl` and can be read into a dataframe using `output = pd.read_pickle("landscape_info.pkl")`.

In [1]:
from gensim.models import KeyedVectors
import pandas as pd
import numpy as np
import umap.umap_ as umap

## 1. Create a dataframe with authors, their tweets and vector representations

In [2]:
umap_seed = 42
retain_threshold = 0 # NOTE: The original paper used retain_threshold=15, but for this small example such a value will remove all data samples

In [3]:
embedding = KeyedVectors.load("word_embedding.emb")

In [4]:
# Turn data text file into pandas dataframe

quotes = []
agents = []

with open("clean_data.txt", "r") as f:
    for line in f:
        l = line.strip().split(" ")
        if l and l[1:]:
            agents.append(l[0])
            quotes.append([" ".join(l[1:])])

df = pd.DataFrame({"author": agents, "quotes": quotes})

# summarise same agents
df = df.groupby(["author"], as_index=False).agg({'quotes': 'sum'})
print("Number of agents in training set: ", len(df.index))

# only take agents with more than so many tweets
df = df[df.quotes.map(len) > retain_threshold]
print("Number of agents with more than " + str(retain_threshold) + " tweets: ", len(df.index))

# store the vector representation of the agent
df["vec"] = df.apply(lambda row: embedding[row.author], axis=1)

Number of agents in training set:  442
Number of agents with more than 0 tweets:  442


In [5]:
# reduce vectors to 2-d representation using UMAP and add to the dataframe

vecs = df["vec"].tolist() 
reducer = umap.UMAP(metric="cosine", min_dist=0.01, n_neighbors=40, random_state=umap_seed)
smaller_vecs = reducer.fit_transform(vecs)

df["low_dim_vec"] = list(smaller_vecs)

df

Unnamed: 0,author,quotes,vec,low_dim_vec
0,agent_0sternchen,[@mountaindream5 @spaet68er @zuma_monty nur fü...,"[-0.0072948337, 0.08525834, 0.088255815, 0.030...","[9.962139, 6.6214237]"
1,agent_1st_rins,"[@zuma_okemaru @1st_rins 1st_rins, @zuma_okema...","[-0.07338741, 0.038354628, 0.07268287, 0.04553...","[10.167288, 7.139133]"
2,agent_80pfarelo,[@frasimphi @coruscakhaya you will follow this...,"[-0.030285044, -0.027092239, 0.012962351, 0.04...","[3.380585, 7.387287]"
3,agent___xmo4,"[@zuma_okemaru ご飯食べたらあそぼ！, @zuma_okemaru ご飯食べた...","[-0.044664677, 0.028426673, 0.043664575, 0.035...","[9.915042, 6.950782]"
4,agent__africansoil,[💻pres zuma discussion with the top 6 presiden...,"[-0.097087175, 0.06697575, -0.17402579, -0.052...","[5.191036, 6.15981]"
...,...,...,...,...
437,agent_zenande_monegi,[@mugabebobby @flawmade @100kmokone @mightijam...,"[-0.022949548, -0.10549712, 0.0016295762, 0.18...","[3.0828862, 5.9407344]"
438,agent_zukile_lize,[@advobarryroux zuma is the worst to tell us t...,"[0.06030377, -0.13764329, -0.09635882, 0.13789...","[3.6614578, 5.330434]"
439,agent_zuma0240,[参加型第5人格！！サバイバー達我が勝利への糧となれ！ランク戦まで！ #identityv ...,"[-0.10415595, 0.056165773, 0.17486429, 0.07228...","[10.337763, 7.356607]"
440,agent_zuma_0807,[2021年2月27日 zuma_0807さんがnew眠しました。 時刻 615 入眠潜時 ...,"[-0.07631513, 0.059613544, 0.12885502, 0.09154...","[10.303881, 7.2586274]"


In [6]:
df.to_pickle("landscape_info.pkl")

## 2. Create a dataframe with annotating words and their vector representation

In [7]:
# Extract hashtags as an example of annotating words

vocab = embedding.index_to_key
hashtags = [word for word in vocab if "#" in word]

In [8]:
# compute large and small vector representation
hashtags_vecs = [embedding[h] for h in hashtags]
hashtags_low_dim_vecs = list(reducer.transform(np.array(hashtags_vecs)))

In [9]:
df_annotations = pd.DataFrame({"word": hashtags, "vec": hashtags_vecs, "low_dim_vec": hashtags_low_dim_vecs})
df_annotations.to_pickle("annotations_info.pkl")