### In this notebook we build extended inputs for our social/graph models

Anything regarding building community graphs, graphs embedding and using social features is this notebook. We use the Waseem & Hovy data. The structure is as follow:

* Import the Data (this time also user data)
* Build graphs (community graph, extended graph, ...)
* Create graph embedding (from graph to dense vectors)

#### What 'raw data' do we use in general?
For now, I sticked to simple inputs like
* tweets (text)
* tweets' authors
* authors' friends



# Import data from MongoDB
the data is imported from a mongodb client setup by Linda, ask her for more info

In [2]:
import pymongo
import pandas as pd
import numpy as np

In [5]:
# open Mongo DB and read Data
db_name = 'hatespeech_WaseemHovy'
client = pymongo.MongoClient(port=27017)
db = client[db_name]

In [6]:
# extract tweet and follow relationship tables from mongodb 
annotated_tweets = pd.DataFrame(db.annotated_tweet.find())
follow_relationships = pd.DataFrame(db.follower.find({},{'_id': 0}))

# condense information (just take the information that we use)
annotated_tweets = annotated_tweets[['text', 'user', 'label']]
annotated_tweets['user'] = annotated_tweets['user'].transform(lambda x: x['id']) 

display(annotated_tweets.head(), follow_relationships.head())

Unnamed: 0,text,user,label
0,So Drasko just said he was impressed the girls...,2498963143,racism
1,Drasko they didn't cook half a bird you idiot ...,110114783,racism
2,Hopefully someone cooks Drasko in the next ep ...,38650214,racism
3,of course you were born in serbia...you're as ...,2587278392,racism
4,These girls are the equivalent of the irritati...,2601524623,racism


Unnamed: 0,user_id,follower_id
0,16754078,949291260
1,8197942,949291260
2,238250804,949291260
3,224302874,949291260
4,915643998,949291260


# Creating a Graph
Here we create graphs with the data that we have, we use NetworkX <br>
Docs for NetworkX: https://networkx.github.io/documentation/stable/

**COMMUNITY GRAPH**: Author profiling for abuse detection (Mishra 2018) <br>
**EXTENDED GRAPH**: Abusive language detection with graph convolutional networks (Mishra 2019)

In [7]:
import networkx as nx
import matplotlib.pyplot as plt

In [8]:
# get unique users
users = np.unique(annotated_tweets['user'].to_numpy())
users = np.array(list(map(str, users)))

# get relationships between users (this file was given to me by Linda)
G_followers = nx.read_gpickle('pickle_files/users_data/user_root_follower_network_WaseemHovy_Linda.pkl')

# get links between tweets and users
tweets = np.unique(annotated_tweets['text'].to_numpy())
author_relationship = annotated_tweets[['text', 'user']].to_numpy()
author_relationship[:,1] = np.array(list(map(str, author_relationship[:,1])))

In [9]:
# create a community graph 
G_community = nx.Graph()
G_community.add_nodes_from(users)
G_community.add_edges_from(G_followers.edges())
print('FOLLOWER GRAPH')
print('Number of nodes: ', G_followers.number_of_nodes(), ' Number of edges: ', G_followers.number_of_edges())
print('COMMUNITY GRAPH')
print('Number of nodes: ', G_community.number_of_nodes(), ' Number of edges: ', G_community.number_of_edges())

FOLLOWER GRAPH
Number of nodes:  1207  Number of edges:  7860
COMMUNITY GRAPH
Number of nodes:  2031  Number of edges:  5514


In [10]:
# Sanity check on degree and density
count_empty = 0
value_sum = 0
for value in dict(G_community.degree()).values():
    if value == 0:
        count_empty += 1
    else:
        value_sum += value

no_of_nodes = G_community.number_of_nodes()
no_of_edges = G_community.number_of_edges()
graph_density = 2*no_of_edges / (no_of_nodes*(no_of_nodes - 1))
        
print('Solitary users: {}, Edges check: {}, Average Degree: {}, Graph Density: {}'\
     .format(count_empty, value_sum, no_of_edges/no_of_nodes, graph_density))        

Solitary users: 824, Edges check: 11028, Average Degree: 2.7149187592319053, Graph Density: 0.00267479680712503


In [11]:
# create an extended graph
G_extended = nx.Graph()
G_extended.add_nodes_from(tweets)
print('TWEETS GRAPH')
print('Number of nodes: ', G_extended.number_of_nodes(), ' Number of edges: ', G_extended.number_of_edges())
G_extended.add_edges_from([tuple(x) for x in author_relationship])
print('AUTHOR GRAPH')
print('Number of nodes: ', G_extended.number_of_nodes(), ' Number of edges: ', G_extended.number_of_edges())
G_extended.add_edges_from(G_followers.edges())
print('EXTENDED GRAPH')
print('Number of nodes: ', G_extended.number_of_nodes(), ' Number of edges: ', G_extended.number_of_edges())

TWEETS GRAPH
Number of nodes:  16849  Number of edges:  0
AUTHOR GRAPH
Number of nodes:  18873  Number of edges:  16849
EXTENDED GRAPH
Number of nodes:  18880  Number of edges:  22363


# Graph embedding
Create a graph embedding for our graphs

For now the best options seem _node2vec_ and _Structural Deep Network Embedding (Also used in Mishra 2018) <br>
I found everything here: https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007 <br>
<br>
REF: https://github.com/eliorc/node2vec

In [None]:
from node2vec import Node2Vec
import pickle

In [None]:
#create node2vec embedding object
node2vec = Node2Vec(G_community, dimensions=128, walk_length=10, num_walks=100, workers=1)

In [None]:
#create the embedding (this might take a long time)
model = node2vec.fit(window=20, min_count=3, batch_words=4)

In [None]:
# We save both the embedding model and the embedded nodes, .load() to load 
EMBEDDING_FILENAME = 'pickle_files/embedded_vectors'
EMBEDDING_MODEL_FILENAME = 'pickle_files/embedding_model'
model.wv.save_word2vec_format(EMBEDDING_FILENAME)
model.save(EMBEDDING_MODEL_FILENAME)

In [None]:
# creates a dictionary where embeddings are stored for each node
embedded_G_community = model.wv.vectors
users_dict = {}
for index, usr in enumerate(np.array(G_community.nodes)):
    users_dict[usr] = embedded_G_community[index]

In [None]:
# save the user_dict to pickle
with open('pickle_files/users_dict.p', 'wb') as handle:
    pickle.dump(users_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Create adjacency matrix

Here we create the adjacency matrices which can be used by our graph/social model

In [12]:
import pickle
import numpy as np

In [13]:
adjacency_matrix = nx.adjacency_matrix(G_extended)
adjacency_matrix.shape

(18880, 18880)

In [14]:
#just check that every row has at list on nonzero element, should not print anything
nonzero = adjacency_matrix.nonzero()[0]
for i in range(adjacency_matrix.shape[0]):
    if i not in nonzero:
        print(i)

In [None]:
# save the extended graph adjacency 
with open('pickle_files/adjacency_matrix_extended_graph.p', 'wb') as handle:
    pickle.dump(adjacency_matrix, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
# creates reduced adjacency matrix to save memory
# (not necessary in general, but was necessary for me due to memory contraints)
no_of_tweets = 16849

#there are 16849 tweets, the rest is users, we save the submatrix
users_adjacency_matrix = adjacency_matrix[no_of_tweets:, no_of_tweets:]
display(users_adjacency_matrix.shape)
np.save("pickle_files/users_data/users_adjacency_matrix.npy", users_adjacency_matrix.todense())

(2031, 2031)

In [17]:
# for each tweet, creates an index to read up author information and connections
# (important if we use the reduced ajacency matrix from before)
authors_idx = np.zeros((no_of_tweets, ), dtype=int)
for tweet_idx in range(no_of_tweets):
    author_idx = np.nonzero(adjacency_matrix[:,:][tweet_idx])[1]
    authors_idx[tweet_idx] = author_idx - no_of_tweets  
    
# we save the array
np.save("pickle_files/users_data/authorship.npy", authors_idx)