# From Annotations to Features

This Python notebook describes the process of the three files:

- `users_infected_diffusion.graphml` with the users their attributes, and the diffusion;

- `tweets.csv` with the tweets and their respective users;

- `users_to_annotate.csv` a csv file with the 5071 users to be annotated.

- `annotated.csv` a csv file with the results in the annotation.

- `created_at.csv` a csv file with the creation date for the annotated users. This was collected after the main data collection, due to a bug in the data collection script (which has been fixed).

Into the following files:

- `users_all_neigh.csv` a csv file with the features extracted for the $100000$ users.

- `users_all_neigh_anon.csv` an anonymous version of the previous file.

- A set of files to be used by GraphSage:

    - `sw-G.json` -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
    - `sw-id_map.json` -- A json-stored dictionary mapping the graph node ids to consecutive integers.
    - `sw-class_map.json` -- A json-stored dictionary mapping the graph node ids to classes.
    - `sw-feats.npy` --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
    
## Profile-based attributes
    
We begin extracting the median and average time between tweets for each user using the `tweets.csv` file:

`tweets.csv` $\rightarrow$ `time_diff.csv`

In [None]:
import pandas as pd

tweets = pd.read_csv("../data/tweets.csv")
tweets.sort_values(by=["user_id", "tweet_creation"], ascending=True, inplace=True)
tweets["time_diff"] = tweets.groupby("user_id", sort=False).tweet_creation.diff()
time_diff_series_mean = tweets.groupby("user_id", sort=False).time_diff.mean()
time_diff_series_median = tweets.groupby("user_id", sort=False).time_diff.median()
time_diff = time_diff_series_mean.to_frame()
time_diff["time_diff_median"] = time_diff_series_median
time_diff.to_csv("../data/time_diff.csv")

We then use this time difference, the diffusion graph, and the annotations. We link these values, and calculate centrality measures for the graph, such as betweenness, eigenvector, in degree and out degree.

We also set a flag for the **neighbors** of the users who are hateful or normal.

`time_diff.csv` `users_infected_diffusion.graphml` `annotated.csv` $\rightarrow$ `users_hate.graphml`

In [None]:
import networkx as nx
import time
import csv

# Read annotated users

f = open("../data/annotated.csv", "r")
csv_writer = csv.DictReader(f)

set_users = dict()

for line in csv_writer:
    if line["hate"] == '1':
        set_users[line["user_id"]] = 1
    elif line["hate"] == "0":
        set_users[line["user_id"]] = 0
f.close()

# Read intervals between tweets

f = open("../data/time_diff.csv", "r")
csv_writer = csv.DictReader(f)

users_interval_median = dict()
users_interval_average = dict()

for line in csv_writer:
    users_interval_median[line["user_id"]] = line["time_diff_median"]
    users_interval_average[line["user_id"]] = line["time_diff"]

# Set hate attributes

nx_graph = nx.read_graphml("../data/users_infected_diffusion.graphml")
nx.set_node_attributes(nx_graph, name="hate", values=-1)
nx.set_node_attributes(nx_graph, name="hate", values=set_users)

# Set hateful and normal neighbors attribute

nodes = nx_graph.nodes(data='hate')

hateful_neighbors = dict()
normal_neighbors = dict()

for i in nodes:
    if i[1] == 1:  # hateful node
        for j in nx_graph.neighbors(i[0]):
            hateful_neighbors[j] = True
    if i[1] == 0:
        for j in nx_graph.neighbors(i[0]):
            normal_neighbors[j] = True

nx.set_node_attributes(nx_graph, name="hateful_neighbors", values=False)
nx.set_node_attributes(nx_graph, name="hateful_neighbors", values=hateful_neighbors)
nx.set_node_attributes(nx_graph, name="normal_neighbors", values=False)
nx.set_node_attributes(nx_graph, name="normal_neighbors", values=normal_neighbors)

# Set median and average interval attributes

nx.set_node_attributes(nx_graph, name="median_interval", values=users_interval_median)
nx.set_node_attributes(nx_graph, name="average_interval", values=users_interval_average)

# Set node network-based attributes, such as betweenness and eigenvector
vt = time.time()
betweenness = nx.betweenness_centrality(nx_graph, k=16258, normalized=False)
eigenvector = nx.eigenvector_centrality(nx_graph)
in_degree = nx.in_degree_centrality(nx_graph)
out_degree = nx.out_degree_centrality(nx_graph)

nx.set_node_attributes(nx_graph, name="betweenness", values=betweenness)
nx.set_node_attributes(nx_graph, name="eigenvector", values=eigenvector)
nx.set_node_attributes(nx_graph, name="in_degree", values=in_degree)
nx.set_node_attributes(nx_graph, name="out_degree", values=out_degree)

nx.write_graphml(nx_graph, "../data/users_hate.graphml")

We then create a csv file with users and these attributes:

    user_id            - unique identifier of a user 
    hate               - hateful|normal|other
    hate_neigh         - True|False
    normal_neigh       - True|False
    statuses_count     - number of statuses
    followers_count    - number of followers
    followees_count    - number of followees
    favorites_count    - number of favorites
    listed_count       - number of listed
    median_int         - median interval between tweets
    average_int        - average interval between tweets
    betweenness        - centrality measure
    eigenvector        - centrality measure
    in_degree          - centrality measure
    out_degree         - centrality measure
    
`users_hate.graphml` $\rightarrow$ `users_attributes.csv`

In [None]:
import networkx as nx
import pandas as pd

nx_graph = nx.read_graphml("../data/users_hate.graphml")

hate = nx.get_node_attributes(nx_graph, "hate")
hate_n = nx.get_node_attributes(nx_graph, "hateful_neighbors")
normal_n = nx.get_node_attributes(nx_graph, "normal_neighbors")
betweenness = nx.get_node_attributes(nx_graph, "betweenness")
eigenvector = nx.get_node_attributes(nx_graph, "eigenvector")
in_degree = nx.get_node_attributes(nx_graph, "in_degree")
out_degree = nx.get_node_attributes(nx_graph, "out_degree")
statuses_count = nx.get_node_attributes(nx_graph, "statuses_count")
followers_count = nx.get_node_attributes(nx_graph, "followers_count")
followees_count = nx.get_node_attributes(nx_graph, "followees_count")
favorites_count = nx.get_node_attributes(nx_graph, "favorites_count")
listed_count = nx.get_node_attributes(nx_graph, "listed_count")
median_interval = nx.get_node_attributes(nx_graph, "median_interval")
average_interval = nx.get_node_attributes(nx_graph, "average_interval")

users = []

for user_id in hate.keys():
    hateful = "other"

    if hate[user_id] == 1:
        hateful = "hateful"

    elif hate[user_id] == 0:
        hateful = "normal"

    median_int = None if user_id not in median_interval else median_interval[user_id]

    average_int = None if user_id not in average_interval else average_interval[user_id]

    users.append((user_id, hateful, hate_n[user_id], normal_n[user_id],  # General Stuff
                  statuses_count[user_id], followers_count[user_id], followees_count[user_id],
                  favorites_count[user_id], listed_count[user_id], median_int,  average_int,  # Numeric attributes
                  betweenness[user_id], eigenvector[user_id],  # Network Attributes
                  in_degree[user_id], out_degree[user_id]))

columns = ["user_id", "hate", "hate_neigh", "normal_neigh", "statuses_count", "followers_count", "followees_count",
           "favorites_count", "listed_count", "median_int", "average_int",
           "betweenness", "eigenvector", "in_degree", "out_degree"]

df = pd.DataFrame.from_records(users, columns=columns)
df.to_csv("../data/users_attributes.csv", index=False)

Now we will use link `created_at.csv` with the annotations.

`created_at.csv` `users_attributes.csv` $\rightarrow$ `created_at_hate.csv`

In [None]:
import pandas as pd

# This is in case created_at_hate.csv is not generated
df1 = pd.read_csv("./created_at.csv")
df2 = pd.read_csv("./users_attributes.csv")
hateful = dict()

for row in df2.iterrows():
    if row[1][1] == "hateful":
        hateful[row[1][0]] = "hateful"
    if row[1][1] == "normal":
        hateful[row[1][0]] = "normal"

to_append = list()
for row in df1.iterrows():
    if row[1][0] in hateful:
        to_append.append(hateful[row[1][0]])
    else:
        to_append.append("None")

df1['hate'] = pd.Series(to_append, index=df1.index)

df1.to_csv("created_at_hate.csv", index=False)

---

## Content-based attributes


In [None]:
from multiprocessing import Process
import pandas as pd
import numpy as np
import time
import csv
import re

hashtags = re.compile("#(\w+)")
regex_mentions = re.compile("@(\w+)")
urls = re.compile("http(s)?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
regex_bad_words = re.compile("(" + "|".join(pd.read_csv("../data/bad_words.txt")["words"].values) + ")")


def mentions(tweets):
    ments = []
    for tweet in tweets.values:
        ments += regex_mentions.findall(tweet)

    return len(ments)

def bad_words(tweets):
    baddies = []
    for tweet in tweets.values:
        baddies += regex_bad_words.findall(tweet)

    return len(baddies)


def urls_all(tweets):
    urlss = []
    for tweet in tweets.values:
        urlss += urls.findall(tweet)

    return len(urlss)


def hashtags_all(tweets):
    hts = []
    for tweet in tweets.values:
        hts += hashtags.findall(tweet)

    return len(hts), " ".join(hts)


def get_values(tweets):
    c = 0
    for tweet in tweets.values:
        if tweet != "":
            c += 1

    return c


def tweet_size(tweets):
    c = 0
    for tweet in tweets.values:
        c += len(tweet)
    return c/len(tweets)


def processing(vals, columns, iterv):
    users = pd.DataFrame(vals)
    users = users[columns]

    print("{0}-------------".format(iterv))

    # HASHTAGS

    users["any_text"] = users["tweet_text"] + users["rt_text"] + users["qt_text"]
    users_hashtags = users.groupby(["user_id"])["any_text"].apply(lambda x: hashtags_all(x))
    hashtags_cols = np.array(list(users_hashtags.values))
    df_hashtags = pd.DataFrame(hashtags_cols, columns=["number hashtags", "hashtags"], index=users_hashtags.index)
    df_hashtags.index.names = ['user_id']

    # TWEETS NUMBER

    df_tweet_number = users.groupby(["user_id"])["tweet_text"].apply(lambda x: get_values(x)).reset_index()
    df_tweet_number.set_index("user_id", inplace=True)
    df_tweet_number.columns = ["tweet number"]

    df_retweet_number = users.groupby(["user_id"])["rt_text"].apply(lambda x: get_values(x)).reset_index()
    df_retweet_number.set_index("user_id", inplace=True)
    df_retweet_number.columns = ["retweet number"]

    df_quote_number = users.groupby(["user_id"])["qt_text"].apply(lambda x: get_values(x)).reset_index()
    df_quote_number.set_index("user_id", inplace=True)
    df_quote_number.columns = ["quote number"]

    df_tweet_length = users.groupby(["user_id"])["any_text"].apply(lambda x: tweet_size(x)).reset_index()
    df_tweet_length.set_index("user_id", inplace=True)
    df_tweet_length.columns = ["status length"]

    df_urls = users.groupby(["user_id"])["any_text"].apply(lambda x: urls_all(x)).reset_index()
    df_urls.set_index("user_id", inplace=True)
    df_urls.columns = ["number urls"]

    df_baddies = users.groupby(["user_id"])["any_text"].apply(lambda x: bad_words(x)).reset_index()
    df_baddies.set_index("user_id", inplace=True)
    df_baddies.columns = ["baddies"]

    df_mentions = users.groupby(["user_id"])["any_text"].apply(lambda x: mentions(x)).reset_index()
    df_mentions.set_index("user_id", inplace=True)
    df_mentions.columns = ["mentions"]

    df = pd.DataFrame(pd.concat([df_hashtags, df_tweet_number, df_retweet_number, df_quote_number,
                                 df_tweet_length, df_urls, df_baddies, df_mentions], axis=1))
    df.to_csv("../data/tmp2/users_content_{0}.csv".format(iterv))
    print("-------------{0}".format(iterv))


f = open("../data/tweets.csv", "r")

cols = ["user_id", "screen_name", "tweet_id", "tweet_text", "tweet_creation", "tweet_fav", "tweet_rt", "rp_flag",
        "rp_status", "rp_user", "qt_flag", "qt_user_id", "qt_status_id", "qt_text", "qt_creation", "qt_fav",
        "qt_rt", "rt_flag", "rt_user_id", "rt_status_id", "rt_text", "rt_creation", "rt_fav", "rt_rt"]

csv_dict_reader = csv.DictReader(f)

acc_vals = []

iter_vals, count, count_max, last_u, v = 1, 0, 50000, None, []
for line in csv_dict_reader:
    if last_u is not None and last_u != line["user_id"]:
        acc_vals.append((v, cols, iter_vals))

        count, last_u, v = 0, None, []
        iter_vals += 1

    if len(acc_vals) == 2:
        s = time.time()
        processes = []
        for i in acc_vals:
            p = Process(target=processing, args=(i[0], i[1], i[2]))
            processes.append(p)
        for p in processes:
            p.start()
        for p in processes:
            p.join()
        print(time.time() - s)
        acc_vals = []

    v.append(line)
    count += 1
    if count >= count_max:
        last_u = line["user_id"]

s = time.time()
processes = []
for i in acc_vals:
    p = Process(target=processing, args=(i[0], i[1], i[2]))
    processes.append(p)
for p in processes:
    p.start()
for p in processes:
    p.join()
print(time.time() - s)
acc_vals = []

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from multiprocessing import Process
from empath import Empath
import pandas as pd
import numpy as np
import textblob
import spacy
import time
import csv
import re


stopWords = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_lg')
prog = re.compile("(@[A-Za-z0-9]+)|([^0-9A-Za-z' \t])|(\w+:\/\/\S+)")
prog2 = re.compile(" +")
lexicon = Empath()
empath_cols = ["{0}_empath".format(v) for v in lexicon.cats.keys()]
glove_cols = ["{0}_glove".format(v) for v in range(300)]


def lemmatization(x, nlp):
    tweets = " ".join(list(x.values))
    letters_only = prog.sub(" ", tweets)
    lemmatized = []
    for token1 in nlp(letters_only):
        if token1.lemma_ != "-PRON-" and token1 not in stopWords:
            lemmatized.append(token1.lemma_)
        else:
            lemmatized.append(token1.text)
    final = prog2.sub(" ", " ".join(lemmatized))
    return final


def empath_analysis(x):
    val = lexicon.analyze(x, normalize=True)
    if val is None:
        return lexicon.analyze(x)
    else:
        return val


def processing(vals, columns, iterv):
    users = pd.DataFrame(vals)
    users = users[columns]

    print("{0}-------------".format(iterv))

    # PRE-PROCESSING

    users["any_text"] = users["tweet_text"] + users["rt_text"] + users["qt_text"]
    users_text = users.groupby(["user_id"])["any_text"].apply(lambda x: lemmatization(x, nlp)).reset_index()
    print("{0}-------------PRE-PROCESSING".format(iterv))

    # GLOVE ANALYSIS

    glove_arr = np.array(list(users_text["any_text"].apply(lambda x: list(nlp(x).vector)).values))
    df_glove = pd.DataFrame(glove_arr, columns=glove_cols, index=users_text.user_id.values)
    print("{0}-------------GLOVE".format(iterv))

    # SENTIMENT ANALYSIS

    sentiment_arr = np.array(list(users_text["any_text"].apply(lambda x: textblob.TextBlob(str(x)).sentiment).values))
    sentiment_cols = ["sentiment", "subjectivity"]
    df_sentiment = pd.DataFrame(sentiment_arr, columns=sentiment_cols, index=users_text.user_id.values)
    print("{0}-------------SENTIMENT".format(iterv))

    # EMPATH ANALYSIS

    lexicon_arr = np.array(list(users_text["any_text"].apply(lambda x: empath_analysis(x)).values))
    df_empath = pd.DataFrame.from_records(index=users_text.user_id.values, data=lexicon_arr)
    df_empath.columns = empath_cols
    print("{0}-------------EMPATH".format(iterv))

    # MERGE TO SINGLE

    df = pd.DataFrame(pd.concat([df_empath, df_sentiment, df_glove], axis=1))
    df.set_index("user_id", inplace=True)
    df.to_csv("../data/tmp/users_content_{0}.csv".format(iterv))
    print("-------------{0}".format(iterv))


f = open("../data/tweets.csv", "r")

cols = ["user_id", "screen_name", "tweet_id", "tweet_text", "tweet_creation", "tweet_fav", "tweet_rt", "rp_flag",
        "rp_status", "rp_user", "qt_flag", "qt_user_id", "qt_status_id", "qt_text", "qt_creation", "qt_fav",
        "qt_rt", "rt_flag", "rt_user_id", "rt_status_id", "rt_text", "rt_creation", "rt_fav", "rt_rt"]

csv_dict_reader = csv.DictReader(f)

acc_vals = []

iter_vals, count, count_max, last_u, v = 1, 0, 50000, None, []
for line in csv_dict_reader:
    if last_u is not None and last_u != line["user_id"]:
        # s = time.time()
        # processing(v, cols, iter_vals)
        # print(time.time() - s)
        acc_vals.append((v, cols, iter_vals))

        count, last_u, v = 0, None, []
        iter_vals += 1

    if len(acc_vals) == 2:
        s = time.time()
        processes = []
        for i in acc_vals:
            p = Process(target=processing, args=(i[0], i[1], i[2]))
            processes.append(p)
        for p in processes:
            p.start()
        for p in processes:
            p.join()
        print(time.time() - s)
        acc_vals = []

    v.append(line)
    count += 1
    if count >= count_max:
        last_u = line["user_id"]

# s = time.time()
# processing(v, cols, iter_vals)
# print(time.time() - s)

s = time.time()
processes = []
for i in acc_vals:
    p = Process(target=processing, args=(i[0], i[1], i[2]))
    processes.append(p)
for p in processes:
    p.start()
for p in processes:
    p.join()
print(time.time() - s)
acc_vals = []

In [None]:
%%bash
OutFileName="../data/users_content.csv"            # Fix the output name
i=0                                                # Reset a counter
for filename in ../data/tmp/*.csv; do
    if [ "$filename"  != "$OutFileName" ] ;        # Avoid recursion
    then
    if [[ $i -eq 0 ]] ; then
       head -1  $filename >   $OutFileName         # Copy header if it is the first file
    fi
    tail -n +2  $filename >>  $OutFileName         # Append from the 2nd line each file
    i=$(( $i + 1 ))                                # Increase the counter
    fi
done
OutFileName="../data/users_content2.csv"           # Fix the output name
i=0                                                # Reset a counter
for filename in ../data/tmp2/*.csv; do
    if [ "$filename"  != "$OutFileName" ] ;        # Avoid recursion
    then
    if [[ $i -eq 0 ]] ; then
       head -1  $filename >   $OutFileName         # Copy header if it is the first file
    fi
    tail -n +2  $filename >>  $OutFileName         # Append from the 2nd line each file
    i=$(( $i + 1 ))                                # Increase the counter
    fi
done

In [None]:
import pandas as pd

users_attributes = pd.read_csv("../data/users_attributes.csv")
users_content = pd.read_csv("../data/users_content.csv")
users_content2 = pd.read_csv("../data/users_content2.csv")

df = pd.merge(users_attributes, users_content, on="user_id", how="inner")
df = pd.merge(df, users_content2, on="user_id", how="inner")

df.to_csv("../data/users_all.csv")