# Moderator Mining

We are creating our home brew index for moderating statements.
- [ ] compute the variance of author timelines for a set of tweets (topic_var)
- [ ] compute the sentiment delta function for each tweet
- [ ] compute the topic delta function for each tweet
- [ ] compute unweighted moderator index
- [ ] display the tweets with the highest m_index rating

In [13]:
from util.sql_switch import get_query_native

# and bertopic_id >= 0"
df_conversations = get_query_native(
    "SELECT tw.id, tw.text, tw.author_id, tw.bertopic_id, tw.bert_visual, tw.conversation_id, tw.sentiment_value, tw.created_at, a.timeline_bertopic_id \
     FROM delab_tweet tw join delab_tweetauthor a on tw.author_id = a.twitter_id where language = 'en' and a.timeline_bertopic_id > 0 and a.has_timeline is TRUE")
df_conversations.head(2)

using postgres


Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at,timeline_bertopic_id
0,9721,"@Magogo232 @MadiBoity This is a good question, but then again they will say, it's the ANC of the time that employed clueless persons.",891933923500056577,-2,5_zuma_anc_africa_africans_african_zulu_afrikanism_lesotho_coalitions_guptas,1451256052205379596,15.692247,2021-10-22 21:02:52+00:00,5
1,9729,"@Nqobaz007 @SabeloComputer @MadiBoity @coolkat_1 The ANC will win as always, others parties are just hoping for the best.",1220343987879448576,-2,5_zuma_anc_africa_africans_african_zulu_afrikanism_lesotho_coalitions_guptas,1451256052205379596,-3.785848,2021-10-25 06:47:20+00:00,5


The bertopic_id is the topic assigned to the tweet. The Bert_Topic module provides a probability but
the distribution is stable and can only be changed by recomputing.

A "distance" between two topic ids can be calculated by using the representative words as a topic
vector and calculate the cosine distance. The word vectors are loaded from fasttext. The probability of finding a fasttext
vector for a given word is about 92 %. As they all reside in a linear space with a fixed set
of dimensions the formula for the distance of the topic A to B is: $$ A = \sum_{a_i}, B = \sum_{b_i}, \frac{A*B}{\Vert{A}\Vert\Vert{B}\Vert}  $$

The Formula for the suggestion of a candidate contains the following ideas:

1. sentiment change after and before the candidate tweet2
2. number of authors involved
3. number of deletion by the twitter devs (should be less after)
4. number of arguments used after and before
5. number of ethotical attacks / personal attacks after and before
6. the topic variance of author timelines should be higher after
7. (because then more authors with different backgrounds would be involved)
8. the sentiment of the tweet should be more or less neutral

![Formula for the Moderator Measurement](notebooks/moderator_index.jpg)

# 1. Sentiment Changes

In the following we compute a column that represents whether the conversation sentiment
has changed for the better after the tweet.

In [14]:
from tqdm import tqdm

df_conversations = df_conversations.sort_values(by=['conversation_id', 'created_at'])
df_conversations.reset_index(drop=True, inplace=True)
df_conversations.head(10)


def compute_sentiment_change_candidate(df):
    """
    :param sentiment_values: pandas series
    :return:
    """
    n = len(df.sentiment_value)
    result = []
    for index in tqdm(range(n)):
        candidate_sentiment_value = 0
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    candidate_sentiment_value -= df.at[previous_tweets_index, "sentiment_value"]
                    candidate_sentiment_value += df.at[following_tweets_index, "sentiment_value"]
        result.append(candidate_sentiment_value)
    return result


candidate_sentiment_values = compute_sentiment_change_candidate(df_conversations)
df_conversations = df_conversations.assign(candidate_sentiment_value=candidate_sentiment_values)
# df_conversations.head(3)


100%|██████████| 2009/2009 [00:02<00:00, 955.48it/s] 


# 2. Number of Authors Involved

The idea here is that it is beneficial (also more deliberative) if there are more authors in
a conversation after the moderation.

In [15]:
def compute_number_of_authors_changed(df):
    """
    :param sentiment_values: pandas series
    :return:
    """
    n = len(df.sentiment_value)
    result = []
    for index in tqdm(range(n)):
        candidate_number_authors_before = set()
        candidate_number_authors_after = set()
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    candidate_number_authors_before.add(df.at[previous_tweets_index, "author_id"])
                    candidate_number_authors_after.add(df.at[following_tweets_index, "author_id"])
        result.append(len(candidate_number_authors_after) - len(candidate_number_authors_before))
    return result


candidate_author_numbers = compute_number_of_authors_changed(df_conversations)
df_conversations = df_conversations.assign(candidate_author_number_changed=candidate_author_numbers)
# df_conversations.head(3)

100%|██████████| 2009/2009 [00:02<00:00, 899.32it/s] 


# 3. Topic Variance in Author Timelines

The basic idea here is that the author timeline represents his/her general interests. The more divers the authors are,
the better it is for the conversation (Hypothesis).

In [16]:
import os
import torch
import re
import pandas as pd
from collections import defaultdict
from bertopic import BERTopic
import json
from scipy import spatial
import numpy as np

torch.cuda.empty_cache()


# a utility function for retrieving the words given a bertopic model
def topic2wordvec(topic_model):
    result = []
    for t_word in topic_model:
        str_w = t_word[0]
        result.append(str_w)
    return result


# loading the bertopic model
BERTOPIC_MODEL_LOCATION = "BERTopic"
bertopic_model = BERTopic(calculate_probabilities=False, low_memory=True).load(BERTOPIC_MODEL_LOCATION,
                                                                               embedding_model="sentence-transformers/all-mpnet-base-v2")
topic_info = bertopic_model.get_topic_info()

# create topic-word map
topic2word = defaultdict(list)
for topic_id in topic_info.Topic:
    topic_model = bertopic_model.get_topic(topic_id)
    words = topic2wordvec(topic_model)
    topic2word[topic_id] = topic2word[topic_id] + words

# loading the word vectors from the database (maybe this needs filtering at some point)
word2vec = get_query_native(
    "SELECT word, ft_vector from delab_topicdictionary")


# a function that computes the cosine similarity betweent the word vectors of the topics
def get_topic_delta(topic_id_1, topic_id_2):
    words1 = topic2word.get(topic_id_1)
    words2 = topic2word.get(topic_id_2)
    if words1 is not None and words2 is not None:
        filtered_w2v1 = word2vec[word2vec["word"].isin(words1)]
        filtered_w2v2 = word2vec[word2vec["word"].isin(words2)]
        ft_vectors_1 = filtered_w2v1.ft_vector.apply(lambda x: pd.Series(json.loads(x)))
        ft_vectors_2 = filtered_w2v2.ft_vector.apply(lambda x: pd.Series(json.loads(x)))
        len1 = len(ft_vectors_1)
        len2 = len(ft_vectors_2)
        if len1 == 0 or len2 == 0:
            # print("vector was not loaded properly for words {}{}".format(words1[0], words2[0]))
            return np.NaN
        sum_v1 = (ft_vectors_1.sum(axis=0) / len1)  # we assume the vectors are embedded in a linear space
        sum_v2 = (ft_vectors_2.sum(axis=0) / len2)
        similarity = spatial.distance.cosine(sum_v1, sum_v2)
        return similarity
    else:
        return np.NaN


# similar to the above shown approaches we create a column that shows the quality of the candidates regarding this "topic variance" measure
def compute_author_topic_variance(df):
    """
    :param df:
    :param bert_topic_model:
    :return:
    """
    n = len(df.author_id)
    result = []
    for index in tqdm(range(n)):
        authors_before = set()
        authors_after = set()
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    # authors_before.add(df.at[previous_tweets_index, "author_id"])
                    # authors_after.add(df.at[following_tweets_index, "author_id"])
                    authors_before.add(df.at[previous_tweets_index, "timeline_bertopic_id"])
                    authors_after.add(df.at[following_tweets_index, "timeline_bertopic_id"])

        author_topic_var_before = 0
        author_topic_var_after = 0
        n_author_before = len(authors_before)
        n_author_after = len(authors_after)
        if n_author_after > 0:
            author_before_pivot = authors_before.pop()
            for author in authors_before:
                delta = get_topic_delta(author_before_pivot, author)
                author_topic_var_before += delta
                author_before_pivot = author
            author_topic_var_before = author_topic_var_before / n_author_before

            author_after_pivot = authors_after.pop()
            for author in authors_after:
                delta = get_topic_delta(author_after_pivot, author)
                author_topic_var_after += delta
                author_after_pivot = author
            author_topic_var_after = author_topic_var_after / n_author_after

        result.append(author_topic_var_after - author_topic_var_before)
    return result


candidate_author_topic_variance = compute_author_topic_variance(df_conversations)
df_conversations = df_conversations.assign(candidate_author_topic_variance=candidate_author_topic_variance)
df_conversations.head(3)

using postgres


100%|██████████| 2009/2009 [01:15<00:00, 26.65it/s] 


Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at,timeline_bertopic_id,candidate_sentiment_value,candidate_author_number_changed,candidate_author_topic_variance
0,2567,Let me get this straight..\n\nYou need proof you got a non-FDA approved vaccine to grocery shop but requiring an ID to vote is going too far.\n\nLmao.,4316769252,-2,,1422614889827225603,5.293154,2021-08-03 17:46:35+00:00,47,0.0,0,0.0
1,2590,@MsBlaireWhite @WhoseBacon Based NY,1341960673484562432,15,15_tweet_retweets_follow_retweet_tweets_followers_followed_tweeting_unfollowed_twitters,1422614889827225603,-12.198787,2021-10-19 14:19:00+00:00,47,0.0,0,0.0
2,10842,"As Canada's Oil Sand companies, we know climate change is a critical challenge.\n\nThat's why we're investing in proven technologies that reduce emissions now. By working together, we can reach our goal of net zero greenhouse gas emissions by 2050.",1428712722670268418,-2,23_climate_energy_sustainable_solar_carbon_scientists_ecosystem_fuels_veg_exxon,1448358350450724868,,2021-10-13 18:41:54+00:00,23,,0,0.0


After having computed the intermediate measures available we are now
ready to compute the candidate index for a moderator

In [17]:
def normalize(sv):
    return (sv - sv.min()) / (sv.max() - sv.min())


sv = df_conversations.sentiment_value
df_conversations = df_conversations.assign(sentiment_value_normalized=normalize(df_conversations.sentiment_value))
df_conversations = df_conversations.assign(c_author_number_changed_normalized=normalize(df_conversations.candidate_author_number_changed))
df_conversations = df_conversations.assign(c_sentiment_value_norm=normalize(df_conversations.candidate_sentiment_value))
df_conversations = df_conversations.assign(c_author_topic_variance_norm=normalize(df_conversations.candidate_author_topic_variance))
df_conversations = df_conversations.assign(moderator_index=  df_conversations.c_author_number_changed_normalized
                                                            + df_conversations.c_sentiment_value_norm
                                                            + df_conversations.c_author_topic_variance_norm
                                                            - abs(df_conversations.sentiment_value_normalized)
                                           )
df_conversations.head(3)

Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at,timeline_bertopic_id,candidate_sentiment_value,candidate_author_number_changed,candidate_author_topic_variance,sentiment_value_normalized,c_author_number_changed_normalized,c_sentiment_value_norm,c_author_topic_variance_norm,moderator_index
0,2567,Let me get this straight..\n\nYou need proof you got a non-FDA approved vaccine to grocery shop but requiring an ID to vote is going too far.\n\nLmao.,4316769252,-2,,1422614889827225603,5.293154,2021-08-03 17:46:35+00:00,47,0.0,0,0.0,0.462238,0.661017,0.619828,0.487047,1.305655
1,2590,@MsBlaireWhite @WhoseBacon Based NY,1341960673484562432,15,15_tweet_retweets_follow_retweet_tweets_followers_followed_tweeting_unfollowed_twitters,1422614889827225603,-12.198787,2021-10-19 14:19:00+00:00,47,0.0,0,0.0,0.351079,0.661017,0.619828,0.487047,1.416814
2,10842,"As Canada's Oil Sand companies, we know climate change is a critical challenge.\n\nThat's why we're investing in proven technologies that reduce emissions now. By working together, we can reach our goal of net zero greenhouse gas emissions by 2050.",1428712722670268418,-2,23_climate_energy_sustainable_solar_carbon_scientists_ecosystem_fuels_veg_exxon,1448358350450724868,,2021-10-13 18:41:54+00:00,23,,0,0.0,,0.661017,,0.487047,


In [21]:
# get the 10 best candidates
candidates = df_conversations.nlargest(10, ["moderator_index"])
candidates.text

773                                                                                                                                                                            @MadiBoity @MasekoThembaJ Ooohhh it’s true because is said by him
161    @GOPLeader @RepSamGraves @TransportGOP @RepWesterman @SteveScalise @NatResources @RepStefanik @HouseGOP @RepFrankLucas @housesciencegop @HouseAgGOP You’re a clown. You love that asshat Trump more than the USA. https://t.co/gJzQRfAmTx
776                                                                     @MnguniSakhy @MadiBoity We are discussing economy here... It's not a language a grade 2 dropout will understand... He'd understand conspiracies and other extreme things
777                                                                                                                                            @MadiBoity @MasekoThembaJ Very profound statement about the eople who have infiltrated the ANC...
775                                 