# Moderator Mining

We are creating our home brew index for moderating statements.
- [ ] compute the variance of author timelines for a set of tweets (topic_var)
- [ ] compute the sentiment delta function for each tweet
- [ ] compute the topic delta function for each tweet
- [ ] compute unweighted moderator index
- [ ] display the tweets with the highest m_index rating

In [26]:
from util.sql_switch import get_query_native

# and bertopic_id >= 0"
df_conversations = get_query_native(
    "SELECT id, text, author_id, bertopic_id, bert_visual, conversation_id,sentiment_value,created_at FROM delab_tweet tw where language = 'en'")
df_conversations.head(2)

using postgres


Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at
0,9138,@zivnk @dagensnyheter @sanna_bjorling stay out of sweden. thx 😁,559404995,-2,,1452225328945012736,,2021-10-25 06:52:47+00:00
1,7748,@chippycss Kato14 and 15 are better but for the price they are definietly the best ones,991229073891479552,22,22_giveaway_fn_1x_pirates_csgo_ports_karambit_card_nflx_usb,1451526982697689113,-5.489332,2021-10-22 13:03:59+00:00


The bertopic_id is the topic assigned to the tweet. The Bert_Topic module provides a probability but
the distribution is stable and can only be changed by recomputing.

A "distance" between two topic ids can be calculated by using the representative words as a topic
vector and calculate the cosine distance. The word vectors are loaded from fasttext. The probability of finding a fasttext
vector for a given word is about 92 %. As they all reside in a linear space with a fixed set
of dimensions the formula for the distance of the topic A to B is: $$ A = \sum_{a_i}, B = \sum_{b_i}, \frac{A*B}{\Vert{A}\Vert\Vert{B}\Vert}  $$

The Formula for the suggestion of a candidate contains the following ideas:

1. sentiment change after and before the candidate tweet2
2. number of authors involved
3. number of deletion by the twitter devs (should be less after)
4. number of arguments used after and before
5. number of ethotical attacks / personal attacks after and before
6. the topic variance of author timelines should be higher after
7. (because then more authors with different backgrounds would be involved)
8. the sentiment of the tweet should be more or less neutral

![Formula for the Moderator Measurement](notebooks/moderator_index.jpg)

# 1. Sentiment Changes

In the following we compute a column that represents whether the conversation sentiment
has changed for the better after the tweet.

In [21]:
from tqdm import tqdm

df_conversations = df_conversations.sort_values(by=['conversation_id', 'created_at'])
df_conversations.reset_index(drop=True, inplace=True)
df_conversations.head(10)


def compute_sentiment_change_candidate(df):
    """
    :param sentiment_values: pandas series
    :return:
    """
    n = len(df.sentiment_value)
    result = []
    for index in tqdm(range(n)):
        candidate_sentiment_value = 0
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    candidate_sentiment_value -= df.at[previous_tweets_index, "sentiment_value"]
                    candidate_sentiment_value += df.at[following_tweets_index, "sentiment_value"]
        result.append(candidate_sentiment_value)
    return result


candidate_sentiment_values = compute_sentiment_change_candidate(df_conversations)
df_conversations = df_conversations.assign(candidate_sentiment_value=candidate_sentiment_values)
df_conversations.head(3)


100%|██████████| 5770/5770 [00:14<00:00, 409.70it/s] 


Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at,candidate_sentiment_value
0,2567,Let me get this straight..\n\nYou need proof you got a non-FDA approved vaccine to grocery shop but requiring an ID to vote is going too far.\n\nLmao.,4316769252,-2,,1422614889827225603,5.293154,2021-08-03 17:46:35+00:00,0.0
1,2591,@MsBlaireWhite What's interesting is how they are getting vaccinated with no ID. All of a sudden they find one real quick huh?,17147493,-2,,1422614889827225603,4.631117,2021-10-17 01:38:42+00:00,0.0
2,2590,@MsBlaireWhite @WhoseBacon Based NY,1341960673484562432,16,16_tweet_twitter_tweeted_tweets_retweet_follow_cringeeeee_tweeter_twittering_hashtag,1422614889827225603,-12.198787,2021-10-19 14:19:00+00:00,0.593054


# 2. Number of Authors Involved

The idea here is that it is beneficial (also more deliberative) if there are more authors in
a conversation after the moderation.

In [23]:
def compute_number_of_authors_changed(df):
    """
    :param sentiment_values: pandas series
    :return:
    """
    n = len(df.sentiment_value)
    result = []
    for index in tqdm(range(n)):
        candidate_number_authors_before = set()
        candidate_number_authors_after = set()
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    candidate_number_authors_before.add(df.at[previous_tweets_index, "author_id"])
                    candidate_number_authors_after.add(df.at[following_tweets_index, "author_id"])
        result.append(len(candidate_number_authors_after) - len(candidate_number_authors_before))
    return result


candidate_author_numbers = compute_number_of_authors_changed(df_conversations)
df_conversations = df_conversations.assign(candidate_author_number_changed=candidate_author_numbers)
df_conversations.head(30)

100%|██████████| 5770/5770 [00:14<00:00, 407.30it/s] 


Unnamed: 0,id,text,author_id,bertopic_id,bert_visual,conversation_id,sentiment_value,created_at,candidate_sentiment_value,candidate_author_number_changed
0,2567,Let me get this straight..\n\nYou need proof you got a non-FDA approved vaccine to grocery shop but requiring an ID to vote is going too far.\n\nLmao.,4316769252,-2,,1422614889827225603,5.293154,2021-08-03 17:46:35+00:00,0.0,0
1,2591,@MsBlaireWhite What's interesting is how they are getting vaccinated with no ID. All of a sudden they find one real quick huh?,17147493,-2,,1422614889827225603,4.631117,2021-10-17 01:38:42+00:00,0.0,0
2,2590,@MsBlaireWhite @WhoseBacon Based NY,1341960673484562432,16,16_tweet_twitter_tweeted_tweets_retweet_follow_cringeeeee_tweeter_twittering_hashtag,1422614889827225603,-12.198787,2021-10-19 14:19:00+00:00,0.593054,0
3,2589,@MsBlaireWhite @WhoseBacon Interesting you say that but don't know Pfizer is FDA approved. So how can people believe this?,1207518301586444289,-2,,1422614889827225603,5.224171,2021-10-21 07:48:19+00:00,18.279857,-1
4,2568,@MsBlaireWhite @babie_sunflower It makes sense to.just let anyone go shopping now when you don't know if they're vaccinated? I think it's either that or mandate masks. Or shop online like a lot do. You know that's a thing?,1207518301586444289,-2,,1422614889827225603,1.531483,2021-10-21 07:50:34+00:00,8.951714,-1
5,2569,"@Codeman43447853 @MsBlaireWhite Yes, the vaccine only lessens the person who gets it’s symptoms and there’s proof showing the vaccine only last 10 months and boosters are doing more harm than good. If I get the vaccine it’s doing you no good. Only me theoretically but I had covid and I’m fine and don’t need-",1263529518943424524,-2,,1422614889827225603,9.180704,2021-10-22 01:48:04+00:00,0.736618,-2
6,2570,@Codeman43447853 @MsBlaireWhite The vaccine for lessened symptoms,1263529518943424524,-2,,1422614889827225603,-2.860934,2021-10-22 01:48:16+00:00,1.007877,-2
7,2588,@babie_sunflower @MsBlaireWhite 10 huh. What proof? Boosters are huh. I don't think I've heard that. Wow. You had it and didn't bother you much?,1207518301586444289,-2,,1422614889827225603,0.288445,2021-10-22 04:25:27+00:00,-1.456743,-2
8,2587,@babie_sunflower @MsBlaireWhite Interesting she didn't reply. I guess some sadly like spreading misinformation or lies,1207518301586444289,-2,,1422614889827225603,-2.257419,2021-10-22 04:26:28+00:00,-28.526924,-2
9,2571,@babie_sunflower @MsBlaireWhite Interesting you didn't reply to my Batman comment,1207518301586444289,16,16_tweet_twitter_tweeted_tweets_retweet_follow_cringeeeee_tweeter_twittering_hashtag,1422614889827225603,4.754509,2021-10-22 04:26:57+00:00,-32.394459,-2


# 3. Topic Variance in Author Timelines

The basic idea here is that the author timeline represents his/her general interests. The more divers the authors are,
the better it is for the conversation (Hypothesis).

In [31]:
import re
import pandas as pd
from collections import defaultdict
from bertopic import BERTopic
import json
from scipy import spatial
import numpy as np


def clean_corpus(corpus_for_fitting_sentences):
    """
    This is typical preprocessing in order to improve on the outcome of the topic analysis
    :param corpus_for_fitting_sentences:
    :return:
    """
    result = []
    for temp in corpus_for_fitting_sentences:
        # removing hashtags
        temp = re.sub("@[A-Za-z0-9_]+", "", temp)
        temp = re.sub("#[A-Za-z0-9_]+", "", temp)
        # removing links
        temp = re.sub(r"http\S+", "", temp)
        temp = re.sub(r"www.\S+", "", temp)
        # removing punctuation
        temp = re.sub('[()!?]', ' ', temp)
        temp = re.sub('\[.*?\]', ' ', temp)
        # alphanumeric
        temp = re.sub("[^a-z0-9A-Z]", " ", temp)
        temp = re.sub("RT", "", temp)
        temp = temp.strip()

        number_of_words = len(temp.split(" ")) > 3
        if len(temp) > 1 and number_of_words:
            result.append(temp)
    return result

# a utility function for retrieving the words given a bertopic model
def topic2wordvec(topic_model):
    result = []
    for t_word in topic_model:
        str_w = t_word[0]
        result.append(str_w)
    return result

# loading the bertopic model
BERTOPIC_MODEL_LOCATION = "BERTopic"
bertopic_model = BERTopic().load(BERTOPIC_MODEL_LOCATION, embedding_model="sentence-transformers/all-mpnet-base-v2")
topic_info = bertopic_model.get_topic_info()

# create topic-word map
topic2word = defaultdict(list)
for topic_id in tqdm(topic_info.Topic):
    topic_model = bertopic_model.get_topic(topic_id)
    words = topic2wordvec(topic_model)
    topic2word[topic_id] = topic2word[topic_id] + words

# loading the word vectors from the database (maybe this needs filtering at some point)
word2vec = get_query_native(
    "SELECT word, ft_vector from delab_topicdictionary")

# a function that computes the cosine similarity betweent the word vectors of the topics
def get_topic_delta(topic_id_1, topic_id_2):
    words1 = topic2word.get(topic_id_1)
    words2 = topic2word.get(topic_id_2)
    if words1 is not None and words2 is not None:
        filtered_w2v1 = word2vec[word2vec["word"].isin(words1)]
        filtered_w2v2 = word2vec[word2vec["word"].isin(words2)]
        ft_vectors_1 = filtered_w2v1.ft_vector.apply(lambda x: pd.Series(json.loads(x)))
        ft_vectors_2 = filtered_w2v2.ft_vector.apply(lambda x: pd.Series(json.loads(x)))
        len1 = len(ft_vectors_1)
        len2 = len(ft_vectors_2)
        sum_v1 = (ft_vectors_1.sum(axis=0) / len1)  # we assume the vectors are embedded in a linear space
        sum_v2 = (ft_vectors_2.sum(axis=0) / len2)
        similarity = spatial.distance.cosine(sum_v1, sum_v2)
        return similarity
    else:
        return np.NaN

# use the bert model to classify the author_tweets to a topic
def calculate_author_topic(author_id, bert_topic_model):
    # and bertopic_id >= 0"
    df_timelines = get_query_native("SELECT id, text, author_id FROM delab_timeline tl where author_id = " + str(author_id))
    author_text = ""
    df_timelines_cleaned = clean_corpus(df_timelines.text)
    for text in df_timelines_cleaned:
        author_text += text + ". "
    suggested_topic = bert_topic_model.transform(author_text)[0]
    author_topic_model = bert_topic_model.get_topic(suggested_topic)
    return author_topic_model

# similar to the above shown approaches we create a column that shows the quality of the candidates regarding this "topic variance" measure
def compute_author_topic_variance(df, bert_topic_model):
    """
    :param df:
    :param bert_topic_model:
    :return:
    """
    n = len(df.author_id)
    result = []
    for index in tqdm(range(n)):
        authors_before = set()
        authors_after = set()
        conversation_id = df.at[index, "conversation_id"]
        conversation_length = df[df["conversation_id"] == conversation_id].conversation_id.count()
        # print(conversation_length)
        # the candidate cannot be later in the conversation then the middle by definition
        for index_delta in range(conversation_length // 2):
            previous_tweets_index = index - index_delta
            following_tweets_index = index + index_delta
            # we assert that there are as many predecessors as there are followers
            if previous_tweets_index > 0 and following_tweets_index < n:
                if (df.at[previous_tweets_index, "conversation_id"] == conversation_id and
                        df.at[following_tweets_index, "conversation_id"] == conversation_id
                ):
                    authors_before.add(df.at[previous_tweets_index, "author_id"])
                    authors_after.add(df.at[following_tweets_index, "author_id"])

        author_topic_var_before = 0
        author_topic_var_after = 0
        n_author_before = len(authors_before)
        n_author_after = len(authors_after)
        if n_author_after > 0:
            author_before_pivot = authors_before.pop()
            author_before_pivot = calculate_author_topic(author_before_pivot, bert_topic_model)
            for author in authors_before:
                author = calculate_author_topic(author, bert_topic_model)
                delta = get_topic_delta(author_before_pivot, author)
                author_topic_var_before += delta
                author_before_pivot = author
            author_topic_var_before = author_topic_var_before / n_author_before

            author_after_pivot = authors_after.pop()
            author_after_pivot = calculate_author_topic(author_after_pivot, bert_topic_model)
            for author in authors_after:
                author = calculate_author_topic(author, bert_topic_model)
                delta = get_topic_delta(author_after_pivot, author)
                author_topic_var_after += delta
                author_after_pivot = author
            author_topic_var_after = author_topic_var_after / n_author_after

        result.append(author_topic_var_after - author_topic_var_before)
    return result


candidate_author_topic_variance = compute_author_topic_variance(df_conversations, bertopic_model)
df_conversations = df_conversations.assign(author_topic_variance=candidate_author_topic_variance)
df_conversations.head(30)

100%|██████████| 88/88 [00:00<00:00, 87154.37it/s]


using postgres


  0%|          | 0/5770 [00:00<?, ?it/s]

using postgres


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 1/5770 [00:03<6:00:11,  3.75s/it]


TypeError: unhashable type: 'list'