# Evaluating Topic Distances of Authors in Twitter-Conversations

From the project plan this adresses the following tasks:

4. Use BERTopic to analyze the topics
   - [ ] encode the

In [2]:
import os
import sqlite3
import pandas as pd

# Create your connection.
from twitter.nlp_util import process_tweet

cnx = sqlite3.connect('db.sqlite3')

df = pd.read_sql_query(
    "SELECT id, conversation_id, created_at, text, author_id,in_reply_to_user_id FROM delab_timeline WHERE lang='en'",
    cnx)
df.head(3)

Unnamed: 0,id,conversation_id,created_at,text,author_id,in_reply_to_user_id
0,1,1435703745304612870,2021-09-08 20:37:01,RT @OregonOEM: 🚩🌩🔥 Red Flag and #FireWeather w...,14838508,
1,2,1435664738353098756,2021-09-08 18:02:01,It's #NationalPreparednessMonth. Help ensure t...,14838508,
2,3,1435663883595706370,2021-09-08 17:58:37,RT @OregonOEM: Oregon is still recovering from...,14838508,


In [3]:
df_reduced = df[["author_id", "text", "id"]]
#df_reduced = df_reduced.groupby('author_id')
# df_reduced.count()

df_reshaped = df_reduced.pivot(index="id", columns="author_id", values="text")
mask = 400 > df_reshaped.nunique()
mask = mask[mask == True]
df_reshaped.drop(columns=mask.index, inplace=True)
df_reshaped.nunique()  # the number of tweets of the authors that have more then 400 tweets

author_id
16558158               478
18616003               946
26998226               469
382814447              446
1005470991668084736    445
1106611172462219265    491
1162371171805011968    447
1239172010363826183    427
1292908140975943681    414
1402252385427222528    414
1403930956428460035    441
dtype: int64

The following takes the pandas dataframe and converts it to a dictionary with the author ids as keys and the twitter
corpora as values.

In [4]:
#df_reshaped.shape
author_corpora_cleaned = {}
author_corpora = df_reshaped.to_dict(orient="series")
for author_id, tweets in author_corpora.items():
    author_corpora_cleaned[author_id] = tweets.dropna()

example_corpus = author_corpora_cleaned[next(iter(author_corpora))]
example_corpus

id
8247                                   @ncreen_same ouch!
8248    That last tweet got some responses from spambo...
8249    'virtually' virtually means: real\n\nSo it sho...
8250                               Sounds great on paper!
8251    Any .com.au registrar recommendations? So far ...
                              ...                        
8742    RT @paydirtapp: Check out our Free Invoice Cre...
8743    @taitems @mmilo yeah nice site, suggestion: ht...
8744    RT @MichaelFHansen: Zendesk eyes Southeast Asi...
8745    While LinkedIn has been changing drastically o...
8746    Another LinkedIn email fail… now they want me ...
Name: 16558158, Length: 478, dtype: object

Using sentence transformers from:

```latex
    @misc{grootendorst2020bertopic,
      author       = {Maarten Grootendorst},
      title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
      year         = 2020,
      publisher    = {Zenodo},
      version      = {v0.7.0},
      doi          = {10.5281/zenodo.4381785},
      url          = {https://doi.org/10.5281/zenodo.4381785}
    }

    @inproceedings{reimers-2019-sentence-bert,
        title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
        author = "Reimers, Nils and Gurevych, Iryna",
        booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
        month = "11",
        year = "2019",
        publisher = "Association for Computational Linguistics",
        url = "https://arxiv.org/abs/1908.10084",
    }
```

In [5]:
from bertopic import BERTopic

from sentence_transformers import SentenceTransformer

sentences = list(example_corpus)
#sentences = ["This is an example sentence with Trump and Merkel as NER", "Each sentence is converted and it is a great Civil War"]

#model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
#embeddings = model.encode(sentences)
#print(embeddings)

topic_model = BERTopic(embedding_model="sentence-transformers/all-mpnet-base-v2", verbose=True)
topics, probs = topic_model.fit_transform(sentences)
topic_model.get_topic_info()

2021-09-17 14:56:17.405806: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


Batches:   0%|          | 0/15 [00:00<?, ?it/s]

2021-09-17 14:56:25,783 - BERTopic - Transformed documents to Embeddings
2021-09-17 14:56:34,542 - BERTopic - Reduced dimensionality with UMAP
2021-09-17 14:56:34,570 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Unnamed: 0,Topic,Count,Name
0,0,141,0_ncreen_same_http_mmilo_it
1,-1,118,-1_on_just_co_it
2,1,55,1_melbourne_co_7pmanywhere_headstart
3,2,52,2_support_crazydomains_bugherd_issue
4,3,46,3_bugherd_zendesk_http_zapier
5,4,33,4_paydirtapp_https_http_money
6,5,19,5_drinking_brewsmithau_beer_heineken
7,6,14,6_web_job_software_hire


**This looks promising as it includes the verbs in the topic description as opposed to the NER approach!**
- It requires less data as it can use the full sentence (based on the transformer model)
- We now need to fit the BERT-Model to all the tweets we have available given a language
- [ ] collect all the tweets from the conversations
- [X] collect all the tweets from the authors

In [41]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

author_tweets_texts = list(df.text)
print("the author tweets are: {}".format(len(author_tweets_texts)))
print(author_tweets_texts[0:5])

df_conversations = pd.read_sql_query(
    "SELECT id, text, author_id FROM delab_tweet",
    cnx)
#  WHERE lang='en' should be in there but the field is missing TODO!

conversation_tweets_texts = list(df_conversations.text)
print("the conversation tweets are: {}".format(len(conversation_tweets_texts)))
print(conversation_tweets_texts[0:5])


corpus_for_fitting = author_tweets_texts + conversation_tweets_texts
# corpus_for_fitting = author_tweets_texts
corpus_for_fitting_sentences = []
for tweet in corpus_for_fitting:
    for sentence in tweet.split("."):
        #clean_tweet = process_tweet(sentence)
        #clean_tweet_string = ' '.join(clean_tweet)
        #corpus_for_fitting_sentences.append(clean_tweet_string)
        corpus_for_fitting_sentences.append(sentence)
print("corpus for fitting is: {}".format(len(corpus_for_fitting_sentences)))

topic_model_2 = BERTopic(embedding_model="sentence-transformers/all-mpnet-base-v2", verbose=True)
topics, probs = topic_model_2.fit_transform(corpus_for_fitting_sentences)
topic_model_2.get_topic_info()

the author tweets are: 7362
the conversation tweets are: 400
["☀️ ☀️ We've said it a lot this summer, but if your home gets too hot, libraries, malls, &amp; theaters are great places to escape too. Also call 211 or visit @211info and your local county for additional cooling center resources. 🥵 https://t.co/bdaaCCW942", "@RedCrossCasc @OregonGovBrown @211info Biggest scandal in world and Oregonian history. They have injected billions with a NON approved never before used experimental mRNA shot, telling all it was safe and would protect them, and they LIED. Now they are blaming 'delta'with NO proof, lying again. It is the shots people!", '@RedCrossCasc @OregonGovBrown @211info You are being lied to. There is no "delta". Proof here...ask yourselves and the governor, for them to again take idiotic non-science masking to us, where is the REAL PROOF of this? Show us the isolated "delta" virus now! https://t.co/IpRD84rYZr', '@RedCrossCasc @OregonGovBrown @211info But make sure you wear your m

Batches:   0%|          | 0/588 [00:00<?, ?it/s]

2021-09-17 15:31:03,539 - BERTopic - Transformed documents to Embeddings
2021-09-17 15:31:11,402 - BERTopic - Reduced dimensionality with UMAP
2021-09-17 15:31:13,048 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Unnamed: 0,Topic,Count,Name
0,-1,6025,-1_she_know_trump_her
1,0,4235,0_fatal_fascists_fathom_fathoms
2,1,1336,1_zxxmo0yfrh_gc0xaznhpq_3ds_emulator
3,2,318,2_watch_thread_kip_boldness
4,3,293,3_covid_vaccine_mrna_vaccinated
...,...,...,...
200,203,10,203_abc_lie_dailyexposeteam_crocodilekatie
199,204,10,204_protection_protect_participation_fighters
198,205,10,205_puke_acid_swallow_quantities
196,201,10,201_acosta_hardball_editorialize_jimbo


In [47]:
topic_model_2.get_topic(3)

[('covid', 0.024164840806527484),
 ('vaccine', 0.017095397907528627),
 ('mrna', 0.0111956963463027),
 ('vaccinated', 0.009954721382282957),
 ('vaccines', 0.009954721382282957),
 ('pandemic', 0.009559790766417903),
 ('coronavirus', 0.008164559412939898),
 ('vaxxx', 0.0075630464177102955),
 ('inoculation', 0.006482611215180254),
 ('vax', 0.006291761201900802)]

After having fit the model to the language we have availabel in the db,
we can now predict the topics of the authors

In [87]:
import numpy as np

# topics2, probs2 = topic_model.fit_transform(example_corpus_2)
# topic_info = topic_model.get_topic_info()
# print(topic_info)

author_ids = author_corpora_cleaned.keys()
author_ids_list = []
for key in author_ids:
    author_ids_list.append(key)

#example_corpus_1 = list(author_corpora_cleaned[author_ids_list[3]])
#print("example corpus 1 is:\n {}\n".format(example_corpus_1[0:5]))
#suggested_topics = topic_model_2.transform(example_corpus_1)
#np_suggested_topics = np.array(suggested_topics[0])
#index_of_suggested_topic = np_suggested_topics.argmax()
#print ("suggested_topic for corpus 1 is {}".format(topic_model_2.get_topic(index_of_suggested_topic)))


example_corpus_2 = list(author_corpora_cleaned[author_ids_list[4]])
print("example corpus 2 is:\n {}".format(example_corpus_2[0:5]))
suggested_topics2 = topic_model_2.transform(example_corpus_2)
np_suggested_topics2 = np.array(suggested_topics2[0])
pd_suggested_topics2 = pd.DataFrame(np_suggested_topics2, columns={0:"topic_counts"})
value_counts2 = pd_suggested_topics2[pd_suggested_topics2>=0].value_counts()
value_counts2[0]
#value_counts2.shape
#index_of_suggested_topic2 = np_suggested_topics2.argmax()
#print(index_of_suggested_topic2)
#print ("suggested_topic for corpus 2 is {}".format(topic_model_2.get_topic(index_of_suggested_topic2)))

example corpus 2 is:
 ['RT @aevanko: When you’ve played too much Soul Sacrifice, everything else in your life suffers. https://t.co/t0nt6lZPKj', 'RT @equalityAlec: Thread.  Have you ever heard of "civil asset forfeiture"? You\'re never going to think about the police the same way again…', 'RT @equalityAlec: If all the crimes committed by police and jail/prison guards was counted, it would completely change the police crime sta…', 'RT @equalityAlec: A few thoughts about "crime."  The concept of “crime” is created and manipulated by people who have power. Throughout U.S…', 'RT @equalityAlec: UPDATED THREAD. You\'re going to hear a lot about how cops need more resources because "crime is surging" in the next few…']


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

53.0     24
8.0      22
3.0      20
130.0    13
47.0     12
113.0     9
110.0     8
159.0     7
19.0      6
11.0      5
14.0      5
134.0     4
9.0       4
30.0      4
140.0     3
42.0      3
20.0      3
6.0       3
16.0      2
89.0      2
66.0      2
69.0      2
126.0     2
75.0      2
80.0      2
129.0     1
205.0     1
131.0     1
152.0     1
173.0     1
165.0     1
145.0     1
147.0     1
160.0     1
0.0       1
74.0      1
116.0     1
111.0     1
104.0     1
96.0      1
82.0      1
71.0      1
70.0      1
59.0      1
37.0      1
33.0      1
27.0      1
18.0      1
7.0       1
4.0       1
206.0     1
dtype: int64