# Case Study on Tweets from Ostrava

This notebook demonstrates how the proposed framework can be used in real world. We use tweets published from Ostrava and extract and aggregate sentiment on topics detected in these tweets.

In [1]:
import json
import re

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
from bertopic import BERTopic
from sklearn.metrics.pairwise import cosine_similarity
from textblob import TextBlob
from umap import UMAP

plt.rcParams.update({'font.size': 14})

## Data loading and preprocessing

We start by loading the dataset:

In [2]:
df = pd.read_csv('../datasets/tweets_ostrava_translated.csv', header=None)
df.columns = ['author_id', 'text_orig', 'date_published', 'likes', 'retweets', 'text_en']
df.head()

Unnamed: 0,author_id,text_orig,date_published,likes,retweets,text_en
0,2765861635,"Bohuzel budou do Tutecka letat dale, jen tank...",2023-03-17 02:58:29+00:00,4,1,"Unfortunately, they will continue to fly to T..."
1,946371211411513346,Já si mohu uložit tvé fotky? Bez svolení? To ...,2023-03-17 00:40:37+00:00,0,0,Can I save your pictures? Without permission?...
2,729345398,"Arsenal-Sporting, 3 hodiny zábavy 👌",2023-03-16 22:47:06+00:00,7,1,"Arsenal-Sporting, 3 hours of fun 👌"
3,864219563172532224,Levice si mne získala již ve velmi útlém věku. ✨,2023-03-16 21:21:56+00:00,31,2,The left won me over at a very young age. ✨
4,917807011152191494,tim jsem byla posedla kdyz mi bylo 11 let,2023-03-16 20:18:15+00:00,7,1,I was obsessed with it when I was 11 years old.


We are only interested in tweets that are not too short and do not containe a weird word `holytrainer`.

In [3]:
MIN_TEXT_LENGTH = 60
texts = []
texts_orig = []
retweets = []
likes = []

for i, t in enumerate(df['text_en']):
    if len(str(t)) > 60 and 'holytrainer' not in t.lower():
        texts.append(t.lower().strip())
        texts_orig.append(df['text_orig'].iloc[i])
        retweets.append(df['retweets'].iloc[i] + 1)
        likes.append(df['likes'].iloc[i] + 1)

texts[:5]


['unfortunately, they will continue to fly to tuteck, but they will refuel in soci.',
 'can i save your pictures? without permission? is that allowed? 😂😂 only ones you send me directly ☺️',
 "banik's box office fraud is a completely different transfer",
 "where the democrats rule, it looks that way, and it's getting worse.",
 "i'm not saying they shouldn't be paid on time. i'm saying that as long as these people are getting paid, it's wrong, they're doing black things with the benefits... without them, they'd have two choices, they'd either steal or they'd start working properly, that's up to them. that's what the police are there for, to motivate them."]

We perform some basic preprocessing: replacement of contractions and removal of unnecessary characters and of stopwords.

In [4]:
nltk.download('stopwords')
nltk.download('wordnet')

texts_clean = texts

replacements = [
    ("it's", "it is"),
    ("#", ''),
    (r"[^\w\s]", ''),
    (r"\s+", ' '),
    ('dont', 'do not'),
    (' im', ' i am'),
    (r"^im", 'i am'),
    ('theres', 'there is'),
    ('thats', 'that is'),
    ('youre', 'you are'),
    ('doesnt', 'does not'),
    ('didnt', 'did not'),
    ('cant', 'can not'),
    ('couldnt', 'could not'),
    ('shouldnt', 'should not')
]

for replacement in replacements:
    texts_clean = [re.sub(replacement[0], replacement[1], t) for t in texts_clean]

stopwords_en = nltk.corpus.stopwords.words('english')
lemmatizer = nltk.stem.WordNetLemmatizer()

texts_lemmatized = [[lemmatizer.lemmatize(w) if w not in ['has', 'was'] else w for w in t.split()] for t in texts_clean]
texts_no_sw = [' '.join([w for w in t if w not in stopwords_en]) for t in texts_lemmatized]

[nltk_data] Downloading package stopwords to /home/milos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/milos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Topic modeling

I first define a custom `BERTopic` class that fixes the issue with setting custom initial topics described in https://github.com/MaartenGr/BERTopic/issues/1421

In [5]:
class MyBERTopic(BERTopic):
    def _guided_topic_modeling(self, embeddings: np.ndarray):
        seed_topic_list = [" ".join(seed_topic) for seed_topic in self.seed_topic_list]
        seed_topic_embeddings = self._extract_embeddings(seed_topic_list, verbose=self.verbose)
        seed_topic_embeddings = np.vstack([seed_topic_embeddings, embeddings.mean(axis=0)])
        sim_matrix = cosine_similarity(embeddings, seed_topic_embeddings)
        y = [np.argmax(sim_matrix[index]) for index in range(sim_matrix.shape[0])]
        y = [val if val != len(seed_topic_list) else -1 for val in y]
        for seed_topic in range(len(seed_topic_list)):
            indices = [index for index, topic in enumerate(y) if topic == seed_topic]
            embeddings[indices] = np.average([embeddings[indices], np.tile(
                [seed_topic_embeddings[seed_topic]], (len(indices), 1))], weights=[3, 1], axis=0)
        return np.array(y), embeddings


Now I fit the topic model with a UMAP random state set to a fixed value for result replication

In [6]:

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=43)
topic_model = MyBERTopic(
    language='english', top_n_words=7, n_gram_range=(1, 3), umap_model=umap_model,
    min_topic_size=8, calculate_probabilities=True, verbose=False,
    seed_topic_list=[
        ['doctor', 'hospital', 'patient', 'healthcare'],
        ['school', 'teaching', 'university', 'education'],
        ['car', 'parking', 'road', 'traffic'],
        ['free time', 'leasure', 'recreation', 'hobbies'],
        ['social services', 'social workers', 'welfare', 'elderly care']])
topic_model.fit(texts_no_sw)


<__main__.MyBERTopic at 0x7f07f4ab2650>

In [7]:
topics = topic_model.get_topics()
topics

{-1: [('like', 0.0028634403227088927),
  ('one', 0.0027603013062464485),
  ('people', 0.0027528029448525013),
  ('would', 0.0026202665684966872),
  ('doe', 0.0025193993294223803),
  ('time', 0.002491933002421932),
  ('get', 0.0023895751827105287)],
 0: [('ukraine', 0.021640652470606522),
  ('russian', 0.02097968026755494),
  ('putin', 0.016837325812832208),
  ('russia', 0.013187600313292412),
  ('ukrainian', 0.00815226963019064),
  ('world', 0.005434353692557727),
  ('war', 0.005325761664983474)],
 1: [('vote', 0.019449935855224344),
  ('election', 0.017484933727280026),
  ('voter', 0.014011025162475361),
  ('ballot', 0.009386681724426795),
  ('ideology', 0.008012564703275625),
  ('common sense', 0.007197804164157738),
  ('party', 0.006466710861044977)],
 2: [('doctor', 0.02813715758975389),
  ('hospital', 0.018602934692687357),
  ('patient', 0.01573204569021384),
  ('nurse', 0.008787251700721465),
  ('medicine', 0.007526770847245028),
  ('care', 0.006498132624854652),
  ('surgery', 0.

Next we determine the overall representation of each topic in the dataset:

In [8]:
if -1 in topics:
    del topics[-1]

topic_reprs = (topic_model.probabilities_ * np.array(likes)[:, np.newaxis]).sum(axis=0)
topic_reprs = (topic_model.probabilities_).sum(axis=0)
topic_reprs = {r[0]: r[1] for r in zip(topic_model.get_topics().keys(), topic_reprs)}
topic_reprs


# 2, 0, 1, 3, 4
# 0, 2, 3,  

{0: 98.77123943912048,
 1: 97.8288191401551,
 2: 45.95718358476957,
 3: 57.25420865110905,
 4: 71.33329012092702,
 5: 33.81629103643526,
 6: 32.6106065729883,
 7: 61.89682301794085,
 8: 56.06524755706684,
 9: 39.969770128135494,
 10: 51.42727415815939,
 11: 35.28978737915821,
 12: 42.75682031897574,
 13: 39.68091908906326,
 14: 40.66381085296776,
 15: 41.96250040532382,
 16: 33.635057777825544,
 17: 34.57198505492522,
 18: 45.786770305050524,
 19: 30.091677252726164,
 20: 30.086536938866253,
 21: 40.34811889171058,
 22: 56.00999803075944,
 23: 40.52318016070297,
 24: 37.6810328257493,
 25: 30.180921716847557,
 26: 29.49528733890183,
 27: 29.420356105122636,
 28: 47.076419807221285,
 29: 43.628844896062574,
 30: 29.389925036852734,
 31: 31.298879561894267,
 32: 28.99659345249103,
 33: 28.8913500732549,
 34: 31.974099536087454,
 35: 28.040314105636554,
 36: 30.32451557729071,
 37: 29.49891369829127,
 38: 30.01425292610233,
 39: 38.643795467655075,
 40: 34.26280129553523,
 41: 28.91357335

## Sentiment analysis

When determining polarities we will also deploy a simple masking to ignore tweets whose polarity is very close to 0.

In [9]:
sentiments = []
mask = []
sentiments_discrete = []

for t in texts:
    if 'holytrainer' in t.lower():
        continue
    polarity = TextBlob(t).sentiment.polarity
    sentiments.append(polarity)
    sentiments_discrete.append(1 if polarity > 0 else -1)
    if abs(polarity) < 0.05:
        mask.append([0] * 218)
    else:
        mask.append([1] * 218)

sentiments = np.array(sentiments).reshape(-1, 1)
sentiments_discrete = np.array(sentiments_discrete).reshape(-1, 1)

Next we calculate mean sentiment for each topic

In [10]:
probas_masked = topic_model.probabilities_ * np.array(
    mask)  # np.multiply(np.array(mask), np.sqrt(np.array(retweets))[:, np.newaxis])
print(probas_masked.shape)
mean_sentiments = (sentiments * probas_masked).sum(axis=0) / probas_masked.sum(axis=0)
mean_sentiments_discrete = (sentiments_discrete * probas_masked).sum(axis=0) / probas_masked.sum(axis=0)
diversity_discrete = (((((sentiments_discrete - mean_sentiments_discrete) ** 2) * probas_masked).sum(axis=0)) / \
                      probas_masked.sum(axis=0))

positive_opinions = np.clip((sentiments_discrete * probas_masked), a_min=0, a_max=np.inf).sum(axis=0)
negative_opinions = - np.clip((sentiments_discrete * probas_masked), a_min=-np.inf, a_max=0).sum(axis=0)

total = negative_opinions + positive_opinions
positive_opinions = positive_opinions / total
negative_opinions = negative_opinions / total

entropy = - positive_opinions * np.log2(positive_opinions) - negative_opinions * np.log2(negative_opinions)

ValueError: operands could not be broadcast together with shapes (9766,218) (9766,102) 

Then we determine the semivariance for both sides:

In [None]:
def semivariance(data, mean: np.ndarray, probas: np.ndarray, side):
    semivariances = []
    for i in range(len(mean)):
        if side == 'left':
            semi_data = data[data < mean[i]]
            semi_probas = probas[(data < mean[i]).reshape(-1), i]
        else:
            semi_data = data[data > mean[i]]
            semi_probas = probas[(data > mean[i]).reshape(-1), i]
        nominator = (((semi_data - mean[i]) ** 2) * semi_probas).sum(axis=0)
        denominator = semi_probas.sum(axis=0)
        result = np.sqrt(nominator / denominator)
        semivariances.append(result)
    return semivariances


semivariance_left = semivariance(sentiments, mean_sentiments, probas_masked, 'left')
semivariance_right = semivariance(sentiments, mean_sentiments, probas_masked, 'right')

semivariance_left[:5], semivariance_right[:5]

# Conformity calculation

In [None]:
def conformity_positive(answer_fuzzy, neg_a, neg_b):
    if answer_fuzzy[1] >= neg_b:
        possibility = 1
    elif answer_fuzzy[2] <= neg_a:
        possibility = 0
    else:
        x = (neg_a * answer_fuzzy[1] - answer_fuzzy[2] * neg_b) / \
            (neg_a - neg_b + answer_fuzzy[1] - answer_fuzzy[2])
        possibility = (x - neg_a) / (neg_b - neg_a)
    return possibility


def conformity_negative(answer_fuzzy, pos_a, pos_b):
    answer_fuzzy_rev = (- answer_fuzzy[2], - answer_fuzzy[1], - answer_fuzzy[0])
    return conformity_positive(answer_fuzzy_rev, pos_b, pos_a)


## Visualization

Each topic gets visualized as a TFN in a plot. Fuzzy sets representing positive and negative opinion are also added. While diplaying the plot, we also create a dataframe for further exploration

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel('Polarity')
plt.plot([-1, 0.75, 1], [1, 0, 0], color='red', label='Negative opinion')
plt.plot([-1, -0.32, 1], [0, 0, 1], color='green', label='Positive opinion')

topics_processed = []

for topic in topic_model.get_topics().items():
    if topic[0] == -1:
        continue
    #if (topic[0] - 1) % 6 != 0:
    #    continue
    core = mean_sentiments[topic[0]]
    support_left = core - semivariance_left[topic[0]] * 1.5
    support_right = core + semivariance_right[topic[0]] * 1.5
    con_pos = conformity_positive([support_left, core, support_right], -0.32, 1)
    con_neg = conformity_negative([support_left, core, support_right], 1, -0.75)
    topic_name = str([t[0] for t in topic[1][:4]])
    plt.plot([support_left, core, support_right], [0, 1, 0], label=topic_name)
    print(f'TOPIC: {topic_name}, CON.POS.: {con_pos:.2f}, CON.NEG.: {con_neg:.2f}')

    topics_processed.append({
        'name': topic_name,
        'sent_mean': core,
        'sent_mean_discrete': mean_sentiments_discrete[topic[0]],
        'support_left': support_left,
        'support_right': support_right,
        'con_pos': con_pos,
        'con_neg': con_neg,
        'controversy': support_right - support_left,
        'controversy_discrete': diversity_discrete[topic[0]],
        'mass': topic_reprs[topic[0]],
        'entropy': entropy[topic[0]]
    })

topics_processed = pd.DataFrame(topics_processed)

plt.legend()
plt.show()


In [None]:
top_topics = topics_processed.sort_values("mass", ascending=False)[:8]

plt.figure(figsize=(15, 8))
plt.xlabel('Polarity')
plt.plot([-1, 0.75, 1], [1, 0, 0], color='red', label='Negative opinion')
plt.plot([-1, -0.32, 1], [0, 0, 1], color='green', label='Positive opinion')

for _, topic in top_topics.iterrows():
    plt.plot([topic['support_left'], topic['sent_mean'], topic['support_right']], [0, 1, 0], label=topic['name'])

plt.legend()
plt.show()

### Exploration of topics

In [None]:
con_min = min(topics_processed.con_pos.min(), topics_processed.con_neg.min())
con_max = max(topics_processed.con_pos.max(), topics_processed.con_neg.max())
topics_processed['con_pos'] = (topics_processed.con_pos - con_min) / (con_max - con_min)
topics_processed['con_neg'] = (topics_processed.con_neg - con_min) / (con_max - con_min)
topics_processed

In [None]:
print(topics_processed[['sent_mean', 'sent_mean_discrete']].corr())
print(topics_processed[['sent_mean', 'sent_mean_discrete', 'controversy', 'controversy_discrete', 'entropy']].corr())

# Data export

We want to export topic data for dashboard visualization. This visualization also requires a short summary of positive and negative aspects of each topic. For these purposes we export several positive and negative tweets for each topic

In [None]:
MIN_PROBA = 0.10

orig_text_series = pd.Series(texts_orig)
texts_no_sw_series = pd.Series(texts_no_sw)
sentiment_series = pd.Series(sentiments[:, 0])


def topic_posts(sentiment, topic_id):
    topic_name = topics_processed['name'].loc[topic_id]
    print(topic_id, topic_name)
    topic_selector = topic_model.probabilities_[:, topic_id] > MIN_PROBA
    topic_sentiments = sentiment_series[topic_selector]
    sent_median = topic_sentiments.median()
    topic_posts = orig_text_series[topic_selector]
    sentiment_selector = topic_sentiments > sent_median if sentiment == 'positive' else topic_sentiments < sent_median
    topic_posts = topic_posts[sentiment_selector][:60]
    posts_clean = texts_no_sw_series[topic_selector][sentiment_selector][:60]
    return {'clean': posts_clean.tolist(), 'original': [tt.strip() for tt in topic_posts.tolist()]}


topic_posts = [
    {'positive': topic_posts('positive', t), 'negative': topic_posts('negative', t)}
    for t in range(len(topics_processed))]

# print(topic_posts)
with open('../datasets/tweets_brno_processed.json', 'w') as tweets_fd:
    json.dump(topic_posts, tweets_fd)

We now use the function to actually summarize positive and negative aspects of each topic.

In [None]:
topics_processed.to_csv('../datasets/topics_brno_processed.csv', index=False)