# Examining the Twitter Discourse Surrounding Large Language Models

Justin Liu

## Introduction

### Motivation

Over the past year or so, the field of generative artificial intelligence has seen a huge rise in popularity. In particular, large language models (LLMs) that have been trained on unprecedented amounts of data can process langauge and respond to user inputs at a humanlike level. A prime example of this [ChatGPT](https://openai.com/blog/chatgpt), a chatbot released on November 30, 2022, that can answer (almost) any question that it is given. LLMs are also used in generative AI art models like [DALL-E](https://openai.com/research/dall-e) and [Midjourney](https://www.midjourney.com/), which can turn any text imaginable into realistic images. With the increasing availability of these tools to the general public, it is becoming easier than ever to utilize these LLMs without much technical experience. In fact, many have praised them for being revolutionary and believe that they will only improve over time.

However, the use of these models have also been at the center of countless debates. There have been heated discussions about whether AI-generated art that "steals" work from actual artists can be considered real art, with controversies ranging from an image created by Midjourney winning first prize at an art contest ([link](https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html)) to using AI to save time on drawing backgrounds from scratch in animated films ([link](https://www.polygon.com/23581376/netflix-wit-studio-short-film-ai-controversy)). And ChatGPT, with its capability to perform a wide array of often very specific tasks, could threaten to replace numerous jobs over the next several years ([link](https://www.businessinsider.com/chatgpt-jobs-at-risk-replacement-artificial-intelligence-ai-labor-trends-2023-02)).

The present analysis seeks to answer a seemingly simple question: *What are people actually talking about when it comes to LLMs?* As many of these tools are currently available for public use, it makes sense to look at how everyday people (not just specialists) are interacting with them. As a case study, we will focus on the social media platform Twitter since it provides an abundant source of data that can be used to analyze the discourse surrounding LLMs.

### Dataset

The dataset we use in this analysis ([Large Language Models: the tweets](https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets)) is made publicly available by Konrad Banachewicz on [Kaggle](https://www.kaggle.com/). It includes English tweets about LLMs from a wide range of Twitter users and comes with metadata (date of tweet, whether the user is verified, etc.). The tweets start from December 2022, and the dataset is updated daily with new tweets.

### Questions

1. *What kinds of topics are brought up in the online discourse surrounding LLMs?*

    - **Hypothesis:** The discourse surrounding LLMs spans a variety of topics (e.g. advances in the sciences, questions relating to ethics and the humanities) that reflect the diversity of social media users.
    - **Methods:** We implement topic modeling by fitting an LDA model to find the most optimal grouping of tweets about LLMs. We also look into how the distribution of the resulting topics change over time.

2. *What kinds of sentiments are associated with online discussions about LLMs?*

    - **Hypothesis:** There is a balance between positive and negative sentiments, reflecting a split between proponents and critics of AI.
    - **Methods:** We carry out sentiment analysis on our tweets, which are each classified as "positive", "neutral", or "negative". We also examine how these sentiments vary over time.

## Code

### Prerequisites

In order to access the dataset, we need to download it from Kaggle.

**Note:** At the time of this writing (June 15, 2023), the latest version of the dataset contains nothing. Instead, we will use the last version that had the tweets ([Version 172](https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets/versions/172)), which has already been downloaded and stored in Google Drive. The commands below download that dataset.

In [None]:
#@title
!rm -rf chatgpt-the-tweets
!gdown 1Oax8ZEqZ4mzU8Pr0gbD4ZXXZt-GZHdVE
!unzip chatgpt-the-tweets.zip -d ./chatgpt-the-tweets
!rm chatgpt-the-tweets.zip

Downloading...
From: https://drive.google.com/uc?id=1Oax8ZEqZ4mzU8Pr0gbD4ZXXZt-GZHdVE
To: /content/chatgpt-the-tweets.zip
100% 95.1M/95.1M [00:02<00:00, 44.4MB/s]
Archive:  chatgpt-the-tweets.zip
  inflating: ./chatgpt-the-tweets/tweets.csv  


This code below is for downloading the latest version of the dataset (currently commented out, see the note above).

In [None]:
#@title
# #@title
# # get API token and dataset from Kaggle
# api_token = {"username": "KAGGLE_USERNAME", "key": "KAGGLE_KEY"}
# dataset = "konradb/chatgpt-the-tweets"

# dataset_name = dataset.split("/")[1]
# dataset_filename = dataset_name + ".zip"

# !rm -rf {dataset_name}
# !rm -rf ~/.kaggle
# !mkdir ~/.kaggle
# !touch ~/.kaggle/kaggle.json

# import json
# with open("/root/.kaggle/kaggle.json", "w") as file:
#     json.dump(api_token, file)

# !chmod 600 ~/.kaggle/kaggle.json

# !kaggle datasets download -d {dataset}
# !unzip {dataset_filename} -d ./{dataset_name}
# !rm {dataset_filename}

We then import the necessary packages.

In [None]:
#@title
# install packages
%%capture
!pip install pyLDAvis

# import packages
import gensim
import pyLDAvis.gensim
import pandas as pd
import spacy
import re
import warnings
import altair as alt
from operator import itemgetter
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nlp = spacy.load("en_core_web_sm")
nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()
warnings.filterwarnings("ignore", category = DeprecationWarning)

### Cleaning the data

We'll take a look at the dataset, dropping rows where either the tweet (`text`) or date (`date`) is missing.

In [None]:
#@title
# read the data, dropping rows where the tweet or date is missing
tweets = pd.read_csv("chatgpt-the-tweets/tweets.csv").dropna(subset = ["text", "date"])
tweets.head()

  tweets = pd.read_csv("chatgpt-the-tweets/tweets.csv").dropna(subset = ["text", "date"])


Unnamed: 0,user_name,text,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,hashtags,source
0,reigndomains 👑,https://t.co/6tFaOonLtv 🔥 for sale .\n\n#Royal...,,Brand Name | https://t.co/Z4d6GWXyWz | https:/...,2019-09-11 04:04:06+00:00,267.0,256.0,1300,False,2023-06-10 12:37:16+00:00,"['RoyalGPT', 'Royal', 'Domains', 'ai', 'Web3',...",Twitter for iPhone
1,MidJourney LIVE,Exquisite realism photography showcasing an ex...,Follow for Inspiration,🎨 Live feed of Art generated by Midjourney AI 🎨,2018-08-28 02:01:04+00:00,100.0,1.0,0,False,2023-06-10 12:36:56+00:00,,MidjourneyLIVE
2,The Tech Trend,Top 10 ChatGPT Plugins You Should Use Right No...,Worldwide,"A Tech community for industry experts, connect...",2020-09-15 15:37:37+00:00,4380.0,4668.0,242,False,2023-06-10 12:35:00+00:00,"['ChatGPT', 'bestChatGPTplugins']",Buffer
3,The Time Blawg,What lawyers will get out of ChatGPT: legal ca...,Scotland... and Beyond,"The past, present and future practice of law (...",2010-12-29 18:03:14+00:00,5897.0,6499.0,4693,False,2023-06-10 12:34:49+00:00,,Twitter for Android
4,Christine Lopez,down an a But the state of summer8 being money...,,,2023-05-06 11:03:29+00:00,0.0,5.0,0,False,2023-06-10 12:33:14+00:00,"['车震', '嫩穴', 'chatGPT']",Twitter Web App


Since we can't see any full tweets in the table above, we sample some random tweets and print them out below.

In [None]:
#@title
# sample 30 random tweets and print them out
sampled_tweets_1 = tweets.sample(30, random_state = 1).text
for i in range(30):
    print("-" * 50)
    print(sampled_tweets_1.iloc[i])
print("-" * 50)

--------------------------------------------------
🚀 Boost Your Sales by using the "Sealing the Deal" template on Jeda Ai's All-in-One Workspace Canvas.

Get your Daily 10K FREE AI Tokens at https://t.co/8NK5W5P55J 🤩

#JedaAI #AI #template #sales #sealthedeal #ChatGPT #GPT4 https://t.co/svVecsO7XF
--------------------------------------------------
Why Seattle's ban on students using ChatGPT is doomed — and what comes next - The Seattle Times https://t.co/chXeUKv874 #chatgpt #AI #openAI
--------------------------------------------------
We are bringing to you the world's most efficient AI-powered virtual trading assistant that trades on financial markets 10 times faster than humans. Get started with these easy steps 👇👇👇🔥🔥🔥

#TradesGPT5 #AI #TradeGPT5 #ChatGPT https://t.co/XMlDajBpAi
--------------------------------------------------
Are there any #lowcode #nocode tools to building #autogpt like apps? Essentially building AI agents.
--------------------------------------------------
Pret

Looking at some of the tweets above, a few of these are very likely to be spam (e.g., tweets talking about crypto and/or have an abnormally high number of hashtags). Since these tweets are unrelated to the discussion of large language models, we will try to filter these out. (Note that the methods implemented below are not perfect as legitimate tweets could be filtered out while some spam tweets could still remain.) After this process, we sample some of the remaining tweets and print them out below.

In [None]:
#@title
def count_items(str_list):
    """Takes in a list as a string and returns the number of items in the list
    (example: "['word', 'number']" would return 2). Returns 0 in the case of
    a TypeError.
    """
    try:
        # remove the brackets, convert to a list, and count the number of items
        brackets_removed = re.sub("\[|\]|'", "", str_list)
        list_split = brackets_removed.split(", ")
        return len(list_split)
    except TypeError:
        # for cases when the value is NaN, return 0
        return 0

def remove_outliers(df, col_name):
    """Returns the dataframe with rows where the outliers in the specified
    column are removed.
    """
    # calculate interquartile range (IQR)
    q1 = df[col_name].quantile(0.25)
    q3 = df[col_name].quantile(0.75)
    iqr = q3 - q1

    # remove outliers using the 1.5 * IQR method
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    df_out = df[(df[col_name] > lower) & (df[col_name] < upper)]
    return df_out

# get the number of hashtags in each tweet and the 'hashtags' column
tweets_cleaned = tweets.copy()
tweets_cleaned["num_hashtags_text"] = tweets_cleaned["text"].str.count("#")
tweets_cleaned["num_hashtags_data"] = tweets_cleaned["hashtags"].map(count_items)

# remove rows where number of hashtags is an outlier
tweets_cleaned = remove_outliers(tweets_cleaned, "num_hashtags_text")
tweets_cleaned = remove_outliers(tweets_cleaned, "num_hashtags_data")

# convert text to lowercase
tweets_cleaned["text_clean"] = tweets_cleaned["text"].str.lower()

# create regex expression for removing tweets with spam (note that this isn't perfect)
# '\d{10}' is for phone numbers, '[\u4e00-\u9fff]+' is for Chinese characters
filter_out = ["crypto", "\$", "🚨", "🚀", "nft", "coin", "weatherupdate", "temu", "\d{10}", "[\u4e00-\u9fff]+"]
filter_out_str = "|".join(filter_out)

# filter out tweets with any of the above words
tweets_cleaned["hashtags_clean"] = tweets_cleaned["hashtags"].str.strip('[|]').str.lower()
tweets_cleaned = tweets_cleaned[~tweets_cleaned["hashtags_clean"].str.contains(filter_out_str, na = False)]
tweets_cleaned = tweets_cleaned[~tweets_cleaned["text_clean"].str.contains(filter_out_str, regex = True)]

# sample 20 random tweets and print them out
sampled_tweets_2 = tweets_cleaned["text"].sample(20, random_state = 1)
for i in range(20):
    print("-" * 50)
    print(sampled_tweets_2.iloc[i])
print("-" * 50)

--------------------------------------------------
🌟 Enhance your business with cutting-edge AI technology! Our #ChatGPT for Beginners course offers the perfect introduction for companies embracing the digital world. Sign up now: https://t.co/kAF7l0d2qN #BusinessInnovation #AI https://t.co/KwmFGmh8bW
--------------------------------------------------
I have accessed the gpt-4-32k API and want to create some more interesting products based on it. 

Would any genius be willing to give me some suggestions?

#ChatGPT #AIGC #developers
--------------------------------------------------
How accurate is #ChatGPT ? Better ask Stanford computational law experts. But first where did the data derive for the program. If from #fakenews then it will fail tremendously on a wide spectrum but the scope is like a scoop of ice cream the kind @SpeakerPelosi likes to eat. 🤭
--------------------------------------------------
The announcement of GPT-4, a language model equipped with an astounding 100 trillio

We then clean the data a bit more so that the words can be processed in our models. The main things are:

- converting everything to lowercase (done in the previous cell when filtering out spam),
- removing hashtags, usernames, and links, and
- removing extra whitespace.

Some extra filtering steps include:

- converting all occurrences of `"&amp;"` (HTML symbol for `"&"`) and `"artificialintelligence"` (most likely from hashtags) to `"and"` and `"artificial intelligence"`, respectively, as well as
- dropping tweets that were the same after preprocessing them, which filters out more possible spam.

Again, we sample some of the resulting tweets below.


In [None]:
#@title
# remove hashtags, usernames, and links
tweets_cleaned.loc[:, "text_clean"] = tweets_cleaned["text_clean"].map(lambda x: re.sub(r"#|@\S+|http\S+", "", x))

# remove whitespace around words
tweets_cleaned.loc[:, "text_clean"] = tweets_cleaned["text_clean"].map(lambda x: " ".join(x.split()))

# convert ampersand to 'and'
tweets_cleaned.loc[:, "text_clean"] = tweets_cleaned["text_clean"].map(lambda x: re.sub(r"&amp;", "and", x))

# convert 'artificialintelligence' (most likely combined in hashtags) to 'artificial intelligence'
tweets_cleaned.loc[:, "text_clean"] = tweets_cleaned["text_clean"].map(lambda x: re.sub(r"artificialintelligence", "artificial intelligence", x))

# remove all rows with duplicates (high probability of spam)
tweets_cleaned = tweets_cleaned.drop_duplicates(subset = ["text_clean"], keep = False)

# sample 20 random tweets and print them out
sampled_tweets_3 = tweets_cleaned["text_clean"].sample(20, random_state = 1)
for i in range(20):
    print("-" * 50)
    print(sampled_tweets_3.iloc[i])
print("-" * 50)

--------------------------------------------------
so many topics how to use chatgpt. does anyone have any concerns regarding security and privacy of the data processed through it? startups security privacy
--------------------------------------------------
this nyt articles ( starts with a pertinent question: how society will greet true artificial intelligence, if and when it arrives. (1) will we panic? (2) start sucking up to our new robot overlords? (3) ignore it and go about our daily lives? chatgpt
--------------------------------------------------
looks like the ultimate meh, middle of the road, not terribly wrong but not great or insightful either take on software testing, which is what i would expect from something like chatgpt. it's like the most average of takes.
--------------------------------------------------
fantastic article! it's amazing to see how openai ai chatgpt can be used to create unique experiences.
--------------------------------------------------
revolutioni

We check to see how many rows and columns are in our resulting dataset.

In [None]:
#@title
dims = tweets_cleaned.shape
print(f"Our cleaned dataset has {dims[0]} rows (tweets) and {dims[1]} columns.")

Our cleaned dataset has 374683 rows (tweets) and 16 columns.


### Number of tweets over time

Now that we have our cleaned dataset, we can move forward with our pipeline. But before that, let's take a look at the distribution of tweets over time.

In [None]:
#@title
# convert the date column to be in YYYY-MM-DD format
tweets_cleaned["date"] = pd.to_datetime(tweets_cleaned["date"],
                                        errors = "coerce",
                                        utc = True).dt.date

# count the number of tweets for each date (some dates are missing!)
tweets_date_count = tweets_cleaned.value_counts("date", sort = False).reset_index()

# get start and end dates for the data
start_date = min(tweets_date_count["date"]).strftime("%Y-%m-%d")
end_date = max(tweets_date_count["date"]).strftime("%Y-%m-%d")

# merge with dataframe of all possible dates
tweets_date_count_all = pd.DataFrame(
    pd.date_range(start = start_date, end = end_date).date
).rename(
    {0: "date"},
    axis = 1
).merge(
    tweets_date_count,
    on = "date",
    how = "left"
)

# convert date column to be a datetime object (for plotting)
tweets_date_count_all["date"] = pd.to_datetime(tweets_date_count_all["date"])

# show the dataframe
tweets_date_count_all

Unnamed: 0,date,count
0,2022-12-05,2053.0
1,2022-12-06,6124.0
2,2022-12-07,4503.0
3,2022-12-08,4655.0
4,2022-12-09,4395.0
...,...,...
183,2023-06-06,968.0
184,2023-06-07,2074.0
185,2023-06-08,2156.0
186,2023-06-09,1018.0


In [None]:
#@title
# create a line plot of number of tweets vs. date
line = alt.Chart(tweets_date_count_all).mark_line(
    color = "#26a7de"
).encode(
    x = alt.X("date:T", title = "Date"),
    y = alt.Y("count", title = "Number of tweets")
)

# make the plot interactive
line.interactive()

Looking at the line plot, it appears that the number of tweets isn't very consistent – the counts fluctuate a lot. Not only are there are large dips (near 0) during February and April 2023, but there also seems to be a lot of missing dates, especially in January. We can confirm this by getting the dates where there are no tweets in our data.

In [None]:
#@title
# get all dates where tweet count is missing
counts = tweets_date_count_all["count"]
tweets_date_count_all[counts.isna()]["date"].reset_index(drop = True)

0    2022-12-14
1    2022-12-15
2    2022-12-16
3    2022-12-17
4    2022-12-18
5    2023-01-07
6    2023-01-08
7    2023-01-09
8    2023-01-10
9    2023-01-11
10   2023-01-12
11   2023-01-13
12   2023-01-14
13   2023-01-15
14   2023-01-16
15   2023-01-17
16   2023-01-18
17   2023-01-19
18   2023-01-20
19   2023-01-21
20   2023-01-22
21   2023-01-23
22   2023-01-24
23   2023-03-04
24   2023-03-19
25   2023-03-20
26   2023-03-21
27   2023-03-22
28   2023-03-23
29   2023-06-01
Name: date, dtype: datetime64[ns]

It is highly unlikely that there were no tweets about LLMs on the dates above, so the missing tweets may be an issue with the data collection itself. This means we have less data for January 2023 compared to other months, as shown by the bar chart below.

In [None]:
#@title
# add month column
tweets_cleaned["month"] = pd.to_datetime(tweets_cleaned["date"]).dt.to_period("M").dt.strftime("%Y-%m")

# get tweet counts per month
tweets_by_month = tweets_cleaned.value_counts(
    "month"
).reset_index(
).sort_values(
    "month"
).reset_index(
    drop = True
)

# show the dataframe
tweets_by_month

Unnamed: 0,month,count
0,2022-12,51250
1,2023-01,33162
2,2023-02,88251
3,2023-03,47722
4,2023-04,72827
5,2023-05,67286
6,2023-06,14163


In [None]:
#@title
# create bar chart of tweet counts per month
alt.Chart(tweets_by_month).mark_bar(
    color = "#26a7de"
).encode(
    x = alt.X("month:O", title = "Month"),
    y = alt.Y("sum(count)", title = "Number of tweets")
)

This shouldn't affect our analysis too much as we still have tens of thousands of tweets for most of the months (with the exception of June 2023 since the current dataset was downloaded during the middle of the month).

### Topic modeling

The next step is tokenization, which involves breaking up the text into units called tokens. Since we are extracting topics from tweets, we ideally want to keep words that have some sort of meaning. This means  we should remove tokens that are either stopwords (words that don't contribute much to the meaning of a sentence, e.g., *a*, *the*, *I*) or punctuation marks. The remaining tokens are lemmatized (e.g., the lemmatized forms of *asked* and *asks* are both *ask*) so that we can find similar words between tweets. To automate this process, we utilize a popular Python library in natural language processing called [spaCy](https://spacy.io/).

**Note:** The code takes around 30 minutes to run.

In [None]:
#@title
def tokenize(doc):
    """Takes in a spaCy Doc object (containing tokens) and returns a
    list of the tokens that are not stopwords or punctuation marks.
    """
    # initialize list of tokens to keep
    tokens = []

    # add the lemamtized form of a word if it isn't a stopword or punctuation mark
    for token in doc:
        if not token.is_stop and not token.is_punct:
            lemma = token.lemma_
            tokens.append(lemma)

    return tokens

# tokenize every tweet (will take around 30 minutes to run)
docs = list(nlp.pipe(tweets_cleaned["text_clean"]))

# keep only the meaningful tokens
tokens_list = [tokenize(doc) for doc in docs]

# show example
print(f"Cleaned text:\n{docs[0]}")
print(f"\nTokenized text:\n{tokens_list[0]}")

Cleaned text:
top 10 chatgpt plugins you should use right now read more:- chatgpt bestchatgptplugins aichatbot topchatgptplugins thetechtrend

Tokenized text:
['10', 'chatgpt', 'plugin', 'use', 'right', 'read', 'more:-', 'chatgpt', 'bestchatgptplugin', 'aichatbot', 'topchatgptplugin', 'thetechtrend']


One thing that we need to take into account is that some pairs of words can frequently occur together, so they should be treated as one "word" (e.g., *artificial intelligence*) – these are called bigrams. We train a bigram model on our tweets and get back the same tokens, with the only difference being that the bigrams contain an underscore (e.g., the bigram `"artificial intelligence"` would show up as `"artificial_intelligence"`).

In [None]:
#@title
# create a bigram model
#   min_count: words that appear together at least this many times will be considered bigrams
#   threshold: higher value = less likely to form bigrams
bigram_model = gensim.models.phrases.Phrases(tokens_list, min_count = 25, threshold = 100)
bigram_phraser = gensim.models.phrases.Phraser(bigram_model)

# run the bigram model over all of the tweets
texts = [bigram_phraser[sentence] for sentence in tokens_list]

# show example
texts[0]

['10',
 'chatgpt',
 'plugin',
 'use',
 'right',
 'read',
 'more:-',
 'chatgpt',
 'bestchatgptplugin',
 'aichatbot',
 'topchatgptplugin',
 'thetechtrend']

Next, we create a dictionary and corpus that our model will take as input.

- The dictionary (`id2word`) maps each word to an index.
- The corpus (`corpus`) contains the term frequency of each word within each doc. The mapping is stored in a tuple, which can be read as (word index, word frequency).

In [None]:
#@title
# create dictionary
id2word = gensim.corpora.Dictionary(texts)

# create corpus (with term frequency)
corpus = [id2word.doc2bow(text) for text in texts]

# show example
print(f"First 5 words and indices in the dictionary: {[(id2word[i], i) for i in range(5)]}")
print(f"First document in the corpus: {corpus[0]}")

First 5 words and indices in the dictionary: [('10', 0), ('aichatbot', 1), ('bestchatgptplugin', 2), ('chatgpt', 3), ('more:-', 4)]
First document in the corpus: [(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)]


Finally, we fit our model to get the possible groupings of our tweets. The method we are using is called Latent Dirichlet Allocation or LDA for short (see [this article](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) by Ria Kulshrestha for a more detailed explanation). In addition to taking the dictionary and corpus above as inputs, we also need to specify how many topics we want to group our texts into.

However, we don't necessarily know how many groups would be "best" for our data. One solution is to use the CV coherence score, which allows us to quantify how interpretable the topics are. The basic idea is that it takes the most frequent words from each topic and measures how similar they are. A higher coherence score means the top words in each topic are more related to each other.

The code below fits an LDA model for $k = 1, 2, ..., 10$ topics, calculating the CV coherence score each time. We choose the number of topics that returns the highest coherence score.

**Note:** The code takes around 30 minutes to run.

In [None]:
#@title
def get_best_num_topics(corpus, id2word, texts, min_topics = 1, max_topics = 10, seed = 1):
    """Runs a LDA model for each number of topics between min_topics and max_topics, returning
    the number of topics that achieves the highest coherence score.
    """
    # initialize list of scores
    scores_list = []

    # for each number of topics
    for i in range(min_topics, max_topics + 1):
        # run LDA model
        lda_model = gensim.models.LdaModel(corpus = corpus,
                                           id2word = id2word,
                                           num_topics = i,
                                           random_state = seed)

        # run coherence score model
        coherence_model = gensim.models.CoherenceModel(model = lda_model,
                                                       texts = texts,
                                                       dictionary = id2word,
                                                       coherence = "c_v")

        # print coherence score
        coherence_lda = coherence_model.get_coherence()
        print(f"Coherence score for {i} topic(s): ", coherence_lda)

        # append score to list of scores
        scores_list.append((i, coherence_lda))

    # get the best number of topics based on the highest coherence score
    best_num_topics, best_score = max(scores_list, key = itemgetter(1))
    print(f"\nThe highest coherence score ({best_score}) occurs when there are {best_num_topics} topics.")

    return best_num_topics

# save the best number of topics in a variable (takes around 30 minutes to run)
seed = 1
best_num_topics = get_best_num_topics(corpus, id2word, texts, seed = seed)

Coherence score for 1 topic(s):  0.33048730772552914
Coherence score for 2 topic(s):  0.3320327290592944
Coherence score for 3 topic(s):  0.4115078807185606
Coherence score for 4 topic(s):  0.36942085462608226
Coherence score for 5 topic(s):  0.4235322851378503
Coherence score for 6 topic(s):  0.35399183349686414
Coherence score for 7 topic(s):  0.3666506074753511
Coherence score for 8 topic(s):  0.3873614361334937
Coherence score for 9 topic(s):  0.38842431325228616
Coherence score for 10 topic(s):  0.38441892727108595

The highest coherence score (0.4235322851378503) occurs when there are 5 topics.


According to the output above, the LDA model achieves the highest coherence score with 5 topics. We re-run this model to get an interactive visualization, allowing us to see the most frequent terms overall as well as in each of the topics.

In [None]:
#@title
# re-run model with highest coherence score
lda_model = gensim.models.LdaModel(corpus = corpus,
                                   id2word = id2word,
                                   num_topics = best_num_topics,
                                   random_state = seed)

# output interactive visualization
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

Just as a note before we move on: for whatever reason, the topic numbers above are ordered differently from the topic numbers we will see later. For the sake of coherence, the topic numbers we will use from here on out are different from the ones we see above. (This might be confusing at first, but it will make sense soon.)

Based on the top words in each topic, we can roughly interpret the groups as follows:

- Topic 1 (circle #3 above): **AI as a field**
  - Top words: *chatgpt*, *google*, *ai*, *model*, *language*, *openai*, *answer*, *search*, *like*, *new*
- Topic 2 (circle #4 above): **LLMs in general**
  - Top words: *chatgpt*, *ai*, *openai*, *intelligence*, *artificial*, *chatbot*, *gpt*, *chat*, *human*, *chatgpt3*
- Topic 3 (circle #1 above): **LLM prompts**
  - Top words: *chatgpt*, *ask*, *write*, *ai*, *like*, *good*, *code*, *try*, *question*, *think*
- Topic 4 (circle #5 above): **AI art**
  - Top words: *chatgpt*, *art*, *ok*, *midjourney*, *probably*, *aiart*, *dalle2*, *nice*, *go_to*, *image*
- Topic 5 (circle #2 above): **Innovation and impact**
  - Top words: *chatgpt*, *ai*, *future*, *technology*, *tool*, *new*, *openai*, *use*, *learn*, *world*

Our model can also be used to classify each tweet into one of the corresponding topics above. This is done by getting the individual probabilities of the tweet belonging to each topic, then choosing the topic that yields the highest probability.

In [None]:
#@title
def get_topic_and_prob(corpus_doc, model = lda_model):
    """Returns the classified topic and corresponding probability for a document
    based on a given LDA model.
    """
    # get the probabilities of belonging to each topic
    probs = model.get_document_topics(corpus_doc)

    # return the topic that yields the highest probability
    topic, prob = max(probs, key = itemgetter(1))
    return (topic + 1, prob) # add 1 to topic since topic numbers start from 0

# initialize lists
all_topics = list()
all_probs = list()

# get topics and probabilities for each doc
for doc in corpus:
    topic, prob = get_topic_and_prob(doc)
    all_topics.append(topic)
    all_probs.append(prob)

# add to dataframe
tweets_cleaned["topic"] = all_topics
tweets_cleaned["probability"] = all_probs

# show example
print("Tweet:", tweets_cleaned["text"].iloc[0])
print("Topic:", tweets_cleaned["topic"].iloc[0])
print("Topic probability:", tweets_cleaned["probability"].iloc[0])

Tweet: Top 10 ChatGPT Plugins You Should Use Right Now
Read More:- https://t.co/p7jvcGsrwk 
#ChatGPT #bestChatGPTplugins #AIchatbot #topChatGPTplugins #TheTechTrend
Topic: 3
Topic probability: 0.58303314


For each topic, we sample and print out some tweets.

In [None]:
#@title
# get each unique topic in the dataset
topics = tweets_cleaned["topic"].unique()
topics.sort()

# number of tweets per topic
num_tweets = 10

# print some random tweets from each topic
for i in topics:
    # sample tweets from the topic
    sample = tweets_cleaned[tweets_cleaned["topic"] == i].sample(num_tweets, random_state = 1)

    # get the most common words for each topic
    most_common_words_id = lda_model.get_topic_terms(i - 1) # topic IDs starts at 0 instead of 1
    most_common_words_list = [id2word[id] for (id, value) in most_common_words_id]
    most_common_words = ", ".join(most_common_words_list)

    # print heading for the topic
    print("-" * 100)
    print(f"⭐ TOPIC {i}: {most_common_words}")
    print("-" * 100)

    # print tweets
    for j in range(num_tweets - 1):
        print(sample["text"].iloc[j])
        print("-" * 50)
    print(sample["text"].iloc[4])
print("-" * 100)

----------------------------------------------------------------------------------------------------
⭐ TOPIC 1: chatgpt, google, ai, model, language, openai, answer, search, like, new
----------------------------------------------------------------------------------------------------
"Introducing 7 new data formats for enterprises that want to unify their search experiences. Because nothing says 'unified' like compatibility issues!" #Terminator #Sarcasm 
Link: https://t.co/7kSde1gtXz
#AI #ChatGPT #OpenAI #GenerativeAI
--------------------------------------------------
#Bard even suggested what to do about this situation... https://t.co/IsapI2Rxom
--------------------------------------------------
Are you ready to take your accounting to the next level? Introducing the power of Artificial Intelligence in the field of accounting! 🤖 Say goodbye to manual data entry, error-prone calculations, and tedious tasks.#AIinAccounting #FutureOfAccounting #chatgpt #ai https://t.co/TtDdKNJf7U
-------

Just by looking at a sample of the tweets, we notice that some tweets don't necessarily fit into *any* of the topics that we defined – remember that the model is just classifying tweets into a topic based on the highest probability. It seems that these kinds of tweets are trying to gain traction by using popular hashtags, often putting closely related hashtags in the same tweet. For example, #aiart could be paired together with #dalle2 and #midjourney (which are AI programs that can generate images from text input) not because the tweet is talking about these topics but because it is more likely to be viewed when looking up these topics.

Despite these findings, what we see above provides valuable insights as to what kinds of words tend to be used together. We will continue our analysis with the topics we defined earlier, keeping in the back of our minds that some tweets don't necessarily have content about AI and/or LLMs.

Now we'll take a look at how the distribution of these topics shift over time.

In [None]:
#@title
# get counts of each topic for each month
topics_by_month = tweets_cleaned.value_counts(
    ["topic", "month"]
).reset_index(
).sort_values(
    ["topic", "month"]
).reset_index(
    drop = True
)

# create normalized bar chart of tweet counts by topic over time
alt.Chart(topics_by_month).mark_bar().encode(
    x = alt.X("month:O", title = "Month"),
    y = alt.Y("sum(count)", title = "Normalized count", stack = "normalize"),
    color = alt.Color("topic:N", title = "Topic")
)

Interestingly, the proportion of tweets relating to Topic 5 (Innovation and Impact) seems to be increasing over time while the proportion of tweets relating to Topic 3 (LLM prompts) seems to be decreasing over time. One possibility is that the release of ChatGPT in November 2022 led to a large influx of users experimenting with it and tweeting about what they're using it for (more about LLM prompts). After a while, this craze died down and more people are beginning to focus on the implications of having such AI tools in their daily lives (more about innovation and impact). Of course, this is only speculation as there could be other reasons as to why we see the change in the plot.

### Sentiment analysis

Now we move onto the second part of this project: getting the feelings or sentiments of the tweet. To save ourselves from work, we will use a pretrained sentiment analyzer called [VADER](https://vadersentiment.readthedocs.io/en/latest/), which is tuned to pick up sentiments in social media. We run this model on each tweet, getting back a compound score between -1 (very negative) and 1 (very positive). We then use this score to determine whether the tweet should be classified as "positive", "negative", or "neutral" based on the scoring outlined [here](https://github.com/cjhutto/vaderSentiment#about-the-scoring). The table below shows a breakdown of the classifications by counts and proportions.

In [None]:
#@title
def get_sentiment(text):
    """Returns the sentiment (positive, negative, or neutral) of the input text."""
    # get sentiment score
    scores = sia.polarity_scores(text)
    compound_score = scores["compound"]

    # classify as positive, negative, or neutral
    if compound_score >= 0.05:
        return "positive"
    elif compound_score <= -0.05:
        return "negative"
    else:
        return "neutral"

# apply get_sentiment() function to all tweets
tweets_cleaned["sentiment"] = tweets_cleaned["text_clean"].apply(get_sentiment)

# show number and proportion of tweets for each sentiment
sentiment_counts = tweets_cleaned.value_counts("sentiment").reset_index()
sentiment_prop = tweets_cleaned.value_counts("sentiment", normalize = True)
sentiment_counts.merge(
    sentiment_prop,
    on = "sentiment",
    how = "left"
)

Unnamed: 0,sentiment,count,proportion
0,positive,215062,0.573984
1,neutral,100040,0.266999
2,negative,59581,0.159017


The most common sentiment among the tweets is positive (57.4%), followed by neutral (26.7%) and negative (15.9%). This suggests that people on Twitter generally have positive or neutral feelings towards LLMs; negative tweets are less common.

Our results above appear to reject the proposed hypothesis that there is a balance between positive and negative tweets. But we are also interested in the kinds of tweets are most "characteristic" of each sentiment, so we sample a few of them below.

In [None]:
#@title
# get each unique sentiment
sentiments = tweets_cleaned["sentiment"].unique()
sentiments.sort()

# number of tweets per sentiment
num_tweets = 10

# print some random tweets from each sentiment
for s in sentiments:
    # sample tweets from the sentiment
    sample = tweets_cleaned[tweets_cleaned["sentiment"] == s].sample(num_tweets, random_state = 1)

    # print heading for the sentiment
    print("-" * 100)
    print(f"⭐ SENTIMENT: {s}")
    print("-" * 100)

    # print tweets
    for j in range(num_tweets - 1):
        print(sample["text"].iloc[j])
        print("-" * 50)
    print(sample["text"].iloc[4])
print("-" * 100)

----------------------------------------------------------------------------------------------------
⭐ SENTIMENT: negative
----------------------------------------------------------------------------------------------------
it has been a few days since the #ChatGPT is all over the internet and I'm so tired of it already... the last time something annoyed me this much, this fast, was Friday by Rebecca Black
--------------------------------------------------
Bard is a cheat code:

Copy and paste any article with a paywall and have bard summarize the conversation

Tell me what articles you’ve tried it with I’ll start:

#ai #bard #Google #TechNews #ChatGPT
--------------------------------------------------
highaiartdump 15 of 24ish I dont think I shared these here, if i did not all 4 in one post, kind of spooky? #ai #aiart #aiartwork #digitalart #GenerativeAI #ChatGPT #midjourney #stablediffusion #toomanyedibleslore #ayyeyeart https://t.co/252Id6n3kC
---------------------------------------

It seems like the tweets that are labeled as positive tend to praise LLMs like ChatGPT since they can be beneficial in saving time and solving specific problems. On the other hand, some of the tweets classified as negative aren't necessarily negative, which is probably due to the presence of negative words. (Now is a good time to note that VADER maps each word to a score and averages these scores into a compound score. You can read more about how it works [here](https://medium.com/@piocalderon/vader-sentiment-analysis-explained-f1c4f9101cd9).) Then again, these are only a sample of the tweets, so we're not necessarily getting the full picture here.

Moving on: how do the sentiments of these tweets change over time?

In [None]:
#@title
# get counts of each sentiment for each month
sentiment_by_month = tweets_cleaned.value_counts(
    ["sentiment", "month"]
).reset_index(
).sort_values(
    ["sentiment", "month"]
).reset_index(
    drop = True
)

# create normalized bar chart of tweet counts by sentiment over time
alt.Chart(sentiment_by_month).mark_bar().encode(
    x = alt.X("month:O", title = "Month"),
    y = alt.Y("sum(count)", title = "Normalized count", stack = "normalize"),
    color = alt.Color("sentiment:N",
                      title = "Sentiment",
                      scale = alt.Scale(domain = ["negative", "neutral", "positive"],
                                        range = ["red", "orange", "green"]))
)

For the most part, the distribution of sentiments don't vary that much between months. Despite there being a slight increase in positive tweets over time, it seems that Twitter users are generally consistent about their opinions on LLMs.

Since we have both the topics and sentiments for all of the tweets, we can see if certain topics tend to have lower or higher proportions of positive sentiments.

In [None]:
#@title
# get counts of each sentiment by topic
topics_by_sentiment = tweets_cleaned.value_counts(
    ["topic", "sentiment"]
).reset_index(
).sort_values(
    ["topic", "sentiment"]
).reset_index(
    drop = True
)

# create normalized bar chart of tweet counts by sentiment for each topic
alt.Chart(topics_by_sentiment).mark_bar().encode(
    x = alt.X("topic:O", title = "Topic"),
    y = alt.Y("sum(count)", title = "Normalized count", stack = "normalize"),
    color = alt.Color("sentiment:N",
                      title = "Sentiment",
                      scale = alt.Scale(domain = ["negative", "neutral", "positive"],
                                        range = ["red", "orange", "green"]))
)

Once again, we notice Topics 3 (LLM prompts) and 5 (Innovation and impact) popping up again – the 2 topics appear to have higher proportions of positive sentiments. Around half of the tweets in each of the other topics (AI as a field, LLMs in general, AI art) are classified as positive.

## Discussion

### Summary of methods

- We focused on tweets about LLMs ranging from December 2022 to the beginning of June 2023 to understand the online discourse surrounding them.
- We cleaned the tweets, which included filtering out spam tweets and standardizing the text (e.g., lowercasing, lemmatizing, tokenizing).
- We performed topic modeling and sentiment analysis on the remaining tweets and also looked at how the resulting topics and sentiments changed over time.

### Answers

1. *What kinds of topics are brought up in the online discourse surrounding LLMs?*

    - The discourse surrounding LLMs tended to fall into one of the 5 topics: AI as a field, LLMs in general, LLM prompts, AI art, and Innovation and impact.
    - There was an initial increase in tweets about LLM prompts after the initial launch of ChatGPT in November 2022, though tweets in the later months shifted towards being more about innovation and impact.
    - These topics are not as clear-cut as initially thought; in fact, the topics have considerable overlap.

2. *What kinds of sentiments are associated with online discussions about LLMs?*

    - Tweets about LLMs tended to be more positive or neutral; neutral tweets made up a smaller proportion (around 15.9%).
    - Over time, the distribution of these sentiments generally did not change – there were still more positive and neutral tweets compared to negative ones.
    - When taking a closer look at legitimate tweets about LLMs (i.e., not spam), positive tweets generally praise LLMs for being revolutionary and efficient while negative tweets tend to be critical about their impact. However, the sentiment labels are a bit hazy since VADER can be prone to misclassifications.

### Limitations

- **It is difficult to manually filter out spam.** A lot of the methods used to filter out spam in this analysis required hard-coding values (e.g., filtering out certain hashtags). This process is by no means perfect as spam tweets could still pass through while other legitimate tweets could be filtered out.
- **Only tweets were used in this analysis.** We only looked at tweets about LLMs since the dataset was readily available. However, the results can only at most be generalizable to people who use Twitter, which does not include everyone on the Internet who has an opinion on LLMs.

### Future directions

- **Find a better way to filter out spam, possibly through machine learning.** One method to try out in the future would be to train a classification model on a labeled dataset of spam and non-spam tweets, then tweak it to filter out spam in our data.
- **Use other sources.** In addition to using tweets, an extension of this project could compare how the discourse changes when focusing on different social media platforms (e.g., Reddit, Facebook) and news outlets (e.g. The New York Times, Fox News).