# Tweet Topic Extraction with Kaggle

In this notebook, we'll use Themefinder to assign topics to tweets from the [Kaggle COVID19 Tweets dataset.](https://www.kaggle.com/datasets/gpreda/covid19-tweets/)

In [None]:
import dotenv
import os

# get the OPENAI_API_KEY from the .env file or enter it manually

dotenv.load_dotenv()

open_ai_api_key = os.getenv('OPENAI_API_KEY')

In [1]:
import pandas as pd

In [None]:
!kaggledatasets download -d "gpreda/covid19-tweets"
!unzip covid19-tweets.zip -d kaggle-data 

In [None]:
!git clone https://github.com/i-dot-ai/themefinder.git
%pip install -e themefinder

In [2]:
tweets_df = pd.read_csv('kaggle-data/covid19_tweets.csv')
tweets_df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False


In [3]:
sample_tweets_df = tweets_df.sample(1000)
sample_tweets_df

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
54288,Blick Calle,"Philadelphia, NYC, Reading, PA","Street Photographer\n\n""And to Think That I Sa...",2013-09-26 21:32:47,248,278,339,False,2020-08-01 18:02:51,Alien Moonlight Mask #flickr https://t.co/FEXV...,"['flickr', 'covid19', 'pandemic', 'pandemiclif...",Twitter Web App,False
93078,Dr Iris Sutter G,Alpine triangle A-D-CH,Book on Anglo-German-Jewish #Kindertransport▪ ...,2017-01-05 12:40:45,4643,3882,115454,False,2020-08-09 07:18:27,#COVID19 #Sturgis \nMotorcycle purchases will ...,"['COVID19', 'Sturgis']",Twitter for iPhone,False
40839,Tara Calishain,,"Obsessed with search engines, info collections...",2007-06-20 22:25:01,5542,3435,9531,False,2020-07-29 16:15:24,New York Times: I Was a Screen Time Expert. Th...,"['covid19', 'parenting', 'screentime']",Hootsuite Inc.,False
28008,shubham pawar,,,2020-07-09 06:41:26,0,5,4,False,2020-07-27 04:58:16,"We, the students, are the future of our state,...",['COVID19'],Twitter for Android,False
122356,Glimpse 33 Media Ltd,"Lagos, Nigeria",Helping our clients thrive in a connected worl...,2018-04-26 17:29:43,1295,1172,464,False,2020-08-13 08:21:30,Do you agree? Yes or no?\n#QueenErica #unilag ...,"['QueenErica', 'unilag', 'npower', 'LayconandV...",Twitter for Android,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129802,Global Tactics,"Seattle, WA",We assist business leaders in making informati...,2012-10-24 16:07:25,156,779,1572,False,2020-08-14 03:30:03,"""The #NBA may save us from #Covid19 as we take...","['NBA', 'Covid19', 'Sweden', 'WeChat', 'TikTok']",Hootsuite Inc.,False
167415,DailyaddaaNews,New Delhi,Breaking news alerts from India.,2016-10-22 09:18:42,563,29,88,False,2020-08-30 08:28:52,All people entering the state now will have to...,,Twitter Web App,False
142235,Khaled Abdul Jawad,"Miami, FL",Postdoc Research Fellow at @UMiamiMedicine Ryd...,2009-06-23 20:30:38,3065,2243,11,False,2020-08-17 07:11:53,Lebanon is heading to an imminent collapse of ...,['Covid19'],Twitter for Mac,False
49987,VT_News_Daily,T. Nagar,Media/NewsCompany\nFor any promotions📸\nCommer...,2020-04-12 14:33:15,6,63,2,False,2020-07-31 17:38:05,"#VT News 📰\n31:07:2020\nFor more promotions📜, ...",['VT'],Twitter for Android,False


We begin by reformatting our dataframe as per the instructions in quickstart.

In [11]:
# reformat our df according to instructions

responses_df = sample_tweets_df[['text']].reset_index().rename(columns={'index':'response_id',
                                                          'text':'response'}).copy()

responses_df.sample(5)

Unnamed: 0,response_id,response
427,52170,One thing I have definitely not missed during ...
534,148447,231 new #covid19 cases on the island of irelan...
262,172715,This is FUN &amp; DELICIOUS…we just want to ma...
726,172061,This is so cool. A must have in these times.\n...
33,140766,👐 Keep your hands and face as clean as possibl...


We then run our quickstart notebook (with a few changes to account for running in a notebook.)

In [13]:
import asyncio
from dotenv import load_dotenv
import pandas as pd
from langchain_openai import ChatOpenAI
from themefinder import find_themes

# Load LLM API settings
load_dotenv()

# Initialize LLM
llm = ChatOpenAI(
    api_key=open_ai_api_key,
    model="gpt-4o",
    temperature=0,
    model_kwargs={"response_format": {"type": "json_object"}},
)

# Question and system prompt
question = "What is your opinion?"
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."

# Define and run async function
async def get_themes():
    result = await find_themes(responses_df, llm, question, system_prompt)
    return result

# Run and store result
result = await get_themes()
result

2025-01-22 15:55:51,229 INFO: Running sentiment analysis on 1000 responses
2025-01-22 15:55:51,230 INFO: Running batch and run with batch size 10
2025-01-22 15:56:33,912 INFO: Running theme generation on 1000 responses
2025-01-22 15:56:33,913 INFO: Running batch and run with batch size 50
2025-01-22 15:56:38,924 INFO: Running theme condensation on 8 topics
2025-01-22 15:56:38,925 INFO: Running batch and run with batch size 10000
2025-01-22 15:56:45,269 INFO: Running topic refinement on 4 responses
2025-01-22 15:56:45,270 INFO: Running batch and run with batch size 10000
2025-01-22 15:56:48,659 INFO: Running theme mapping on 1000 responses using 4 themes
2025-01-22 15:56:48,661 INFO: Running batch and run with batch size 20
2025-01-22 15:56:52,722 INFO: Failed integrity check
2025-01-22 15:56:52,723 INFO: Present in original but not returned from LLM: {'143350', '117662', '110545', '125093', '42520', '35806', '79534', '135030', '18254', '122579', '167855', '19433', '46169', '157170', '1

{'question': 'What is your opinion?',
 'sentiment':      response_id                                           response position
 0          54288  Alien Moonlight Mask #flickr https://t.co/FEXV...  unclear
 1          93078  #COVID19 #Sturgis \nMotorcycle purchases will ...  unclear
 2          40839  New York Times: I Was a Screen Time Expert. Th...  unclear
 3          28008  We, the students, are the future of our state,...  unclear
 4         122356  Do you agree? Yes or no?\n#QueenErica #unilag ...  unclear
 ..           ...                                                ...      ...
 995       129802  "The #NBA may save us from #Covid19 as we take...  unclear
 996       167415  All people entering the state now will have to...  unclear
 997       142235  Lebanon is heading to an imminent collapse of ...  unclear
 998        49987  #VT News 📰\n31:07:2020\nFor more promotions📜, ...  unclear
 999        76526  US #Covid19 numbers have miraculously dropped ...  unclear
 
 [1000 rows

Our result object is a dictionary containing the outputs from each stage of our pipeline.

In [16]:
for key in result:
    print(f"Theme: {key}")

Theme: question
Theme: sentiment
Theme: topics
Theme: condensed_topics
Theme: refined_topics
Theme: mapping


For easy manipulation, we'll take our mapping as a dataframe.

In [25]:
topic_mapping_df = pd.DataFrame(result['mapping'])
topic_mapping_df

Unnamed: 0,response_id,response,position,reasons,labels,stances
0,76724,"First posted 01/01/2020 ""Britains Death Knell""...",unclear,[],[],[]
1,86737,@MattHancock @BorisJohnson and the #COVIDIDIOT...,unclear,[The response criticizes the management of COV...,[A],[NEGATIVE]
2,160803,BBC News - Coronavirus pandemic could be over ...,unclear,[],[],[]
3,115745,Help slow the spread of #COVID19 and identify ...,unclear,[The response encourages self-reporting of sym...,[A],[POSITIVE]
4,158857,"my life as a piece of string: ""into lockdown- ...",unclear,[],[],[]
...,...,...,...,...,...,...
515,145930,Covid-19 social distance Halloween: The Kids d...,unclear,[],[],[]
516,146065,GT Capital posts lower H1 net income due to #C...,unclear,[],[],[]
517,92185,What does #DigitalTransformation mean to you? ...,unclear,[],[],[]
518,32073,While everyone in the world were stress buying...,unclear,[],[],[]


The "condensed_topics" dictionary gives an overview of each topic.

In [65]:
result['condensed_topics']

Unnamed: 0,topic_label,topic_description,position,response_id
0,Mistrust in Public Health and Government,A significant theme is the mistrust in public ...,disagreement,0
1,Skepticism and Criticism of COVID-19 Narrative,There is a notable skepticism about the existe...,disagreement,1
2,Privacy and Surveillance Concerns,"Privacy concerns were highlighted, particularl...",disagreement,2
3,Safety Concerns in School Reopenings,There is opposition to reopening schools witho...,disagreement,3


While the majority of our tweets fit within one topic, a few have been assigned more than one - let's focus on visualising those.

In [64]:
topic_mapping_df['labels'].value_counts()

labels
[]        688
[A]       186
[B]        71
[D]        41
[C]        10
[A, B]      3
[C, B]      1
Name: count, dtype: int64

In [48]:
print('topic A is about')
print(result['refined_topics']['A'][0])

print('topic B is about')
print(result['refined_topics']['B'][0])

topic A is about
Public Health and Government Trust: The level of trust in public health measures and government competence during the COVID-19 pandemic, including views on the necessity and effectiveness of restrictions and the scientific basis of measures like mask mandates.
topic B is about
COVID-19 Narrative and Pharmaceutical Industry: Perspectives on the existence or severity of COVID-19 and the role of pharmaceutical companies, including concerns about motivations behind public health policies.


Now let's display each tweet which has been assigned both, along with the reason they have been assigned.

In [61]:
filtered_df = topic_mapping_df[topic_mapping_df['labels'].apply(lambda x: x == ['A', 'B'])].reset_index(drop=True)

# Set response_id as the index for responses_df
responses_df.set_index('response_id', inplace=True)

# print out each of the response texts, followed by the reason.
for idx, row in filtered_df.iterrows():
    print(f"Response: {responses_df.loc[row['response_id']]['response']}")
    print(f"Reason: {row['reasons']}")
    

Response: I wear my mask, I take all precautions so I am not down playing the #COVID19 at all but politics is the real virus… https://t.co/19k0yDORU5
Reason: ['The response mentions taking precautions and wearing a mask, which indicates a recognition of the necessity and effectiveness of public health measures.', "The statement 'politics is the real virus' suggests skepticism about the motivations behind public health policies, hinting at a narrative that questions the role of politics in the pandemic response."]
Response: Wait until furlough ends in October !
#NoNewNormal #COVID19 #NoMasks #Freedom #Liberty #Plandemic #KeepBritainFree https://t.co/pbMiv3OD2L
Reason: ['The use of hashtags like #NoMasks and #Plandemic suggests skepticism towards the necessity and effectiveness of public health measures, including mask mandates.', 'The hashtag #Plandemic implies a belief in a narrative that questions the severity or existence of COVID-19 and suggests ulterior motives behind public health