## Topic Modeling ChatGPT Tweets - Iteration 1
In the first iteration, modeling will explore topics contained in tweet documents without any specification of custom stopwords.  The modeled topics will then inform a second iteration that will remove stopwords that do not appear to add any information to the model.  

### Import Cleaned Dataset
Information about the dataset is available [here](https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets).  This [notebook](https://app.hex.tech/5b266aaf-b343-4ae7-bdea-218e8fe3001f/hex/87ba702b-030a-4821-8ee1-8f7bf0117139/draft/logic) provides more detail on how the dataset was cleaned, profiled and analyzed prior to modeling.  

In [2]:
import pandas as pd
data = pd.read_csv(r'C:\Users\joelm\Documents\Projects\JMS-Analytics\ChatGPT Topic Modeling\Data\tweets_clean.csv')

In [3]:
data.head()

Unnamed: 0,Unnamed: 1,user_name,tweet,user_followers,user_friends,user_duration_days,user_verified,tweet_date
0,0,Harry Ax,Struggling to make progress on your New Year's resolutions? ChatGpt can provide the guidance and support you need. Watch my latest video to learn more: \n\nhttps://t.co/YqTe9Ogia8\n\n #newyear #resolutions #chatgpt,149,575,4404,False,2023-01-02 20:50:45
1,1,Shiva Chandrashekher,To monetize #ChatGPT at scale #OpenAI need to figure how to incentivize web-content publishers so they can maintain the training data inflow to improve ML models. $GOOG does link-outs to pubs who make money on site Ads/subs. Pubs losing traffic means fewer sites &amp; poor trng data,43,855,4710,False,2023-01-02 20:50:23
2,2,Holly Sawyer,"#AI, all ya’ll! #ChatGPT a rap for me on why #ESL teachers should incorporate coding in their language instruction, but don’t take my word for it! #codeitup #CSforall #edtech 🎧🎶 https://t.co/fjVodZklLQ",471,592,1423,False,2023-01-02 20:50:05
3,3,Michel S.Chbeir,#ChatGPT to replace @Google ?\nThe first serious threat to the giant's search engine...,134,282,3811,False,2023-01-02 20:47:34
4,4,blaquiere guillaume,"My first article of the year is not about @GoogleCloudTech but about #ChatGPT.\n\nA personal opinion of that outstanding and incomplete AI.\n\nDon't hesitate to comment and provide feedback, I'm not a data scientist, only a user and a dreamer!\n\nhttps://t.co/f4pNirqckt",838,123,3701,False,2023-01-02 20:47:30


### Text Pre-processing
All numeric and special characters were removed from each tweet and replaced with blanks. Tweets were then tokenized.  Basic stopwords were removed, and bi and trigrams were extracted.  Finally, the corpus was lemmatized.  The resulting corpus contained a vocabulary size of just over 32,000 words.  

To conduct these significant pre-processing steps with minimal code, the [PyCaret](https://pycaret.org/) libary was leveraged. 

In [4]:
from pycaret.nlp import *
lda_iter1 = setup(data = data, target = 'tweet', session_id = 123)

Description,Value
session_id,123
Documents,60504
Vocab Size,32016
Custom Stopwords,False


### Model Training & Assignment
A latent dirichlet allocation (lda) model was trained with an initial "guess" of 6 topics, and assigned to the dataset.

In [5]:
lda = create_model('lda', num_topics = 6, multi_core = True)

In [6]:
print(lda)

LdaModel(num_terms=32016, num_topics=6, decay=0.5, chunksize=100)


In [7]:
lda_results = assign_model(lda)
lda_results.head()

Unnamed: 0,Unnamed: 1,user_name,tweet,user_followers,user_friends,user_duration_days,user_verified,tweet_date,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Dominant_Topic,Perc_Dominant_Topic
0,0,Harry Ax,struggle make progress provide guidance support need watch late video learn resolution,149,575,4404,False,2023-01-02 20:50:45,0.012864,0.839116,0.012824,0.012882,0.012852,0.109462,Topic 1,0.84
1,1,Shiva Chandrashekher,scale openai need figure incentivize web content publisher maintain training datum inflow improve ml model goog link pub make money site ad sub pub lose traffic mean site amp poor trng datum,43,855,4710,False,2023-01-02 20:50:23,0.522373,0.005355,0.005336,0.374658,0.005335,0.086944,Topic 0,0.52
2,2,Holly Sawyer,ai teacher incorporate code language instruction take word codeitup fjvodzkllq,471,592,1423,False,2023-01-02 20:50:05,0.137707,0.018319,0.018053,0.35924,0.018067,0.448613,Topic 5,0.45
3,3,Michel S.Chbeir,giant,134,282,3811,False,2023-01-02 20:47:34,0.083376,0.083376,0.083376,0.083376,0.083376,0.583122,Topic 5,0.58
4,4,blaquiere guillaume,year personal opinion outstanding ai hesitate comment provide feedback user,838,123,3701,False,2023-01-02 20:47:30,0.015205,0.716759,0.106256,0.015279,0.015184,0.131316,Topic 1,0.72


### Corpus and Topic Analysis
Dominant words across the entire corpus included: co, ai, https, use, chatgpt, ask, write, make, get

In [8]:
plot_model()

On a topic-by-topic basis:

* Topic 0: make, take, chatgpt, try, pretty, time, use, write, get, thing
* Topic 1: ai, chatgpt, go, see, think, people, new, use, make, work
* Topic 2: thank, story, happen, image, live, stuff, incredible, human, team, true
* Topic 3: use, write, ai, generate, create, content, text, model, human, chatgpt
* Topic 4: co, https, ask, write, explain, know, get, seem, ai, think
* Topic 5: ask, answer, question, give, chatgpt, write, good, get, say, would

In [9]:
plot_model(lda, plot = 'topic_distribution')

t-distributed stochastic neighbor embedding (tSNE) shows a lot of topic overlap, with topic 5 showing a high degree of dispersion throughout the map.  

In [12]:
plot_model(lda, plot = 'tsne')

### Save Model
Model saved in notebook directory for future loading with `saved_lda = load_model('Iter1_lda')`

In [13]:
save_model(lda,'Iter1_lda')

Model Succesfully Saved


(<gensim.models.ldamulticore.LdaMulticore at 0x1aac2b796d0>, 'Iter1_lda.pkl')