## Topic Modeling ChatGPT Tweets - Iteration 2
In this second iteration, terms that were identified in the first iteration that did not yield significant information in the model (like "ChatGPT", "AI", etc.) were removed as custom stopwords, and the model was retrained.  In this iteration, 5, rather than 6 topics was initially asserted.  

### Import Cleaned Dataset
Information about the dataset is available [here](https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets).  This [notebook](https://app.hex.tech/5b266aaf-b343-4ae7-bdea-218e8fe3001f/hex/87ba702b-030a-4821-8ee1-8f7bf0117139/draft/logic) provides more detail on how the dataset was cleaned, profiled and analyzed prior to modeling.  

In [1]:
import pandas as pd
data = pd.read_csv(r'C:\Users\joelm\Documents\Projects\JMS-Analytics\ChatGPT Topic Modeling\Data\tweets_clean.csv')

In [2]:
data.head()

Unnamed: 0,Unnamed: 1,user_name,tweet,user_followers,user_friends,user_duration_days,user_verified,tweet_date
0,0,Harry Ax,Struggling to make progress on your New Year's resolutions? ChatGpt can provide the guidance and support you need. Watch my latest video to learn more: \n\nhttps://t.co/YqTe9Ogia8\n\n #newyear #resolutions #chatgpt,149,575,4404,False,2023-01-02 20:50:45
1,1,Shiva Chandrashekher,To monetize #ChatGPT at scale #OpenAI need to figure how to incentivize web-content publishers so they can maintain the training data inflow to improve ML models. $GOOG does link-outs to pubs who make money on site Ads/subs. Pubs losing traffic means fewer sites &amp; poor trng data,43,855,4710,False,2023-01-02 20:50:23
2,2,Holly Sawyer,"#AI, all ya’ll! #ChatGPT a rap for me on why #ESL teachers should incorporate coding in their language instruction, but don’t take my word for it! #codeitup #CSforall #edtech 🎧🎶 https://t.co/fjVodZklLQ",471,592,1423,False,2023-01-02 20:50:05
3,3,Michel S.Chbeir,#ChatGPT to replace @Google ?\nThe first serious threat to the giant's search engine...,134,282,3811,False,2023-01-02 20:47:34
4,4,blaquiere guillaume,"My first article of the year is not about @GoogleCloudTech but about #ChatGPT.\n\nA personal opinion of that outstanding and incomplete AI.\n\nDon't hesitate to comment and provide feedback, I'm not a data scientist, only a user and a dreamer!\n\nhttps://t.co/f4pNirqckt",838,123,3701,False,2023-01-02 20:47:30


### Text Pre-processing
All numeric and special characters were removed from each tweet and replaced with blanks. Tweets were then tokenized.  Basic and custom stopwords were removed, and bi and trigrams were extracted.  Finally, the corpus was lemmatized.  The resulting corpus contained a vocabulary size of just about 45,000 words.  

The [PyCaret](https://pycaret.org/) libary was used again here.

In [6]:
from pycaret.nlp import *
lda_iter2 = setup(data = data, target = 'tweet', session_id = 123,
                 custom_stopwords = ['co', 'ai', 'https', 'use', 'chatgpt', 'ask', 'write', 'make', 'get',
                                    'answer', 'question', 'give', 'say', 'would'],           
                 )

Description,Value
session_id,123
Documents,60504
Vocab Size,44893
Custom Stopwords,True


### Model Training & Assignment
A latent dirichlet allocation (lda) model was trained with 5 topics in this iteration, and the model was assigned to the dataset.

In [7]:
lda = create_model('lda', num_topics = 5, multi_core = True)

In [8]:
print(lda)

LdaModel(num_terms=44893, num_topics=5, decay=0.5, chunksize=100)


In [9]:
lda_results = assign_model(lda)
lda_results.head()

Unnamed: 0,Unnamed: 1,user_name,tweet,user_followers,user_friends,user_duration_days,user_verified,tweet_date,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Dominant_Topic,Perc_Dominant_Topic
0,0,Harry Ax,struggle progress new_year_resolution provide guidance support need watch late video learn resolution,149,575,4404,False,2023-01-02 20:50:45,0.298359,0.131039,0.015948,0.538305,0.016349,Topic 3,0.54
1,1,Shiva Chandrashekher,monetize scale openai need figure incentivize web content publisher maintain training datum inflow improve ml model goog link pub money site ad sub pub lose traffic mean site amp poor trng datum,43,855,4710,False,2023-01-02 20:50:23,0.86427,0.006337,0.006349,0.006374,0.11667,Topic 0,0.86
2,2,Holly Sawyer,teacher incorporate code language instruction take word codeitup,471,592,1423,False,2023-01-02 20:50:05,0.293563,0.024607,0.024258,0.024929,0.632643,Topic 4,0.63
3,3,Michel S.Chbeir,giant,134,282,3811,False,2023-01-02 20:47:34,0.599828,0.100043,0.100043,0.100043,0.100043,Topic 0,0.6
4,4,blaquiere guillaume,year googlecloudtech personal opinion outstanding incomplete hesitate comment provide feedback user dreamer fpnirqckt,838,123,3701,False,2023-01-02 20:47:30,0.016029,0.110107,0.16859,0.016175,0.689099,Topic 4,0.69


### Corpus and Topic Analysis
The top 10 words across the corpus in this iteration were: good, think, go, know, new, try, see, create, time, work

In [10]:
plot_model()

On a topic-by-topic basis:

* Topic 0: amp, model, information, problem, datum, provide, train, student, may, could
* Topic 1: know, people, see, human, think, even, future, already, world, go
* Topic 2: time, wonder, day, guess, second, hour, name, stackoverflow, startup, feel
* Topic 3: try, good, go, think, word, take, see, thing, well, time
* Topic 4: generate, create, code, chatbot, text, content, human, new, prompt, tool

In [11]:
plot_model(lda, plot = 'topic_distribution')

t-distributed stochastic neighbor embedding (tSNE) still shows a lot of topic overlap, with topic 3 and 4 showing a high degree of dispersion throughout the map.  

In [12]:
plot_model(lda, plot = 'tsne')

### Save Model
Model saved in notebook directory for future loading with `saved_lda = load_model('Iter1_lda')`

In [13]:
save_model(lda,'Iter2_lda')

Model Succesfully Saved


(<gensim.models.ldamulticore.LdaMulticore at 0x20561942ca0>, 'Iter2_lda.pkl')

### Export Data with Topic Assignments
Final dataset with topic assignments for use in downstream visualization tools

In [18]:
lda_results.to_csv(r'C:\Users\joelm\Documents\Projects\JMS-Analytics\ChatGPT Topic Modeling\Data\tweets_clean_topics.csv')