# Environment set-up

In [None]:
%%capture
!pip install gensim==3.8.3
!pip install pyLDAvis==3.2.2
!pip install numpy==1.17.4
!pip install pyLDAvis==2.1.2
!pip install -U transformers
!pip install bertopic

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Codes/Experiments/Topic_modeling/SA_LDA

In [None]:
%%capture
import warnings
import re
import pandas as pd
from tqdm.notebook import tqdm
from tqdm._tqdm_notebook import tqdm_notebook
from itertools import chain
# import networkx as nx
from collections import Counter
from calendar import month_name,day_name
# from normalizer import normalize 
from topic_modeling import Topic_modeling
from string import punctuation
from gensim.summarization import summarize
import matplotlib.pyplot as plt
import nltk
from nltk import FreqDist, bigrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import spacy
import pyLDAvis
from transformers import pipeline
from wordcloud import WordCloud
import seaborn as sns
import plotly.express as px
# from bidi.algorithm import get_display
nltk.download('stopwords')

stop_words = list(set(stopwords.words('english')))
lemmatizer = WordNetLemmatizer()

ps = PorterStemmer()

# %load_ext blackcellmagic

#Load and Overview of initial Data

For Topic modeling purpose we need just a column of text and its index 

In [None]:
# Before preprocessing the colomns the data frame is as follows
df = pd.read_csv("training_input.tsv",sep='\t')

print("three first rows:\n\n\n")
df.head(3)

three first rows:





Unnamed: 0,paper_sentence_id,paper,sentence_id,sentence_text,sentence_text_plus,label,split,modified_data,external_data,qc_paper
0,annotator claim project 1_1/economics_166.txt.tsv0,annotator claim project 1_1/economics_166.txt.tsv,0,social learning equilibria,"social learning equilibria sentence id: 0, sentence placement: 0.0",0.0,train,0,0.0,0
1,annotator claim project 1_1/economics_166.txt.tsv1,annotator claim project 1_1/economics_166.txt.tsv,1,We consider a large class of social learning models in which a group of agents face uncertainty regarding a state of the world share the same utility function observe private signals and interact ...,We consider a large class of social learning models in which a group of agents face uncertainty regarding a state of the world share the same utility function observe private signals and interact ...,0.0,train,0,0.0,0
2,annotator claim project 1_1/economics_166.txt.tsv2,annotator claim project 1_1/economics_166.txt.tsv,2,We introduce social learning equilibria a static equilibrium concept that abstracts away from the details of the given extensive form but nevertheless captures the corresponding asymptotic equilib...,We introduce social learning equilibria a static equilibrium concept that abstracts away from the details of the given extensive form but nevertheless captures the corresponding asymptotic equilib...,1.0,train,0,0.0,0


In [None]:
print("Columns:\n\n\n")
df.columns

Columns:





Index(['paper_sentence_id', 'paper', 'sentence_id', 'sentence_text',
       'sentence_text_plus', 'label', 'split', 'modified_data',
       'external_data', 'qc_paper'],
      dtype='object')

## More Statistical Info

In [None]:
#more cleaned data over view


print("Some overview of  columns, nunNull counts and Dtypes:\n\n\n")
df.info()
print("Some Statistical overview of  columns:\n\n\n")
df.describe()

df = df[df["sentence_text"].apply(lambda x: len(x)<100)]
df = df[["sentence_text"]]

Some overview of  columns, nunNull counts and Dtypes:



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48725 entries, 0 to 48724
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   paper_sentence_id   48725 non-null  object 
 1   paper               48725 non-null  object 
 2   sentence_id         48725 non-null  int64  
 3   sentence_text       48725 non-null  object 
 4   sentence_text_plus  48725 non-null  object 
 5   label               48725 non-null  float64
 6   split               48725 non-null  object 
 7   modified_data       48725 non-null  int64  
 8   external_data       48283 non-null  float64
 9   qc_paper            48725 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 3.7+ MB
Some Statistical overview of  columns:





The Used data for topic modeling contains a text column named "Sentence_text" includes the texts documents with less than 100 charachter lenght.

In [None]:
df.head(3)

# Preprocessing on documents


#### Tokenize, Reindex,  Normalize, Filtering 

Filtering on texts : filter tokens and return only nouns and adjective based on the POS of tokens

In [None]:
%%capture
nlp = spacy.load("en_core_web_sm")

def proc_text(text, stemming=None, return_token=False,filter_POS_based=False):
    """function for processing the input text.
    Args:
        text (string): the text to be processed
        stemming (string): default=None
            - stem: stem words using nltk's PortStemmer
            - lemma: lemmatize words using nltk's WordNetLemmatizer
        filter_POS_based(string): filter tokens and return only nouns and adjective based on the POS of tokens
    """
    text_no_punct = ''.join([c for c in text if c not in punctuation])
    text_no_punct = text_no_punct.lower()
    tokens = [word for word in word_tokenize(text_no_punct) if word not in stop_words if len(word)>2]
    if stemming is not None:
        if stemming == 'lemma':
            tokens = [lemmatizer.lemmatize(token) for token in tokens]
        elif stemming == 'stem':
            tokens = [ps.stem(token) for token in tokens]
    if filter_POS_based:
      tags=['NOUN', 'ADJ']
      doc = nlp(" ".join(tokens))
      tokens = [token.lemma_ for token in doc if token.pos_ in tags]
    if return_token:
        return tokens
    else:
        return ' '.join(tokens)

tqdm_notebook().pandas()        
nltk.download('punkt')

# df["sentence_text"] = df["sentence_text"].progress_apply(lambda x: normalize(x))
df.reset_index(drop=True, inplace=True)
df["tokens"] = df["sentence_text"].progress_apply(lambda x: proc_text(x,return_token=True,filter_POS_based=True))

Shape of data after preprocessing

In [None]:
df.head()

Unnamed: 0,sentence_text,tokens
0,social learning equilibria,[]
1,Estimating the long-term effects of treatments is of interest in many fields .,"[longterm, effect, treatment, interest, many, field]"
2,The validity of the surrogacy condition is often controversial .,"[condition, controversial]"
3,nonparametric tests for treatment effect heterogeneity with duration outcomes,"[treatment, effect, heterogeneity, duration, outcome]"
4,at the parametric -rate being the sample size .,"[parametric, rate, sample, size]"


## process on token column using ast.literal_eval

The ast.literal_eval method can safely evaluate strings containing Python values from unknown sources without us having to parse the values. However, complex expressions involving indexing or operators cannot be evaluated using this function.

In [None]:
df = pd.read_csv("clean_df.csv")
import ast
def sth(txt): return ast.literal_eval(txt)
df["tokens"] = df["tokens"].apply(lambda x: sth(x))

df.head()

Unnamed: 0,sentence_text,tokens
0,social learning equilibria,[]
1,Estimating the long-term effects of treatments is of interest in many fields.,"[longterm, effect, treatment, interest, many, field]"
2,The validity of the surrogacy condition is often controversial.,"[condition, controversial]"
3,nonparametric tests for treatment effect heterogeneity with duration outcomes,"[treatment, effect, heterogeneity, duration, outcome]"
4,at the parametric -rate being the sample size.,"[parametric, rate, sample, size]"


# Training and Fitting bertTopic Model 



BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

In [None]:
%%capture
from bertopic import BERTopic
import pandas as pd

df_train = df

############## Load
BERTopic_model = BERTopic.load("BERTopic_model")


In [None]:
############# Train

BERTopic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = BERTopic_model.fit_transform(df_train["sentence_text"].values)

BERTopic_model.save("BERTopic_model")

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, value)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, value)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, val

In [None]:
BERTopic_model.get_topic(14)

[('model', 0.044388181040100554),
 ('models', 0.03205055570846602),
 ('robustness', 0.02853781709840847),
 ('proposed', 0.02172129852699391),
 ('limited', 0.018763067103441423),
 ('predictions', 0.017861591182225023),
 ('empirical', 0.017528958057896796),
 ('stateoftheart', 0.016192009977522712),
 ('results', 0.01494917645010978),
 ('confidence', 0.014502910824566681)]

# Prediction Bert Topic Model

## Predicting the topical cluster of a new document

In [None]:
import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    
new_doc = "By semantically optimizing your content, you add more meaning to the words you use. You optimize for the true intent of your users, not just answering a simple query. This means answering the first question, then answering the second, third, fourth, and fifth questions right after that."

BERTopic_model.transform([new_doc])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, value)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, value)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if (hasattr(numpy, value)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  and def_val == getattr(numpy, value)):
Deprecated in NumPy 1.20; for more details and guidance: http

([-1], array([[1.28115954e-86, 4.39525155e-86, 5.43352318e-86, 2.55100709e-84,
         5.76482981e-86, 5.63452782e-87, 5.11227293e-86, 6.11347090e-86,
         4.25884274e-86, 3.51313234e-87, 6.22409512e-88, 7.05914482e-88,
         1.71919871e-65, 2.15926480e-88, 1.12482722e-65, 2.01048350e-71,
         2.49990285e-88, 2.22815768e-84, 8.15905202e-85, 6.04844669e-87,
         5.01742726e-88, 4.76248403e-86, 1.08397365e-65, 5.02190268e-87,
         2.55991972e-01, 4.34769079e-86, 1.60854336e-71, 1.08315926e-86,
         2.06214947e-86, 5.82919828e-87, 2.83988646e-88, 1.07258230e-86,
         2.43585896e-01, 3.51047376e-87, 2.94489474e-86, 5.06129630e-87,
         6.71329685e-87, 4.63801443e-86, 4.52048334e-86, 1.03010074e-86,
         8.40492829e-88, 4.82589061e-86, 4.27745823e-86, 1.65258767e-86,
         4.53996704e-86, 1.09057100e-86, 3.75098522e-84, 4.35493344e-86,
         4.62847736e-86, 4.58100388e-86, 2.11772486e-71, 1.78299971e-71,
         3.75038646e-86, 1.92968560e-71, 4.85

# Analysis and Evaluation 


## Some Insight Visualization about BERTopic_model 

In [None]:
freq = BERTopic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4876,-1_the_of_to_is
1,0,407,0_quantum_entanglement_states_spin
2,1,223,1_patients_surgery_surgical_months
3,2,208,2_no_not_did_there
4,3,208,3_cancer_tumor_tumors_breast


#Coherance Analysis

### We can use the coherence score in topic modeling to measure how interpretable the topics are to humans. In this case, topics are represented as the top N words with the highest probability of belonging to that particular topic. Briefly, the coherence score measures how similar these words are to each other.

In [None]:
df_train.columns

Index(['sentence_text', 'tokens'], dtype='object')

##Phrase Modeling: Bigram and Trigram Models

min_count ignore all words and bigrams with total collected count lower than this. Bydefault it value is 5

threshold represents a threshold for forming the phrases (higher means fewer phrases). A phrase of words a and b is accepted if (cnt(a, b) - min_count) * N / (cnt(a) * cnt(b)) > threshold, where N is the total vocabulary size. Bydefault it value is 10.0

In [None]:
import gensim
# Build the bigram and trigram models
# higher threshold fewer phrases.
bigram = gensim.models.Phrases(df_train['sentence_text'], min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[df_train['sentence_text']], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [None]:
bigram.export_phrases


<bound method Phrases.export_phrases of <gensim.models.phrases.Phrases object at 0x7f32a1816190>>

## Remove Stopwords, Make Bigrams and Lemmatize

Define the functions to remove the stopwords, make trigrams and lemmatization and call them sequentially

In [None]:
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]
    
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
%%capture
import spacy
# Remove Stop Words
data_words_nostops = remove_stopwords(df_train['sentence_text'])
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])

In [None]:
from gensim.models import CoherenceModel
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Compute Coherence Score
coherence_model = CoherenceModel(model=BERTopic_model, texts=df_train['sentence_text'], dictionary=id2word, coherence='c_v')

'''
caouldn't use for lda because This topic model is not currently supported.
Supported topic models should implement the `get_topics` method.
coherence_model = CoherenceModel(model=top, texts=df_train['sentence_text'], dictionary=id2word, coherence='c_v')
'''

coherence = coherence_model.get_coherence()
print('\nCoherence Score: ', coherence)



Coherence Score:  nan


  numerator = (co_occur_count / num_docs) + EPSILON
  denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
  co_doc_prob = co_occur_count / num_docs


#Bert  Topic Visualization

Then we saw multiple ways to visualize the outputs of topic models including the word clouds and sentence coloring, which intuitively tells you what topic is dominant in each topic.

To see the probability of  topic occurance in documents we use visualize_distribution

## per topic visualization

In [None]:
BERTopic_model.visualize_distribution(probs[200], min_probability=0.015)

In [None]:
BERTopic_model.visualize_barchart()

Each bubble represents a topic. The larger the bubble, the higher percentage of the number of claim in the corpus is about that topic.
Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.
Red bars give the estimated number of times a given term was generated by a given topic.
The further the bubbles are away from each other, the more different they are. Also this is a good validation for our topic modeling since the topics are well aparted from each other and they don't have overlap and they are well seperated from each other.

In [None]:
top = Topic_modeling(tokenized_text=df["tokens"])
vis = top.visualize(lda_model=lda_model,save=True)
pyLDAvis.display(vis)

Heatmaps show the relationships between the underlying word frequencies.

In [None]:
BERTopic_model.visualize_heatmap()


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



## Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented over time. 

In [None]:
df.head(2)

Unnamed: 0,sentence_text,tokens
0,social learning equilibria,[]
1,Estimating the long-term effects of treatments is of interest in many fields.,"[longterm, effect, treatment, interest, many, field]"


In [None]:
#if the dataset has attribute date, we could use this graph
# timestamps = df.date.to_list()
topics_over_time = BERTopic_model.topics_over_time(df['tokens'], topics, timestamps, nr_bins=20)

BERTopic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)

NameError: ignored

# Topic per class analysis

if the topic has the class colomn we could visualize topics per class

In [None]:
# classes = df["?"]
# BERTopic_model.topics_per_class(df["sentence_text"], topics, classes)
# BERTopic_model.visualize_topics_per_class(topics_per_class, top_n=10)

In [None]:
BERTopic_model.visualize_topics_per_class(topics_per_class)

NameError: ignored

In [None]:
# import re
# import pandas as pd

# trump = df = pd.read_csv("training_input.tsv",sep='\t')

# trump.sentence_text = trump.apply(lambda row: re.sub(r"http\S+", "", row.sentence_text).lower(), 1)
# trump.sentence_text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.sentence_text.split())), 1)
# trump.sentence_text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.sentence_text).split()), 1)
# trump = trump.loc[(trump.isRetweet == "f") & (trump.sentence_text != ""), :]
# timestamps = trump.date.to_list()
# tweets = trump.text.to_list()

## Topic Reduction

In [None]:
model = BERTopic(nr_topics=20) 
topics, probs = BERTopic_model.fit_transform(df_train["sentence_text"].values)

Batches:   0%|          | 0/331 [00:00<?, ?it/s]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Depreca

In [None]:
model.visualize_heatmap()

ValueError: ignored

## Similar Topic Analysis

In [None]:
# BERTopic_model.visualize_term_rank()

If Date-Time was in the dataset we could dynamically illustrates the topics over time.

In [None]:
"""
topics_over_time = BERTopic_model.topics_over_time(df['sentence_text'], dic, 10)
BERTopic_model.visualize_topics_over_time(topics_over_time)
"""

"\ntopics_over_time = BERTopic_model.topics_over_time(df['sentence_text'], dic, 10)\nBERTopic_model.visualize_topics_over_time(topics_over_time)\n"

In [None]:
"""
import re
import pandas as pd
from bertopic import BERTopic

# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

# Create topics over time
model = BERTopic(verbose=True)
topics, probs = model.fit_transform(tweets)
topics_over_time = model.topics_over_time(tweets, topics, timestamps)

# Visualize topics over time with the updated colors
visualize_topics_over_time(model, topics_over_time)

"""

'\nimport re\nimport pandas as pd\nfrom bertopic import BERTopic\n\n# Prepare data\ntrump = pd.read_csv(\'https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6\')\ntrump.text = trump.apply(lambda row: re.sub(r"http\\S+", "", row.text).lower(), 1)\ntrump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)\ntrump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)\ntrump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]\ntimestamps = trump.date.to_list()\ntweets = trump.text.to_list()\n\n# Create topics over time\nmodel = BERTopic(verbose=True)\ntopics, probs = model.fit_transform(tweets)\ntopics_over_time = model.topics_over_time(tweets, topics, timestamps)\n\n# Visualize topics over time with the updated colors\nvisualize_topics_over_time(model, topics_over_time)\n\n'

In [None]:
similar_topics, similarity = BERTopic_model.find_topics("cell", top_n=5); 
similar_topics

[10, 0, 8, 53, 79]

change the number of topicts and see how the simillarity changes

In [None]:
# for i in range(3,50):
#   new_topics, new_probs = BERTopic_model.reduce_topics(df["sentence_text"].values, topics, probs, nr_topics=i);
#   similar_topics, similarity = BERTopic_model.find_topics("cell", top_n=5); 
#   similar_topics

The chart below contains the top 50 frequent words in corpus is used and the hirarchy tree is drawn to show how words related, and the number of classes next ot them shows which classes are contain words similar to other words from other classes. 

In [None]:
#hiraichical analysis
BERTopic_model.visualize_hierarchy(top_n_topics=50)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



## Heatmap Model

In [None]:
BERTopic_model.visualize_heatmap()

IndexError: ignored

In [None]:
BERTopic_model.visualize_topics()

TypeError: ignored



*   Blue bars represent the overall frequency of each word in the corpus.
*   Red bars give the estimated number of times a given term was generated by a given topic.




In [None]:
BERTopic_model.visualize_barchart(top_n_topics=12)

In [None]:
new_topics, new_probs = topic_model.reduce_topics(df["sentence_text"].values, topics, probs, nr_topics=20)

## 3 topic analysis of frequent words of each topic

In [None]:
def plot_top_n_words(word_list, title="Word Count", n=20,save_name=None):
  """function to plot list of string e.g words_list
  Args:
    - word_list(LIST): list of strings it could be bigram, unigrame, ...
    - title(string): title of plot
    - n(int): number of words(element of list) to ne displayed in plot
    """
  fdist = FreqDist(word_list)
  words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})
  d = words_df.nlargest(columns="count", n = n)
  fig = px.bar(words_df.nlargest(columns="count", n = n), x = 'word', y = 'count',title=title)
  if save_name is not None:
    fig.write_html(f"{save_name}.html")
  fig.show()
  
for numtopic in df_dominant_topic["Dominant_Topic"].unique():
  top1_toks = df_dominant_topic[df_dominant_topic["Dominant_Topic"] == numtopic]["tokens"].tolist()
  flat_list = [item for sublist in top1_toks for item in sublist]
  plot_top_n_words(flat_list, title=f"Top keywords of topic {numtopic}", n=25)

In [None]:
top0 = df_dominant_topic[df_dominant_topic["Dominant_Topic"] == 0]
top1 = df_dominant_topic[df_dominant_topic["Dominant_Topic"] == 1]
top2 = df_dominant_topic[df_dominant_topic["Dominant_Topic"] == 2]

**topic0 top 10 important**

In [None]:
top0.sort_values(["Topic_Perc_Contrib"], ascending=False)[["Topic_Perc_Contrib","Dominant_Topic", "sentence_text"]][:10]

**topic1 top 10 important**

In [None]:
#@title
top1.sort_values(["Topic_Perc_Contrib"], ascending=False)[["Topic_Perc_Contrib","Dominant_Topic", "sentence_text"]][:10]

**topic2 top 10 important**

In [None]:
#@title
top2.sort_values(["Topic_Perc_Contrib"], ascending=False)[["Topic_Perc_Contrib","Dominant_Topic", "sentence_text"]][:10]

**Conclusion:**
- Topic0:

  This category is mainly about biocheminal scientific researches due to the high frequent words in this topic the words Patinet, model result [250 to 300 times] were most repetitive ones alongside other descriptive words like treatment, cancer, clinical, infection. Recrding to top 10 important sentences in this topic we infer that the content of texts are mostly models and methods about human chemical activities [amino acid, proteomic findings, synaptic activity] related to mortality and  human body imunology and more generally about heath care [WHO, food, organic].

- Topic1:

 As listed in diagram of 25 most frequent words and most important sentences in this class of texts, contents are mainly about neural networks methods and technologies related to cure human body issues. Since the words like TENS,  knee arthrosis, encoding of deep neural networks, heart sound classification, adaptive image-feature learning for disease classification and tterance-level Permutation Invariant Training (uPIT) technique which is a technique for minimizing mean square error were detected as significant pharases in this topic.

- Topic2:  

 In generall, this group is all about startegies and analysis in markrting and trading. Because the words like market, price, optimal, strategy, process, analysis, information theory, agent and prediction appear among most 25-th frequent words. More effective subjects in this category that can explain our deduction about this topic description sentences like: seasonal and trend forecasting of tourist arrivals,  most common pricing rules, cryptocurrency price prediction, trading and trend forecasting of tourist arrivals .