In this article we will implementing topics model with gensim, topic models are 
probabilistic models which contains information about topics in the
text. A topic is like theme, or in other words underlying ideas represented in text. 
For example, we are working with a corpus of **spanish newspaper articles**,
possible topics would be  politics, conflicts, elections and so on.

#### Load the libraries

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
import matplotlib.pyplot as plt
import gensim
import numpy as np
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.corpora import Dictionary
# from gensim.models.wrappers import LdaMallet


In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

### Connect to drive to get the data

In [3]:
import pandas as pd
df_news = pd.read_pickle("udemy_reviews.pkl")
# Aplica pos/neg
df_news['tag']=df_news['rating'].apply(lambda x: 'pos' if x > 4 else 'neg')
# Filtra los que solamente dicen una palabra (ej. "Excelente!")
df_news = df_news[df_news['comment'].str.contains("\s")]
# Filtra los que dicen menos de 12 letras (ej. "Excelente!")
df_news = df_news[df_news['comment'].str.len() >= 12]
df_news = df_news.sample(2000)
df_news.sample(10)

Unnamed: 0,id,course,rating,comment,user,tag
51893,61574892,922644,5.0,"Para mi fue una fuente continua de información útil para aplicar, todos los mensajes son claros y precisos, encontré mucho para mejorar, entendí porque pierdo el interés en la mayoría de las presentaciones empresariales y que no debo ser el único. Totalmente para recomendar!!!\n\nPD: Algunos temas quizás sea bueno dividirlos, especialmente aquellos que durán más de 5 a 7 minutos.",Mariano Buñirigo,pos
70521,2606180,166170,4.5,Da uma visão clara sobre o negocíos.,Roseana Santos de Freitas,pos
63186,35583622,749262,4.0,Me ayudo a entender mucho algunas cosas que no sabia,Lorenzo Santiago Saul Arias Villegas,neg
77793,80370794,3314342,5.0,"Es un excelente curso, enseñas muy bien, gracias.",Kevin,pos
67355,80597074,1254714,4.0,fue buena pero demaciada explicacion un poko confusa pero necesaria,Alexis Fernando Espinoza Arce,neg
25409,65666142,3332670,5.0,"me parecio muy bueno,pero me hubiera gustado que explicara los tonos para cada tipo d piel y cuando se tiene q hacer de nuevo o como saber si es al ano o algo asi",Elizabeth Aranda,pos
20194,60168386,2096652,5.0,muy buen curso a persa que se aborda no profundiza me paresio muy buen,Paulina Guadalupe López Espinoza,pos
6207,72222592,1133110,5.0,"Perfecto, el profesor es muy atento y ayuda con lo que puede. se aprende bastante, mucho mejor que otros cursos del mismo nivel.",Jose Manuel Lopez Gomez,pos
6887,57956543,1363082,5.0,"MUY BUENO, EXCELENTE!!!!!!",Robeff Angel Darío,pos
43301,68937214,2426324,5.0,fue muy buena eleccion avian cosas que me costaban y las entendi,Soledad Garro,pos


In [4]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 7767 to 67010
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       2000 non-null   int64  
 1   course   2000 non-null   int64  
 2   rating   2000 non-null   float64
 3   comment  2000 non-null   object 
 4   user     2000 non-null   object 
 5   tag      2000 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 109.4+ KB


#Data Preprocessing

We defined a list of custom words to be exclude from our dataset

Create the cleaner function to clean the spanish text, remove non alpha numeric characters, remove duplicate, remove spanish accutes, remove digits

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Faolin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Faolin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
st = ['aaaaa','asd','asdadsad']
y = ['aa','bbb','ccc']
[x for x in st if any(string for string in y if string in x)]


['aaaaa']

In [7]:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('spanish'))

black_list = ['excelen', 'buen',
              'muchas', 'graci'
              ]

additional_stopwords=set(black_list)

stopwords_sp = stop.union(additional_stopwords)

from nltk.stem import SnowballStemmer
spanish_stemmer = SnowballStemmer('spanish')
def stemmization(texts):
    texts = re.sub(r"""
                   [,.;@#?!&$]+  # Accept one or more copies of punctuation
                   \ *           # plus zero or more copies of a space,
                   """,
                   " ",          # and replace it with a single space
                   texts, flags=re.VERBOSE)
    return spanish_stemmer.stem(texts).split()


import spacy
nlp = spacy.load('es_core_news_md')
def lemmatization(texts, allowed_postags=['NOUN']):
    #x = nlp(texts)
    #print([(xx.text,xx.pos_) for xx in x])
    texts_out = [ token.text for token in nlp(texts) if token.pos_ in 
                 allowed_postags and token.text not in black_list and len(token.text)>2]
    return texts_out

In [8]:
%%time
bigram = gensim.models.Phrases(df_news['comment'].to_list()) 

Wall time: 329 ms


In [9]:
def cleaner(word):
    word = re.sub(r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', '', word, flags=re.MULTILINE)
    word = re.sub(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', "", word)
    word = re.sub(r'ee.uu', 'eeuu', word)
    word = re.sub(r'\#\.', '', word)
    word = re.sub(r'\n', '', word)
    word = re.sub(r',', ' ', word)
    word = re.sub(r'\-', ' ', word)
    word = re.sub(r'\.{3}', ' ', word)
    word = re.sub(r'a{2,}', 'a', word)
    word = re.sub(r'é{2,}', 'é', word)
    word = re.sub(r'i{2,}', 'i', word)
    word = re.sub(r'ja{2,}', 'ja', word) 
    word = re.sub(r'á', 'a', word)
    word = re.sub(r'é', 'e', word)
    word = re.sub(r'í', 'i', word)
    word = re.sub(r'ó', 'o', word)
    word = re.sub(r'ú', 'u', word)  
    word = re.sub('[^a-zA-Z]', ' ', word)
    wordlist = [ token for token in nltk.word_tokenize(word) if token.lower() not in stopwords_sp and len(token)>3 ]
    wordlist = [x for x in wordlist if not any(string for string in black_list if string in x)]
    word = " ".join(wordlist)
    list_word_clean = []
    for w1 in word.split(r"\s"):
        if  w1.lower() not in stopwords_sp:
            list_word_clean.append(w1.lower())

    bigram_list = bigram[list_word_clean]
    out_text = stemmization(" ".join(bigram_list))
    return out_text

In [10]:
cleaner('hola soy un gérmen y tendría pulgas. Pero no,creo que no. O si. excelente! excelente')

['hola', 'germen', 'tendria', 'pulgas', 'cre']

Create the function for select **only nouns** for our data, this way we are removing adverb, adjetives, verbs, etc. This is doing with spacy

For gensim we need a list of text, so we need do convert the dataframe to list

In [11]:
stemmization('hola soy un gérmen y tendría pulgas. Pero no,creo que no. O si')

['hola',
 'soy',
 'un',
 'germen',
 'y',
 'tendria',
 'pulgas',
 'pero',
 'no',
 'creo',
 'que',
 'no',
 'o',
 'si']

In [12]:
lemmatization('hola soy un gérmen y tendría pulgas. Pero no,creo que no. O si')

['gérmen', 'pulgas']

In [13]:
len(df_news)

2000

In [14]:
# !python -m spacy download es_core_news_md

In [15]:
df_news['comment'].sample(3)

13092    Gracias por estos sencillos consejos de Flexbox, la verdad me sirvieron de mucho. Saludos!!!
63917            Excelente explicación y dominio del tema, ejemplos prácticos y de fácil aprendizaje.
15768                                    excelente curso, felicitaciones al profesor.\nMuchas gracias
Name: comment, dtype: object

In [16]:
cleaner(df_news['comment'].iloc[3])

['excelente', 'aclare', 'dudas', 'respecto', 'tem']

The Cleaner function work properly

##### Let's clean all the text

In [17]:
from tqdm.notebook import tqdm
tqdm.pandas()

df_news['comment_cleaned'] = df_news['comment'].progress_apply(cleaner)

  0%|          | 0/2000 [00:00<?, ?it/s]

Now we need to build the *corpus* and the *dictionary* that gensim need to work, to do that we need to pass a list of list of tokens

In [18]:
df_news['comment_cleaned'].iloc[200:210]

45448                                                                                                      [profesor, entiende, bien, palabras, acelera, traduce, spanish, english]
29693                                                              [bien, metodologia, manera, enfocar, contenidos, aprendizaje, permite, imprimir, certificado, visualiza, opcion]
24716                                                                                                                                         [profesor, experiencia, sabe, explic]
24324                                                                                                                                                   [curso, dinamico, entreten]
9710                                                                                                                                                           [explica, paso, pas]
12482                          [informacion, presentada, manera, concisa, concreta, acompa, graficas

In [19]:
dictionary = Dictionary(df_news['comment_cleaned'].to_list())
dictionary.compactify()
# Filter extremes
#dictionary.filter_extremes(no_below=5, no_above=0.3)#, keep_n=10000)
#dictionary.filter_extremes(no_below=2, no_above=0.97, keep_n=None)
dictionary.filter_extremes(no_below=5, no_above=0.2, keep_n=None)
dictionary.compactify()

corpus = [dictionary.doc2bow(text) for text in df_news['comment_cleaned'].to_list()]

# Now let's do the modeling part

We are comparing 3 topic modeling algorithm Latent Dirichlet Allocation (LDA), Latent
semantic analysis (LSA), Hierarchical Dirichlet Process
(HDP),in order to evaluate topic models we will be using **topic coherence**, which is a measure of how
interpretable topics are for human beings.

In [22]:
%%capture
!pip install pyLDAvis==2.1.2

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot  as plt
from collections import Counter
import numpy as np
from nltk import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

import gensim
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
#from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
from gensim import corpora

import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import os, re, operator, warnings
warnings.filterwarnings('ignore')  
%matplotlib inline

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Faolin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Faolin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
lsamodel = LsiModel(corpus=corpus, num_topics=25, id2word=dictionary)

In [26]:
lsamodel.print_topics(10,4)

[(0,
  '-0.782*"bien" + -0.236*"explicado" + -0.141*"excelente" + -0.139*"explica"'),
 (1, '0.587*"excelente" + -0.432*"bien" + 0.291*"curs" + -0.185*"explicado"'),
 (2,
  '-0.657*"excelente" + -0.246*"curs" + 0.244*"instructor" + -0.202*"bien"'),
 (3,
  '-0.544*"instructor" + 0.249*"bastante" + 0.231*"profesor" + 0.176*"entender"'),
 (4,
  '-0.564*"explica" + -0.489*"curs" + 0.416*"explicacion" + -0.211*"profesor"'),
 (5, '-0.586*"curs" + 0.438*"explica" + 0.371*"claro" + -0.205*"explicado"'),
 (6,
  '-0.766*"explicacion" + -0.326*"curs" + 0.238*"excelente" + -0.186*"clara"'),
 (7, '0.754*"claro" + 0.302*"curs" + -0.277*"explica" + -0.129*"excelente"'),
 (8, '0.393*"entender" + 0.358*"cosas" + -0.290*"profesor" + 0.248*"facil"'),
 (9, '0.461*"facil" + 0.401*"explicado" + -0.258*"cosas" + -0.221*"explic"')]

In [27]:
ldamodel = LdaModel(corpus=corpus, num_topics=25, id2word=dictionary, iterations = 2000, passes=10)

In [28]:
ldamodel.print_topics(10, 6)

[(17,
  '0.030*"parecio" + 0.027*"conocimientos" + 0.023*"excelente" + 0.022*"ense" + 0.022*"primer" + 0.021*"deja"'),
 (3,
  '0.047*"cosas" + 0.046*"forma" + 0.038*"aprendido" + 0.027*"explicaciones" + 0.023*"bien" + 0.018*"clara"'),
 (22,
  '0.099*"interesante" + 0.028*"gustado" + 0.027*"cursos" + 0.026*"ense" + 0.020*"principi" + 0.018*"graci"'),
 (7,
  '0.100*"parece" + 0.047*"mucha" + 0.045*"buena" + 0.041*"recom" + 0.037*"conocimiento" + 0.025*"explicacion"'),
 (9,
  '0.051*"completo" + 0.041*"sido" + 0.040*"ahora" + 0.040*"claro" + 0.038*"profesor" + 0.037*"sencillo"'),
 (23,
  '0.089*"entender" + 0.028*"rapido" + 0.027*"facil" + 0.027*"falta" + 0.025*"explicar" + 0.023*"simple"'),
 (14,
  '0.050*"bien" + 0.019*"aprendiendo" + 0.019*"programacion" + 0.018*"software" + 0.017*"nuevas" + 0.016*"iniciacion"'),
 (19,
  '0.218*"explicacion" + 0.076*"clara" + 0.061*"momento" + 0.024*"basicos" + 0.022*"manera" + 0.020*"entiende"'),
 (21,
  '0.023*"programacion" + 0.022*"instructor" + 0.

 ## Hierarchical Dirichlet process Model

In [20]:
hdpmodel = HdpModel(corpus=corpus, id2word=dictionary, random_state= 30)

and the topics of this model:

In [21]:
def display_topics(model, model_type="lda"):
    for topic_idx, topic in enumerate(model.print_topics()):
        print ("Topic %d:" % (topic_idx))
        if model_type== "hdp":
            print (" ".join(re.findall( r'\*(.[^\*-S]+).?', topic[1])), "\n")
        else:
            print (" ".join(re.findall( r'\"(.[^"]+).?', topic[1])), "\n")


In [22]:
# hdpmodel.show_topics() 

display_topics(hdpmodel, model_type="hdp")

Topic 0:
toda  bueno  visto  habia  siendo  lenguaje  clar  entienda  bast  photoshop 

Topic 1:
bien  demas  resultado  iniciarse  deberia  todas  utilidad  temas  basicas  equipo 

Topic 2:
dado  vida  nivel  conten  practicos  aprende  hablando  encuentro  llevo  real 

Topic 3:
ejercicios  expectativas  bases  profesor  necesario  comprar  actividades  algun  proyectos  cantidad 

Topic 4:
empieza  viene  explicar  practicas  recien  pronunciacion  explicacion  final  bi  calificacion 

Topic 5:
ayudado  temas  podria  gratuito  cosas  explicado  actividades  explica  cosa  responde 

Topic 6:
repasar  herramientas  dominio  hice  conocimiento  calidad  mundo  pienso  cero  persona 

Topic 7:
parecido  mala  aplicacion  principiantes  ando  explicando  detallado  recomendable  dado  trabajo 

Topic 8:
primer  varios  ayudan  termine  tipo  encima  practicar  deja  video  cortos 

Topic 9:
dedicacion  conciso  contenido  puesto  aspectos  entender  introduccion  objetivo  concepto  

as we could see there are 20 topics, however is kind of dificult to interpret or follow, so we decide to move to another model.





In [23]:
def evaluate_graph(dictionary, corpus, texts, limit, model):
    """
    Function to display num_topics - LDA graph using c_v coherence
    
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : topic limit
    
    Returns:
    -------
    lm_list : List of LDA topic models
    c_v : Coherence values corresponding to the LDA model with respective number of topics
    """
    c_v = []
    lm_list = []
    for num_topics in range(1, limit):
        if model == 'lsi':
            lm = LsiModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        else:
            lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        lm_list.append(lm)
        cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary, coherence='c_v')
        c_v.append(cm.get_coherence())
        
    # Show graph
    x = range(1, limit)
    plt.plot(x, c_v)
    plt.xlabel("num_topics")
    plt.ylabel("Coherence score")
    plt.legend(("c_v"), loc='best')
    plt.show()
    
    return lm_list, c_v

##LSI MODEL

In [24]:
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [26]:
display_topics(lsimodel)  # Showing the topics

Topic 0:
bien explicado explica explic excelente profesor facil instructor curs temas 

Topic 1:
bien excelente curs explic explicacion instructor temas explicado explica manera 

Topic 2:
excelente curs explica temas profesor solo bien hace explic videos 

Topic 3:
explicacion curs facil explica clara entender profesor explicado manera tema 

Topic 4:
temas explica curs profesor solo visto tema informacion explicado explicacion 

Topic 5:
facil explica manera informacion curs explicado entender explic explicacion herramientas 

Topic 6:
curs excelente explica temas informacion solo explicacion instructor parte explicado 

Topic 7:
explicacion informacion profesor eleccion facil gran curs explica entender conceptos 

Topic 8:
claro eleccion momento curs explicado parte temas excelente instructor paso 

Topic 9:
informacion explica claro bastante eleccion explicado cosas curs ense clara 



It seen that with 10 topics there is some themes with keywords related to: trump, venezuela, police, electiones, terrorism; still is a little difficult to gt some insight, because of this we are trying to select the best number of topics by iterate over a range of values and looking the coherence 

In [None]:
%%time
lmlist_lsi, c_v = evaluate_graph(dictionary=dictionary, corpus=corpus, texts=df_news['comment_cleaned'].to_list(), limit=21, model= "lsi")

According to the coherence the best number of topics are between 3-7, however you must select the topics using both the coherence and visual inspection.


In [None]:
display_topics(lmlist_lsi[2])

Now, Let's try another model

## Latent Dirichlet Allocation Model

In [None]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [None]:
display_topics(ldamodel)

Find out the optimal number of topics for the LDA model based on the coherence metric:

In [None]:
%%timer
lmlist, c_v = evaluate_graph(dictionary=dictionary, corpus=corpus, texts=df_news['comment_cleaned'].to_list(), limit=21, model= "lda")

For this model it seems that  9 or 18, again we must to check the keywords too.

### Comparing the Model Coherence of the Best Models

we made 3 models, now let's compare each other's  coherence

In [None]:
ldamodel = lmlist[11]
lsimodel = lmlist_lsi[2]

lsitopics = [[word for word, prob in topic] for topicid, topic in lsimodel.show_topics(formatted=False)]

hdptopics = [[word for word, prob in topic] for topicid, topic in hdpmodel.show_topics(formatted=False)]

ldatopics = [[word for word, prob in topic] for topicid, topic in ldamodel.show_topics(formatted=False)]

In [None]:
lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=df_news['comment_cleaned'].to_list(), dictionary=dictionary, window_size=10).get_coherence()

hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=df_news['comment_cleaned'].to_list(), dictionary=dictionary, window_size=10).get_coherence()

lda_coherence = CoherenceModel(topics=ldatopics, texts=df_news['comment_cleaned'].to_list(), dictionary=dictionary, window_size=10).get_coherence()

In [None]:
import seaborn as sns

coherences = [lsi_coherence, hdp_coherence, lda_coherence]
n = len(coherences)
x = ['lsi_coherence','hdp_coherence', 'lda_coherence']
sns.barplot(x, coherences)


We can see that the **LdaModel** model **with 8 topics** has the higher value of
coherence

Examine the keyword to get the topics of the best model

In [None]:

display_topics(ldamodel)

It looks like the topics are:
* Topic 0: felicitaciones
* Topic 1: expectativas
* Topic 2: experiencia
* Topic 3: contenido
* Topic 4: instructor
* Topic 5: material
* Topic 6: video
* Topic 7: lenguaje
* Topic 8: ejercicios
* Topic 9: titulo
* Topic 10: temas
* Topic 11: explicación


In [None]:
label_dicc = {0:'felicitaciones', 1:'expectativas', 2:'experiencia', 3: 'contenido', 4:'instructor', 5:'material', 6:'video', 
              7:'lenguaje', 8:'ejercicios', 9: 'titulo', 10:'temas', 11:'explicación'}

Let´s check the keyword when we selecting another number of topics (14)

In [None]:
ldamodel_16 =lmlist[16]


In [None]:
display_topics(ldamodel_16)

# Classifiying all documents

now that we have been select the best model and topics number, is time to assign a topic to each document, means **cluster** according to the topics

In [None]:
from tqdm.notebook import tqdm_notebook

def format_topics_sentences(ldamodel=0, corpus=corpus, texts=0):
    # Init output
    sent_topics_df = pd.DataFrame()-n

    # Get main topic in each document
    for i, row in tqdm_notebook(enumerate(ldamodel[corpus]), total=len(ldamodel[corpus])):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel, corpus=corpus, texts=df_news['comment_cleaned'].to_list())



In [None]:
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

We selected the ldamodel with 12 topics and asigned a dominant topic to each document, now let map each topic with a label 

first let's create the dictionary

In [None]:
label_dicc = {0:'felicitaciones', 1:'expectativas', 2:'experiencia', 3: 'contenido', 4:'instructor', 5:'material', 6:'video', 
              7:'lenguaje', 8:'ejercicios', 9: 'titulo', 10:'temas', 11:'explicación'}

In [None]:
df_dominant_topic['Dominant_Topic'] = df_dominant_topic['Dominant_Topic'].astype('int64')


In [None]:
df_dominant_topic['Dominant_Topic'] = df_dominant_topic['Dominant_Topic'].map(label_dicc)
df_dominant_topic.head(10)

In [None]:
df_news['labels'] = df_dominant_topic['Dominant_Topic']
df_news['label_confidence'] = df_dominant_topic['Topic_Perc_Contrib']

Let's examine some text and its topics

In [None]:
df_news[['comment', 'labels']].head(10)

In [None]:
df_news[ df_news['labels'] == 'instructor'].sort_values(by='label_confidence',ascending=False).head(10)

In [None]:
df_news.sort_values(by='label_confidence',ascending=False).head(10)

### let's see the distribution of topics


In [None]:
ax = df_dominant_topic['Dominant_Topic'].value_counts().plot(kind='bar')
plt.show()

The topis are almost balanced, so we are good

finally that we have our models set up, as well as analyzed, we can go
ahead to visualizing them.

In [None]:
!pip install pyLDAvis

In [None]:
import pyLDAvis

pyLDAvis.enable_notebook()

In [None]:
# %%time
import pyLDAvis.gensim
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)