<a href="https://colab.research.google.com/github/mariaeduardagimenes/NLP/blob/master/Tutorial_NLP_AEVO_TopicModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TOPIC MODELING

## Introdução

Outra técnica popular de análise de texto é chamada topic modeling (modelagem de tópicos). O objetivo final da modelagem de tópicos é encontrar vários tópicos que estão presentes em seu corpus. Cada documento do corpus será composto por pelo menos um tópico, ou então vários tópicos.

Colocarei as etapas de como fazer a Latent Dirichlet Allocation (LDA), que é uma das muitas técnicas de modelagem de tópicos. Ele foi projetado especificamente para dados de texto.

Para usar uma técnica de modelagem de tópicos, você precisa fornecer (1) uma document-term matrix e (2) o número de tópicos que você gostaria que o algoritmo pegasse.

Uma vez que a técnica de modelagem de tópicos é aplicada, nosso trabalho como humano é interpretar os resultados e ver se a mistura de palavras em cada tópico faz sentido. Se eles não fizerem sentido, você pode tentar alterar o número de tópicos, os termos na matriz de termos do documento, parâmetros do modelo ou até mesmo tentar um modelo diferente.


## Topic Modeling - Tentativa #1 (Todo texto)


In [None]:
# Vamos ler em nossa document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,abacaxi,aberta,abertos,abrange,abrir,acaba,acelera,acima,acompanha,acompanhe,...,whatsapp,zona,às,ágil,álcool,área,áreas,ótima,única,útil
inovacao aberta,0,9,1,0,1,0,0,1,1,1,...,1,0,2,2,2,5,4,2,0,0
inovacao incremental,1,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
intraempreendedorismo,0,1,0,1,0,1,1,2,0,0,...,0,1,3,0,0,3,0,0,1,1


In [None]:
# Importa os modules necessários para LDA com gensim

from gensim import matutils, models
import scipy.sparse

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [None]:
# Um nos inputs queridos é uma term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,inovacao aberta,inovacao incremental,intraempreendedorismo
abacaxi,0,1,0
aberta,9,0,1
abertos,1,0,0
abrange,0,0,1
abrir,1,0,0


In [None]:
# Vamos colocar a matriz term-document em um novo formato gensim, de df -> matriz esparsa -> corpus gensim
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [None]:
# O Gensim também requer um dicionário de todos os termos e sua respectiva localização na matriz do documento de termos
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [None]:
# Agora que temos o corpus (matriz do termo-documento) e id2word (dicionário de localização: termo),
# precisamos especificar dois outros parâmetros também - o número de tópicos e o número de passagens
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

2020-08-16 13:14:43,956 : INFO : using symmetric alpha at 0.5
2020-08-16 13:14:43,957 : INFO : using symmetric eta at 0.5
2020-08-16 13:14:43,958 : INFO : using serial LDA version on this node
2020-08-16 13:14:43,963 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:14:44,004 : INFO : -7.760 per-word bound, 216.8 perplexity estimate based on a held-out corpus of 3 documents with 3720 words
2020-08-16 13:14:44,005 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:14:44,014 : INFO : topic #0 (0.500): 0.031*"que" + 0.022*"para" + 0.017*"em" + 0.014*"da" + 0.011*"um" + 0.010*"com" + 0.010*"se" + 0.009*"os" + 0.008*"uma" + 0.008*"como"
2020-08-16 13:14:44,015 : INFO : topic #1 (0.500): 0.021*"que" + 0.017*"para" + 0.010*"da" + 0.009*"em" + 0.009*"como" + 0.008*"com" + 0.008*"

[(0,
  '0.029*"que" + 0.022*"para" + 0.015*"da" + 0.014*"em" + 0.011*"com" + 0.010*"como" + 0.010*"um" + 0.009*"os" + 0.008*"uma" + 0.007*"na"'),
 (1,
  '0.024*"que" + 0.017*"para" + 0.017*"se" + 0.016*"soft" + 0.014*"skills" + 0.014*"em" + 0.010*"um" + 0.009*"você" + 0.008*"da" + 0.008*"uma"')]

In [None]:
# LDA para num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

2020-08-16 13:14:47,048 : INFO : using symmetric alpha at 0.3333333333333333
2020-08-16 13:14:47,049 : INFO : using symmetric eta at 0.3333333333333333
2020-08-16 13:14:47,050 : INFO : using serial LDA version on this node
2020-08-16 13:14:47,052 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:14:47,072 : INFO : -8.070 per-word bound, 268.7 perplexity estimate based on a held-out corpus of 3 documents with 3720 words
2020-08-16 13:14:47,073 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:14:47,080 : INFO : topic #0 (0.333): 0.030*"que" + 0.024*"para" + 0.018*"em" + 0.011*"se" + 0.011*"um" + 0.011*"da" + 0.009*"os" + 0.009*"uma" + 0.008*"com" + 0.008*"como"
2020-08-16 13:14:47,081 : INFO : topic #1 (0.333): 0.029*"que" + 0.022*"para" + 0.015*"da" + 0.013*"com" + 0.012

2020-08-16 13:14:47,341 : INFO : topic #0 (0.333): 0.027*"que" + 0.020*"para" + 0.019*"se" + 0.018*"soft" + 0.016*"skills" + 0.016*"em" + 0.011*"um" + 0.010*"você" + 0.009*"da" + 0.009*"uma"
2020-08-16 13:14:47,342 : INFO : topic #1 (0.333): 0.031*"que" + 0.023*"para" + 0.016*"da" + 0.015*"em" + 0.012*"com" + 0.011*"como" + 0.010*"um" + 0.010*"os" + 0.009*"uma" + 0.007*"na"
2020-08-16 13:14:47,343 : INFO : topic #2 (0.333): 0.001*"que" + 0.001*"da" + 0.001*"para" + 0.001*"em" + 0.001*"se" + 0.001*"os" + 0.001*"um" + 0.001*"com" + 0.001*"como" + 0.001*"soft"
2020-08-16 13:14:47,343 : INFO : topic diff=0.011563, rho=0.301511
2020-08-16 13:14:47,344 : INFO : topic #0 (0.333): 0.027*"que" + 0.020*"para" + 0.019*"se" + 0.018*"soft" + 0.016*"skills" + 0.016*"em" + 0.011*"um" + 0.010*"você" + 0.009*"da" + 0.009*"uma"
2020-08-16 13:14:47,345 : INFO : topic #1 (0.333): 0.031*"que" + 0.023*"para" + 0.016*"da" + 0.015*"em" + 0.012*"com" + 0.011*"como" + 0.010*"um" + 0.010*"os" + 0.009*"uma" + 0.0

[(0,
  '0.027*"que" + 0.020*"para" + 0.019*"se" + 0.018*"soft" + 0.016*"skills" + 0.016*"em" + 0.011*"um" + 0.010*"você" + 0.009*"da" + 0.009*"uma"'),
 (1,
  '0.031*"que" + 0.023*"para" + 0.016*"da" + 0.015*"em" + 0.012*"com" + 0.011*"como" + 0.010*"um" + 0.010*"os" + 0.009*"uma" + 0.007*"na"'),
 (2,
  '0.001*"que" + 0.001*"da" + 0.001*"para" + 0.001*"em" + 0.001*"se" + 0.001*"os" + 0.001*"um" + 0.001*"com" + 0.001*"como" + 0.001*"soft"')]

In [None]:
# LDA para num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

2020-08-16 13:14:49,614 : INFO : using symmetric alpha at 0.25
2020-08-16 13:14:49,615 : INFO : using symmetric eta at 0.25
2020-08-16 13:14:49,616 : INFO : using serial LDA version on this node
2020-08-16 13:14:49,617 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:14:49,640 : INFO : -8.443 per-word bound, 348.1 perplexity estimate based on a held-out corpus of 3 documents with 3720 words
2020-08-16 13:14:49,641 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:14:49,649 : INFO : topic #0 (0.250): 0.033*"que" + 0.026*"para" + 0.015*"da" + 0.014*"em" + 0.013*"com" + 0.011*"se" + 0.010*"uma" + 0.010*"um" + 0.009*"os" + 0.009*"como"
2020-08-16 13:14:49,650 : INFO : topic #1 (0.250): 0.026*"que" + 0.017*"em" + 0.014*"da" + 0.013*"para" + 0.012*"um" + 0.008*"com" + 0.008*"

2020-08-16 13:14:49,851 : INFO : topic #2 (0.250): 0.002*"para" + 0.002*"que" + 0.001*"um" + 0.001*"em" + 0.001*"se" + 0.001*"da" + 0.001*"os" + 0.001*"com" + 0.001*"como" + 0.001*"uma"
2020-08-16 13:14:49,852 : INFO : topic #3 (0.250): 0.002*"que" + 0.002*"para" + 0.001*"em" + 0.001*"da" + 0.001*"como" + 0.001*"se" + 0.001*"uma" + 0.001*"soft" + 0.001*"um" + 0.001*"skills"
2020-08-16 13:14:49,853 : INFO : topic diff=0.031234, rho=0.333333
2020-08-16 13:14:49,870 : INFO : -6.803 per-word bound, 111.7 perplexity estimate based on a held-out corpus of 3 documents with 3720 words
2020-08-16 13:14:49,871 : INFO : PROGRESS: pass 8, at document #3/3
2020-08-16 13:14:49,874 : INFO : topic #0 (0.250): 0.033*"que" + 0.028*"para" + 0.017*"em" + 0.015*"da" + 0.012*"com" + 0.012*"se" + 0.010*"um" + 0.010*"como" + 0.009*"uma" + 0.009*"os"
2020-08-16 13:14:49,875 : INFO : topic #1 (0.250): 0.026*"que" + 0.015*"um" + 0.014*"em" + 0.014*"seu" + 0.011*"da" + 0.011*"colaboradores" + 0.010*"incremental" 

[(0,
  '0.033*"que" + 0.028*"para" + 0.017*"em" + 0.015*"da" + 0.012*"com" + 0.012*"se" + 0.010*"um" + 0.010*"como" + 0.009*"uma" + 0.009*"os"'),
 (1,
  '0.026*"que" + 0.015*"um" + 0.014*"em" + 0.014*"seu" + 0.011*"da" + 0.011*"colaboradores" + 0.011*"incremental" + 0.010*"uma" + 0.009*"para" + 0.009*"os"'),
 (2,
  '0.001*"para" + 0.001*"que" + 0.001*"um" + 0.001*"em" + 0.001*"se" + 0.001*"da" + 0.001*"os" + 0.001*"com" + 0.001*"como" + 0.001*"uma"'),
 (3,
  '0.001*"que" + 0.001*"para" + 0.001*"em" + 0.001*"da" + 0.001*"como" + 0.001*"se" + 0.001*"uma" + 0.001*"soft" + 0.001*"um" + 0.001*"skills"')]

Esses tópicos não parecem muito bons. Tentamos modificar nossos parâmetros. Vamos tentar modificar nossa lista de termos também.

## Topic Modeling - tentativa nº 2 (apenas substantivos)

Um truque popular é olhar apenas para termos que são de uma classe gramatical (apenas substantivos, apenas adjetivos, etc.). Confira o conjunto de tags UPenn: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [None]:
# Vamos criar uma função para extrair substantivos de uma string de texto
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Dada uma string de texto, tokenize o texto e retire apenas os substantivos.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [None]:
# Leia os dados limpos, antes da etapa CountVectorizer
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,textos
inovacao aberta,se você já acompanha nossos conteúdos há algum...
inovacao incremental,o termo inovação incremental ganhou força em ...
intraempreendedorismo,as soft skills são habilidades subjetivas de d...


In [None]:
# Aplique a função de substantivos às transcrições para filtrar apenas em substantivos
data_nouns = pd.DataFrame(data_clean.textos.apply(nouns))
data_nouns

Unnamed: 0,textos
inovacao aberta,se você já acompanha conteúdos há algum tempo ...
inovacao incremental,o termo ganhou força marcando presença livro b...
intraempreendedorismo,skills habilidades das tradução livre o termo ...


In [None]:
# Crie uma nova matriz de documento-termo usando apenas substantivos
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Adicione novamente as palavras irrelevantes, já que estamos recriando a matriz documento-termo
add_stop_words = ['de', 'a', 'o', 'que', 'e', 'do', 'da', 'em', 'um', 'para', 'é', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'foi', 'ao', 'ele', 'das', 'tem', 'à', 'seu', 'sua', 'ou', 'ser', 'quando', 'muito', 'há', 'nos', 'já', 'está', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'era', 'depois', 'sem', 'mesmo', 'aos', 'ter', 'seus', 'quem', 'nas', 'me', 'esse', 'eles']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recrie uma matriz de documento-termo com apenas substantivos
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.textos)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,abacaxi,aberta,abertos,abrange,abrir,acelera,acima,acompanha,acompanhe,acontece,...,vídeos,webinars,zona,às,álcool,área,áreas,ótima,única,útil
inovacao aberta,0,3,1,0,1,0,0,1,1,2,...,1,1,0,2,1,4,2,2,0,0
inovacao incremental,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
intraempreendedorismo,0,1,0,1,0,1,0,0,0,0,...,0,0,1,1,0,3,0,0,1,1


In [None]:
# Crie o gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Crie o dicionário do vocabulário 
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Vamos começar com 2 tópicos
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()


2020-08-16 13:14:57,185 : INFO : using symmetric alpha at 0.5
2020-08-16 13:14:57,186 : INFO : using symmetric eta at 0.5
2020-08-16 13:14:57,187 : INFO : using serial LDA version on this node
2020-08-16 13:14:57,188 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:14:57,202 : INFO : -7.499 per-word bound, 180.9 perplexity estimate based on a held-out corpus of 3 documents with 1659 words
2020-08-16 13:14:57,203 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:14:57,210 : INFO : topic #0 (0.500): 0.009*"skills" + 0.009*"colaboradores" + 0.007*"você" + 0.006*"mirella" + 0.006*"pode" + 0.005*"produto" + 0.005*"inovação" + 0.005*"intraempreendedor" + 0.005*"seja" + 0.004*"valor"
2020-08-16 13:14:57,211 : INFO : topic #1 (0.500): 0.011*"skills" + 0.008*"você" + 0.007*"cola

[(0,
  '0.015*"colaboradores" + 0.010*"produto" + 0.009*"business" + 0.008*"seja" + 0.008*"core" + 0.007*"você" + 0.007*"mercado" + 0.007*"entender" + 0.005*"pode" + 0.005*"estratégia"'),
 (1,
  '0.015*"skills" + 0.008*"você" + 0.008*"mirella" + 0.007*"intraempreendedor" + 0.006*"pode" + 0.005*"valor" + 0.005*"empresa" + 0.005*"momento" + 0.005*"inovação" + 0.004*"colaboradores"')]

In [None]:
# Vamos tentar tópicos = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.022*"skills" + 0.014*"você" + 0.012*"colaboradores" + 0.010*"intraempreendedor" + 0.008*"seja" + 0.007*"produto" + 0.007*"pode" + 0.006*"business" + 0.006*"mercado" + 0.006*"são"'),
 (1,
  '0.001*"skills" + 0.001*"intraempreendedor" + 0.001*"você" + 0.001*"empresa" + 0.001*"habilidades" + 0.001*"pode" + 0.001*"mirella" + 0.001*"valor" + 0.001*"tornar" + 0.001*"inovação"'),
 (2,
  '0.012*"mirella" + 0.007*"valor" + 0.007*"momento" + 0.007*"crise" + 0.007*"basf" + 0.007*"empresa" + 0.006*"pode" + 0.006*"inovação" + 0.006*"colaboradores" + 0.006*"cliente"')]

In [None]:
# Vamos tentar 4 tópicos
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

2020-08-16 13:15:00,297 : INFO : using symmetric alpha at 0.25
2020-08-16 13:15:00,298 : INFO : using symmetric eta at 0.25
2020-08-16 13:15:00,299 : INFO : using serial LDA version on this node
2020-08-16 13:15:00,301 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:15:00,320 : INFO : -8.519 per-word bound, 366.8 perplexity estimate based on a held-out corpus of 3 documents with 1659 words
2020-08-16 13:15:00,320 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:15:00,327 : INFO : topic #0 (0.250): 0.007*"pode" + 0.007*"colaboradores" + 0.007*"skills" + 0.006*"você" + 0.005*"inovação" + 0.005*"mirella" + 0.005*"momento" + 0.004*"valor" + 0.004*"business" + 0.004*"produto"
2020-08-16 13:15:00,328 : INFO : topic #1 (0.250): 0.009*"mirella" + 0.006*"você" + 0.006*"colabor

2020-08-16 13:15:00,482 : INFO : topic #2 (0.250): 0.020*"colaboradores" + 0.013*"produto" + 0.012*"business" + 0.010*"seja" + 0.010*"core" + 0.009*"você" + 0.009*"mercado" + 0.009*"entender" + 0.007*"pode" + 0.007*"estratégia"
2020-08-16 13:15:00,483 : INFO : topic #3 (0.250): 0.035*"skills" + 0.016*"você" + 0.016*"intraempreendedor" + 0.010*"habilidades" + 0.010*"skill" + 0.009*"suas" + 0.009*"intraempreendedorismo" + 0.009*"tornar" + 0.007*"vai" + 0.007*"profissionais"
2020-08-16 13:15:00,484 : INFO : topic diff=0.051820, rho=0.353553
2020-08-16 13:15:00,496 : INFO : -6.866 per-word bound, 116.6 perplexity estimate based on a held-out corpus of 3 documents with 1659 words
2020-08-16 13:15:00,496 : INFO : PROGRESS: pass 7, at document #3/3
2020-08-16 13:15:00,499 : INFO : topic #0 (0.250): 0.001*"pode" + 0.001*"colaboradores" + 0.001*"skills" + 0.001*"você" + 0.001*"inovação" + 0.001*"mirella" + 0.001*"momento" + 0.001*"valor" + 0.001*"business" + 0.001*"produto"
2020-08-16 13:15:00,

[(0,
  '0.001*"pode" + 0.001*"colaboradores" + 0.001*"skills" + 0.001*"você" + 0.001*"inovação" + 0.001*"mirella" + 0.001*"momento" + 0.001*"valor" + 0.001*"business" + 0.001*"produto"'),
 (1,
  '0.013*"mirella" + 0.008*"valor" + 0.008*"momento" + 0.007*"empresa" + 0.007*"basf" + 0.007*"crise" + 0.006*"colaboradores" + 0.006*"pode" + 0.006*"inovação" + 0.006*"cliente"'),
 (2,
  '0.020*"colaboradores" + 0.014*"produto" + 0.012*"business" + 0.010*"seja" + 0.010*"core" + 0.009*"você" + 0.009*"mercado" + 0.009*"entender" + 0.007*"pode" + 0.007*"estratégia"'),
 (3,
  '0.036*"skills" + 0.016*"você" + 0.016*"intraempreendedor" + 0.010*"habilidades" + 0.010*"skill" + 0.009*"suas" + 0.009*"intraempreendedorismo" + 0.009*"tornar" + 0.007*"vai" + 0.007*"profissionais"')]

# Modelagem de Tópico - Tentativa # 3 (Substantivos e Adjetivos)

In [None]:
# Vamos criar uma função para extrair substantivos de uma string de texto
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [None]:
# Aplique a função de substantivos às transcrições para filtrar apenas em substantivos
data_nouns_adj = pd.DataFrame(data_clean.textos.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,textos
inovacao aberta,se você já acompanha nossos conteúdos há algum...
inovacao incremental,o termo inovação incremental ganhou força marc...
intraempreendedorismo,soft skills habilidades emocional das em tradu...


In [None]:
# Crie uma nova matriz de documento-termo usando apenas substantivos e adjetivos, remova também palavras comuns com max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.textos)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,abacaxi,aberta,abertos,abrange,abrir,acelera,acompanha,acompanhe,acontece,acontecer,...,vídeos,webinars,zona,às,álcool,área,áreas,ótima,única,útil
inovacao aberta,0,7,1,0,1,0,1,1,2,1,...,1,1,0,2,1,4,3,2,0,0
inovacao incremental,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
intraempreendedorismo,0,1,0,1,0,1,0,0,0,0,...,0,0,1,3,0,3,0,0,1,1


In [None]:
# Crie o gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Crie o dicionário vocabulário
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

2020-08-16 13:15:10,333 : INFO : using symmetric alpha at 0.5
2020-08-16 13:15:10,334 : INFO : using symmetric eta at 0.5
2020-08-16 13:15:10,335 : INFO : using serial LDA version on this node
2020-08-16 13:15:10,336 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:15:10,357 : INFO : -7.621 per-word bound, 196.9 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:10,357 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:15:10,365 : INFO : topic #0 (0.500): 0.014*"soft" + 0.010*"skills" + 0.006*"intraempreendedor" + 0.005*"incremental" + 0.005*"empresa" + 0.005*"profissional" + 0.004*"produto" + 0.004*"business" + 0.004*"suas" + 0.004*"tornar"
2020-08-16 13:15:10,366 : INFO : topic #1 (0.500): 0.009*"skills" + 0.008*"soft" + 0.00

[(0,
  '0.021*"soft" + 0.018*"skills" + 0.009*"incremental" + 0.008*"intraempreendedor" + 0.008*"profissional" + 0.006*"produto" + 0.005*"suas" + 0.005*"tornar" + 0.005*"habilidades" + 0.005*"business"'),
 (1,
  '0.010*"mirella" + 0.008*"empresa" + 0.006*"momento" + 0.005*"aberta" + 0.005*"basf" + 0.005*"crise" + 0.005*"soluções" + 0.005*"cliente" + 0.005*"ambev" + 0.004*"respeito"')]

In [None]:
# Vamos tentar 3 tópicos
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

2020-08-16 13:15:22,579 : INFO : using symmetric alpha at 0.3333333333333333
2020-08-16 13:15:22,581 : INFO : using symmetric eta at 0.3333333333333333
2020-08-16 13:15:22,582 : INFO : using serial LDA version on this node
2020-08-16 13:15:22,583 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:15:22,609 : INFO : -8.105 per-word bound, 275.2 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:22,609 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:15:22,617 : INFO : topic #0 (0.333): 0.010*"soft" + 0.009*"skills" + 0.008*"incremental" + 0.006*"produto" + 0.005*"business" + 0.005*"intraempreendedor" + 0.005*"frente" + 0.005*"empresa" + 0.005*"core" + 0.004*"mirella"
2020-08-16 13:15:22,618 : INFO : topic #1 (0.333): 0.010*"soft

2020-08-16 13:15:22,849 : INFO : topic diff=0.030634, rho=0.333333
2020-08-16 13:15:22,865 : INFO : -6.885 per-word bound, 118.2 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:22,865 : INFO : PROGRESS: pass 8, at document #3/3
2020-08-16 13:15:22,870 : INFO : topic #0 (0.333): 0.018*"incremental" + 0.012*"produto" + 0.011*"business" + 0.009*"core" + 0.008*"frente" + 0.008*"google" + 0.006*"estratégia" + 0.006*"alto" + 0.006*"conhecimento" + 0.005*"escala"
2020-08-16 13:15:22,871 : INFO : topic #1 (0.333): 0.011*"mirella" + 0.009*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"crise" + 0.006*"basf" + 0.005*"soluções" + 0.005*"cliente" + 0.005*"ambev" + 0.004*"lucas"
2020-08-16 13:15:22,872 : INFO : topic #2 (0.333): 0.033*"soft" + 0.029*"skills" + 0.013*"intraempreendedor" + 0.012*"profissional" + 0.008*"habilidades" + 0.008*"skill" + 0.008*"suas" + 0.008*"tornar" + 0.007*"intraempreendedorismo" + 0.006*"profissionais"
2020-08-16 13:

[(0,
  '0.018*"incremental" + 0.012*"produto" + 0.011*"business" + 0.009*"core" + 0.008*"frente" + 0.008*"google" + 0.006*"estratégia" + 0.006*"alto" + 0.006*"conhecimento" + 0.005*"escala"'),
 (1,
  '0.011*"mirella" + 0.009*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"crise" + 0.006*"basf" + 0.005*"soluções" + 0.005*"cliente" + 0.005*"ambev" + 0.004*"lucas"'),
 (2,
  '0.033*"soft" + 0.029*"skills" + 0.013*"intraempreendedor" + 0.012*"profissional" + 0.008*"habilidades" + 0.008*"skill" + 0.008*"suas" + 0.008*"tornar" + 0.007*"intraempreendedorismo" + 0.006*"profissionais"')]

In [None]:
# Tentativa com 4 tópicos
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

2020-08-16 13:15:30,841 : INFO : using symmetric alpha at 0.25
2020-08-16 13:15:30,843 : INFO : using symmetric eta at 0.25
2020-08-16 13:15:30,844 : INFO : using serial LDA version on this node
2020-08-16 13:15:30,845 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:15:30,864 : INFO : -8.691 per-word bound, 413.3 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:30,864 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:15:30,872 : INFO : topic #0 (0.250): 0.016*"soft" + 0.010*"skills" + 0.006*"profissional" + 0.006*"empresa" + 0.006*"mirella" + 0.004*"suas" + 0.004*"intraempreendedor" + 0.004*"momento" + 0.004*"tornar" + 0.004*"capacidade"
2020-08-16 13:15:30,872 : INFO : topic #1 (0.250): 0.017*"skills" + 0.012*"soft" + 0.00

2020-08-16 13:15:31,018 : INFO : topic #1 (0.250): 0.025*"soft" + 0.022*"skills" + 0.011*"incremental" + 0.010*"intraempreendedor" + 0.009*"profissional" + 0.007*"produto" + 0.006*"business" + 0.006*"habilidades" + 0.006*"tornar" + 0.006*"skill"
2020-08-16 13:15:31,019 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"crise" + 0.006*"aberta" + 0.006*"basf" + 0.006*"cliente" + 0.006*"soluções" + 0.006*"ambev" + 0.005*"respeito"
2020-08-16 13:15:31,020 : INFO : topic #3 (0.250): 0.002*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"profissional" + 0.001*"produto" + 0.001*"incremental" + 0.001*"intraempreendedor" + 0.001*"basf" + 0.001*"frente"
2020-08-16 13:15:31,021 : INFO : topic diff=0.051147, rho=0.353553
2020-08-16 13:15:31,034 : INFO : -7.088 per-word bound, 136.0 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:31,035 : INFO : PROGRESS: pass 7, at document #3/3
2020-08-16 13:15

[(0,
  '0.002*"soft" + 0.001*"skills" + 0.001*"profissional" + 0.001*"empresa" + 0.001*"suas" + 0.001*"mirella" + 0.001*"capacidade" + 0.001*"intraempreendedor" + 0.001*"tornar" + 0.001*"habilidades"'),
 (1,
  '0.025*"soft" + 0.022*"skills" + 0.011*"incremental" + 0.010*"intraempreendedor" + 0.009*"profissional" + 0.007*"produto" + 0.006*"business" + 0.006*"habilidades" + 0.006*"tornar" + 0.006*"skill"'),
 (2,
  '0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"crise" + 0.006*"aberta" + 0.006*"basf" + 0.006*"cliente" + 0.006*"soluções" + 0.006*"ambev" + 0.005*"respeito"'),
 (3,
  '0.001*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"profissional" + 0.001*"produto" + 0.001*"incremental" + 0.001*"intraempreendedor" + 0.001*"basf" + 0.001*"frente"')]

## Identifique os tópicos em cada documento

Dos 9 modelos de tópico que examinamos, os substantivos e adjetivos, 4 o tópico um fez mais sentido. Então, vamos puxar isso aqui e executá-lo por mais algumas iterações para obter tópicos mais ajustados

In [None]:
# Nosso modelo final (por enquanto)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

2020-08-16 13:15:42,967 : INFO : using symmetric alpha at 0.25
2020-08-16 13:15:42,968 : INFO : using symmetric eta at 0.25
2020-08-16 13:15:42,969 : INFO : using serial LDA version on this node
2020-08-16 13:15:42,971 : INFO : running online (multi-pass) LDA training, 4 topics, 80 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2020-08-16 13:15:42,987 : INFO : -8.696 per-word bound, 414.6 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:42,987 : INFO : PROGRESS: pass 0, at document #3/3
2020-08-16 13:15:42,996 : INFO : topic #0 (0.250): 0.010*"incremental" + 0.007*"soft" + 0.007*"skills" + 0.006*"business" + 0.006*"produto" + 0.006*"core" + 0.005*"empresa" + 0.005*"estratégia" + 0.004*"mirella" + 0.004*"google"
2020-08-16 13:15:42,997 : INFO : topic #1 (0.250): 0.008*"soft" + 0.006*"skills" + 0.006*"empresa

2020-08-16 13:15:43,128 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"ambev" + 0.006*"cliente" + 0.006*"soluções" + 0.005*"cenário"
2020-08-16 13:15:43,129 : INFO : topic #3 (0.250): 0.036*"soft" + 0.032*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"
2020-08-16 13:15:43,129 : INFO : topic diff=0.056324, rho=0.353553
2020-08-16 13:15:43,143 : INFO : -6.939 per-word bound, 122.7 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,144 : INFO : PROGRESS: pass 7, at document #3/3
2020-08-16 13:15:43,147 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"desse"
2020-08-16 13:15

2020-08-16 13:15:43,261 : INFO : topic #1 (0.250): 0.001*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"momento" + 0.001*"produto" + 0.001*"profissional" + 0.001*"aberta" + 0.001*"soluções" + 0.001*"business"
2020-08-16 13:15:43,262 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"ambev" + 0.006*"cliente" + 0.006*"soluções" + 0.005*"cenário"
2020-08-16 13:15:43,263 : INFO : topic #3 (0.250): 0.036*"soft" + 0.032*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"
2020-08-16 13:15:43,263 : INFO : topic diff=0.003747, rho=0.258199
2020-08-16 13:15:43,276 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,276 : INFO : PROGRESS: pass 14, at document #3/3
2020-08-16 13:15

2020-08-16 13:15:43,402 : INFO : PROGRESS: pass 20, at document #3/3
2020-08-16 13:15:43,406 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"minutos"
2020-08-16 13:15:43,407 : INFO : topic #1 (0.250): 0.001*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"momento" + 0.001*"produto" + 0.001*"profissional" + 0.001*"aberta" + 0.001*"soluções" + 0.001*"business"
2020-08-16 13:15:43,408 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"ambev" + 0.006*"soluções" + 0.006*"cliente" + 0.005*"cenário"
2020-08-16 13:15:43,408 : INFO : topic #3 (0.250): 0.036*"soft" + 0.032*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"c

2020-08-16 13:15:43,535 : INFO : topic diff=0.000108, rho=0.188982
2020-08-16 13:15:43,548 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,549 : INFO : PROGRESS: pass 27, at document #3/3
2020-08-16 13:15:43,551 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"minutos"
2020-08-16 13:15:43,552 : INFO : topic #1 (0.250): 0.001*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"momento" + 0.001*"produto" + 0.001*"profissional" + 0.001*"aberta" + 0.001*"soluções" + 0.001*"business"
2020-08-16 13:15:43,553 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"ambev" + 0.006*"cliente" + 0.005*"respeito"
2020-08-16 13:15:43,554 : INFO : top

2020-08-16 13:15:43,666 : INFO : topic #3 (0.250): 0.036*"soft" + 0.033*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"
2020-08-16 13:15:43,667 : INFO : topic diff=0.000023, rho=0.169031
2020-08-16 13:15:43,688 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,689 : INFO : PROGRESS: pass 34, at document #3/3
2020-08-16 13:15:43,696 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"minutos"
2020-08-16 13:15:43,697 : INFO : topic #1 (0.250): 0.001*"soft" + 0.001*"skills" + 0.001*"empresa" + 0.001*"mirella" + 0.001*"momento" + 0.001*"produto" + 0.001*"profissional" + 0.001*"aberta" + 0.001*"soluções" + 0.001*"business"
2020

2020-08-16 13:15:43,817 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"ambev" + 0.006*"cliente" + 0.005*"respeito"
2020-08-16 13:15:43,817 : INFO : topic #3 (0.250): 0.036*"soft" + 0.033*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"
2020-08-16 13:15:43,818 : INFO : topic diff=0.000006, rho=0.154303
2020-08-16 13:15:43,838 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,838 : INFO : PROGRESS: pass 41, at document #3/3
2020-08-16 13:15:43,845 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"escala"
2020-08-16 13

2020-08-16 13:15:43,972 : INFO : topic #1 (0.250): 0.001*"soft" + 0.001*"empresa" + 0.001*"skills" + 0.001*"momento" + 0.001*"produto" + 0.001*"mirella" + 0.001*"aberta" + 0.001*"profissional" + 0.001*"business" + 0.001*"soluções"
2020-08-16 13:15:43,973 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"ambev" + 0.006*"cliente" + 0.005*"respeito"
2020-08-16 13:15:43,974 : INFO : topic #3 (0.250): 0.036*"soft" + 0.033*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"skill" + 0.009*"habilidades" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"
2020-08-16 13:15:43,975 : INFO : topic diff=0.000002, rho=0.142857
2020-08-16 13:15:43,989 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:43,989 : INFO : PROGRESS: pass 48, at document #3/3
2020-08-16 13:1

2020-08-16 13:15:44,106 : INFO : PROGRESS: pass 54, at document #3/3
2020-08-16 13:15:44,111 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"alto" + 0.007*"conhecimento" + 0.005*"minutos"
2020-08-16 13:15:44,112 : INFO : topic #1 (0.250): 0.001*"empresa" + 0.001*"soft" + 0.001*"momento" + 0.001*"produto" + 0.001*"skills" + 0.001*"aberta" + 0.001*"profissional" + 0.001*"business" + 0.001*"frente" + 0.001*"respeito"
2020-08-16 13:15:44,113 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"ambev" + 0.006*"cliente" + 0.005*"respeito"
2020-08-16 13:15:44,113 : INFO : topic #3 (0.250): 0.036*"soft" + 0.033*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"tornar" + 0.009*"habilidades" + 0.009*"skill" + 0.008*"intraempreendedorismo" + 0.007*"c

2020-08-16 13:15:44,230 : INFO : topic diff=0.000000, rho=0.127000
2020-08-16 13:15:44,243 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:44,243 : INFO : PROGRESS: pass 61, at document #3/3
2020-08-16 13:15:44,247 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"alto" + 0.007*"conhecimento" + 0.005*"minutos"
2020-08-16 13:15:44,248 : INFO : topic #1 (0.250): 0.001*"momento" + 0.001*"cada" + 0.001*"empresa" + 0.001*"terá" + 0.001*"dedicado" + 0.001*"contínua" + 0.001*"tornase" + 0.001*"processos" + 0.001*"termo" + 0.001*"parte"
2020-08-16 13:15:44,248 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"crise" + 0.006*"basf" + 0.006*"soluções" + 0.006*"ambev" + 0.006*"cliente" + 0.005*"respeito"
2020-08-16 13:15:44,249 : INFO : topic #3 (0

2020-08-16 13:15:44,367 : INFO : topic diff=0.000000, rho=0.120386
2020-08-16 13:15:44,380 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:44,380 : INFO : PROGRESS: pass 68, at document #3/3
2020-08-16 13:15:44,384 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"alto" + 0.007*"conhecimento" + 0.005*"minutos"
2020-08-16 13:15:44,384 : INFO : topic #1 (0.250): 0.001*"contínua" + 0.001*"dedicado" + 0.001*"cada" + 0.001*"terá" + 0.001*"tornase" + 0.001*"processos" + 0.001*"termo" + 0.001*"aqui" + 0.001*"pouco" + 0.001*"deu"
2020-08-16 13:15:44,386 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"crise" + 0.006*"basf" + 0.006*"soluções" + 0.006*"cliente" + 0.006*"ambev" + 0.005*"respeito"
2020-08-16 13:15:44,387 : INFO : topic #3 (0.250): 

2020-08-16 13:15:44,504 : INFO : topic diff=0.000000, rho=0.114708
2020-08-16 13:15:44,518 : INFO : -6.936 per-word bound, 122.5 perplexity estimate based on a held-out corpus of 3 documents with 1732 words
2020-08-16 13:15:44,518 : INFO : PROGRESS: pass 75, at document #3/3
2020-08-16 13:15:44,522 : INFO : topic #0 (0.250): 0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"vez"
2020-08-16 13:15:44,523 : INFO : topic #1 (0.250): 0.001*"dedicado" + 0.001*"cada" + 0.001*"processos" + 0.001*"tornase" + 0.001*"termo" + 0.001*"terá" + 0.001*"contínua" + 0.001*"aqui" + 0.001*"aevo" + 0.001*"deste"
2020-08-16 13:15:44,523 : INFO : topic #2 (0.250): 0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"cliente" + 0.006*"ambev" + 0.005*"respeito"
2020-08-16 13:15:44,524 : INFO : topic #3 (0.250): 0.0

[(0,
  '0.020*"incremental" + 0.014*"produto" + 0.012*"business" + 0.010*"core" + 0.009*"google" + 0.009*"frente" + 0.007*"estratégia" + 0.007*"conhecimento" + 0.007*"alto" + 0.005*"vez"'),
 (1,
  '0.001*"contínua" + 0.001*"terá" + 0.001*"tornase" + 0.001*"processos" + 0.001*"cada" + 0.001*"dedicado" + 0.001*"termo" + 0.001*"deste" + 0.001*"aqui" + 0.001*"aevo"'),
 (2,
  '0.012*"mirella" + 0.010*"empresa" + 0.007*"momento" + 0.006*"aberta" + 0.006*"basf" + 0.006*"crise" + 0.006*"soluções" + 0.006*"cliente" + 0.006*"ambev" + 0.005*"respeito"'),
 (3,
  '0.036*"soft" + 0.033*"skills" + 0.014*"intraempreendedor" + 0.013*"profissional" + 0.009*"suas" + 0.009*"habilidades" + 0.009*"tornar" + 0.009*"skill" + 0.008*"intraempreendedorismo" + 0.007*"capacidade"')]

Tópico 0 = soft, skills - Intraempreendedorismo

Tópico 1 = empresa momento aberta crise solucoes - Inovação Aberta

Tópico 2 = processos contínua

Tópico 3 = incremental produto business - Inovação Incremental

In [None]:
# Qual tópico cada texto contém
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(2, 'inovacao aberta'),
 (0, 'inovacao incremental'),
 (3, 'intraempreendedorismo')]