# Asignación Latente de Dirichlet (LDiA)
La idea general de este algoritmo es que se puede pensar un tema, como una combinacion de palabras en el sentido de la frecuencia de palabras en los documentos.

LDiA supone que cada documento es una mezcla de un numero arbitrario de temas, que es uno de los hiperparametros a darle en el algoritmo.
Tambien supone que cada tema, puede ser representado por una mezcla de frecuencia de palabras o una distribucion de frecuencia de palabras.

Así, el peso de cada palabra en cada tema y en cada documento, asi como el peso de un tema para un documento, siguen una distribucion de Dirichlet.

Al entrenar el algoritmo LDiA, se va a encontrar la combinacion de palabras que siguen la distribucion de Dirichlet para un tema, y la combinacion de temas que siguen la distribucion de Dirichlet para un documento.

Para entrenar el algoritmo, es un poco complicado.
Se puede utilizar 
- "Batch Variational Bayes for LDA" o
- "Online variational Bayes for LDA"

In [1]:
import pandas as pd
from nlpia.data.loaders import get_data
pd.options.display.width = 120

sms = get_data('sms-spam')
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
sms.head(6)

  [datetime.datetime, pd.datetime, pd.Timestamp])
  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')
  np = pd.np
  np = pd.np
INFO:nlpia.constants:Starting logger in nlpia.constants...
  np = pd.np
  np = pd.np
INFO:nlpia.loaders:No BIGDATA index found in d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.csv so copy d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.latest.csv to d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.csv if you want to "freeze" it.
INFO:nlpia.futil:Reading CSV with `read_csv(*('d:\\program files\\python37\\lib\\site-packages\\nlpia\\data\\mavis-batey-greetings.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('d:\\program files\\python37\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('d:\\program files\\python37\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'n

Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,1,FreeMsg Hey there darling it's been 3 week's n...


PSA intenta mantener las cosas separadas que comenzaron separadas, ya que buscamos las filas de la matrix X que tienen maxima varianza.

En contraste, LDiA intenta mantener las cosas juntas que comenzaron juntas.

Ademas como este algoritmo toma en cuenta la frecuencias de las palabras, tenemos que hacer la vectorizacion bolsa de palabras (no binaria).

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize
import numpy as np

np.random.seed(42)

counter = CountVectorizer(tokenizer=casual_tokenize)
bow_docs = pd.DataFrame(counter.fit_transform(raw_documents=sms.text).toarray(), index=index)

# Cambiar el indice de token por los mismos tokens, para mejor visualizacion
column_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(), counter.vocabulary_.keys())))
bow_docs.columns = terms
bow_docs

Unnamed: 0,!,"""",#,#150,#5000,$,%,&,',(,...,ü'll,–,—,‘,’,“,…,┾,〨ud,鈥
sms0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms2!,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sms3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sms4832!,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4833,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4834,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4835,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Esta seria nuestra matriz $X$.
Una ves con ella, podemos utilizar la funcion de sklearn, de la **Asignacion Latente de Dirichlet (LDiA)**

Se le debe dar un parametro referente al numero de temas y, aparte, necesita otro hiperparametro que es la media de palabras por documento, donde el algoritmo la calcula desde la bolsa de palabras.

In [3]:
from sklearn.decomposition import LatentDirichletAllocation as LDiA
ldia = LDiA(n_components=16, learning_method='batch')
ldia = ldia.fit(bow_docs)

In [4]:
ldia.components_.shape

(16, 9232)

Visualicemos los pesos de los tokens en cada tema.

In [6]:
pd.set_option('display.width', 75)
columns = ['topic{}'.format(i) for i in range(16)]
components = pd.DataFrame(ldia.components_.T, index=terms, columns=columns)
components.round(2).head(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
!,184.03,15.0,72.22,394.95,45.48,36.14,9.55,44.81,0.43,90.23,37.42,44.18,64.4,297.29,41.16,11.7
"""",0.68,4.22,2.41,0.06,152.35,0.06,0.06,0.06,0.45,0.68,8.42,11.42,0.07,62.72,12.27,0.06
#,0.06,0.06,0.06,0.06,0.06,2.07,0.06,0.06,0.06,0.06,0.06,0.06,1.07,4.05,0.06,0.06
#150,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,1.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06
#5000,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,3.06,0.06,0.06,0.06,0.06,0.06,0.06
$,1.09,2.99,0.06,0.06,1.13,0.06,0.06,1.06,8.68,0.06,0.06,1.06,0.06,5.42,2.06,0.06
%,0.06,0.06,0.06,1.06,0.06,4.95,0.06,0.06,0.06,0.06,0.06,2.17,0.06,0.06,2.06,0.06
&,10.26,0.06,0.06,47.49,22.58,9.97,19.01,0.06,0.06,107.26,10.09,0.06,0.06,50.24,7.42,10.31
',0.06,0.06,0.06,0.06,21.08,0.06,0.06,0.06,0.06,3.39,0.06,0.06,0.06,7.87,0.06,127.92
(,0.06,0.06,0.35,2.16,9.95,0.06,13.42,0.06,0.06,52.09,3.75,0.06,0.06,0.89,4.88,0.06


In [7]:
components.topic3.sort_values(ascending=False)[:10]

!       394.952246
.       218.049724
to      119.533134
u       118.857546
call    111.948541
£       107.358914
,        96.954384
*        90.314783
your     90.215961
is       75.750037
Name: topic3, dtype: float64

Aca vemos que los simbolos "!" "." "£" aparecen dentro de los 10 tokens mas importantes del topico 3, eso significa que debemos eliminar los simbolos y quizas stopwords.

Ahora transformamos los vectores de topicos, en vectores LDiA

In [8]:
ldia16_topic_vectors = ldia.transform(bow_docs)
ldia16_topic_vectors = pd.DataFrame(ldia16_topic_vectors, index=index,columns=columns)
ldia16_topic_vectors.round(2).head(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.0,0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.78,0.01,0.01,0.12,0.01,0.01,0.01,0.01
sms2!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.39,0.0,0.33,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0
sms5!,0.0,0.0,0.28,0.0,0.0,0.0,0.0,0.17,0.0,0.26,0.05,0.0,0.11,0.08,0.05,0.0
sms6,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0
sms7,0.0,0.0,0.0,0.0,0.97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms8!,0.57,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
sms9!,0.0,0.0,0.0,0.43,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.44,0.0,0.0


# Conjuntos Train y Test
Si uno ajusta el modelo a todos los datos, se podria estar incurriendo en un sobreajuste. Para serciorarse de que esto no es asi, se toma el conjunto de datos y se dividen en dos conjuntos, el "train" y el "test.

Con el primero, se ajustará el modelo y con el segundo se verificará que éste modelo puede discriminar bien.

In [10]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ldia16_topic_vectors, sms.spam, test_size=0.5, random_state=271828)

In [11]:
X_train.shape

(2418, 16)

Ahora ajustaremos LDA a nuestro set de datos.

In [12]:
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia16_spam'] = lda.predict(ldia16_topic_vectors)

In [13]:
print(round(float(lda.score(X_train, y_train)), 2))
print(round(float(lda.score(X_test, y_test)), 2))

0.92
0.94


Que es bastante bueno

Comparando con el resultado de la clase pasada:

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = tfidf_docs - tfidf_docs.mean(axis=0)

(Notar que se le esta restando la media, se debe realizar esto siempre)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_docs, sms.spam.values, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)

In [22]:
# Ver la puntuacion para el conjunto train
round(float(lda.score(X_train, y_train)), 3)

1.0

In [23]:
# Ver la puntuacion para el conjunto test
round(float(lda.score(X_test, y_test)), 3)

0.748

Ahora, comparemos este método, con el LDiA pero reduciendo el numero de componentes a 32

In [24]:
ldia32 = LDiA(n_components=32, learning_method='batch')
ldia32 = ldia32.fit(bow_docs)
ldia32.components_.shape

(32, 9232)

In [25]:
ldia32_topic_vectors = ldia32.transform(bow_docs)
columns32 = ['topic{}'.format(i) for i in range(ldia32.n_components)]
ldia32_topic_vectors = pd.DataFrame(ldia32_topic_vectors, index=index, columns=columns32)
ldia32_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic22,topic23,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms0,0.0,0.0,0.0,0.06,0.14,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.0,0.0,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.0,0.0
sms2!,0.0,0.0,0.0,0.0,0.0,0.65,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.11,0.0,0.0,0.39,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.09,0.0,0.0,0.47,0.0,0.0,0.0,0.0


In [27]:
X_train, X_test, y_train, y_test = train_test_split(ldia32_topic_vectors, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia32_spam'] = lda.predict(ldia32_topic_vectors)
print(round(float(lda.score(X_train, y_train)), 3))
print(round(float(lda.score(X_test, y_test)), 3))

0.933
0.936


### Aun así, es mejor realizar cross-validation

In [30]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lda, ldia32_topic_vectors, sms.spam, cv=5)
"Accuracy: {:.2f}(+/-{:.2f})".format(scores.mean(), scores.std()*2)

'Accuracy: 0.93(+/-0.01)'

## Comparacion
Ahora realizaremos la reduccion de dimension con PCA y clasificacion con LDA

In [35]:
from sklearn.decomposition import PCA

pca = PCA(n_components=16)
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=index)

X_train, X_test, y_train, y_test = train_test_split(pca_topic_vectors.values, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1, priors=None, shrinkage=None, solver='svd', store_covariance=False, tol=0.0001)
lda.fit(X_train, y_train)
lda.score(X_test, y_test).round(3)

0.962

Ahora, con la metodologia cross-validation

In [36]:
lda = LDA(n_components=1)
scores = cross_val_score(lda, pca_topic_vectors, sms.spam, cv=5)
"Accuracy: {:.2f}(+/-{:.2f})".format(scores.mean(), scores.std()*2)

'Accuracy: 0.96(+/-0.01)'