# Implementación del método de Random Projections para el corpus de Twenty News Groups

A continuación se implementará el método propuesto en este [paper](./random-indexing-dr-explained.pdf).

La idea es generar la matriz $M'_{p*m}$ a partir de la matriz $M_{p*n}$ y la matriz de proyecciones $R_{n*m}$.
Donde:

* p = cantidad de documentos
* n = cantidad de palabras en el vocabulario
* m = cantidad de tópicos

La matriz $M$ contiene una fila por documento y una columna por cáda término dentro del vocabulario.
El elemento $M_{i,j}$ contendrá una medida de la cantidad de veces que aparece el término $j$ en el documento $i$. Para nuestro caso, esa medida será el tfidf. La matriz $M$ es sparsa ya que para un documento en particular, la gran mayoría de las palabras no aparecerá.

La matriz $M'$ contiene una fila por documento y una columna por cada tópico. El elemento $M'_{i,j}$ contiene una medida de cuánto aporta el tópico $j$ a la construcción del documento $i$.

El método propuesto por el paper consta en:

1) Construcción de la matriz $M_{p*n}$ a partir del corpus generado por los artículos de Twenty News Groups. Esto se hará en un script de Python aparte para no ensuciar con código secundario esta notebook.

2) Generación de la matriz de proyección $R$. Mas adelante se dará un detalle de la generación de esta matriz.

3) Cálculo de la matriz $M'$ como la proyección definida por $R$ de la matriz $M$ en el espacio de menor dimensión.

Una vez calculada esta transformación, podremos hacer el siguiente análisis:

1) Tomar un palabra en particular del vocabulario (como un documento en el espacio de dimensión n cuyas componentes son todas cero salvo la componente que representa la palabra elegida que vale uno), transformarla con R y recuperar los k documentos vecinos.

2) Tomar varias palabras del vocabulario (mismo procedimiento que el anterior, pero con más componentes en uno), transformarlas con R y recuperar los k documentos vecinos a este vector en el espacio de dimensión reducida.

3) Tomar documentos de referencia de $M'$ y buscar sus k documentos mas cercanos en el espacio de dimensión reducida.

Para cada punto definido anteriormente, se puede hacer una evaluación cualitativa de los resultados. ** Discutir un mecanismo de evaluación cuantitativo.**

# Construcción de la matriz $M$

Para construir la matriz $M$ se utilizarán los artículos extraídos en esta [notebook](./Corpus que se utilizará para la comparación de métodos.ipynb).

Primero se cargarán los artículos y luego se construirá la matriz $M$ utilizando el paquete tfidf de sklearn.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import json
from time import time
from scipy.sparse import csr_matrix
import tensorflow as tf
import numpy as np
import pickle

  from ._conv import register_converters as _register_converters


In [2]:
with open ('art_filt.txt', 'rb') as fp:
    articulos = pickle.load(fp)
with open ('art_filt_labels', 'rb') as fp:
    labels = pickle.load(fp) 

In [3]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,norm="l2")
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(articulos)
print("done in %0.3fs." % (time() - t0))
p=tfidf.shape[0]
n=tfidf.shape[1]
print("Se transformaron {} artículos, la cantidad de palabras en el vocabulario es: {}".format(p,n))

done in 1.219s.
Se transformaron 11314 artículos, la cantidad de palabras en el vocabulario es: 28966


## Generación de la matriz de proyección R

Según lo visto en el paper de referencia [1], los elementos $r_{i,j}$ de la matriz R se calculan como:

$$ r_{i,j} = \sqrt{s}\left\{ \begin{array}{rcl}
1 & \text{con probabilidad} & \frac{1}{2s} \\ 
0 & \text{con probabilidad} & 1-\frac{1}{s} \\
-1 & \text{con probabilidad} & \frac{1}{2s}
\end{array}\right.$$

In [4]:
import numpy as np
from scipy import spatial
from sklearn.random_projection import SparseRandomProjection

# Cálculo de m a partir del error
error=0.1
m=int(np.log(p)/(error**2))+1
print("El valor de m mínimo para un error de {} es:{}".format(error,m))

El valor de m mínimo para un error de 0.1 es:934


In [5]:
#Defino la densidad de la matriz según la definición del paper
density=1/np.sqrt(n)
#density=1/3

t0 = time()
rp=SparseRandomProjection(n_components=m, density=density)
rp.fit(tfidf)
print("done in %0.3fs." % (time() - t0))

done in 0.226s.


In [6]:
tfidf

<11314x28966 sparse matrix of type '<class 'numpy.float64'>'
	with 1067657 stored elements in Compressed Sparse Row format>

In [7]:
#Obtengo los artículos en su representación reducida
rp.components_.shape

(934, 28966)

In [8]:
M_prime=np.dot(rp.components_,tfidf.T)

In [9]:
M_prime=M_prime.T
M_prime.shape

(11314, 934)

In [10]:
#Chequeo algún vector al azar, para comprobar que tenga norma uno
np.linalg.norm(M_prime[3].toarray())

1.0496854095778152

In [11]:
#Proceso los labels para que queden en formato one hot
labels_oh = np.zeros((M_prime.shape[0], 20))
labels_oh[np.arange(M_prime.shape[0]), labels] = 1

In [12]:
M_prime=M_prime.toarray()

In [13]:
# Genero el test-set

with open ('art_filt_test.txt', 'rb') as fp:
    articulos_test = pickle.load(fp)
with open ('art_filt_test_labels.txt', 'rb') as fp:
    labels_test = pickle.load(fp) 
tfidf_test = tfidf_vectorizer.transform(articulos_test)
print("done in %0.3fs." % (time() - t0))
p=tfidf_test.shape[0]
n=tfidf_test.shape[1]
print("Se transformaron {} artículos, la cantidad de palabras en el vocabulario es: {}".format(p,n))

done in 1.129s.
Se transformaron 7532 artículos, la cantidad de palabras en el vocabulario es: 28966


In [14]:
# Trasnformo el test set a R prima

M_prime_test=np.dot(rp.components_,tfidf_test.T)
M_prime_test=M_prime_test.T.toarray()
#Chequeo algún vector al azar, para comprobar que tenga norma uno
print(np.linalg.norm(M_prime_test[3]))


1.0094327733981099


In [15]:
#Proceso los labels para que queden en formato one hot
labels_test_oh = np.zeros((M_prime_test.shape[0], 20))
labels_test_oh[np.arange(M_prime_test.shape[0]), labels_test] = 1

In [16]:
def next_batch(num, data, labels):
    '''
    Return a total of `num` random samples and labels. 
    '''
    idx = np.arange(0 , len(data))
    np.random.shuffle(idx)
    idx = idx[:num]
    data_shuffle = [data[i] for i in idx]
    labels_shuffle = [labels[i] for i in idx]
    return np.asarray(data_shuffle), np.asarray(labels_shuffle)
tfidf=tfidf.toarray()
tfidf_test=tfidf_test.toarray()

In [17]:
import tensorflow as tf
import shutil
from tensorboard import summary as summary_lib
logs_path="log_dir"
# Parameters
learning_rate = 0.01
training_epochs = 12
batch_size = 256
display_step = 1
hidden_units=30

# Network Parameters
n_input =  M_prime.shape[1] # Vocab size 
n_classes = 20 # Twenty news groups # classes


with tf.name_scope("inputs"):
    # tf Graph input
    X = tf.placeholder("float", [None, n_input],name="X")
with tf.name_scope("labels"):
    Y = tf.placeholder("float", [None, n_classes],name="Y")

# Construct model
with tf.name_scope('Capa1'):
    # Model
    weights1= tf.Variable(tf.random_normal([n_input, hidden_units]),name="weights1")
    bias1= tf.Variable(tf.random_normal([hidden_units]),name="bias1")
    act1= tf.nn.sigmoid(tf.matmul(X,weights1)+bias1, name="activacion_1")

with tf.name_scope('Capa2'):
    # Model
    weights2= tf.Variable(tf.random_normal([hidden_units, n_classes]),name="weights2")
    bias2= tf.Variable(tf.random_normal([n_classes]),name="bias2")
    logits= tf.matmul(act1,weights2)+bias2

with tf.name_scope('Loss'):
# Define loss and optimizer
    loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
        logits=logits, labels=Y),name="costo")
with tf.name_scope('BGD'):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,name="optimizador")
    train_op = optimizer.minimize(loss_op)
with tf.name_scope('Accuracy'):
    # Accuracy
    #pred = tf.nn.softmax(logits) # Softmax
    acc_op = tf.equal(tf.argmax(logits, 1), tf.argmax(Y, 1))
    acc_op = tf.reduce_mean(tf.cast(acc_op, tf.float32),name="acc_red_mean")
    
# Initializing the variables
init = tf.global_variables_initializer()
# Create a summary to monitor cost tensor
tf.summary.scalar("loss", loss_op)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("accuracy", acc_op)
# Merge all summaries into a single op
tf.summary.histogram('histogram', weights1)
merged_summary_op = tf.summary.merge_all()


t0 = time()
with tf.Session() as sess:
    sess.run(init)
    # op to write logs to Tensorboard
    summary_writer = tf.summary.FileWriter(logs_path, graph=sess.graph)
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(M_prime.shape[0]/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = next_batch(batch_size,M_prime,labels_oh)
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c= sess.run([train_op, loss_op], feed_dict={Y: batch_y,
                                                            X: batch_x})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            #batch_x, batch_y = next_batch(batch_size,M_prime,labels_oh)
            #run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
            #run_metadata = tf.RunMetadata()
            summary, test_cost,_ = sess.run([merged_summary_op,loss_op,acc_op],
                                  feed_dict={X: M_prime_test, Y: labels_test_oh})#,
                                  #options=run_options,
                                  #run_metadata=run_metadata)
            summary_writer.add_summary(summary, epoch)
            print("Epoch:", '%04d' % (epoch+1), "train loss={:.9f} crossval loss={:.9f}".format(avg_cost,test_cost))
    print("Optimization Finished!")

    # Test model
    pred = tf.nn.softmax(logits)  # Apply softmax to logits
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(Y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print("Accuracy:", accuracy.eval({X: M_prime_test, Y: labels_test_oh})) 
print("done in %0.3fs." % (time() - t0))


Epoch: 0001 train loss=3.780539534 crossval loss=2.858926296
Epoch: 0002 train loss=2.387215094 crossval loss=2.384835005
Epoch: 0003 train loss=1.925596245 crossval loss=2.066501617
Epoch: 0004 train loss=1.544671796 crossval loss=1.842457891
Epoch: 0005 train loss=1.310173788 crossval loss=1.685601115
Epoch: 0006 train loss=1.115884893 crossval loss=1.571999073
Epoch: 0007 train loss=0.950864618 crossval loss=1.494475126
Epoch: 0008 train loss=0.859440439 crossval loss=1.443039179
Epoch: 0009 train loss=0.770634378 crossval loss=1.400556564
Epoch: 0010 train loss=0.689865410 crossval loss=1.379191160
Epoch: 0011 train loss=0.651552646 crossval loss=1.362631679
Epoch: 0012 train loss=0.587656390 crossval loss=1.334321737
Optimization Finished!
Accuracy: 0.6188263
done in 17.972s.


In [17]:
import tensorflow as tf
import shutil
from tensorboard import summary as summary_lib
logs_path="log_dir_tfidf"
# Parameters
learning_rate = 0.01
training_epochs = 12
batch_size = 256
display_step = 1
hidden_units=5

# Network Parameters
n_input =  tfidf.shape[1] # Vocab size 
n_classes = 20 # Twenty news groups # classes


with tf.name_scope("inputs"):
    # tf Graph input
    X = tf.placeholder("float", [None, n_input],name="X")
with tf.name_scope("labels"):
    Y = tf.placeholder("float", [None, n_classes],name="Y")

# Construct model
with tf.name_scope('Capa1'):
    # Model
    weights1= tf.Variable(tf.random_normal([n_input, hidden_units]),name="weights1")
    bias1= tf.Variable(tf.random_normal([hidden_units]),name="bias1")
    act1= tf.nn.sigmoid(tf.matmul(X,weights1)+bias1, name="activacion_1")

with tf.name_scope('Capa2'):
    # Model
    weights2= tf.Variable(tf.random_normal([hidden_units, n_classes]),name="weights2")
    bias2= tf.Variable(tf.random_normal([n_classes]),name="bias2")
    logits= tf.matmul(act1,weights2)+bias2

with tf.name_scope('Loss'):
# Define loss and optimizer
    loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
        logits=logits, labels=Y),name="costo")
with tf.name_scope('BGD'):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,name="optimizador")
    train_op = optimizer.minimize(loss_op)
with tf.name_scope('Accuracy'):
    # Accuracy
    #pred = tf.nn.softmax(logits) # Softmax
    acc_op = tf.equal(tf.argmax(logits, 1), tf.argmax(Y, 1))
    acc_op = tf.reduce_mean(tf.cast(acc_op, tf.float32),name="acc_red_mean")
    
# Initializing the variables
init = tf.global_variables_initializer()
# Create a summary to monitor cost tensor
tf.summary.scalar("loss", loss_op)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("accuracy", acc_op)
# Merge all summaries into a single op
tf.summary.histogram('histogram', weights1)
merged_summary_op = tf.summary.merge_all()

t0=time()
with tf.Session() as sess:
    sess.run(init)
    # op to write logs to Tensorboard
    summary_writer = tf.summary.FileWriter(logs_path, graph=sess.graph)
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(M_prime.shape[0]/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = next_batch(batch_size,tfidf,labels_oh)
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c= sess.run([train_op, loss_op], feed_dict={Y: batch_y,
                                                            X: batch_x})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            #batch_x, batch_y = next_batch(batch_size,M_prime,labels_oh)
            #run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
            #run_metadata = tf.RunMetadata()
            summary, test_cost,_ = sess.run([merged_summary_op,loss_op,acc_op],
                                  feed_dict={X: tfidf_test, Y: labels_test_oh})#,
                                  #options=run_options,
                                  #run_metadata=run_metadata)
            summary_writer.add_summary(summary, epoch)
            print("Epoch:", '%04d' % (epoch+1), "train loss={:.9f} crossval loss={:.9f}".format(avg_cost,test_cost))
    print("Optimization Finished")

    # Test model
    pred = tf.nn.softmax(logits)  # Apply softmax to logits
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(Y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print("Accuracy:", accuracy.eval({X: tfidf_test, Y: labels_test_oh})) 
print("done in %0.3fs." % (time() - t0))

Epoch: 0001 train loss=3.491955540 crossval loss=3.109154224
Epoch: 0002 train loss=2.822648222 crossval loss=2.738812208
Epoch: 0003 train loss=2.421804791 crossval loss=2.453038931
Epoch: 0004 train loss=2.058099600 crossval loss=2.194232941
Epoch: 0005 train loss=1.727975141 crossval loss=1.967353582
Epoch: 0006 train loss=1.437958284 crossval loss=1.777373075
Epoch: 0007 train loss=1.207580090 crossval loss=1.624869823
Epoch: 0008 train loss=1.008474644 crossval loss=1.508641243
Epoch: 0009 train loss=0.855171972 crossval loss=1.413915753
Epoch: 0010 train loss=0.732527908 crossval loss=1.342490554
Epoch: 0011 train loss=0.632608278 crossval loss=1.286498785
Epoch: 0012 train loss=0.542441987 crossval loss=1.243498325
Optimization Finished
Accuracy: 0.65361124
done in 48.007s.
