[View in Colaboratory](https://colab.research.google.com/github/paulanavarretec/Recsys-practicos/blob/master/Copia_de_Pra%CC%81ctico_pyreclab_3.ipynb)

# Ayudantía 3 - Sistemas Recomendadores: Pyreclab

**Nombre(s):** Paula Navarrete - Astrid San Martín

## Setup

**Paso 1:** Descarga de archivos:

*   `dictionary.p`
*   `dictionary-stemm.p`
*  `tfidf_model.p`
*  `tfidf_model-stemm.p`

In [27]:
# Descargue los archivos ejecutando este comando
!curl -L -o 'resources.tar.gz' "https://drive.google.com/uc?export=download&id=1_Vp-veFfqCFkaEs-qVx99DYrAexBfq8w"

# Descomprima el archivo
!tar -xvf resources.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0    388      0 --:--:--  0:00:01 --:--:--   308
100 1754k  100 1754k    0     0  1754k      0  0:00:01  0:00:01 --:--:-- 1754k
resources/
resources/dictionary-stemm.p
resources/dictionary.p
resources/tfidf_model-stemm.p
resources/tfidf_model.p


**Paso 1.5:** Descarga del dataset:

In [28]:
# Puede descargar el dataset ejecutando el siguiente comando
!curl -L -o 'dataset.tar.gz' "https://drive.google.com/uc?export=download&id=1by4BZRPeUSnQRbwJWc-OKpIF6YBpCa7s"

# Y descomprimirlo con
!tar -xvf dataset.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0    129      0 --:--:--  0:00:03 --:--:--   124
100 3117k    0 3117k    0     0  1039k      0 --:--:--  0:00:03 --:--:-- 1039k
./._corpus1.csv
corpus1.csv


**Paso 2:** Para este práctico es necesario instalar las siguentes dependencias:

In [29]:
!pip install nltk
!pip install sklearn
!pip install gensim
!pip install pandas
!pip install numpy



In [0]:
import os
import nltk
import sklearn
import gensim
import string
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from gensim import corpora, models, similarities
from sklearn.neighbors import NearestNeighbors

## Preprocesamiento de datos

Lo primero es descargar las librerías de NLTK necesarias:

In [31]:
# Download corpora
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Para comenzar cargaremos el set de datos en un *dataframe* de Pandas, e imprimimos los 5 primeros registros para visualizar la estructura de los datos.

In [32]:
corpus_df = pd.read_csv('./corpus1.csv', sep='\t', header=None, encoding='latin')
corpus_df.columns = ['id', 'title', 'abstract']
corpus_df = corpus_df[['id', 'title', 'abstract']]
corpus_df[:5]

Unnamed: 0,id,title,abstract
0,100002,Nonlinear Shape Statistics in Mumford{Shah Bas...,We present a variational integration of nonlin...
1,100007,On the Complexity of Equilibria,"We prove complexity, approximability, and inap..."
2,100008,On QoS-Aware Publish-Subscribe,This position paper addresses the issue of sup...
3,10001,Checking Mergeable Priority Queues,We present an efficient algorithm which can ch...
4,100012,Mobile Code Security by Java Bytecode Instrume...,Mobile code provides significant opportunities...


Lo siguiente es implementar una función que transforme texto no estructurado a una lista de tokens procesados.

In [33]:
stemm = False
stemmer = PorterStemmer()

def get_tokens(text):
    lowers = text.lower()
    no_punctuation = lowers.translate({ord(c): None for c in string.punctuation})
    tokens = nltk.word_tokenize(no_punctuation)
    if stemm:
        tokens = map(stemmer.stem, tokens)
        
    return tokens

get_tokens("I'm a super student for recommender systems!")

['im', 'a', 'super', 'student', 'for', 'recommender', 'systems']

**Pregunta:** Explique en sus palabras qué hace la función `get_tokens()`.

**Respuesta:**procesa el texto llevando todas las letras a minúsculas primero, para luego sacar los signos de puntuación (translate). Finalmente crea un arreglo con una palabra en cada posición. sin retirar las "stop words". 



Ahora se tiene que generar un diccionario con todas las palabras del *corpus*.

Se recomienda revisar la documentación de gensim y leer cómo usar los diccionarios: [corpora.dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html)

In [34]:
dict_file = './resources/dictionary-stemm.p' if stemm else './resources/dictionary.p'
if os.path.isfile(dict_file):
    dictionary = corpora.dictionary.Dictionary().load(dict_file)
else:
    dictionary = corpora.dictionary.Dictionary(documents=corpus_df.tokenised_abstract.tolist())
    dictionary.save(dict_file)
    
corpus_df['tokenized_abstract'] = corpus_df.abstract.map(get_tokens)
corpus_df[:5]

Unnamed: 0,id,title,abstract,tokenized_abstract
0,100002,Nonlinear Shape Statistics in Mumford{Shah Bas...,We present a variational integration of nonlin...,"[we, present, a, variational, integration, of,..."
1,100007,On the Complexity of Equilibria,"We prove complexity, approximability, and inap...","[we, prove, complexity, approximability, and, ..."
2,100008,On QoS-Aware Publish-Subscribe,This position paper addresses the issue of sup...,"[this, position, paper, addresses, the, issue,..."
3,10001,Checking Mergeable Priority Queues,We present an efficient algorithm which can ch...,"[we, present, an, efficient, algorithm, which,..."
4,100012,Mobile Code Security by Java Bytecode Instrume...,Mobile code provides significant opportunities...,"[mobile, code, provides, significant, opportun..."


In [35]:
print(corpus_df['tokenized_abstract'][0][0], corpus_df['tokenized_abstract'][0][1],'...',corpus_df['tokenized_abstract'][0][len(corpus_df['tokenized_abstract'][0])-1])

we present ... framework


**Pregunta:** Explique a qué corresponde la columna `tokenised_abstract` del dataframe.

**Respuesta:** La columna contiene el vector con las palabras después de pasar el abstract por la función *get_tokens*, es decir el abstract en un arreglo, con cada palabra de el en una posición consecutiva, pasada a minúsculas, incluyendo las stopwords.


In [36]:
corpus_df['bow'] = corpus_df.tokenized_abstract.map(dictionary.doc2bow)
del corpus_df['tokenized_abstract']
corpus = corpus_df['bow'].tolist()
corpus_df[:5]

Unnamed: 0,id,title,abstract,bow
0,100002,Nonlinear Shape Statistics in Mumford{Shah Bas...,We present a variational integration of nonlin...,"[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, 1..."
1,100007,On the Complexity of Equilibria,"We prove complexity, approximability, and inap...","[(4, 5), (7, 2), (8, 1), (10, 1), (30, 3), (35..."
2,100008,On QoS-Aware Publish-Subscribe,This position paper addresses the issue of sup...,"[(1, 1), (4, 6), (7, 1), (16, 1), (22, 1), (27..."
3,10001,Checking Mergeable Priority Queues,We present an efficient algorithm which can ch...,"[(4, 7), (7, 2), (8, 1), (10, 1), (16, 1), (17..."
4,100012,Mobile Code Security by Java Bytecode Instrume...,Mobile code provides significant opportunities...,"[(4, 4), (5, 1), (7, 1), (10, 2), (16, 5), (22..."


**Pregunta:** Explique a qué corresponde la columna `bow`

**Respuesta:** Bag of Words  (BOW) es una forma de representar el documento, en términos de las palabras presentes en el diccionario, para poder usarlo en un modelo. Representa el conjunto de palabras presentes en el documento o abstract (aplicado ya todo el pre procesamiento explicado en las secciones anteriores), entonces, cada entrada del vector BoW contiene en su primera componente el índice en el diccionario de la palabra contenida en el abstract, y en su segunda coordenada almacena la cuenta total de cuántas veces está presente la palabara en el documento. En este modelo no interesa el orden de las palabras si no que sólo su ocurrencia. 



## Tf-idf

In [37]:
tfidf_model_file = 'resources/tfidf_model-stemm.p' if stemm else 'resources/tfidf_model.p'
if os.path.isfile(tfidf_model_file):
    tfidf_model = models.tfidfmodel.TfidfModel().load(tfidf_model_file)
else:
    tfidf_model = models.tfidfmodel.TfidfModel(corpus, dictionary=dictionary)
    tfidf_model.save(tfidf_model_file)

corpus_df['tf_idf'] = tfidf_model[corpus_df.bow.tolist()]
corpus_df[:5]

Unnamed: 0,id,title,abstract,bow,tf_idf
0,100002,Nonlinear Shape Statistics in Mumford{Shah Bas...,We present a variational integration of nonlin...,"[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, 1...","[(0, 0.19689725999527163), (1, 0.0861613877917..."
1,100007,On the Complexity of Equilibria,"We prove complexity, approximability, and inap...","[(4, 5), (7, 2), (8, 1), (10, 1), (30, 3), (35...","[(4, 0.0033554011043417254), (7, 0.02333778550..."
2,100008,On QoS-Aware Publish-Subscribe,This position paper addresses the issue of sup...,"[(1, 1), (4, 6), (7, 1), (16, 1), (22, 1), (27...","[(1, 0.06276351152911328), (4, 0.0049492930133..."
3,10001,Checking Mergeable Priority Queues,We present an efficient algorithm which can ch...,"[(4, 7), (7, 2), (8, 1), (10, 1), (16, 1), (17...","[(4, 0.0022699486545179476), (7, 0.01127724975..."
4,100012,Mobile Code Security by Java Bytecode Instrume...,Mobile code provides significant opportunities...,"[(4, 4), (5, 1), (7, 1), (10, 2), (16, 5), (22...","[(4, 0.001715799318906219), (5, 0.031751265629..."


**Pregunta:** Explicar a qué corresponde la columna `tf_idf` y por qué es útil en el procesamiento de texto. Mencione sus 2 principales partes, mediante la explicación del puntaje.

**Respuesta:** Se desea generar una puntuación para obtener la frecuencia de palabras para saber el contexto y poder determinar las palabras dominantes en el texto. Algunas palabras que no son de mucha utilidad van a aparecer con mucha frecuencia y se quiere evitar eso. TF-IDF  busca penalizar palabras que aparecen frecuentemente en los documentos, como "the" o "and"pero no aporta información relevante sobre el contexto del documento para la clasificacion.

El enfoque TF-IDF calcula los siguientes puntajes:

- TF (*term frecuency*): es una puntuación de la frecuencia de la palabra en el documento actual.
- IDF (*inverse document frecuency*): es una puntuación de qué tan rara es la palabra en los documentos.

Los puntajes son una ponderación donde no todas las palabras son igual de importantes o interesantes y tienen el efecto de resaltar palabras que son distintas (contienen información útil) en un documento dado. Es útil en el procesamiento de documentos porque se puede ponderar cuando una palabra frecuente es realmente un aporte para conocer el contexto ó es sólo una stopword (artículo, pronombre, etc.).


## Generar recomendaciones

En esta sección se implementan las funciones necesarias para poder generar recomendaciones dado lo que un usuario ha consumido. De manera artificial, se samplearán 3 documentos aleatorios que representarán al usuario objetivo (`sample`). Luego tendrás que generar diferentes recomendaciones y evaluar los resultados.

In [38]:
# Random user

samples = corpus_df.sample(3)

for n, (ix, paper) in enumerate(samples.iterrows()):
  idx, title, abstract, bow, tf_idf = paper
  print('%d) %s' % (n+1, title))
  print('')
  print(abstract)
  print('\n' )

1) From a Trickle to a Flood: Active Attacks on Several Mix Types

The literature contains a variety of different mixes, some of which have been used in deployed anonymity systems. We explore their anonymity and message delay properties, and show how to mount active attacks against them by altering the traffic between the mixes.


2) Reasoning about Infinite Computations

We investigate extensions of temporal logic by connectives defined by finite automata on infinite words. We consider three different logics, corresponding to three different types of acceptance conditions (finite, looping and repeating) for the automata. It turns out, however, that these logics all have the same expressive power and that their decision problems are all PSPACE-complete. We also investigate connectives defined by alternating automata and show that they do not increase the expressive power of the logic or the complexity of the decision problem. 1 Introduction For many years, logics of programs have been 

In [0]:
# Recommendation functions

N = len(dictionary)

def to_sparse(matrix):
    return csr_matrix([gensim.matutils.sparse2full(row, length=N) for row in matrix]) 

def make_recommendations(model, metric, neighbors):
    M = len(corpus)

    X = to_sparse(corpus_df[model].tolist())
    document_index = NearestNeighbors(n_neighbors=(neighbors + 1), algorithm='brute', metric=metric).fit(X)
    return document_index

def print_recommendations(indexes, model):
    for n, (ix, paper) in enumerate(samples.iterrows()):
        dists, neighbors = indexes.kneighbors([gensim.matutils.sparse2full(paper[model],length=N)])
        print(paper['title'])
        print('')
        print('Documentos cercanos: ')
        i = 1
        for neighbour in neighbors[0]:
            if ix != neighbour:
                line = str(i) + ". " + corpus_df.iloc[neighbour]['title']
                print(line)
                i+=1
        print('\n')

A continuación deberá utilizar las funciones implementadas anteriormente para generar nuevas recomendaciones variando los parámetros del modelo. Agregue nuevas celdas para cada implementación y/o pregunta.


** Pregunta:** Ejecute el modelo utilizando como representación tf-idf y una métrica de distancia euclideana. Modifique el parámetro nearest_neighbors a [5, 10, 20]. ¿qué efecto tiene el modelo en las recomendaciones observadas?

**Respuesta:**Las recomendaciones no cambian solo se van incorporando nuevas recomendaciones dependiendo si son 5, 10 ó 20 vecinos cercanos a entregar.

**Pregunta:** Eligiendo un valor fijo para *nearest neighbors* y utilizando representación tf-idf, ejecute el modelo con métrica de distancia *cosine*.¿Qué efecto tiene la métrica de distancia en las recomendaciones observadas?

**Respuesta:** La métrica *cosine* no genera ningún cambio en las recomendaciones con respecto a la distancia euclideana, al pedir los 10 vecinos cercanos estos son los mismos para las dos métricas.


In [41]:
# Recommendation example
 
doc_idx = make_recommendations('tf_idf', 'euclidean', 5)
print_recommendations(doc_idx, 'tf_idf')

From a Trickle to a Flood: Active Attacks on Several Mix Types

Documentos cercanos: 
1. Comparison Between Two Practical Mix Designs
2. Mix-networks with Restricted Routes
3. Wireless Authentication Protocol Preserving User Anonymity
4. Gap { Practical Anonymous Networking
5. Probabilistic Analysis of Anonymity


Reasoning about Infinite Computations

Documentos cercanos: 
1. Stability and Sequentiality in Dataflow Networks
2. An Automata-Theoretic Approach to Linear Temporal Logic
3. Enhanced Propositional Dynamic Logic for Reasoning about Concurrent Actions (extended abstract)
4. Real-time Logics: Complexity and Expressiveness
5. Locally Linear Time Temporal Logic


On Probabilistic Model Checking

Documentos cercanos: 
1. Algebraic Reasoning for Probabilistic Concurrent Systems
2. A Testing Scenario for Probabilistic Automata
3. Probabilistic Temporal Logics via the Modal Mu-Calculus
4. A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Sys

In [42]:
#Recomendacion tf-idf, distancia euclideana, nearest_neighbors 10
doc_idx1 = make_recommendations('tf_idf', 'euclidean', 10)
print_recommendations(doc_idx1, 'tf_idf')

From a Trickle to a Flood: Active Attacks on Several Mix Types

Documentos cercanos: 
1. Comparison Between Two Practical Mix Designs
2. Mix-networks with Restricted Routes
3. Wireless Authentication Protocol Preserving User Anonymity
4. Gap { Practical Anonymous Networking
5. Probabilistic Analysis of Anonymity
6. On the Economics of Anonymity
7. Synchronous Batching: From Cascades to Free Routes
8. Generalising Mixes
9. Anonymizing Censorship Resistant Systems
10. Defending Anonymous Communications Against Passive Logging Attacks


Reasoning about Infinite Computations

Documentos cercanos: 
1. Stability and Sequentiality in Dataflow Networks
2. An Automata-Theoretic Approach to Linear Temporal Logic
3. Enhanced Propositional Dynamic Logic for Reasoning about Concurrent Actions (extended abstract)
4. Real-time Logics: Complexity and Expressiveness
5. Locally Linear Time Temporal Logic
6. Evolving Deterministic Finite Automata Using Cellular Encoding
7. State Clock Logic: a Decidable 

In [43]:
#Recomendacion tf-idf, distancia euclideana, nearest_neighbors 20
doc_idx2 = make_recommendations('tf_idf', 'euclidean', 20)
print_recommendations(doc_idx2, 'tf_idf')

From a Trickle to a Flood: Active Attacks on Several Mix Types

Documentos cercanos: 
1. Comparison Between Two Practical Mix Designs
2. Mix-networks with Restricted Routes
3. Wireless Authentication Protocol Preserving User Anonymity
4. Gap { Practical Anonymous Networking
5. Probabilistic Analysis of Anonymity
6. On the Economics of Anonymity
7. Synchronous Batching: From Cascades to Free Routes
8. Generalising Mixes
9. Anonymizing Censorship Resistant Systems
10. Defending Anonymous Communications Against Passive Logging Attacks
11. The Eternity Service
12. Remote Timing Attacks are Practical
13. Provably Secure Public-Key Encryption for Length-Preserving Chaumian Mixes
14. Secure Routing in Wireless Sensor Networks: Attacks and Countermeasures
15. Attacking DDoS at the Source
16. Steady-State Analysis of the Rate-Based Congestion Control Mechanism for ABR Services in ATM Networks
17. Secure and Resilient Peer-to-Peer E-Mail: Design and Implementation
18. Fair Off-Line e-Cash made e

In [44]:
#Recomendacion tf-idf, distancia cosine, nearest_neighbors 10
doc_idx1 = make_recommendations('tf_idf', 'cosine', 10)
print_recommendations(doc_idx1, 'tf_idf')

From a Trickle to a Flood: Active Attacks on Several Mix Types

Documentos cercanos: 
1. Comparison Between Two Practical Mix Designs
2. Mix-networks with Restricted Routes
3. Wireless Authentication Protocol Preserving User Anonymity
4. Gap { Practical Anonymous Networking
5. Probabilistic Analysis of Anonymity
6. On the Economics of Anonymity
7. Synchronous Batching: From Cascades to Free Routes
8. Generalising Mixes
9. Anonymizing Censorship Resistant Systems
10. Defending Anonymous Communications Against Passive Logging Attacks


Reasoning about Infinite Computations

Documentos cercanos: 
1. Stability and Sequentiality in Dataflow Networks
2. An Automata-Theoretic Approach to Linear Temporal Logic
3. Enhanced Propositional Dynamic Logic for Reasoning about Concurrent Actions (extended abstract)
4. Real-time Logics: Complexity and Expressiveness
5. Locally Linear Time Temporal Logic
6. Evolving Deterministic Finite Automata Using Cellular Encoding
7. State Clock Logic: a Decidable 