# Punto 4 Recuperación ranqueada y vectorización de documentos (RRDV) GENSIM
## Integrantes
* Juan Esteban Arboleda
* Luccas Rojas

### 1. Preprocesamiento
Lo primero que se llevara a cabo para poder hacer la recuperación ranqueada es la tokenizacion de los documentos y generacion del vocabulario. Ademas es importante tener en cuenta que el vocabulario debe estar ordenado, no debe contener stop-words, debe estar stemizado y normalizado

* A continucion se cargan los documentos y los queries en una estructura de datos, se debe cambiar DOCUMENTS_PATH y QUERIES_PATH por la ruta donde se encuentran los documentos y los queries respectivamente
* El archivo de salida es guardado en RRDV_RESULTS_FILE_PATH con el formato exigido

In [14]:
import os
import pandas as pd
import numpy as np
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import math
import time
from gensim import corpora, models

# Rutas a definir segun la ubicacion de los archivos
DOCUMENTS_PATH = '../data/docs-raw-texts'
QUERIES_PATH = '../data/queries-raw-texts'
GENSIM_RESULTS_FILE_PATH = "../data/GENSIM-consultas_resultados"
RELEVANCE_JUDGMENTS_FILE_PATH = "../data/relevance-judgments.tsv"

def load_documents(folder_path: str) -> pd.DataFrame:
    """
    Returns a Pandas DataFrame where each row represents a document in folder_path.
    The DataFrame will have as many rows as there are documents in folder_path

        Parameters
        ----------
            folder_path: str
                The path to the folder that contains the documents to load
    
        Returns
        --------
            documents: pd.DataFrame
                Pandas DataFrame with two columns: "filename" and "body"
    """
    documents = []
    index = []
    id = 1
    columns = ['filename', 'body']
    for filename in os.listdir(folder_path):
        text = pd.read_xml(os.path.join(folder_path, filename))['raw'].tolist()[1]
        filtered_text = text.replace('\n', ' ').replace('\xa0', ' ')
        document = [filename, filtered_text]
        documents.append(document)
        index.append(id)
        id += 1

    return pd.DataFrame(documents, index, columns)

documents = load_documents(DOCUMENTS_PATH)
queries = load_documents(QUERIES_PATH)

documents

Unnamed: 0,filename,body
1,wes2015.d001.naf,William Beaumont and the Human Digestion. Wil...
2,wes2015.d002.naf,Selma Lagerlöf and the wonderful Adventures of...
3,wes2015.d003.naf,Ferdinand de Lesseps and the Suez Canal. Ferd...
4,wes2015.d004.naf,Walt Disney’s ‘Steamboat Willie’ and the Rise ...
5,wes2015.d005.naf,Eugene Wigner and the Structure of the Atomic ...
...,...,...
327,wes2015.d327.naf,James Parkinson and Parkinson’s Disease. Wood...
328,wes2015.d328.naf,Juan de la Cierva and the Autogiro. Demonstra...
329,wes2015.d329.naf,Squire Whipple – The Father of the Iron Bridge...
330,wes2015.d330.naf,William Playfair and the Beginnings of Infogra...


Primero que todo tokenizamos el texto, para esto utilizamos el word tokenize de la libreria NLTK

In [15]:
nltk.download('punkt')
nltk.download('stopwords')

documents['tokens'] = documents['body'].apply(word_tokenize)
queries['tokens'] = queries['body'].apply(word_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Removemos todos los signos de puntuacion, contracciones del ingles y dejamos el texto todo en minusculas (normalizar) 

In [16]:
def remove_punctuation(token_list):
    return [token.lower() for token in token_list if (token not in string.punctuation and (len(token)>1 or token.isnumeric()))]

documents['tokens']=documents['tokens'].apply(lambda x: remove_punctuation(x))
queries['tokens']=queries['tokens'].apply(lambda x: remove_punctuation(x))

Luego de tokenizar, dejar todo en minusculas, quitaremos las stop words para que reduzcan el vocabulario y no afecten el resultado final. Para esto usaremos la libreria nltk y su metodo stopwords.words('english').

In [17]:
stop_words = set(stopwords.words('english'))

#TODO no se si normalizar cuente como poner todo en minusculas
def remove_stop_words(token_list):
    return [token for token in token_list if token not in stop_words]

documents['tokens']=documents['tokens'].apply(lambda x: remove_stop_words(x))
queries['tokens']=queries['tokens'].apply(lambda x: remove_stop_words(x))

Luego de eliminar las stop words se hace stemming a las palabras restantes.

In [18]:
stemmer = PorterStemmer()
def stemming(token_list):
    return [stemmer.stem(token) for token in token_list]

documents['tokens']=documents['tokens'].apply(lambda x: stemming(x))
queries['tokens']=queries['tokens'].apply(lambda x: stemming(x))

En este punto el texto de cada documento y query esta en un formato mas facil de procesar, por lo que se procede a realizar la representacion vectorial de los documentos y queries.

## 2. Representación de los datos

Primero se crea un diccionario y el corpus

In [19]:
dictionary = corpora.Dictionary(documents["tokens"])
corpus = [dictionary.doc2bow(text) for text in documents["tokens"]]

Se crea el modelo TF-IDF a partor del corpus

In [20]:
tfidf = models.TfidfModel(corpus)

Se calcula el tf-idf para el corpus

In [21]:
corpus_tfidf = tfidf[corpus]

Se crea la matriz de similitudes, con respecto a la cual se puede consultar la similitud de un documento con respecto a todos los documentos del corpus

In [22]:
from gensim import similarities
index = similarities.MatrixSimilarity(corpus_tfidf)

# 3. Evaluación

Se realiza la similitud coseno entre las consultas y el corpus y se escribe el archivo de resultados

In [23]:
# Clear output file contents
open(GENSIM_RESULTS_FILE_PATH, "w").close()
file = open(GENSIM_RESULTS_FILE_PATH, "a")

# This dictionary will be used to evaluate metrics
ret_docs = {}

# Loop through queries
for i, query in queries.iterrows():
    # Open output file
    query_str = query['filename'].replace('.naf', '').replace('wes2015.', '')
    file.write(query_str + " ")

    # Perform cosine similarity between corpus ans query
    query_bow = dictionary.doc2bow(query["tokens"])
    query_tfidf = tfidf[query_bow]
    sims = index[query_tfidf]
    sims = sorted(enumerate(sims, start=1), key=lambda item: item[1], reverse=True)

    ret_docs[query_str] = []

    # Write output file
    doc_counter = 0
    for docId, sim in sims:
        if(sim > 0):
            if docId < 10:
                doc_str = "d00" + str(docId)
            elif docId < 100:
                doc_str = "d0" + str(docId)
            else:
                doc_str = "d" + str(docId)
            file.write(doc_str + ":" + str(sim))
            if i != len(sims):
                file.write(",")
        # Add document to ret_docs
        ret_docs[query_str].append(doc_str + ":" + str(sim))
        doc_counter += 1
    
    if i != queries.shape[0]:
        file.write("\n")

file.close()

# 4. Métricas

## Definición de las funciones para evaluar las métricas
Éstas son las mismas funciones definidas en el punto 1

In [24]:
def precision(relevance: list) -> float:
    """
    Returns the precision of a query result.

        Params
        ------
            relevance: list
                A binary vector. The kth element of the vector
                represent if the kth returned document is relevant
                to the query. 1 represent that it is relevant. 0
                represent that it is not.
    """
    relevance = np.array(relevance)
    num = np.sum(relevance)
    den = len(relevance)

    return num / den

def precision_at_k(relevance: list, k: int) -> float:
    """
    Returns the precision @ k of a query result.

        Params
        ------
            relevance: list
                A binary vector. The kth element of the vector
                represent if the kth returned document is relevant
                to the query. 1 represent that it is relevant. 0
                represent that it is not.

            k: int
                Position untill which the metric should be evaluated.
    """
    relevance = relevance[:k]
    
    return precision(relevance)

def recall_at_k(relevance: list, n_relevant_docs, k):
    """
    Returns the Recall @ k of a query result result.

        Params
        ------
            relevance: list
                A binary vector. The kth element of the vector
                represent if the kth returned document is relevant
                to the query. 1 represent that it is relevant. 0
                represent that it is not.
            
            n_relevant_docts: int
                The number of relevant documents to the query.

            k: int
                Position untill which the metric should be evaluated.
    """
    relevance = np.array(relevance)
    relevance = relevance[:k]

    num = np.sum(relevance)
    den = n_relevant_docs

    return num / den

def average_precision(relevance: list) -> float:
    """
    Returns the average precision of a query result

        Params
        ------
            relevance: list
                A binary vector. The kth element of the vector
                represent if the kth returned document is relevant
                to the query. 1 represent that it is relevant. 0
                represent that it is not.
                The relevance list MUST contain all the relevant documents.
    """

    k = 1
    n_relevant_documents = np.sum(relevance)
    current_rel_documents = 0
    current_p_at_k_sum = 0
    rec_at_k = 0
    while rec_at_k < 1:
        if relevance[k-1] == 1:
            current_rel_documents += 1
            current_p_at_k_sum += precision_at_k(relevance, k)
        
        rec_at_k = recall_at_k(relevance, n_relevant_documents, k)
        k += 1

    return current_p_at_k_sum / current_rel_documents

def dcg_i(relevance: list, i: int) -> float:
    """
    Returns the DCG_i. i.e. relevance_i / log2(max(i,2))

    Params
        ------
            relevance: list
                A numeric vector where the kth component of the vector
                represents the relevance of the kth returned document.
    """
    return relevance[i - 1] / math.log2(max(i,2))


def dcg_at_k(relevance: list, k: int) -> float:
    """
    Returns the DCG @ k of a query result.

        Params
        ------
            relevance: list
                A numeric vector where the kth component of the vector
                represents the relevance of the kth returned document.

            k: int
                Position untill which the metric should be evaluated.     
    """
    relevance = np.array(relevance)
    cr_sum = 0
    for i in range(1, k+1):
        cr_sum += dcg_i(relevance, i)

    return cr_sum

def ndcg_at_k(relevance: list, k: int) -> float:
    """
    Returns normalized DCG @ k of a query result.

        Params
        ------
            relevance: list
                A numeric vector where the kth component of the vector
                represents the relevance of the kth returned document.

            k: int
                Position untill which the metric should be evaluated.  
    """
    ordered_relevance = relevance.copy()
    ordered_relevance.sort(reverse=True)

    cr_sum1 = 0
    cr_sum2 = 0

    for i in range(1, k+1):
        cr_sum1 += dcg_i(relevance, i)
        cr_sum2 += dcg_i(ordered_relevance, i)

    return cr_sum1 / cr_sum2

## Evaluación de métricas

In [26]:
rj_file = open(RELEVANCE_JUDGMENTS_FILE_PATH, "r")
rj_lines = rj_file.readlines()

results = {}

for rj in rj_lines:
    # continue if reading an empty line
    if rj == "":
        continue
    
    temp = rj.strip().split()

    query_str = temp[0]
    relevant_docs_lst = temp[1].split(",")
    relevant_docs = {}
    returned_docs = [doc.split(":")[0] for doc in  ret_docs[query_str]]

    for doc in relevant_docs_lst:
        temp2 = doc.split(":")
        relevant_docs[temp2[0]] = float(temp2[1])


    n_relevant_docs = 0
    bin_relevance = []
    num_relevance = []
    
    i = 0
    while n_relevant_docs < len(relevant_docs):
        if i < len(returned_docs):
            ret_doc = returned_docs[i]
            if ret_doc in relevant_docs:
                bin_relevance.append(1)
                num_relevance.append(relevant_docs[ret_doc])
                n_relevant_docs += 1
            else:
                bin_relevance.append(0)
                num_relevance.append(0)
        else:
            bin_relevance.append(1)
            n_relevant_docs += 1
        
        i += 1

    results[query_str] = {
        "P@M": precision_at_k(bin_relevance, len(relevant_docs)),
        "R@M": recall_at_k(bin_relevance, len(relevant_docs), len(relevant_docs)),
        "AP": average_precision(bin_relevance),
        "NDCG@M": ndcg_at_k(num_relevance, len(relevant_docs))
    }


# Print results and calculate map
ap_sum = 0
for q_id in results:
    result = results[q_id]
    print(str(q_id), " - ",
          "P@M: ", "{:.2f}".format(result["P@M"]), ",\t",
          "R@M: ", "{:.2f}".format(result["R@M"]), ",\t",
          "AP: ", "{:.2f}".format(result["AP"]),",\t",
          "NDCG@M: ", "{:.2f}".format(result["NDCG@M"]), sep="")
    
    ap_sum += result["AP"]
        
mean_average_presition = ap_sum / len(results)
print("MAP - ", "{:.4f}".format(mean_average_presition))

q01 - P@M: 0.33,	R@M: 0.33,	AP: 0.70,	NDCG@M: 0.40
q02 - P@M: 0.64,	R@M: 0.64,	AP: 0.69,	NDCG@M: 0.63
q03 - P@M: 1.00,	R@M: 1.00,	AP: 1.00,	NDCG@M: 0.98
q04 - P@M: 0.71,	R@M: 0.71,	AP: 0.88,	NDCG@M: 0.78
q06 - P@M: 0.67,	R@M: 0.67,	AP: 0.86,	NDCG@M: 0.81
q07 - P@M: 0.25,	R@M: 0.25,	AP: 0.17,	NDCG@M: 0.26
q08 - P@M: 0.67,	R@M: 0.67,	AP: 0.77,	NDCG@M: 0.75
q09 - P@M: 0.83,	R@M: 0.83,	AP: 0.93,	NDCG@M: 0.89
q10 - P@M: 0.38,	R@M: 0.38,	AP: 0.30,	NDCG@M: 0.41
q12 - P@M: 1.00,	R@M: 1.00,	AP: 1.00,	NDCG@M: 0.96
q13 - P@M: 0.80,	R@M: 0.80,	AP: 0.74,	NDCG@M: 0.72
q14 - P@M: 0.58,	R@M: 0.58,	AP: 0.48,	NDCG@M: 0.48
q16 - P@M: 0.50,	R@M: 0.50,	AP: 0.36,	NDCG@M: 0.60
q17 - P@M: 0.50,	R@M: 0.50,	AP: 0.44,	NDCG@M: 0.70
q18 - P@M: 0.71,	R@M: 0.71,	AP: 0.81,	NDCG@M: 0.86
q19 - P@M: 0.50,	R@M: 0.50,	AP: 0.50,	NDCG@M: 1.00
q22 - P@M: 0.57,	R@M: 0.57,	AP: 0.51,	NDCG@M: 0.53
q23 - P@M: 0.25,	R@M: 0.25,	AP: 0.29,	NDCG@M: 0.60
q24 - P@M: 0.00,	R@M: 0.00,	AP: 0.09,	NDCG@M: 0.00
q25 - P@M: 0.50,	R@M: 0.50,	AP: