# Instalando as libs relevantes

Ao longo deste projeto utilizarei a implementação da `rank_bm25` da Okapi e uma implementação from scratch, o módulo `evaluate` que implementa `trec_eval` da HuggingFace, e também sua dependência `trectools`.

In [1]:
!pip install rank_bm25
!pip install evaluate
!pip install trectools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.7.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses

# Montando o drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Importando as libs pertinentes

In [3]:
import numpy as np
import pandas as pd
import random
import string
import nltk
import json
import math
from time import time
from evaluate import load
from rank_bm25 import BM25Okapi
import warnings
warnings.filterwarnings('ignore')

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Definindo DataLoader

O DataLoader é uma classe responsável por carregar todos os arquivos relevantes para esta atividade. Sua implementação contém 4 métodos:


*   _load_document_set -> Responsável por carregar os documentos do arquivo CISI.ALL
*   _load_query_set -> Responsável por carregar os documentos do arquivo CISI.QRY
*   _load_relevant_set -> Responsável por carregar os documentos do arquivo CISI.REL
*   load -> Responsável por executar os outros métodos e carregar todos os arquivos.



In [4]:
PATH = '/content/drive/MyDrive/Colab Notebooks/IR/data/'

In [5]:
class DataLoader():
    """
    Class responsible for handling data load. It is implemented specifically for CISI dataset,
    but it can be modified to handle other datasets.

    Methods:
        _load_document_set: responsible for loading the document set (CISI.ALL)
            Returns: 
                document_set: a dict with keys as indexes of documents and values as texts.
        _load_query_set: responsible for loading the query set (CISI.QRY)
            Returns: 
                query_set: a dict with keys as indexes of documents and values as texts.
        _load_relevant_set: responsible for loading the relevant set (CISI.REL)
            Returns: 
                relevant_set: a dict with keys as indexes of documents and values as list of relevant indexes.
    """

    def __init__(self, PATH):
        self.path = PATH

    def _load_document_set(self):
        # Getting document set
        document_set = {}
        doc_id = ""
        doc_text = ""

        # Opening file
        with open(self.path + 'CISI.ALL') as f:
            lines = ""
            
            # Iterating through lines and parsing 
            for l in f.readlines():
                lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
            lines = lines.lstrip("\n").split("\n")

        # Getting relevant lines
        doc_count = 0
        for l in lines:
            if l.startswith(".I"): # .I stands for index
                doc_id = int(l.split(" ")[1].strip())-1
            elif l.startswith(".X"): # .X may not be relevant
                document_set[doc_id] = doc_text.lstrip(" ")
                doc_id = ""
                doc_text = ""
            else: # Concating author, title and text
                doc_text += l.strip()[3:] + " " 
        
        return document_set


    def _load_query_set(self):
        # Getting query set
        query_set = {}
        query_id = ""

        # Openinig file
        with open(self.path + 'CISI.QRY') as f:
            lines = ""

            # Iterating through lines and parsing
            for l in f.readlines():
                lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
            lines = lines.lstrip("\n").split("\n")
            
        # Getting relevant lines
        for l in lines:
            if l.startswith(".I"): # .I stands for index
                query_id = int(l.split(" ")[1].strip()) -1
            elif l.startswith(".W"): # .W stands for the query text
                query_set[query_id] = l.strip()[3:]
                query_id = ""

        return query_set

    def _load_relevant_set(self):
        # Getting relevant set
        relevant_set = {}

        # Opening file
        with open(PATH + 'CISI.REL') as f:

            # Iterating through lines and parsing 
            for l in f.readlines():
                qry_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[0]) -1
                doc_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[-1])-1
                if qry_id in relevant_set:
                    relevant_set[qry_id].append(doc_id)
                else:
                    relevant_set[qry_id] = []
                    relevant_set[qry_id].append(doc_id)

        return relevant_set


    def load(self):
        document_set = self._load_document_set()
        query_set    = self._load_query_set()
        relevant_set = self._load_relevant_set()

        return document_set, query_set, relevant_set

In [6]:
dataloader = DataLoader(PATH)

In [7]:
document_set, query_set, relevant_set = dataloader.load()

In [8]:
document_set[0]

"18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. "

In [9]:
query_set[0]

'What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles?'

In [10]:
relevant_set[0]

[27,
 34,
 37,
 41,
 42,
 51,
 64,
 75,
 85,
 149,
 188,
 191,
 192,
 194,
 214,
 268,
 290,
 319,
 428,
 464,
 465,
 481,
 482,
 509,
 523,
 540,
 575,
 581,
 588,
 602,
 649,
 679,
 710,
 721,
 725,
 782,
 812,
 819,
 867,
 868,
 893,
 1161,
 1163,
 1194,
 1195,
 1280]

# Lista com queries sem documentos relevantes

Existem algumas queries que não tem correspondência no arquivo de documentos relevantes. Vou deixá-las explícitas em uma variável para utilizar posteriormente. A ideia é não calcular métricas para essas queries.

In [11]:
queries_with_no_relevant_docs = np.setdiff1d(list(query_set.keys()),list(relevant_set.keys()))

# Limpando e criando o corpus

A etapa de limpeza de dados é muito importante. É uma etapa customizável, uma vez que podemos escolher diferentes estratégias de limpeza dos textos.
Aqui, escolhi o seguinte caminho:

- retirar acentuação e caractéres especiais;
- aplicar .lower() em todas as palavras;
- tokenização;
- remoção de stopwords da língua inglesa;
- aplicação de stemmer (extração dos núcleos das palavras).

In [12]:
class PreProcessing:
    """
    Class that preprocess all corpus data
    Methods:
        _remove_special_char: removes special characters based on string.punctuation module
            Args:
                input_string: string to clean
            Returns:
                string without punctuations
        _preprocess_string: Apply _remove_special_char followed by tokenization and stem, removing stopwords
            Args:
                input_string: string to clean
            Returns:
                tokens: list of cleaned tokens
        preprocess_corpus: Apply _preprocess string into a set of documents
            Args:
                corpus_set: a dict containing the texts to clean in its values
            Returns:
                a list of cleaned and tokenized texts
    """

    def __init__(self, stopwords: list, stemmer):
        self.stopwords = stopwords
        self.stemmer   = stemmer 

    def _remove_special_char(self, input_string: str):
        return input_string.translate(str.maketrans('','', string.punctuation)).lower()

    def _preprocess_string(self, input_string: str):
     
        # Removing special characters
        txt = self._remove_special_char(input_string)

        # creating tokens
        tokens = nltk.tokenize.word_tokenize(txt) 
        
        # removing stopwords and applying stemmer
        tokens = [self.stemmer.stem(tk) for tk in tokens if tk not in self.stopwords]

        return tokens

    def preprocess_corpus(self, corpus_set: dict):
        return [self._preprocess_string(txt) for txt in corpus_set.values()]

In [13]:
# Creating the stemmer and stopwords list
stemmer = nltk.stem.PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

# Applying preprocessing class to documents and queries
preprocess = PreProcessing(stopwords = stopwords, stemmer = stemmer)
document_corpus = preprocess.preprocess_corpus(document_set)
query_corpus    = preprocess.preprocess_corpus(query_set)

A mesma limpeza que é aplicada aos documentos deve ser aplicada às queries.

# Classe BM25 from scratch

In [14]:
class BM25:
    """
    Implementation of BM25 algorithm. It computes the term frequencies, document frequencies
    and scores for a given set of documents and query parameters.
    Methods:
        fit: given a set of corpus, it computes all the necessary statistics to 
            get bm25 matching scores 
        _score: computes a single score based on a query and an index
        get_scores: computes all scores based on a query and the given corpus

    """

    def __init__(self, k1=1.5, b=0.75):
        self.b = b
        self.k1 = k1

    def fit(self, corpus):
        """
        Function that fits the statistics that are required to calculate BM25 ranking
        score using a given corpus.

        Args:
            corpus : list[list[str]]
                Each element in the list represents a document, and each document
                is a list of the terms.
        Returns:
            self
        """
        tf = []
        df = {}
        idf = {}
        doc_len = []
        corpus_size = 0
        for document in corpus:
            corpus_size += 1
            doc_len.append(len(document))

            # compute tf (term frequency) per document
            frequencies = {}
            for term in document:
                term_count = frequencies.get(term, 0) + 1
                frequencies[term] = term_count

            tf.append(frequencies)

            # compute df (document frequency) per term
            for term, _ in frequencies.items():
                df_count = df.get(term, 0) + 1
                df[term] = df_count

        # compute the inverse document frequency
        for term, freq in df.items():
            idf[term] = math.log(1 + (corpus_size - freq + 0.5) / (freq + 0.5))

        self.tf_ = tf
        self.df_ = df
        self.idf_ = idf
        self.doc_len_ = doc_len
        self.corpus_ = corpus
        self.corpus_size_ = corpus_size
        self.avg_doc_len_ = sum(doc_len) / corpus_size
        return self

    def _score(self, query, index):
        """
        Function that computes a score based on the query and the index
        Args:
            query: user's query
            index: the index of the query
        Returns:
            score: score of the given query
        """
        score = 0.0

        doc_len = self.doc_len_[index]
        frequencies = self.tf_[index]
        for term in query:
            if term not in frequencies:
                continue

            freq = frequencies[term]
            numerator = self.idf_[term] * freq * (self.k1 + 1)
            denominator = freq + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_len_)
            score += (numerator / denominator)

        return score

    def get_scores(self, query):
        """
        Function that get scores for all the documents in the corpus
        Args:
            query: input query
        Returns:
            scores: a list of all the scores
        """
        scores = [self._score(query, index) for index in range(self.corpus_size_)]
        return scores

# Definição da classe de `SearchEngine`

Nesta etapa foi implementada uma classe para lidar com toda a lógica de Search Engine, desde o carregamento dos dados, definição do algorítmo de busca (aqui foi usado apenas BM25, mas a classe foi pensada para ser agnóstica a algoritmo) e cálculo de métricas.

In [15]:
def timer_func(func):
    # This function shows the execution time of 
    # the function object passed
    def wrap_func(*args, **kwargs):
        t1 = time()
        result = func(*args, **kwargs)
        t2 = time()
        print(f'Function {func.__name__!r} executed in {(t2-t1):.4f}s')
        return result
    return wrap_func


class SearchEngine:
    """
    This class implements all the logic for a search engine, since loading of data to
    preprocessing, tokenization, searching with a given algorithm, and valuation following
    the trec_eval framework for metrics extraction.
    Methods:
        _load_data: given a DataLoader object, performs its methods for data loading
        _preprocess: given a PreProcessing object, perfoms its methods for cleaning and tokenization
        _fit: given an algorithm (either Okapi or BM25), fits to the tokenized corpus
        _results_from_query: given and index and a tokenized query, searches in 
            the entire corpus, scoring documents and retrieving relevant matches
        _data_format_trec_eval: formats relevant set and query results in order to fit the
            trec_eval API.
        _extract_metrics: performs metrics extractions given by the trec_eval API.
        _search: given all the query set, performs the _results_from_query method.
        run: runs the entire pipeline returning metrics and a set of retrieved documents.
    """

    def __init__(self, dataloader, preprocessing, algorithm, trec_eval):
        self.dataloader    = dataloader
        self.preprocessing = preprocessing
        self.algorithm     = algorithm
        self.trec_eval     = trec_eval

    @timer_func
    def _load_data(self):
        """
        Function responsible for loading all data based on the DataLoader object provided.
        """
        data = self.dataloader.load()

        self.document_set = data[0]
        self.query_set    = data[1]
        self.relevant_set = data[2]
        self.queries_with_no_relevant_docs = np.setdiff1d(list(self.query_set.keys()),list(self.relevant_set.keys()))

    @timer_func
    def _preprocess(self):
        """
        Function responsible for preprocess and create document and query corpus
        based on the PreProcessing object provided.
        """

        self.document_corpus = self.preprocessing.preprocess_corpus(self.document_set)
        self.query_corpus    = self.preprocessing.preprocess_corpus(self.query_set)


    @timer_func
    def _fit(self):
        """ 
        If algorithm has the "fit" attribute, it is the scratch implementation,
        otherwise it is the okapi implementation.
        This function can generalize as long as the algorithm follows this syntax.
        """
        if hasattr(self.algorithm, 'fit'):
            self.algorithm = self.algorithm.fit(self.document_corpus)
        else:
            self.algorithm = self.algorithm(self.document_corpus)


    def _results_from_query(self, idx, tokenized_query):
        """Return an ordered array of relevant documents returned by query_id

        Args:
            tokenized_query: tokenized query to submit to algorithm
            idx: index of tokenized query
            algorithm: indexed corpus
        Returns:
            sorted_masked_relevance_results: sorted relevance array of documents
            metrics: trec_eval metrics
        """    
        relevant_docs = []

        # Retrieving relevant document
        if idx in self.relevant_set:
            relevant_docs = self.relevant_set[idx]

        # Scoring query using algorithm
        scores = self.algorithm.get_scores(tokenized_query)

        # Creating a masked relevant documents 
        # of 1's and 0's for documents in the relevant set
        masked_relevant_docs = np.zeros(len(scores))
        masked_relevant_docs[relevant_docs] = 1

        # Getting indexes of most relevant retrieved documents
        most_relevant_retrieved_docs = np.argsort([-1*x for x in scores])

        # trec_eval format
        qrel, run = self._data_format_trec_eval(idx, masked_relevant_docs, most_relevant_retrieved_docs, scores)

        # Getting metrics
        metrics = self._extract_metrics(qrel, run)

        return most_relevant_retrieved_docs, metrics

    @timer_func
    def retrieve_docs_from_query(self, query, n=10):
        """
        Function that takes an user's query and retrieved n documents
        Args:
            query: user's query (string)
            n: number of documents to retrieve
        Returns:
            retrieved_docs: documents that the algorithms matched
        """

        # preprocess user's query
        tokenized_query = self.preprocessing._preprocess_string(query)

        # fitting algorithm 
        try:
            # if not fitted, must fit
            self._fit()
        except:
            # if fitted, just pass
            pass

        # Scoring query using algorithm
        scores = self.algorithm.get_scores(tokenized_query)

        # Getting indexes of most relevant retrieved documents
        most_relevant_retrieved_docs = np.argsort([-1*x for x in scores])

        # Getting retrieved docs
        retrieved_docs = [self.document_set[i] for i in most_relevant_retrieved_docs[:n]]

        return retrieved_docs



    def _data_format_trec_eval_bkp(self, idx, masked_relevant_docs, most_relevant_retrieved_docs, scores):
        """
        #####################
        #### DEPRECATED #####
        #####################

        Function that transforms a set of scores, retrieved and relevant documents
        into trec_eval format used by HuggingFace evaluate module.
        Args:
            idx: index of document
            masked_relevant_docs: list of relevant documents
            most_relevant_retrieved_docs: list of retrieved documents
            scores: list with BM25 scores
        Returns:
            qrel: dict with relevant documents in trec_eval format
            run: dict with retrieved docouments and scores in trec_eval format.
        ## SOURCE: https://huggingface.co/spaces/evaluate-metric/trec_eval
        """
        
        N = len(masked_relevant_docs)

        qrel = {
            'query': [idx] * N,
            'q0': ['q0'] * N,
            "docid": [str(x) for x in masked_relevant_docs],
            "rel": [str(x) for x in masked_relevant_docs]
        }

        run = {
            "query": [idx] * N,
            "q0": ["q0"] * N,
            "docid": [str(x) for x in list(most_relevant_retrieved_docs[:N])],
            "rank": list(range(N)),
            "score": sorted(scores)[::-1][:N],
            "system": ["test"] * N
        }

        return qrel, run


    def _data_format_trec_eval(self, idx, masked_relevant_docs, most_relevant_retrieved_docs, scores):
        """
        Function that transforms a set of scores, retrieved and relevant documents
        into trec_eval format used by HuggingFace evaluate module.
        Args:
            idx: index of document
            masked_relevant_docs: list of relevant documents
            most_relevant_retrieved_docs: list of retrieved documents
            scores: list with BM25 scores
        Returns:
            qrel: dict with relevant documents in trec_eval format
            run: dict with retrieved docouments and scores in trec_eval format.
        ## SOURCE: https://huggingface.co/spaces/evaluate-metric/trec_eval
        """
        
        N_REL_DOCS = len(masked_relevant_docs)
        N_RET_DOCS = len(most_relevant_retrieved_docs)

        qrel = {
            'query': [idx] * N_REL_DOCS,
            'q0': ['q0'] * N_REL_DOCS,
            "docid": [str(int(x)) for x in range(N_REL_DOCS)],
            "rel": [str(int(x)) for x in masked_relevant_docs]
        }

        run = {
            "query": [idx] * N_RET_DOCS,
            "q0": ["q0"] * N_RET_DOCS,
            "docid": [str(x) for x in list(most_relevant_retrieved_docs[:N_RET_DOCS])],
            "rank": list(range(N_RET_DOCS)),
            "score": sorted(scores)[::-1][:N_RET_DOCS],
            "system": ["test"] * N_RET_DOCS
        }

        return qrel, run


    def _extract_metrics(self, qrel, run):
        """
        Function that extract trec_eval metrics from qrel and run
        Args:
            qrel: dict with relevant documents in trec_eval format
            run: dict with retrieved docouments and scores in trec_eval format
        Returns:
            metrics: dict with trec_eval metrics.
        """
        return self.trec_eval.compute(references=[qrel], predictions=[run])

    @timer_func
    def _search(self):
        """
        Function that execute all queries in the document set and outputs
        both the retrieved documents and metrics 
        """
        # Running search for all queries
        output = [
              self._results_from_query(idx, tokenized_query) 
              for idx, tokenized_query in enumerate(self.query_corpus)
              if idx not in self.queries_with_no_relevant_docs
        ]

        # Extracting most relevant documents and metrics
        most_relevant_retrieved_docs = [x[0] for x in output]
        metrics                      = [x[1] for x in output]

        return most_relevant_retrieved_docs, pd.DataFrame(metrics)

    @timer_func
    def run(self):
        """
        Function responsible for run all steps of the pipeline for information retrieval.
        """
        # Loading data
        self._load_data()

        # Preprocess data
        self._preprocess()

        # Fitting algorithm
        self._fit()

        # Searching 
        most_relevant_retrieved_docs, df_metrics = self._search()

        return most_relevant_retrieved_docs, df_metrics


# Rodando o pipeline completo

- Definição das classes cujo `SearchEngine` depende
  - `DataLoader`
    - depende do `PATH`
  - `PreProcessing`
    - depende das `stopwords` e do `stemmer`
  - `Algorithm`
    - `Okapi` ou `Scratch`
  - `trec_eval`

In [16]:
# DataLoader
PATH = '/content/drive/MyDrive/Colab Notebooks/IR/data/' 
dataloader = DataLoader(PATH)

# PreProcessing
stemmer = nltk.stem.PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')
preprocessing = PreProcessing(stopwords = stopwords, stemmer = stemmer)

# Algorithm
algo1 = BM25Okapi
algo2 = BM25()

# trec eval
trec_eval = load("trec_eval")

# SearchEngine1
search_engine_1 = SearchEngine(dataloader, preprocessing, algo1, trec_eval)
most_relevant_docs_1, df_metrics_1 = search_engine_1.run()

# SearchEngine2
search_engine_2 = SearchEngine(dataloader, preprocessing, algo2, trec_eval)
most_relevant_docs_2, df_metrics_2 = search_engine_2.run()

Downloading builder script:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

Function '_load_data' executed in 0.0867s
Function '_preprocess' executed in 3.8765s
Function '_fit' executed in 0.0630s
Function '_search' executed in 41.1351s
Function 'run' executed in 45.1619s
Function '_load_data' executed in 0.0745s
Function '_preprocess' executed in 5.2036s
Function '_fit' executed in 0.0950s
Function '_search' executed in 39.8502s
Function 'run' executed in 45.2258s


# Verificando métrics:
# Okapi vs BM25

- Mean average precision; 
- Geometric mean average precision;
- Binary preference score;
- Precision@R;
- Reciprocal Rank.

In [17]:
set1 = [
    'map',
    'gm_map',
    'bpref',
    'Rprec',
    'recip_rank'
]

In [18]:
df_metrics_1[set1].describe().merge(
    df_metrics_2[set1].describe(), 
    how='left', 
    left_index=True, 
    right_index=True, 
    suffixes=('_Okapi', '_Scratch')
)

Unnamed: 0,map_Okapi,gm_map_Okapi,bpref_Okapi,Rprec_Okapi,recip_rank_Okapi,map_Scratch,gm_map_Scratch,bpref_Scratch,Rprec_Scratch,recip_rank_Scratch
count,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0
mean,0.225431,0.225431,0.202589,0.245756,0.661638,0.224083,0.224083,0.201311,0.246825,0.65752
std,0.157095,0.157095,0.154056,0.160005,0.372463,0.154845,0.154845,0.146955,0.152068,0.376336
min,0.011458,0.011458,0.0,0.0,0.013889,0.011614,0.011614,0.0,0.0,0.014286
25%,0.109605,0.109605,0.083363,0.142143,0.333333,0.104506,0.104506,0.090604,0.145089,0.333333
50%,0.196504,0.196504,0.187077,0.25,1.0,0.201266,0.201266,0.179778,0.25,1.0
75%,0.308895,0.308895,0.276886,0.339112,1.0,0.311406,0.311406,0.283473,0.351972,1.0
max,0.833584,0.833584,0.833333,0.833333,1.0,0.780052,0.780052,0.740741,0.722222,1.0


## Precision

In [19]:
precision_cols = [
    'P@5',
    'P@10',
    'P@15',
    'P@20',
    'P@30',
    'P@100',
    'P@200',
    'P@500',
    'P@1000',
]

In [20]:
df_metrics_1[precision_cols].describe().merge(
    df_metrics_2[precision_cols].describe(), 
    how='left', 
    left_index=True, 
    right_index=True, 
    suffixes=('_Okapi', '_Scratch')
)

Unnamed: 0,P@5_Okapi,P@10_Okapi,P@15_Okapi,P@20_Okapi,P@30_Okapi,P@100_Okapi,P@200_Okapi,P@500_Okapi,P@1000_Okapi,P@5_Scratch,P@10_Scratch,P@15_Scratch,P@20_Scratch,P@30_Scratch,P@100_Scratch,P@200_Scratch,P@500_Scratch,P@1000_Scratch
count,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0
mean,0.410526,0.369737,0.314035,0.283553,0.239035,0.147237,0.105132,0.060947,0.038,0.423684,0.356579,0.327193,0.289474,0.241228,0.149079,0.105921,0.061632,0.038105
std,0.291926,0.252465,0.209922,0.19363,0.166568,0.113356,0.08357,0.052218,0.032844,0.293879,0.239074,0.215868,0.198211,0.16732,0.110395,0.082546,0.051723,0.032929
min,0.0,0.0,0.0,0.0,0.0,0.01,0.005,0.002,0.001,0.0,0.0,0.0,0.0,0.0,0.01,0.005,0.002,0.001
25%,0.2,0.175,0.133333,0.15,0.1,0.06,0.04,0.022,0.01275,0.2,0.175,0.133333,0.15,0.1,0.06,0.04375,0.022,0.01275
50%,0.4,0.4,0.333333,0.25,0.2,0.12,0.09,0.048,0.0295,0.4,0.3,0.3,0.25,0.2,0.14,0.0875,0.049,0.0295
75%,0.6,0.6,0.466667,0.4125,0.366667,0.1925,0.135,0.0825,0.04925,0.6,0.525,0.466667,0.45,0.366667,0.2025,0.14125,0.085,0.049
max,1.0,0.9,0.8,0.75,0.566667,0.5,0.38,0.218,0.134,1.0,0.9,0.8,0.65,0.633333,0.43,0.38,0.202,0.134


## Normalized Discounted Cumulative Gain

In [21]:
ndcg_cols = [
    'NDCG@5',
    'NDCG@10',
    'NDCG@15',
    'NDCG@20',
    'NDCG@30',
    'NDCG@100',
    'NDCG@200',
    'NDCG@500',
    'NDCG@1000'
]

In [22]:
df_metrics_1[ndcg_cols].describe().merge(
    df_metrics_2[ndcg_cols].describe(), 
    how='left', 
    left_index=True, 
    right_index=True, 
    suffixes=('_Okapi', '_Scratch')
)

Unnamed: 0,NDCG@5_Okapi,NDCG@10_Okapi,NDCG@15_Okapi,NDCG@20_Okapi,NDCG@30_Okapi,NDCG@100_Okapi,NDCG@200_Okapi,NDCG@500_Okapi,NDCG@1000_Okapi,NDCG@5_Scratch,NDCG@10_Scratch,NDCG@15_Scratch,NDCG@20_Scratch,NDCG@30_Scratch,NDCG@100_Scratch,NDCG@200_Scratch,NDCG@500_Scratch,NDCG@1000_Scratch
count,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0
mean,0.437259,0.407284,0.377304,0.362763,0.345349,0.387496,0.450308,0.53211,0.591084,0.446872,0.399259,0.382963,0.3641,0.343428,0.388335,0.451207,0.535172,0.591227
std,0.304352,0.260309,0.223411,0.210981,0.197175,0.195955,0.188573,0.177706,0.168439,0.308564,0.254274,0.229491,0.217347,0.199925,0.193561,0.188291,0.180216,0.171697
min,0.0,0.0,0.0,0.0,0.0,0.075815,0.136392,0.191298,0.192022,0.0,0.0,0.0,0.0,0.0,0.076309,0.121592,0.187927,0.193426
25%,0.16958,0.219384,0.203825,0.202789,0.200805,0.252744,0.332118,0.425279,0.500071,0.16958,0.209938,0.202881,0.211868,0.209567,0.2509,0.326761,0.400283,0.491053
50%,0.426966,0.396891,0.381293,0.334246,0.345198,0.356438,0.431062,0.533139,0.623669,0.477797,0.406278,0.381447,0.359974,0.339945,0.363492,0.421079,0.547889,0.626217
75%,0.684352,0.608999,0.550456,0.533233,0.500274,0.539224,0.57103,0.665318,0.715886,0.684352,0.602245,0.571163,0.531985,0.479205,0.521657,0.580309,0.671344,0.709674
max,1.0,0.936379,0.864362,0.879198,0.912788,0.912788,0.912788,0.912788,0.945759,1.0,0.933746,0.861179,0.804155,0.89771,0.89771,0.89771,0.91487,0.930482


# Recuperando documentos baseado na query do usuário

Também já a opção do usuário submeter uma query e recuperar um set de N documentos que os algoritimos deram match. Para isso, basta usar o método `retrieve_docs_from_query` informando a query e o número de documentos a serem recuperados.

In [23]:
# Defining query
query = '''
What problems and concerns are there in making up descriptive titles? 
What difficulties are involved in automatically retrieving articles from approximate titles? 
What is the usual relevance of the content of articles to their titles?'''

# Retrieving documents for both algorithm implementations 
retrieved_docs_1 = search_engine_1.retrieve_docs_from_query(query, n=10)
retrieved_docs_2 = search_engine_2.retrieve_docs_from_query(query, n=10)

Function 'retrieve_docs_from_query' executed in 0.0142s
Function '_fit' executed in 0.0555s
Function 'retrieve_docs_from_query' executed in 0.0623s


In [24]:
print('Número de documentos recuperados:', len(retrieved_docs_1))
print('Melhor match:', retrieved_docs_1[0])

Número de documentos recuperados: 10
Melhor match: The Information Content of Titles in Engineering Literature Bottle, Robert T. Since many alerting and information services rely very heavily on the use of titles to transfer information to the potential user, it is essential that he be aware of the proportion of the information contained in the complete document which will not be deducible from the title and which he will therefore miss.. Methods will be discussed for analyzing the relative information content of the titles of engineering paper and results presented for the amount and type of information lost through scanning title listing only.. Between one-third and one-half of indexable terms are not retrievable from article titles even if all possible synonyms and  related terms are used.. If all synonyms are used instead of one keyword the amount of information retrieved is increased by about 70 percent.. The problems of dealing with synonyms and with syntactical variants in searc

In [25]:
print('Número de documentos recuperados:', len(retrieved_docs_2))
print('Melhor match:', retrieved_docs_2[0])

Número de documentos recuperados: 10
Melhor match: The Information Content of Titles in Engineering Literature Bottle, Robert T. Since many alerting and information services rely very heavily on the use of titles to transfer information to the potential user, it is essential that he be aware of the proportion of the information contained in the complete document which will not be deducible from the title and which he will therefore miss.. Methods will be discussed for analyzing the relative information content of the titles of engineering paper and results presented for the amount and type of information lost through scanning title listing only.. Between one-third and one-half of indexable terms are not retrievable from article titles even if all possible synonyms and  related terms are used.. If all synonyms are used instead of one keyword the amount of information retrieved is increased by about 70 percent.. The problems of dealing with synonyms and with syntactical variants in searc