# Punto 2 Búsqueda binaria usando índice invertido (BSII)
## Integrantes
* Juan Esteban Arboleda
* Luccas Rojas

### 1. Preprocesamiento
Lo primero que se llevara a cabo para poder hacer la busqueda binaria a travez de indice invertido es la tokenizacion de los documentos y generacion del vocabulario. Ademas es importante tener en cuenta que el vocabulario debe estar ordenado, no debe contener stop-words, debe estar stemizado y normalizado

* A continucion se cargan los documentos y los queries en una estructura de datos, se debe cambiar document_path y query_path por la ruta donde se encuentran los documentos y los queries respectivamente

In [2]:
import os
import pandas as pd
import numpy as np

# Rutas a definir segun la ubicacion de los archivos
DOCUMENTS_PATH = '../data/docs-raw-texts'
QUERIES_PATH = '../data/queries-raw-texts'

def load_documents(folder_path: str) -> pd.DataFrame:
    """
    Returns a Pandas DataFrame where each row represents a document in folder_path.
    The DataFrame will have as many rows as there are documents in folder_path

        Parameters
        ----------
            folder_path: str
                The path to the folder that contains the documents to load
    
        Returns
        --------
            documents: pd.DataFrame
                Pandas DataFrame with two columns: "filename" and "body"
    """
    documents = []
    index = []
    id = 1
    columns = {'filename', 'body'}
    for filename in os.listdir(folder_path):
        text = pd.read_xml(os.path.join(folder_path, filename))['raw'].tolist()[1]
        filtered_text = text.replace('\n', ' ').replace('\xa0', ' ')
        document = [filename, filtered_text]
        documents.append(document)
        index.append(id)
        id += 1

    return pd.DataFrame(documents, index, columns)

documents = load_documents(DOCUMENTS_PATH)
queries = load_documents(QUERIES_PATH)

Primero que todo tokenizamos el texto, para esto utilizamos el word tokenize de la libreria NLTK

In [3]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

documents['tokens'] = documents['body'].apply(word_tokenize)
queries['tokens'] = queries['body'].apply(word_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Removemos todos los signos de puntuacion, contracciones del ingles y dejamos el texto todo en minusculas (normalizar) 

In [4]:
import string

def remove_punctuation(token_list):
    return [token.lower() for token in token_list if (token not in string.punctuation and (len(token)>1 or token.isnumeric()))]

documents['tokens']=documents['tokens'].apply(lambda x: remove_punctuation(x))
queries['tokens']=queries['tokens'].apply(lambda x: remove_punctuation(x))

print(documents)

             filename                                               body  \
1    wes2015.d001.naf  William Beaumont and the Human Digestion.  Wil...   
2    wes2015.d002.naf  Selma Lagerlöf and the wonderful Adventures of...   
3    wes2015.d003.naf  Ferdinand de Lesseps and the Suez Canal.  Ferd...   
4    wes2015.d004.naf  Walt Disney’s ‘Steamboat Willie’ and the Rise ...   
5    wes2015.d005.naf  Eugene Wigner and the Structure of the Atomic ...   
..                ...                                                ...   
327  wes2015.d327.naf  James Parkinson and Parkinson’s Disease.  Wood...   
328  wes2015.d328.naf  Juan de la Cierva and the Autogiro.  Demonstra...   
329  wes2015.d329.naf  Squire Whipple – The Father of the Iron Bridge...   
330  wes2015.d330.naf  William Playfair and the Beginnings of Infogra...   
331  wes2015.d331.naf  Juan Bautista de Anza and the Route to San Fra...   

                                                tokens  
1    [william, beaumont, and, 

Luego de tokenizar, dejar todo en minusculas, quitaremos las stop words para que reduzcan el vocabulario y no afecten el resultado final. Para esto usaremos la libreria nltk y su metodo stopwords.words('english').

In [5]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

#TODO no se si normalizar cuente como poner todo en minusculas
def remove_stop_words(token_list):
    return [token for token in token_list if token not in stop_words]

documents['tokens']=documents['tokens'].apply(lambda x: remove_stop_words(x))
queries['tokens']=queries['tokens'].apply(lambda x: remove_stop_words(x))

Luego de eliminar las stop words se hace stemming a las palabras restantes.

In [6]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
def stemming(token_list):
    return [stemmer.stem(token) for token in token_list]

documents['tokens']=documents['tokens'].apply(lambda x: stemming(x))
queries['tokens']=queries['tokens'].apply(lambda x: stemming(x))

En este punto el texto de cada documento y query esta en un formato mas facil de procesar, por lo que se procede a realizar la representacion vectorial de los documentos y queries.

## 2. Representación de los datos

A continuación se hace la implementación para transformar el anterior dataframe en una estructura de indice inertido para así poder realizar busquedas binarias 

In [7]:
print(documents.loc[1])
documents.loc[1]

filename                                     wes2015.d001.naf
body        William Beaumont and the Human Digestion.  Wil...
tokens      [william, beaumont, human, digest, william, be...
Name: 1, dtype: object


filename                                     wes2015.d001.naf
body        William Beaumont and the Human Digestion.  Wil...
tokens      [william, beaumont, human, digest, william, be...
Name: 1, dtype: object

In [17]:
documents.loc

<pandas.core.indexing._LocIndexer at 0x252bc249ef0>

In [19]:
def create_inverted_index(documents: pd.DataFrame) -> dict:
    """
    Creates the inverted index for a document set.

    Params
    ------
        documents: pd.DataFrame
            A Pandas DataFrame that represents the document set. The
            DataFrame should have the following columns: "filename", "body".
            DataFrame's index should correspond to the document id
        
    Returns
    -------
        inverted_index: dict
            A python dictionary that represents the inverted index.
            Keys are the terms in the vocabulary.
            Each value has a "df" (document frecuency) and "postings".
            "postings" are a numpy array
    """
    inverted_index = {}

    for id, document in documents.iterrows():
        for token in document['tokens']:
            if token not in inverted_index:
                inverted_index[token] = {"df": 0, "postings": np.array([], dtype=np.uint64)}
            if id not in inverted_index[token]["postings"]:
                inverted_index[token]["df"] += 1
                inverted_index[token]["postings"] = np.append(inverted_index[token]["postings"],
                                                                np.array([id], dtype=np.uint64))
    
    return inverted_index

{'william': {'df': 50,
  'postings': array([  1,   9,  15,  28,  35,  55,  56,  69,  78,  88,  91,  92,  95,
          98, 102, 106, 109, 111, 129, 136, 138, 147, 175, 179, 180, 189,
         190, 191, 193, 197, 212, 230, 231, 241, 254, 257, 266, 272, 273,
         274, 289, 291, 294, 299, 300, 309, 310, 320, 323, 330],
        dtype=uint64)},
 'beaumont': {'df': 1, 'postings': array([1], dtype=uint64)},
 'human': {'df': 59,
  'postings': array([  1,   2,   7,  12,  32,  36,  46,  50,  51,  53,  54,  55,  62,
          84,  86,  91, 102, 113, 118, 120, 131, 146, 161, 173, 174, 190,
         191, 192, 194, 197, 198, 204, 206, 210, 213, 215, 219, 222, 232,
         241, 249, 251, 261, 263, 266, 276, 279, 283, 290, 294, 296, 300,
         306, 310, 312, 313, 314, 317, 319], dtype=uint64)},
 'digest': {'df': 6,
  'postings': array([  1,  49,  54, 102, 263, 314], dtype=uint64)},
 'physiolog': {'df': 12,
  'postings': array([  1,  37,  46,  62, 113, 120, 133, 191, 261, 286, 294, 314],
      

Como se pudo observar en el anterior código, se crea un diccionario que almacenara el índice invertido haciendo un recorrido por cada uno de los documentos y sus tokens. Agregando así todos los tokens del vocabulario y añadiendo a cada token el listado de documentos que contienen ese token. El vocabulario final cuenta con 14682 tokens.

## 3. Modelamiento

In [9]:
#TODO funcion para and y algoritmo de merge

def organize_tokens(tokens,inverted_index):
    organized_tokens = []
    dfs = []
    for token in tokens:
        dfs.append(inverted_index[token]['df'])     
    dfs.sort()
    for df in dfs:
        for token in tokens:
            if inverted_index[token]['df'] == df:
                organized_tokens.append(token) 
                tokens.remove(token)
    return organized_tokens   

def and_search(tokens,inverted_index):
    organized_tokens = organize_tokens(tokens,inverted_index)
    relevant_documents = []
    token_document_index = [0 for i in range(len(organized_tokens))]
    while token_document_index[0] < len(inverted_index[organized_tokens[0]]['postings']):
        document = inverted_index[organized_tokens[0]]['postings'][token_document_index[0]]
        document_is_relevant = True
        token_index = 1
        while document_is_relevant:
            document_index=0
            while token_document_index[document_index] < len(inverted_index[organized_tokens[document_index]]['postings']) and inverted_index[organized_tokens[document_index]]['postings'][token_document_index[document_index]] < document:
            if inverted_index[organized_tokens[token_index]]['postings'][array_index[token_index]] == document:
                token_index += 1
                if token_index == len(organized_tokens):
                    relevant_documents.append(document)
                    document_is_relevant = False

        token_document_index[0] += 1
        

print(and_search(queries.iloc[2]['tokens'],inverted_index))



IndentationError: expected an indented block (928746149.py, line 27)