# Tarea: Índice Inverso
- Martínez Ostoa Néstor Iván
- Datos Masivos I
- Ciencia de Datos, IIMAS, UNAM
- Abril 2021

\* Notebook basado en: https://github.com/gibranfp/CursoDatosMasivosI/blob/main/notebooks/3b_indice_inverso.ipynb

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import  Counter
import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.corpus.reader.wordnet import NOUN, VERB, ADV, ADJ
nltk.download(['punkt','averaged_perceptron_tagger','wordnet'])

[nltk_data] Downloading package punkt to /home/nestor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nestor/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/nestor/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
morphy_tag = {
    'JJ' : ADJ,
    'JJR' : ADJ,
    'JJS' : ADJ,
    'VB' : VERB,
    'VBD' : VERB,
    'VBG' : VERB,
    'VBN' : VERB,
    'VBP' : VERB,
    'VBZ' : VERB,
    'RB' : ADV,
    'RBR' : ADV,
    'RBS' : ADV
}

def doc_a_tokens(doc):
    """
        Función encargada de convertir un documento a una lista de tokens "limpios"
    """
    tagged = pos_tag(word_tokenize(doc.lower()))
    lemmatizer = WordNetLemmatizer()
    tokens = []
    for p,t in tagged:
        tokens.append(lemmatizer.lemmatize(p, pos=morphy_tag.get(t, NOUN)))
    return tokens

## 1. Construcción del Corpus

In [3]:
db = fetch_20newsgroups(remove=('headers','footers','quotes'))

In [4]:
corpus = []
for d in db.data:
  d = d.replace('\n',' ').replace('\r',' ').replace('\t',' ')
  tokens = doc_a_tokens(d)
  corpus.append(' '.join(tokens))

### 1.1 Ejemplo de un documento dentro del corpus
- *corpus*: documentos provenientes del conjunto de noticias de sklearn

In [5]:
print(corpus[0])

i be wonder if anyone out there could enlighten me on this car i saw the other day . it be a 2-door sport car , look to be from the late 60s/ early 70 . it be call a bricklin . the door be really small . in addition , the front bumper be separate from the rest of the body . this be all i know . if anyone can tellme a model name , engine spec , year of production , where this car be make , history , or whatever info you have on this funky look car , please e-mail .


## 2. Bolsas de palabras (BoW) del corpus

La bolsa de palabras es una matriz dispersa (CSR matrix) proveniente de scipy

In [6]:
v = CountVectorizer(stop_words='english', max_features=5000, max_df=0.8)
bolsas = v.fit_transform(corpus)

In [7]:
print(f'BoW: \n\t{bolsas.shape[0]} documentos - {bolsas.shape[1]} palabras diferentes\n')
print(f'Ejemplo primer documento: \n\t(num_of_docs, word_id) \t word frequency\n\n{bolsas[0]}')

BoW: 
	11314 documentos - 5000 palabras diferentes

Ejemplo primer documento: 
	(num_of_docs, word_id) 	 word frequency

  (0, 4905)	1
  (0, 984)	4
  (0, 3986)	1
  (0, 1407)	1
  (0, 1603)	2
  (0, 4253)	1
  (0, 2775)	2
  (0, 2666)	1
  (0, 1649)	1
  (0, 309)	1
  (0, 3740)	1
  (0, 4181)	1
  (0, 441)	1
  (0, 4063)	1
  (0, 3850)	1
  (0, 848)	1
  (0, 2638)	1
  (0, 2996)	1
  (0, 1725)	1
  (0, 4231)	1
  (0, 4972)	1
  (0, 3566)	1
  (0, 2833)	1
  (0, 2269)	1
  (0, 2418)	1
  (0, 2823)	1


## 3. Índice Inverso

In [8]:
class IndiceInverso:
  def  __getitem__(self, idx):
    return self.ifs[idx]

  def __repr__(self):
    contenido = ['%d::%s' % (i, self.ifs[i]) for i in range(len(self.ifs))]
    return "<IFS :%s >" % ('\n'.join(contenido))

  def __str__(self):
    contenido = ['%d::%s' % (i, self.ifs[i]) for i in range(len(self.ifs))]
    return '\n'.join(contenido)

  def recupera(self, l):
    return Counter([j for (i,_) in l for j in self.ifs[i]])

  def from_csr(self, csr):
    self.ifs = [[] for _ in range(csr.shape[1])]
    coo = csr.tocoo()
    for i,j,v in zip(coo.row, coo.col, coo.data):
      self.ifs[j].append(i)

### 3.1 Instancia de ```IndiceInverso```

In [9]:
ifs = IndiceInverso()
ifs.from_csr(bolsas)

In [10]:
print(f'Documentos en los que aparece la palabra con índice 0: ({len(ifs[0])}) \n\n{ifs[0]}')

Documentos en los que aparece la palabra con índice 0: (243) 

[59, 109, 372, 444, 472, 498, 502, 514, 531, 548, 583, 621, 769, 782, 794, 795, 846, 884, 939, 940, 949, 986, 1033, 1125, 1189, 1197, 1310, 1331, 1459, 1505, 1546, 1554, 1569, 1572, 1584, 1623, 1689, 1704, 1729, 1948, 2064, 2070, 2138, 2142, 2168, 2287, 2319, 2455, 2539, 2592, 2634, 2639, 2746, 2821, 2895, 2936, 3050, 3127, 3166, 3186, 3197, 3210, 3282, 3425, 3483, 3520, 3530, 3549, 3569, 3600, 3674, 3790, 3850, 3864, 3875, 3883, 3953, 3970, 4080, 4166, 4168, 4301, 4352, 4449, 4498, 4520, 4524, 4604, 4639, 4730, 4767, 4796, 4867, 4960, 4963, 5014, 5022, 5045, 5100, 5125, 5133, 5265, 5274, 5317, 5337, 5481, 5566, 5586, 5666, 5744, 5765, 5795, 5869, 5888, 5993, 6012, 6087, 6209, 6253, 6256, 6261, 6266, 6286, 6334, 6440, 6459, 6476, 6483, 6484, 6511, 6522, 6631, 6673, 6696, 6703, 6719, 6759, 6822, 6838, 6847, 6880, 6891, 6908, 6914, 6987, 6998, 7039, 7042, 7139, 7165, 7202, 7302, 7409, 7411, 7432, 7442, 7478, 7499, 7554, 7627,

### 3.2 Convertir de CSR a listas de listas + Consultas

In [11]:
def csr_to_ldb(csr):
  ldb = [[] for _ in range(csr.shape[0])]
  coo = csr.tocoo()    
  for i,j,v in zip(coo.row, coo.col, coo.data):
    ldb[i].append((j, v))
  return ldb

In [12]:
consultas = []
for c in ['nasa space mission satellite','government crime enforcement Security']:
  tokens = doc_a_tokens(c)
  consultas.append(' '.join(tokens))

bolsa_consultas = v.transform(consultas)
cl = csr_to_ldb(bolsa_consultas)

In [13]:
cl

[[(2979, 1), (3083, 1), (3981, 1), (4223, 1)],
 [(1337, 1), (1723, 1), (2129, 1), (4041, 1)]]

In [14]:
print('Word \t\t| BoW (keyword, frequency)')
print('---------------------------------\n')
for t_idx, text in enumerate(consultas):
    for idx, word in enumerate(text.split(' ')):
        print(f'{word} --> \t\t{cl[t_idx][idx]}\n')
    print('-----------------------------------------')
    print('-----------------------------------------')

Word 		| BoW (keyword, frequency)
---------------------------------

nasa --> 		(2979, 1)

space --> 		(3083, 1)

mission --> 		(3981, 1)

satellite --> 		(4223, 1)

-----------------------------------------
-----------------------------------------
government --> 		(1337, 1)

crime --> 		(1723, 1)

enforcement --> 		(2129, 1)

security --> 		(4041, 1)

-----------------------------------------
-----------------------------------------


### 3.3 Índice Inverso para recuperar los documentos que contienen las palabras del primer documento de consulta

-> 'nasa space mission satellite inside government'

In [15]:
recs = ifs.recupera(cl[0])
top = recs.most_common()[0]

In [16]:
print(f'Documentos más comunes:')
for e in recs.most_common()[:25]: print(e, end=', ')

Documentos más comunes:
(59, 4), (153, 4), (545, 4), (1830, 4), (2800, 4), (3285, 4), (3564, 4), (3864, 4), (4425, 4), (5356, 4), (6197, 4), (6719, 4), (7554, 4), (8525, 4), (9096, 4), (9154, 4), (9986, 4), (10855, 4), (11198, 4), (953, 3), (1071, 3), (1459, 3), (3044, 3), (3665, 3), (4443, 3), 

In [17]:
article_id = top[0]
print(f'Documento más similar: id->{article_id}\n')
print(db.data[article_id])

Documento más similar: id->59

Archive-name: space/new_probes
Last-modified: $Date: 93/04/01 14:39:17 $

UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES

    Information on upcoming or currently active missions not mentioned below
    would be welcome. Sources: NASA fact sheets, Cassini Mission Design
    team, ISAS/NASDA launch schedules, press kits.


    ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into
    Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20
    Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas
    scintillation proportional counters.


    CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint
    NASA/ESA project designed to accomplish an exploration of the Saturnian
    system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini
    is scheduled for launch aboard a Titan IV/Centaur in October of 1997.
    After gravity assists of Venus, Earth and Jupiter in a VVEJGA
    trajector

## 4. Búsqueda de documentos similares

### 4.1 Documento de consulta

In [18]:
dc = db.data[0]
print(dc)

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


### 4.2 BoW de ``dc``

In [19]:
tokens = doc_a_tokens(dc)
bolsa_dc = v.transform([' '.join(tokens)])

In [20]:
print('Componentes para consulta: \n{0}'.format(tokens), end='\n\n')
print('Bolsa para consulta: \n[{0}]'.format(bolsa_dc))

Componentes para consulta: 
['i', 'be', 'wonder', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'i', 'saw', 'the', 'other', 'day', '.', 'it', 'be', 'a', '2-door', 'sport', 'car', ',', 'look', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70', '.', 'it', 'be', 'call', 'a', 'bricklin', '.', 'the', 'door', 'be', 'really', 'small', '.', 'in', 'addition', ',', 'the', 'front', 'bumper', 'be', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'this', 'be', 'all', 'i', 'know', '.', 'if', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'spec', ',', 'year', 'of', 'production', ',', 'where', 'this', 'car', 'be', 'make', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'look', 'car', ',', 'please', 'e-mail', '.']

Bolsa para consulta: 
[  (0, 309)	1
  (0, 441)	1
  (0, 848)	1
  (0, 984)	4
  (0, 1407)	1
  (0, 1603)	2
  (0, 1649)	1
  (0, 1725)	1
  (0, 2269)	1
  (0, 2418)	1
  (0, 2638)	1
  (0, 2666

### 4.3 Funciones de similitud

In [21]:
def similitud_coseno(x, y):
    x = x.toarray()[0]
    y = y.toarray()[0]
    pnorma = (np.sqrt(x @ x) * np.sqrt(y @ y))
    if pnorma > 0:
        return (x @ y) / pnorma
    return np.nan 

def distancia_euclidiana(x, y):   
    x = x.toarray()[0]
    y = y.toarray()[0]
    return np.sqrt(np.sum((x - y)**2))

def similitud_jaccard(x, y):
    x = x.toarray()[0]
    y = y.toarray()[0]
    inter = np.count_nonzero(x * y)
    return inter / (np.count_nonzero(x) + np.count_nonzero(y) - inter)

def similitud_minmax(x, y):
    x = x.toarray()[0]
    y = y.toarray()[0]
    c = np.vstack((x,y))
    mn = np.sum(np.min(c, axis=0))
    mx = np.sum(np.max(c, axis=0))
    return mn / mx

### 4.3 Búsqueda de similitud por fuerza bruta

In [22]:
def fuerza_bruta(base, consulta, fd):
  medidas = np.zeros(base.shape[0])
  for i,x in enumerate(base):
    medidas[i] = fd(consulta, x)
  return medidas

def info(sims, db, time, distance):
    print(f'Tiempo elapsado: \n\t{time} [s]')
    print('--------------------------------------------------\n')
    most_similar_doc_id = np.nanargmax(sims) + 1
    max_similitude = np.nanmax(sims)
    document = db.data[most_similar_doc_id]
    print(f'Similitud máxima ({distance.upper()}): \n\t {max_similitude} con el documento {most_similar_doc_id}\n')
    print('--------------------------------------------------\n')
    print(f'Documento: \n{most_similar_doc_id}: \n {document}')

def mide_fuerza_bruta(consulta, base, func_similitud, sim_title):
    start = time.time()
    sims = fuerza_bruta(base, consulta, func_similitud)
    end = time.time()
    info(sims, db, time=end-start, distance=sim_title)

#### 4.3.1 Fuerza bruta + Similitud Coseno

In [23]:
mide_fuerza_bruta(consulta=bolsa_dc, base=bolsas[1:], func_similitud=similitud_coseno,
    sim_title='Coseno')

Tiempo elapsado: 
	1.906834363937378 [s]
--------------------------------------------------

Similitud máxima (COSENO): 
	 0.4589534811637672 con el documento 6997

--------------------------------------------------

Documento: 
6997: 
 
Perhaps instead of this silly argument about what backup lights
are for, couldn't we agree that they serve the dual purpose of
letting people behind your car know that you have it in reverse
and that they can also light up the area behind your car while
you're backing up so you can see?

Backup lamps on current models are much brighter than they used
to be on older cars. Those on my Taurus Wagon are quite bright
enough to illuminate a good area behind the car, and they're 
MUCH brighter than those on my earlier cars from the 60s and 70s. 

Insofar as Vettes having side backup lights, look at a '92 or '93
model (or perhaps a year or two earlier too) and you'll see
red side marker lamps and white side marker lamps both near the
car's hindquarters.  Those

#### 4.3.2 Fuerza bruta + Distancia Euclideana

In [24]:
mide_fuerza_bruta(consulta=bolsa_dc, base=bolsas[1:], func_similitud=distancia_euclidiana,
    sim_title='Euclideana')

Tiempo elapsado: 
	1.3075840473175049 [s]
--------------------------------------------------

Similitud máxima (EUCLIDEANA): 
	 11180.764106267514 con el documento 4772

--------------------------------------------------

Documento: 
4772: 
 
------------ Part 12 of 14 ------------
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>',<3$9@L+I:5'W]_?W]_?W]_?>GIZ*BJ[N[M>7EY>7EY>7@,#`P,#`P->
M`P,#`P->*BHJ*KN[N[M>`P,#`P->NRHJ*BIZ1PMF,8>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX

#### 4.3.3 Fuerza bruta + Similitud Jaccard

In [25]:
mide_fuerza_bruta(consulta=bolsa_dc, base=bolsas[1:], func_similitud=similitud_jaccard,
    sim_title='Jaccard')

Tiempo elapsado: 
	1.5316853523254395 [s]
--------------------------------------------------

Similitud máxima (JACCARD): 
	 0.16049382716049382 con el documento 5282

--------------------------------------------------

Documento: 
5282: 
 Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, eithe

#### 4.3.4 Fuerza bruta + Similitud MinMax

In [26]:
mide_fuerza_bruta(consulta=bolsa_dc, base=bolsas[1:], func_similitud=similitud_minmax,
    sim_title='MinMax')

Tiempo elapsado: 
	1.580542802810669 [s]
--------------------------------------------------

Similitud máxima (MINMAX): 
	 0.14736842105263157 con el documento 5282

--------------------------------------------------

Documento: 
5282: 
 Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, either 

### 4.4 Búsqueda de similitud por Índice Inverso

In [44]:
def encontrar_docs_a_comparar(doc_consulta, v, ifs, f_token, f_csr_ldb):
    """
        Función encargada de generar, mediante el índice inverso,
        los documentos a comparar con base en el documento de consulta
    """
    bolsa_consulta = v.transform([' '.join(f_token(doc_consulta))])
    bolsa_consulta_listas = f_csr_ldb(bolsa_consulta)
    recs = ifs.recupera(bolsa_consulta_listas[0])
    # Seleccionamos a partir del índice 1 hasta el final
    # para ignorar el mismo documenot de consulta
    return (bolsa_consulta, recs.most_common()[1:])

def obtener_doc_mas_similar(doc_consulta, bolsas, func_distancia, v, ifs, f_token, f_csr_ldb):
    """
        Función encargada de encontrar la distancia máxima entre un documento
        de consulta y los demás documentos del corpus. Aunado a esto, también
        regresa el identificador del documento más cercano
    """
    # Obtenemos los documentos a comparar utilizando el índice inverso
    (bolsa_doc_consulta, docs_a_comparar) = encontrar_docs_a_comparar(doc_consulta, v, ifs, f_token, f_csr_ldb)

    # Buscamos la similitud máxima y el índice del documento más similar
    max_distance = -1
    most_similar_doc_idx = -1
    for element in docs_a_comparar:
        doc_idx = element[0]
        bolsa_a_comparar = bolsas[doc_idx]
        d = func_distancia(bolsa_doc_consulta, bolsa_a_comparar)
        if d > max_distance:
            max_distance = d
            most_similar_doc_idx = doc_idx
    return (max_distance, most_similar_doc_idx)

def mide_indice_inverso(doc_consulta, bolsas, fdist, fdist_title, v, ifs, f_token, f_csr_ldb):
    """
        Función encargada de imprimir los tiempos de ejecución entre un documento
        de consulta y las bolsas de todos los demás documentos dentro del corpus 
        bajo una función de similitud (coseno, euclideana, jaccard, minmax)
    """
    start = time.time()
    (max_dist, most_similar_doc) = obtener_doc_mas_similar(doc_consulta, bolsas, fdist, v, ifs, f_token, f_csr_ldb)
    end = time.time()
    print(f'\t----------INDICE INVERSO----------\n\t\t\t{fdist_title.upper()}\n')
    print(f'Tiempo elapsado: \n\t{end - start} [s]\n')
    document = db.data[most_similar_doc]
    print(f'Similitud máxima: \n\t {max_dist} con el documento {most_similar_doc}\n')
    print(f'Documento: {most_similar_doc}: \n {document}')


In [45]:
doc_consulta = db.data[0]
print(doc_consulta)

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [51]:
mide_indice_inverso(doc_consulta=doc_consulta, bolsas=bolsas, fdist=similitud_coseno, fdist_title='coseno',
    v=v, ifs=ifs, f_token=doc_a_tokens, f_csr_ldb=csr_to_ldb)

	----------INDICE INVERSO----------
			COSENO

Tiempo elapsado: 
	0.8900790214538574 [s]

Similitud máxima: 
	 0.4589534811637672 con el documento 6997

Documento: 6997: 
 
Perhaps instead of this silly argument about what backup lights
are for, couldn't we agree that they serve the dual purpose of
letting people behind your car know that you have it in reverse
and that they can also light up the area behind your car while
you're backing up so you can see?

Backup lamps on current models are much brighter than they used
to be on older cars. Those on my Taurus Wagon are quite bright
enough to illuminate a good area behind the car, and they're 
MUCH brighter than those on my earlier cars from the 60s and 70s. 

Insofar as Vettes having side backup lights, look at a '92 or '93
model (or perhaps a year or two earlier too) and you'll see
red side marker lamps and white side marker lamps both near the
car's hindquarters.  Those aren't just white reflectors. 


In [52]:
mide_indice_inverso(doc_consulta=doc_consulta, bolsas=bolsas, fdist=distancia_euclidiana, fdist_title='euclideana',
    v=v, ifs=ifs, f_token=doc_a_tokens, f_csr_ldb=csr_to_ldb)

	----------INDICE INVERSO----------
			EUCLIDEANA

Tiempo elapsado: 
	0.7350575923919678 [s]

Similitud máxima: 
	 6020.848113015309 con el documento 4515

Documento: 4515: 
 
------------ Part 8 of 14 ------------
MAX>'AX>'AX>'AX>'AYZ>8=#0T"L!A%(!*]#0T"4E)<-N;@+(U,C(;CZ;1-/3
MTXU*IH?4JH!75\B`P2&((F[JZ$/Q[/&J3;YJ+2WE<"3T*0:4E,SUAX>'AY\*
MFYL^/J"@PR4E)=#0T"4E2VY+)1$1D&[4U-34U-34AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX<Y824E%!17@$JN%!1-(;GO?NKHM:=:6FQ*
MKNSL\3ND\>QJ+2F'AX>'AX&!UM8DFYL^/CZ@B+40B*`^/IN;FYN;)24E)6%A
MT-#0T-!A)25+P\/#-L#`P%L45U?R(>#@AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'
MAX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'AX>'

In [53]:
mide_indice_inverso(doc_consulta=doc_consulta, bolsas=bolsas, fdist=similitud_jaccard, fdist_title='Jaccard',
    v=v, ifs=ifs, f_token=doc_a_tokens, f_csr_ldb=csr_to_ldb)

	----------INDICE INVERSO----------
			JACCARD

Tiempo elapsado: 
	0.8227863311767578 [s]

Similitud máxima: 
	 0.16049382716049382 con el documento 5282

Documento: 5282: 
 Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, either an Alpha Romeo
or a Lancia. It was about the size of an Austin M

In [54]:
mide_indice_inverso(doc_consulta=doc_consulta, bolsas=bolsas, fdist=similitud_minmax, fdist_title='minmax',
    v=v, ifs=ifs, f_token=doc_a_tokens, f_csr_ldb=csr_to_ldb)

	----------INDICE INVERSO----------
			MINMAX

Tiempo elapsado: 
	1.2106525897979736 [s]

Similitud máxima: 
	 0.14736842105263157 con el documento 5282

Documento: 5282: 
 Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, either an Alpha Romeo
or a Lancia. It was about the size of an Austin Mi

## 5. Análisis

Para el análisis, lo que nos interesa es comparar los tiempos de búsqueda de similitud entre documentos. En esta práctica realizamos dos enfoques para buscar documentos similares:
- Fuerza Bruta: se encarga de comparar la bolsa de palabras de un documento de consulta contra las bolsas de palabras de todos los demás documentos en el corpus. Esta es una función lineal que compara $N-1$ veces un documento bajo una función de similitud (coseno, euclideana, jaccard, minmax) y encuentra aquel documento con la mayor similitud. Es un enfoque bastante sencillo pero es el más costoso computacionalmente pues el tiempo incrementa conforme el tamaño del corpus incremente
- Índice Inverso: por otro lado, el índice inverso lo que busca es hacer menos comparaciones entre el documento de consulta y los restantes. Para esto, se ocupa el índice inverso para determinar simplemente a aquellos documentos que valga la pena comparar. Dichos documentos son aquellos en los que se comparte al menos una palabra con el documento de consulta. Debido a este enfoque, puede que el número de comparaciones $M | M \leq (N-1)$ no sea mucho menor que $N-1$ sin embargo, para el documento de consulta, estos son los resultados obtenidos:

|Similitud/Método|Fuerza Bruta|Índice Inverso|
--- | --- | ---
|Coseno|1.9068|0.8462|
|Euclideana|1.3076|0.7362|
|Jaccard|1.5317|0.8226|
|MinMax|1.5805|0.9872|

Lo que podemos observar es que claramente ocupando el Índice Inverso se obtiene una disminución en los tiempos de comparación. La disminución no es tan drástica porque se está considerando todos los documentos que comparten al menos una palabra. Para el caso en particular del documento de consulta $0$, en Fuerza Bruta se comparan aproximadamente 11,300 documentos, mientras que con el índice inverso se comparan únicamente $7,000$, lo cual es una disminución de documentos, pero no muy significativa para que los tiempos reduzcan drásticamente. 