# Introdução à Recuperação de Informações

## Lista de exercícios 1

<hr>

In [1]:
import pandas as pd
import gzip
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.tokenize import WordPunctTokenizer
import re 
from nltk.stem import PorterStemmer
from collections import defaultdict
from spellchecker import SpellChecker

## Base de dados utilizada

A base de dados utilizada está disponível em: http://jmcauley.ucsd.edu/data/amazon/ e contém o texto e notas de reviews da seção de video games da Amazon.

In [2]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient = 'index')

df = getDF('reviews_Video_Games_5.json.gz')

In [3]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4.0,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1.0,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3.0,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4.0,DIRT 3,1308009600,"06 14, 2011"


Note que o **asin** é um identificador de um determinado produto.

In [4]:
nrow = df.shape[0]
print(df.shape)

(231780, 9)


Exemplo de review

In [5]:
df.reviewText[3]

'I got this version instead of the PS3 version, which turned out to be a mistake. Console versions of games look 95 percent as good as their PC versions, but you do not have to deal with driver issues and the numerous things that can go wrong with windows. First off the installation takes about 30 minutes, which is ridiculous. I have never had a game take this long to load, Shift 2 took about 20 minutes which seemed too long also. Next many of the latest games for PC are forcing you to have an internet connection in order to install the game, regardless of whether you want to play only offline single player games. Shift 2 unleashed is also like this, so be forewarned. Internet requirements are not prominently displayed on the boxes. The game pushes you, but does not require you to, sign up for a Games For Windows Live account, which is required to get the game patches and updates. More time wasted signing up for that. Finally after about one hour the game was up and running , but the m

<hr>

### Exercício 1: Truncagem e revocação.

Baseando-se no indice invertido construído na prática 1, Calcule a diferença de revocação com e sem a utilização de "stemming", ou truncagem na construção do índice.

<hr>

Colocando em caixa baixa, removendo urls, pontuações e tokenizando o texto

In [6]:
not_stemmed = [] 
transtable = str.maketrans(' ', ' ', string.punctuation) # https://stackoverflow.com/questions/34860982/replace-the-punctuation-with-whitespace
for t in df['reviewText']:
    t = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', t)
    t = t.translate(transtable)
    t = re.sub(' +', ' ', t).strip()
    txt = [token.lower() for token in WordPunctTokenizer().tokenize(t)]
    not_stemmed.append(txt)    
    
ps = PorterStemmer()

stemmed = []
for t in not_stemmed:
    txt = [ps.stem(token) for token in t]
    stemmed.append(txt)

Criando os índices invertidos

In [7]:
ind_not_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(not_stemmed):
    for term in t:
        ind_not_stemmed[term].add(tid)

In [8]:
ind_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(stemmed):
    for term in t:
        ind_stemmed[term].add(tid)

In [9]:
def busca_not_stemmed(consulta):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    for tok in toks:
        tmp = ind_not_stemmed[tok]
        ret = ret & tmp
    return ret

In [10]:
def busca_stemmed(consulta):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    toks = [ps.stem(tok) for tok in toks]
    for tok in toks:
        tmp = ind_stemmed[tok]
        ret = ret & tmp
    return ret

Testando a busca

In [11]:
busca_not_stemmed("zelda")

{57344,
 8193,
 57346,
 40963,
 57345,
 57347,
 57348,
 57349,
 57350,
 57351,
 57352,
 40971,
 57354,
 57355,
 32782,
 57356,
 57357,
 57359,
 40978,
 8211,
 57360,
 8213,
 8214,
 57362,
 8216,
 57363,
 49178,
 57366,
 57367,
 57369,
 8222,
 57370,
 32,
 33,
 34,
 8225,
 8227,
 8229,
 38,
 8230,
 8232,
 57376,
 57377,
 57378,
 57379,
 57380,
 57381,
 8239,
 8240,
 57384,
 57385,
 57386,
 57387,
 57388,
 41014,
 57390,
 57391,
 8249,
 57393,
 57394,
 57395,
 57398,
 57399,
 57400,
 57401,
 57402,
 57403,
 57404,
 41028,
 57405,
 57406,
 57408,
 57409,
 57410,
 41034,
 57411,
 41036,
 41037,
 41038,
 57414,
 57415,
 73806,
 16470,
 106590,
 57439,
 188532,
 8316,
 155772,
 155777,
 131203,
 106630,
 41095,
 73866,
 188556,
 155789,
 139406,
 180366,
 139408,
 188557,
 229519,
 149,
 153,
 229531,
 157,
 229537,
 229539,
 229540,
 139429,
 229542,
 16552,
 229544,
 41130,
 229548,
 24749,
 188589,
 229549,
 229550,
 229552,
 229554,
 229555,
 229556,
 57525,
 229557,
 229559,
 229560,
 2

In [12]:
df.iloc[57344]

reviewerID                                           A35R8PJSEURHHF
asin                                                     B0009UBR3A
reviewerName                                           Aaron Merkel
helpful                                                     [3, 16]
reviewText        Graphics are bland and there is little detail ...
overall                                                           2
summary                                 When did Zelda become lame?
unixReviewTime                                           1167696000
reviewTime                                               01 2, 2007
Name: 57344, dtype: object

In [13]:
df.reviewText[57344]

"Graphics are bland and there is little detail in most areas.  I feel things just don't flow together at all either.  The story seems very choppy.  I search each map extensvely and sometimes there is just nothing there at all, why bother having huge area with nothing in it.  The music gets very tedious after awhile; everytime an enemy approaches you get the same annoying music. I expected a graphical and musical extravaganza, but received mediocre graphics and sound.  I guess I just expected more.  Dungeons are very aggrivating, until you figure out what to do.  I just don't see the fun in this game and that's unfortunate since I'm a fan of the Zelda series and it feels as this game was rushed."

Parece ok!

Para calcular o precision e recall, consideremos o produto "Super Smash Bros Melee" (asin = B00005Q8M0) e a consulta "smash melee". 

In [14]:
relevantes = set(df[df.asin == "B00005Q8M0"].index)
len(relevantes)

225

In [15]:
retrieved_not_stemmed = busca_not_stemmed("smash melee")
len(retrieved_not_stemmed)

282

In [16]:
retrieved_stemmed = busca_stemmed("smash melee")
len(retrieved_stemmed)

325

In [17]:
precision_not_stemmed = len(relevantes & retrieved_not_stemmed) / len(retrieved_not_stemmed)
print("A precisão sem stemming foi de", round(precision_not_stemmed, 4))

A precisão sem stemming foi de 0.2021


In [18]:
recall_not_stemmed = len(relevantes & retrieved_not_stemmed) / len(relevantes)
print("A revocação sem stemming foi de", round(recall_not_stemmed, 4))

A revocação sem stemming foi de 0.2533


In [19]:
precision_stemmed = len(relevantes & retrieved_stemmed) / len(retrieved_stemmed)
print("A precisão com stemming foi de", round(precision_stemmed, 4))

A precisão com stemming foi de 0.1815


In [20]:
recall_stemmed = len(relevantes & retrieved_stemmed) / len(relevantes)
print("A revocação com stemming foi de", round(recall_stemmed, 4))

A revocação com stemming foi de 0.2622


<hr>

### Exercício 2: Expansão de consultas
Crie grupos de equivalência para alguns termos de busca e calcule a diferença em termos de revocação e, possivelmente precisão, na resposta a consultas expandidas e não expandidas. Dica: use tempos verbais, pluralização, sinônimos, etc.
    
<hr>

Consideremos o produto "PlayStation 3 Dualshock 3 Wireless Controller (Black)" (asin = B0015AARJI), a consulta "Dualshock 3" e o grupo de equivalência ["Dualshock 3", "DS3", "Playstation 3 Controller"].

In [21]:
relevantes = set(df[df.asin == "B0015AARJI"].index)
len(relevantes)

652

In [22]:
retrieved_not_expanded = busca_not_stemmed("dualshock 3")
len(retrieved_not_expanded)

360

In [23]:
retrieved_expanded = busca_not_stemmed("dualshock 3") | busca_not_stemmed("ds 3") | busca_not_stemmed("playstation 3 controller")
len(retrieved_expanded)

1811

In [24]:
precision_not_expanded = len(relevantes & retrieved_not_expanded) / len(retrieved_not_expanded)
print("A precisão sem a expansão foi de", round(precision_not_expanded, 4))

A precisão sem a expansão foi de 0.1611


In [25]:
recall_not_expanded = len(relevantes & retrieved_not_expanded) / len(relevantes)
print("A revocação sem a expansão foi de", round(recall_not_expanded, 4))

A revocação sem a expansão foi de 0.089


In [26]:
precision_expanded = len(relevantes & retrieved_expanded) / len(retrieved_expanded)
print("A precisão com a expansão foi de", round(precision_expanded, 4))

A precisão com a expansão foi de 0.0392


In [27]:
recall_expanded = len(relevantes & retrieved_expanded) / len(relevantes)
print("A revocação com a expansão foi de", round(recall_expanded, 4))

A revocação com a expansão foi de 0.1089


<hr>

### Exercício 3: Verificação ortográfica

Implemente uma expansão de consulta por meio da correção ortográfica. Utilize o corretor ortográfico [Pyenchant](http://pythonhosted.org/pyenchant/) para fazer as correções.

<hr>

Não consegui instalar o Pyenchant no Windows, utilizei o https://github.com/barrust/pyspellchecker

In [28]:
spell = SpellChecker()

In [29]:
def busca_not_stemmed(consulta, check = False):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    if check:
        toks = [spell.correction(tok) for tok in toks]
    for tok in toks:
        tmp = ind_not_stemmed[tok]
        ret = ret & tmp
    return ret

In [30]:
def busca_stemmed(consulta, check = False):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    if check:
        toks = [spell.correction(tok) for tok in toks]
    toks = [ps.stem(tok) for tok in toks]
    for tok in toks:
        tmp = ind_stemmed[tok]
        ret = ret & tmp
    return ret

Testando uma consulta para o jogo "The Legend of Zelda: Twilight Princess"

In [31]:
busca_not_stemmed("zelda twiligth princess", check = True) 

{637,
 674,
 761,
 6249,
 6251,
 6270,
 6328,
 10721,
 10872,
 10883,
 10884,
 10885,
 10888,
 10927,
 12034,
 12066,
 16778,
 16787,
 23639,
 35494,
 35500,
 35509,
 35559,
 35569,
 35577,
 35619,
 35631,
 35632,
 36703,
 43803,
 46720,
 46793,
 46806,
 48423,
 48798,
 57345,
 57347,
 57349,
 57350,
 57352,
 57355,
 57357,
 57359,
 57360,
 57362,
 57366,
 57367,
 57369,
 57372,
 57373,
 57376,
 57378,
 57380,
 57383,
 57384,
 57386,
 57389,
 57390,
 57393,
 57395,
 57398,
 57400,
 57402,
 57404,
 57405,
 57406,
 57409,
 57411,
 57996,
 58008,
 58024,
 58082,
 58084,
 58092,
 58185,
 58203,
 58241,
 58253,
 58282,
 58283,
 58284,
 58313,
 58315,
 58329,
 58336,
 58343,
 58386,
 58396,
 58829,
 63883,
 69072,
 69144,
 69168,
 69212,
 69248,
 69292,
 69294,
 69385,
 69389,
 69710,
 69711,
 69713,
 69714,
 69716,
 69721,
 69729,
 69733,
 69737,
 69749,
 69751,
 69752,
 69753,
 69760,
 69770,
 69772,
 69774,
 69778,
 69781,
 69786,
 69791,
 69794,
 69796,
 69808,
 69811,
 69812,
 69820,
 6

In [32]:
df.reviewText[637]

'Every once in a while I experience a piece of entertainment that is *so* overrated it\'s hard to believe I\'m even living on the same planet as those that actually believe this is supposed to be the best. The Legend of Zelda: Ocarina of Time is that very piece of overrated entertainment. Talk about a video game that not only doesn\'t live up to the hype (not even close) but it doesn\'t do a darn thing to impress me even *if* I were to ignore all the exaggerated hype. This is a serious letdown of a video game. But again, please understand I\'m not from the same planet as those that enjoy this game. I care about gameplay and replay value more than anything else. Apparently most Zelda fans believe puzzle-based dungeons and soulless overworlds are what defines the Zelda series.One thing that positively stinks is that the land of Hyrule, despite being presented as beautiful (and realistic- don\'t tell me Nintendo wasn\'t riding the 3D high in 1998 and going for realism here) it\'s basicall

<hr>

### Exercício 4: Consultas por frases
Implemente um indice invertido que permita consulta por frases, conforme definido na aula 2.

<hr>

Construindo um índice para a consulta por frases

In [33]:
ind = defaultdict(lambda: defaultdict(list))

for t_id, t in enumerate(not_stemmed):
    for tok_id, tok in enumerate(t):
        ind[tok][t_id].append(tok_id)

Testando

In [34]:
ind["zelda"]

defaultdict(list,
            {32: [37],
             33: [4],
             34: [206,
              266,
              279,
              324,
              351,
              359,
              427,
              432,
              465,
              511,
              560,
              586,
              668,
              728,
              760,
              775,
              793,
              803,
              857],
             38: [80],
             149: [38, 73, 246, 352],
             153: [397],
             157: [53],
             372: [23],
             373: [151],
             508: [34],
             527: [80],
             560: [477],
             619: [11, 37, 81, 102, 113, 125],
             621: [2, 10, 60],
             624: [19],
             626: [5, 13, 26, 82],
             627: [11, 58, 81, 229],
             629: [17, 37],
             630: [8, 93, 150],
             632: [268],
             634: [6, 140, 153],
             635: [6],
             637: [41,
 

In [35]:
df.reviewText[32]

'Great game! I love the storyline and graphics, as well as the fighting style. Minus the super long time it takes traveling the ocean, this game is a blast, and a must-have for any fan of the Zelda franchise.'

In [36]:
not_stemmed[32][37]

'zelda'

Função de busca:

In [37]:
def busca_frase(consulta, check = False):
    
    ret = []
    
    toks = WordPunctTokenizer().tokenize(consulta)

    if check:
        toks = [spell.correction(tok) for tok in toks]
    
    matches = set(list(range(0, nrow)))
    for tok in toks[1:]:
        matches = matches & ind[tok].keys()
        
    first_token = ind[toks[0]]    
        
    for mat in matches:
        for ft in first_token[mat]:
            cont = 1
            for tok in range(1, len(toks)):
                if ft + tok < len(not_stemmed[mat]):   
                    if not_stemmed[mat][ft+tok] == toks[tok]:
                        cont += 1
            if cont == len(toks):
                ret.append(mat)           
                
    return ret

Testando a busca por frases

In [38]:
busca_frase("the legend of zelda twilight princess")

[57347,
 74254,
 111137,
 57378,
 57378,
 57378,
 57378,
 193571,
 152613,
 225324,
 69168,
 57393,
 57402,
 57409,
 57409,
 57409,
 77894,
 69714,
 78420,
 69212,
 81511,
 157298,
 96884,
 211061,
 81527,
 78465,
 78979,
 57996,
 178325,
 134296,
 48798,
 69794,
 69796,
 93888,
 69827,
 69827,
 69827,
 69829,
 112837,
 69836,
 69839,
 69855,
 79583,
 58092,
 69888,
 69385,
 69903,
 129816,
 76083,
 130873,
 99666,
 99673,
 58203,
 99675,
 58241,
 121733,
 178587,
 83883,
 178618,
 178625,
 178625,
 178625,
 178625,
 76229,
 69072,
 178654,
 76260,
 83944,
 76265,
 78834]

In [39]:
df.reviewText[78834]

"With the successof the Nintendo Wii, many gamers though have complained about the sensor bar, because it sometimes falls down whenever get a bit too close, and that it is still wired to the Wii. Nyko, which has made some previous lackluster controllers for Nintendo products before has tried it again, with their Wireless Sensor Bar. It is easy to setup and connect with the Wii. Unfortunately, there has been a huge problem with the gameplay. At times, whenever I'm playing a game like The Legend Of Zelda Twilight Princess, my Wii controller stops working, and I have to change the batteries, and it affects the gameplay constantly. To be honest with you, I think Nintendo can make a better wireless sensor bar than this one. In the meantime, stick with the one you got with the Wii.Price: C+Convience: C 1/2-Setup: C+Overall: C"

<hr>

### Exercício 5: Consulta híbrida

Modifique a solução acima para permitir respostas alternativas caso a frase não retorne resultados. Por exemplo, retornar, documentos que contenham parte da frase, ou uma busca booleana simples combinando as palavras da frase.

<hr>

In [40]:
def busca_ex_5(consulta, check = False):
    ret = busca_frase(consulta, check)  
    if len(ret) > 0:
        return ret
    else:
        return busca_not_stemmed(consulta, check)

Testando quando a frase retorna resultados (mesmo output do exercício 4)

In [41]:
busca_ex_5("the legend of zelda twilight princess")

[57347,
 74254,
 111137,
 57378,
 57378,
 57378,
 57378,
 193571,
 152613,
 225324,
 69168,
 57393,
 57402,
 57409,
 57409,
 57409,
 77894,
 69714,
 78420,
 69212,
 81511,
 157298,
 96884,
 211061,
 81527,
 78465,
 78979,
 57996,
 178325,
 134296,
 48798,
 69794,
 69796,
 93888,
 69827,
 69827,
 69827,
 69829,
 112837,
 69836,
 69839,
 69855,
 79583,
 58092,
 69888,
 69385,
 69903,
 129816,
 76083,
 130873,
 99666,
 99673,
 58203,
 99675,
 58241,
 121733,
 178587,
 83883,
 178618,
 178625,
 178625,
 178625,
 178625,
 76229,
 69072,
 178654,
 76260,
 83944,
 76265,
 78834]

Testando quando a frase não retorna resultados

In [42]:
busca_ex_5("the legend of zelda nintendo game cube")

{683,
 728,
 8870,
 8895,
 10890,
 10919,
 23559,
 23639,
 23652,
 23728,
 26974,
 35597,
 43203,
 46803,
 46814,
 57395,
 57402,
 58014}

In [43]:
df.reviewText[683]

"The Legend of Zelda: Ocarina of Time is, simply put, the best adventure game I have ever played. I'm a newcomeer to the Zelda series, so I'm not too sure how Ocarina rates with the others, but I love it. For an Nintendo 64, the graphics are absolutely beautiful. The lighting rivals that of a Playstation 2 or Xbox or Game Cube. The sound is terrific. Each different sound effect was done in detail. The sword against stone and the sword against wood are not the same as in a lot of games and sound just like they should. The echos when you enter the room and the monsters roaring in the distance are incredibly realistic and Link's footsteps are done perfectly. The puzzles are challenging, but not so that you want to through the controller down in frustration but so that you won't put the controller down until you've figured them out. The minigames and the sidequests will keep you playing long past the time you've beaten the game itself because they are so plentiful. The controls take some g