# Introdução à Recuperação de Informações

## Lista de exercícios 1

<hr>

In [1]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.tokenize import WordPunctTokenizer
import re 
from nltk.stem import PorterStemmer # https://pythonprogramming.net/stemming-nltk-tutorial/
from collections import defaultdict
from spellchecker import SpellChecker

## Base de dados utilizada

A base de dados utilizada está disponível em: http://jmcauley.ucsd.edu/data/amazon/ e contém o texto e notas de reviews da seção de video games da Amazon.

In [2]:
import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient = 'index')

df = getDF('reviews_Video_Games_5.json.gz')

In [3]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4.0,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1.0,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3.0,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4.0,DIRT 3,1308009600,"06 14, 2011"


In [4]:
nrow = df.shape[0]

Exemplo de review

In [5]:
df.reviewText[3]

'I got this version instead of the PS3 version, which turned out to be a mistake. Console versions of games look 95 percent as good as their PC versions, but you do not have to deal with driver issues and the numerous things that can go wrong with windows. First off the installation takes about 30 minutes, which is ridiculous. I have never had a game take this long to load, Shift 2 took about 20 minutes which seemed too long also. Next many of the latest games for PC are forcing you to have an internet connection in order to install the game, regardless of whether you want to play only offline single player games. Shift 2 unleashed is also like this, so be forewarned. Internet requirements are not prominently displayed on the boxes. The game pushes you, but does not require you to, sign up for a Games For Windows Live account, which is required to get the game patches and updates. More time wasted signing up for that. Finally after about one hour the game was up and running , but the m

In [6]:
raw_text = df['reviewText']
len(raw_text)

231780

<hr>

### Exercício 1: Truncagem e revocação.

Baseando-se no indice invertido construído na prática 1, Calcule a diferença de revocação com e sem a utilização de "stemming", ou truncagem na construção do índice.

<hr>

Colocando em caixa baixa, removendo urls, pontuações, stop words e tokenizando o texto

In [7]:
sw = stopwords.words('english')

In [8]:
not_stemmed = [] 
transtable = str.maketrans(' ', ' ', string.punctuation) # https://stackoverflow.com/questions/34860982/replace-the-punctuation-with-whitespace
for t in raw_text:
    t = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', t)
    t = t.translate(transtable)
    t = re.sub(' +', ' ', t).strip()
    txt = [token.lower() for token in WordPunctTokenizer().tokenize(t) if token not in sw]
    not_stemmed.append(txt)    
    
ps = PorterStemmer()

stemmed = []
for t in not_stemmed:
    txt = [ps.stem(token) for token in t]
    stemmed.append(txt)

Criando os índices invertidos

In [9]:
ind_not_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(not_stemmed):
    for term in t:
        ind_not_stemmed[term].add(tid)

In [10]:
ind_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(stemmed):
    for term in t:
        ind_stemmed[term].add(tid)

In [11]:
def busca_not_stemmed(consulta):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    for tok in toks:
        tmp = ind_not_stemmed[tok]
        ret = ret & tmp
    return ret

In [12]:
def busca_stemmed(consulta):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    toks = [ps.stem(tok) for tok in toks]
    for tok in toks:
        tmp = ind_stemmed[tok]
        ret = ret & tmp
    return ret

Testando

In [13]:
busca_not_stemmed("zelda")

{57344,
 8193,
 57346,
 40963,
 57345,
 57347,
 57348,
 57349,
 57350,
 57351,
 57352,
 40971,
 57354,
 57355,
 32782,
 57356,
 57357,
 57359,
 40978,
 8211,
 57360,
 8213,
 8214,
 57362,
 8216,
 57363,
 49178,
 57366,
 57367,
 57369,
 8222,
 57370,
 32,
 33,
 34,
 8225,
 8227,
 8229,
 38,
 8230,
 8232,
 57376,
 57377,
 57378,
 57379,
 57380,
 57381,
 8239,
 8240,
 57384,
 57385,
 57386,
 57387,
 57388,
 41014,
 57390,
 57391,
 8249,
 57393,
 57394,
 57395,
 57398,
 57399,
 57400,
 57401,
 57402,
 57403,
 57404,
 41028,
 57405,
 57406,
 57408,
 57409,
 57410,
 41034,
 57411,
 41036,
 41037,
 41038,
 57414,
 57415,
 73806,
 16470,
 106590,
 57439,
 188532,
 8316,
 155772,
 155777,
 131203,
 106630,
 41095,
 73866,
 188556,
 155789,
 139406,
 180366,
 139408,
 188557,
 229519,
 149,
 153,
 229531,
 157,
 229537,
 229539,
 229540,
 139429,
 229542,
 16552,
 229544,
 41130,
 229548,
 24749,
 188589,
 229549,
 229550,
 229552,
 229554,
 229555,
 229556,
 57525,
 229557,
 229559,
 229560,
 2

In [14]:
df.iloc[57344]

reviewerID                                           A35R8PJSEURHHF
asin                                                     B0009UBR3A
reviewerName                                           Aaron Merkel
helpful                                                     [3, 16]
reviewText        Graphics are bland and there is little detail ...
overall                                                           2
summary                                 When did Zelda become lame?
unixReviewTime                                           1167696000
reviewTime                                               01 2, 2007
Name: 57344, dtype: object

In [15]:
df.reviewText[57344]

"Graphics are bland and there is little detail in most areas.  I feel things just don't flow together at all either.  The story seems very choppy.  I search each map extensvely and sometimes there is just nothing there at all, why bother having huge area with nothing in it.  The music gets very tedious after awhile; everytime an enemy approaches you get the same annoying music. I expected a graphical and musical extravaganza, but received mediocre graphics and sound.  I guess I just expected more.  Dungeons are very aggrivating, until you figure out what to do.  I just don't see the fun in this game and that's unfortunate since I'm a fan of the Zelda series and it feels as this game was rushed."

In [16]:
df[df.asin == "B0009UBR3A"] # filtrando por este produto

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
57344,A35R8PJSEURHHF,B0009UBR3A,Aaron Merkel,"[3, 16]",Graphics are bland and there is little detail ...,2.0,When did Zelda become lame?,1167696000,"01 2, 2007"
57345,A1PID2MT8MMPAF,B0009UBR3A,"A. Griffiths ""Adrian""","[1, 1]",The new Zelda game is the last one for the Gam...,5.0,GREAT FUN...but not that original,1170633600,"02 5, 2007"
57346,A130YN8T37O833,B0009UBR3A,"Always Samsung ""ravereviews""","[0, 0]",..........I own both the GameCube and the Nint...,5.0,Addictive....BUT.........,1167264000,"12 28, 2006"
57347,A25JI9SJZ00WV6,B0009UBR3A,Alyssa,"[0, 0]",I bought The Legend of Zelda Twilight Princess...,5.0,One of the GREATEST games ever made!!,1342224000,"07 14, 2012"
57348,A38T1JFRUIG19W,B0009UBR3A,"Amazon Customer ""Don't Forget to Breath""","[1, 1]",This has to be the best Zelda game from the se...,5.0,Best Zelda Yet!,1170374400,"02 2, 2007"
...,...,...,...,...,...,...,...,...,...
57411,A3BN2M8R5QOFE5,B0009UBR3A,"Summer Paulus ""FL""","[0, 0]","After playing this game since Christmas, I fou...",5.0,"Awww, good times.",1168732800,"01 14, 2007"
57412,A1K9VDWEOWSV65,B0009UBR3A,Talitha Snyder,"[0, 6]",Link turns into a wolf?? I mean... My step dad...,1.0,link turns into a wolf???,1395792000,"03 26, 2014"
57413,A1Z0Z84FDWHTFI,B0009UBR3A,teburns81,"[0, 1]",Zelda and all it's other games in the series a...,5.0,Love this game,1360195200,"02 7, 2013"
57414,AROWZGGO4VTJU,B0009UBR3A,"The Tech Fanatic ""The Tech Fanatic""","[1, 1]",This game is one of those that stand out for m...,5.0,One of THE Best Games of this Console Generation,1370908800,"06 11, 2013"


<hr>

### Exercício 2: Expansão de consultas
Crie grupos de equivalência para alguns termos de busca e calcule a diferença em termos de revocação e, possivelmente precisão, na resposta a consultas expandidas e não expandidas. Dica: use tempos verbais, pluralização, sinônimos, etc.
    
<hr>

<hr>

### Exercício 3: Verificação ortográfica

Implemente uma expansão de consulta por meio da correção ortográfica. Utilize o corretor ortográfico [Pyenchant](http://pythonhosted.org/pyenchant/) para fazer as correções.

<hr>

Não consegui instalar o Pyenchant, utilizei o https://github.com/barrust/pyspellchecker

In [17]:
spell = SpellChecker()

In [18]:
def busca_not_stemmed(consulta, check = False):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    if check:
        toks = [spell.correction(tok) for tok in toks]
    for tok in toks:
        tmp = ind_not_stemmed[tok]
        ret = ret & tmp
    return ret

In [19]:
def busca_stemmed(consulta, check = False):
    ret = set(list(range(0, nrow)))
    toks = WordPunctTokenizer().tokenize(consulta)
    if check:
        toks = [spell.correction(tok) for tok in toks]
    toks = [ps.stem(tok) for tok in toks]
    for tok in toks:
        tmp = ind_stemmed[tok]
        ret = ret & tmp
    return ret

In [20]:
busca_not_stemmed("twiligth princess", check = True) # zelda twilight princess

{637,
 674,
 680,
 702,
 761,
 6249,
 6251,
 6270,
 6328,
 10721,
 10872,
 10883,
 10884,
 10885,
 10888,
 10927,
 12034,
 12066,
 16778,
 16787,
 23639,
 35494,
 35500,
 35509,
 35559,
 35569,
 35577,
 35619,
 35631,
 35632,
 36703,
 43803,
 46720,
 46793,
 46806,
 48315,
 48384,
 48423,
 48798,
 57345,
 57347,
 57349,
 57350,
 57352,
 57355,
 57357,
 57359,
 57360,
 57362,
 57366,
 57367,
 57369,
 57372,
 57373,
 57376,
 57378,
 57380,
 57383,
 57384,
 57386,
 57389,
 57390,
 57393,
 57395,
 57398,
 57400,
 57402,
 57404,
 57405,
 57406,
 57409,
 57411,
 57996,
 58008,
 58024,
 58082,
 58084,
 58092,
 58131,
 58170,
 58185,
 58193,
 58203,
 58241,
 58253,
 58275,
 58282,
 58283,
 58284,
 58309,
 58313,
 58315,
 58329,
 58336,
 58343,
 58349,
 58386,
 58396,
 58818,
 58829,
 63883,
 65340,
 69072,
 69144,
 69168,
 69171,
 69212,
 69248,
 69287,
 69292,
 69294,
 69371,
 69381,
 69385,
 69389,
 69710,
 69711,
 69713,
 69714,
 69716,
 69721,
 69729,
 69733,
 69734,
 69737,
 69749,
 69751

In [21]:
df.reviewText[637]

'Every once in a while I experience a piece of entertainment that is *so* overrated it\'s hard to believe I\'m even living on the same planet as those that actually believe this is supposed to be the best. The Legend of Zelda: Ocarina of Time is that very piece of overrated entertainment. Talk about a video game that not only doesn\'t live up to the hype (not even close) but it doesn\'t do a darn thing to impress me even *if* I were to ignore all the exaggerated hype. This is a serious letdown of a video game. But again, please understand I\'m not from the same planet as those that enjoy this game. I care about gameplay and replay value more than anything else. Apparently most Zelda fans believe puzzle-based dungeons and soulless overworlds are what defines the Zelda series.One thing that positively stinks is that the land of Hyrule, despite being presented as beautiful (and realistic- don\'t tell me Nintendo wasn\'t riding the 3D high in 1998 and going for realism here) it\'s basicall

<hr>

### Exercício 4: Consultas por frases
Implemente um indice invertido que permita consulta por frases, conforme definido na aula 2.

<hr>

Construindo um índice para a consulta por frases

In [22]:
ind = defaultdict(lambda: defaultdict(list))

for t_id, t in enumerate(not_stemmed):
    for tok_id, tok in enumerate(t):
        ind[tok][t_id].append(tok_id)

Testando

In [23]:
ind["zelda"]

defaultdict(list,
            {32: [20],
             33: [3],
             34: [138,
              173,
              180,
              206,
              223,
              229,
              270,
              273,
              295,
              321,
              348,
              363,
              412,
              446,
              463,
              472,
              486,
              491,
              522],
             38: [44],
             149: [20, 44, 145, 202],
             153: [226],
             157: [31],
             372: [16],
             373: [89],
             508: [22],
             527: [58],
             560: [253],
             619: [7, 20, 44, 59, 66, 72],
             621: [2, 7, 30],
             624: [12],
             626: [3, 9, 17, 50],
             627: [3, 25, 37, 129],
             629: [9, 21],
             630: [4, 52, 91],
             632: [159],
             634: [2, 75, 81],
             635: [3],
             637: [18,
             

In [24]:
df.reviewText[32]

'Great game! I love the storyline and graphics, as well as the fighting style. Minus the super long time it takes traveling the ocean, this game is a blast, and a must-have for any fan of the Zelda franchise.'

In [25]:
not_stemmed[32][20]

'zelda'

Adicionando espaços em branco no fim dos textos pra não dar problema na hora de buscar

In [26]:
for i in range(len(not_stemmed)):
    for j in range(10):
        not_stemmed[i].append("")

In [27]:
def busca_frase(consulta, check = False):
    
    ret = []
    
    toks = WordPunctTokenizer().tokenize(consulta)
    #toks = [tok for tok in toks if tok not in sw]

    if check:
        toks = [spell.correction(tok) for tok in toks]
    
    matches = set(list(range(0, nrow)))
    for tok in toks[1:]:
        matches = matches & ind[tok].keys()
        
    first_token = ind[toks[0]]    
        
    for mat in matches:
        for ft in first_token[mat]:
            cont = 1
            for tok in range(1, len(toks)):
                if not_stemmed[mat][ft+tok] == toks[tok]:
                    cont += 1
            if cont == len(toks):
                ret.append(mat)           
                
    return ret

In [28]:
busca_frase("the legend of zelda twilight princess")

[78465,
 78979,
 121733,
 69385,
 178325,
 129816,
 134296,
 69794,
 152613,
 69168,
 57393,
 76083,
 57402,
 93888,
 178625,
 178625,
 178625,
 178625,
 76229,
 77894,
 69072,
 178654,
 79583,
 81511,
 78834,
 157298,
 211061]

In [29]:
df.reviewText[78834]

"With the successof the Nintendo Wii, many gamers though have complained about the sensor bar, because it sometimes falls down whenever get a bit too close, and that it is still wired to the Wii. Nyko, which has made some previous lackluster controllers for Nintendo products before has tried it again, with their Wireless Sensor Bar. It is easy to setup and connect with the Wii. Unfortunately, there has been a huge problem with the gameplay. At times, whenever I'm playing a game like The Legend Of Zelda Twilight Princess, my Wii controller stops working, and I have to change the batteries, and it affects the gameplay constantly. To be honest with you, I think Nintendo can make a better wireless sensor bar than this one. In the meantime, stick with the one you got with the Wii.Price: C+Convience: C 1/2-Setup: C+Overall: C"

<hr>

### Exercício 5: Consulta híbrida

Modifique a solução acima para permitir respostas alternativas caso a frase não retorne resultados. Por exemplo, retornar, documentos que contenham parte da frase, ou uma busca booleana simples combinando as palavras da frase.

<hr>

In [30]:
def ex_5(consulta, check = False):
    ret = busca_frase(consulta, check)  
    if len(ret) > 0:
        return ret
    else:
        return busca_not_stemmed(consulta, check)

In [31]:
ex_5("the legend of zelda nintendo game cube")

{728, 8895, 10890, 10919, 35597, 46814, 57402}

In [32]:
df.reviewText[728]

"Miyamoto slacked off on this one.1.This is certainly not the best game ever. Not even the best game on Nintendo 64. Super Mario 64, and Perfect Dark both stomp this game into submission.2.This is not a role-playing game. Role-playing games have a lot of text and character involvement. This game has a script that is about two pages long, maybe less. What you say or do in this game really has no effect on the games out come.3.This is not the best Zelda game. The Legend Of Zelda, Links Awakening, and A Link To The Past are all much better games, especially for their time.4.This game is not 256 Mega Bytes. It is actually 256 Mega Bits, which is equal to 32 Mega Bytes. So sorry the game isn't nearly the size of a 650/700 Mega Byte CD-ROM.5.The game has no replay value, I played through it once, then about a year later, I attempted playing through it again, I couldn't bare it, the game is just too boring. I played through Final Fantasy VII and VIII twice each, and those took alot longer tha