# Introdução à Recuperação de Informações

## Lista de exercícios 1

<hr>

In [1]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.tokenize import WordPunctTokenizer
import re 
from nltk.stem import PorterStemmer # https://pythonprogramming.net/stemming-nltk-tutorial/
from collections import defaultdict

## Base de dados utilizada

A base de dados utilizada está disponível em: http://jmcauley.ucsd.edu/data/amazon/ e contém o texto e notas de reviews da seção de video games da Amazon.

In [2]:
import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Video_Games_5.json.gz')

In [3]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4.0,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1.0,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3.0,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4.0,DIRT 3,1308009600,"06 14, 2011"


Exemplo de review

In [4]:
df.reviewText[3]

'I got this version instead of the PS3 version, which turned out to be a mistake. Console versions of games look 95 percent as good as their PC versions, but you do not have to deal with driver issues and the numerous things that can go wrong with windows. First off the installation takes about 30 minutes, which is ridiculous. I have never had a game take this long to load, Shift 2 took about 20 minutes which seemed too long also. Next many of the latest games for PC are forcing you to have an internet connection in order to install the game, regardless of whether you want to play only offline single player games. Shift 2 unleashed is also like this, so be forewarned. Internet requirements are not prominently displayed on the boxes. The game pushes you, but does not require you to, sign up for a Games For Windows Live account, which is required to get the game patches and updates. More time wasted signing up for that. Finally after about one hour the game was up and running , but the m

In [5]:
raw_text = df['reviewText']
len(raw_text)

231780

<hr>

### Exercício 1: Truncagem e revocação.

Baseando-se no indice invertido construído na prática 1, Calcule a diferença de revocação com e sem a utilização de "stemming", ou truncagem na construção do índice.

<hr>

Colocando em caixa baixa, removendo urls, pontuações, stop words e tokenizando o texto

In [6]:
sw = stopwords.words('english')

In [7]:
not_stemmed = [] 
transtable = str.maketrans(' ', ' ', string.punctuation) # https://stackoverflow.com/questions/34860982/replace-the-punctuation-with-whitespace
for t in raw_text:
    t = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', t)
    t = t.translate(transtable)
    t = re.sub(' +', ' ', t).strip()
    txt = [token.lower() for token in WordPunctTokenizer().tokenize(t) if token not in sw]
    not_stemmed.append(txt)    
    
ps = PorterStemmer()

stemmed = []
for t in not_stemmed:
    txt = [ps.stem(token) for token in t]
    stemmed.append(txt)

Criando os índices invertidos

In [8]:
ind_not_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(not_stemmed):
    for term in t:
        ind_not_stemmed[term].add(tid)

In [9]:
ind_stemmed = defaultdict(lambda:set([]))
for tid, t in enumerate(stemmed):
    for term in t:
        ind_stemmed[term].add(tid)

In [24]:
def busca_not_stemmed(consulta):
    ret = set()
    toks = WordPunctTokenizer().tokenize(consulta)
    for tok in toks:
        tmp = ind_not_stemmed[tok]
        ret = ret | tmp
    return ret

In [31]:
def busca_stemmed(consulta):
    ret = set()
    toks = WordPunctTokenizer().tokenize(consulta)
    toks = [ps.stem(tok) for tok in toks]
    for tok in toks:
        tmp = ind_stemmed[tok]
        ret = ret | tmp
    return ret

Testando

In [41]:
busca_not_stemmed("zelda")

{57344,
 8193,
 57345,
 40963,
 57346,
 57347,
 57348,
 57349,
 57350,
 57351,
 57352,
 40971,
 57354,
 57355,
 32782,
 57356,
 57357,
 57359,
 40978,
 8211,
 57360,
 8213,
 8214,
 57362,
 8216,
 57363,
 49178,
 57366,
 57367,
 57369,
 8222,
 57370,
 32,
 33,
 34,
 8225,
 8227,
 8229,
 38,
 8230,
 8232,
 57376,
 57377,
 57378,
 57379,
 57380,
 57381,
 8239,
 8240,
 57384,
 57385,
 57386,
 57387,
 57388,
 41014,
 57390,
 57391,
 8249,
 57393,
 57394,
 57395,
 57398,
 57399,
 57400,
 57401,
 57402,
 57403,
 57404,
 41028,
 57405,
 57406,
 57408,
 57409,
 57410,
 41034,
 57411,
 41036,
 41037,
 41038,
 57414,
 57415,
 73806,
 16470,
 106590,
 57439,
 188532,
 8316,
 155772,
 155777,
 131203,
 106630,
 41095,
 73866,
 188556,
 155789,
 139406,
 180366,
 139408,
 188557,
 229519,
 149,
 153,
 229531,
 157,
 229537,
 229539,
 229540,
 139429,
 229542,
 16552,
 229544,
 41130,
 229548,
 24749,
 188589,
 229549,
 229550,
 229552,
 229554,
 229555,
 229556,
 57525,
 229557,
 229559,
 229560,
 2

In [42]:
df.iloc[57344]

reviewerID                                           A35R8PJSEURHHF
asin                                                     B0009UBR3A
reviewerName                                           Aaron Merkel
helpful                                                     [3, 16]
reviewText        Graphics are bland and there is little detail ...
overall                                                           2
summary                                 When did Zelda become lame?
unixReviewTime                                           1167696000
reviewTime                                               01 2, 2007
Name: 57344, dtype: object

In [43]:
df.reviewText[57344]

"Graphics are bland and there is little detail in most areas.  I feel things just don't flow together at all either.  The story seems very choppy.  I search each map extensvely and sometimes there is just nothing there at all, why bother having huge area with nothing in it.  The music gets very tedious after awhile; everytime an enemy approaches you get the same annoying music. I expected a graphical and musical extravaganza, but received mediocre graphics and sound.  I guess I just expected more.  Dungeons are very aggrivating, until you figure out what to do.  I just don't see the fun in this game and that's unfortunate since I'm a fan of the Zelda series and it feels as this game was rushed."

In [46]:
df[df.asin == "B0009UBR3A"] # filtrando por este produto

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
57344,A35R8PJSEURHHF,B0009UBR3A,Aaron Merkel,"[3, 16]",Graphics are bland and there is little detail ...,2.0,When did Zelda become lame?,1167696000,"01 2, 2007"
57345,A1PID2MT8MMPAF,B0009UBR3A,"A. Griffiths ""Adrian""","[1, 1]",The new Zelda game is the last one for the Gam...,5.0,GREAT FUN...but not that original,1170633600,"02 5, 2007"
57346,A130YN8T37O833,B0009UBR3A,"Always Samsung ""ravereviews""","[0, 0]",..........I own both the GameCube and the Nint...,5.0,Addictive....BUT.........,1167264000,"12 28, 2006"
57347,A25JI9SJZ00WV6,B0009UBR3A,Alyssa,"[0, 0]",I bought The Legend of Zelda Twilight Princess...,5.0,One of the GREATEST games ever made!!,1342224000,"07 14, 2012"
57348,A38T1JFRUIG19W,B0009UBR3A,"Amazon Customer ""Don't Forget to Breath""","[1, 1]",This has to be the best Zelda game from the se...,5.0,Best Zelda Yet!,1170374400,"02 2, 2007"
57349,A31RM5QU797HPJ,B0009UBR3A,Amazon Customer,"[1, 3]","Same old Zelda, same old game play. If it's n...",5.0,An Adventure in Twilight,1167350400,"12 29, 2006"
57350,A33AJ1MSMEIA3C,B0009UBR3A,Anne,"[5, 5]",Loved it. Fell short of Ocarina of Time (whic...,5.0,Almost as good as OOT . . . Almost.,1197590400,"12 14, 2007"
57351,A37XJZF145XH5B,B0009UBR3A,A. Trujillo,"[0, 0]",This for me is the best Zelda adventure yet. I...,5.0,Best Zelda adventure yet!,1200614400,"01 18, 2008"
57352,A1T9PJBFBFKSGD,B0009UBR3A,Auburn Niewiadomski,"[0, 0]",I'm not accustomed to the Legend of Zelda fran...,5.0,The Legend of Zelda - Twilight Princess,1287273600,"10 17, 2010"
57353,A1U9W1U7UFBZMN,B0009UBR3A,Blue Roman,"[0, 0]",This is one of the best zeldas I have ever pla...,5.0,An awesome game,1335312000,"04 25, 2012"
