# IIC-3670 NLP UC

- Versiones de librerías, python 3.8.10

- numpy 1.20.3
- nltk 3.7
- rank_bm25 0.2.2

In [1]:
from nltk.corpus import product_reviews_1
camera_reviews = product_reviews_1.reviews('Canon_G3.txt')

In [2]:
review = camera_reviews[0]
review.sents()[0]

['i',
 'recently',
 'purchased',
 'the',
 'canon',
 'powershot',
 'g3',
 'and',
 'am',
 'extremely',
 'satisfied',
 'with',
 'the',
 'purchase',
 '.']

____________________________________________________________________________________________________________

## Actividad en clase

Construya un motor de consultas **BM25** que trabaje sobre los camera_reviews. Para esto haga lo siguiente:

- Cree el corpus de camera_reviews. Fíjese que cada review tiene varias sentencias. El corpus debe tener un documento por review, el cual está representado por una lista de palabras.
- Debe preprocesar el texto para que el listado de palabras que representa a cada review esté limpio. Reúse los ejemplos de la clase. 
- Corra la consulta *'best price'*. Puede usar la librería rank_bm25. 
- Corra la consulta *'quality'*. Puede usar la librería rank_bm25. 
- Corra la consulta *'best price and quality'*. Puede usar la librería rank_bm25. 
- Muestre los reviews ordenados por relevancia (top-5).
- Cuanto termine, me avisa para entregarle una **L (logrado)**.
- Recuerde que las L otorgan un bono en la nota final de la asignatura.

***Tiene hasta el final de la clase.***

_________________________________________________________________________________________________________________

# Solución

In [3]:
reviews = []

for review in camera_reviews:
    sentences = []
    for sentence in review.sents():
        text = " ".join(sentence)
        sentences.append(text)
    document = " ".join(sentences)
    reviews.append(document)


In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer

# Load stop-words
stop_words = set(stopwords.words('english'))

# Initialize tokenizer
# It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
tokenizer = RegexpTokenizer('[\'a-zA-Z]+')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def tokenize(document):
    words = []

    for sentence in sent_tokenize(document):
        tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words and len(t) > 2]
        words += tokens
    
    return words

In [5]:
corpus = []

for review in reviews:
    document = tokenize(review)
    corpus.append(document)


In [6]:
from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(corpus)
query = "best price"
tokenized_query = tokenize(query)

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=5)

for res in result:
    print(' '.join(res))
    print('')
    

recent price drop made best bargain digital camera currently available advanced photobugs find creative control imaginable newbie find full auto setting give perfect picture right box megapixels enough anybody photo quality awesome get fooled megapixel marketing hype unless want print mural need used camera find comfortable friendly use anyone looking point shoot make huge step moderate price difference extended zoom range faster lense put top class expect please year come

canon improves almost way fact beat nikon coolpix performance picture quality battery life amazing megapixel camera canon megapixel camera canon optic better believe processing algorithm also better simply canon best digital camera today price point canon allows change lens accepts ibm microdrive type compact flash gigabyte storage fine resolution setting maximum close add image gig card battery life camera twice nikon better anything else seen minor nit camera fairly boxy looking need wrist strap instead neck strap

In [7]:
query = "quality"
tokenized_query = tokenize(query)

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=5)

for res in result:
    print(' '.join(res))
    print('')

recently purchased canon powershot extremely satisfied purchase camera easy use fact recent trip past week asked take picture vacationing elderly group took picture camera offered take picture told press halfway wait box turn green press rest way fired away picture turned quite nicely picture thusfar work constituants owned highly recommended canon picture quality easily enlarging picture visable loss picture quality even using best possible setting yet super fine ensure get larger flash selling larger flash pinch quickly want larger flash card camera bottom line well made camera easy use flexible powerful feature include ability use external flash lense filter choice highly recommend camera anyone looking excellent quality picture combination ease use flexibility get advanced many option adjust like great job canon

using six week proven advertised hand comparison nikon coolpix sony dsc lack quality feel feature ultimately chose outstanding image quality resolution coloration superior

In [8]:
query = "best price and quality"
tokenized_query = tokenize(query)

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=5)

for res in result:
    print(' '.join(res))
    print('')

recent price drop made best bargain digital camera currently available advanced photobugs find creative control imaginable newbie find full auto setting give perfect picture right box megapixels enough anybody photo quality awesome get fooled megapixel marketing hype unless want print mural need used camera find comfortable friendly use anyone looking point shoot make huge step moderate price difference extended zoom range faster lense put top class expect please year come

canon improves almost way fact beat nikon coolpix performance picture quality battery life amazing megapixel camera canon megapixel camera canon optic better believe processing algorithm also better simply canon best digital camera today price point canon allows change lens accepts ibm microdrive type compact flash gigabyte storage fine resolution setting maximum close add image gig card battery life camera twice nikon better anything else seen minor nit camera fairly boxy looking need wrist strap instead neck strap