In [1]:
import os
import random
import numpy as np
import pandas as pd
from functools import partial
from collections import defaultdict
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from utils import show_example, process_doc
from gensim import models, corpora
from datasets import load_dataset

pd.options.display.max_colwidth = 500
N_SAMPLES = 11_000
N_TOPICS = 10
MIN_DOC_LEN = 10
np.random.seed(0)
random.seed(0)

In [2]:
# reviews = load_dataset('app_reviews')['train']['review']
reviews = load_dataset('amazon_polarity')['test']['content']
reviews = random.sample(
    list(filter(lambda x: len(x.split(' ')) > MIN_DOC_LEN, reviews)),
    k=N_SAMPLES
)

Reusing dataset amazon_polarity (/home/honza/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1)


In [3]:
def remove_unique(docs):
    frequency = defaultdict(int)
    for doc in docs:
        for token in doc:
            frequency[token] += 1
            
    return [
        [token for token in doc if frequency[token] > 1]
        for doc in docs
    ]

In [4]:
docs = [process_doc(doc)for doc in reviews]

tfidf = TfidfVectorizer()
tfidf.fit([' '.join(doc) for doc in docs])

docs = remove_unique(docs)
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=N_TOPICS)
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=N_TOPICS)

In [5]:
def transform_doc(doc, model, dictionary):
    vec_sparse = dictionary.doc2bow(process_doc(doc))
    
    if isinstance(model, models.LdaModel):
        vec_dense = model.get_document_topics(
            vec_sparse, minimum_probability=0
        )
    
    elif isinstance(model, models.LsiModel):
        vec_dense = model[vec_sparse]
        
    else:
        raise ValueError(f'Unkown model {type(model)}')
    x = np.array([v[-1] for v in vec_dense])
    if len(x) < 10:
        print(doc)
    
    
    return np.array([v[-1] for v in vec_dense])

In [6]:
review_embeddings = np.vstack(
    [
        transform_doc(doc, lsi, dictionary)
        for doc in reviews
    ]
)

In [7]:
knn = NearestNeighbors(n_neighbors=40, metric='cosine')
knn.fit(review_embeddings)

NearestNeighbors(metric='cosine', n_neighbors=40)

In [11]:
show_example(
    i=0,
    docs=reviews,
    p_transform_doc=partial(
        transform_doc,
        model=lsi,
        dictionary=dictionary,
    ),
    knn=knn,
)

text,cosine
I believe this product is helping me lose weight. You cannot take it and not change your lifestyle and expect to lose but if you are eating healthy and working out this gives you a boost. The only thing I do not like about it is it does increase my night time hot flashes immensely but they go away as soon as I stop taking it. I ordered more because I decided I could deal with that while losing weight. To me this increased heat means it is doing what it says it does. I definitely lose more quickly with this than without.,0.003345
"I've just mounted it up and tried it out for the first time. It's hard to judge thus far because I haven't received my whip antenna yet, but a functioning cb with PA cabilities for $40 is tough to beat. The construction is a little cheap, so is the mounting hardware, but it functions like it's supposed to and install is about as straight forward as it comes. It's a smaller unit too so if you want to throw it in a cramped cockpit you should be able to find a spot.",0.006544
I've had an older version of this breadmaker and use it all the time. I especially like that it's programmable. The person I purchased the new one for loves it. I would recommend this breadmaker to anyone looking to purchase one.,0.006555
"Had to return my first copy (the DVD froze after 22 mins and chopped out about 2 mins of the feature) - Amazon very kindly replaced the set (and charged me $11 more for freight this time, instead of $3 as previous) ... I returned my faulty copy to Amazon Returns (I was reimbused @ $3 freight - it actually cost me more than $12 to send).The replacement set arrived this morning and, guess what? The DVD disc gets about 6 mins in this time - and FREEZES again.Quality problems like this means it's a huge disappointment.No replacement set this time, thanks Amazon - just a total refund.How many other crap quality sets of this long awaited Karloff 'classic' are out there like this???BUYER BEWARE !!!",0.007561
"I bought a set of these clubs for each of my sons last year. A couple of weeks later, the kids were using their clubs at the driving range, and the head of one of the irons came off the club in the middle of a backswing! Fortunately, it didn't hit anyone. Then a few months later, the head came off one of the putters while my son was practicing. We have glued the heads back on, and the boys are still using the clubs, but we are disappointed with the quality of these clubs. Also, the wood is too long, even for my 7 year old son. The reason I gave it two stars instead of one is because I like the bag - at least that part of the set has worked out.",0.009199
"The family had gathered round, we were all set for a sparkling night's entertaiment. I turned off the lights and pressedthe ""Play"" button. Nothing. Nothing!! Then a message to the effect that the disc wouldnot play in my region. Nothing on the Amazon site oron the blueray disc or packaging to indicate there might be a problem. I have bought many DVDs and a few bluerays. AmaZon are well aware that I live in New Zealand. In New Zealand law it is the retailersresponsibility to ensure that the goods they sellare fit for their intended purpose.Amazon won't let me award the zero stars I would like to give.",0.009675
"Having a Dremel is worth nothing if you don't have the right attachments to get the job done. I purchased this kit at Wally World to satisfy my need for vengeance. Did I say vengeance? I meant... ummm.. never mind. The savings here is enormous. If I ever run out of anything I'll just buy another one of these and sell off what I don't need, huge profits are to be had,My one suggestion to Dremel is to make the case a little more, well, square. It has to be stored on top of everything else because of the odd shape. Now, the handle, and holes for frequently used attachments is nice and all, but it does get in the way of storing the thing. And having the storage compartments split in two like it is makes it a little awkward to get the tools you need.Either way, this set is perfect for any Dremel owner, you never know when you're going to need one of the twenty or so tools included.",0.01219
"I got this for my son after seeing all of the commercials on tv and saw how ""cool"" it was. We got it home, put it together, and started spraying all over the back deck, the house and the window. It came out like silly string so I figured it would dissipate. The web shooter only lasted about a minute. We bought an extra one and were finished in less than 5 minutes total. Very disappointing. It has been about 4 weeks and our deck still has blue goo all over it. We have to get the deck redone because when we try to clean it up, the finish comes off. It is still on the house and the door. I can only imagine what it would do to a car. Will not buy again. Fun for about 5 min and that's it.",0.01426
"We have slept on this mattress for a week now and no more backaches! My husband and I were used to getting up 4-5 times a night and then feeling very tired in the morning, we are now sleeping through the night and both feel rested when morning comes. It is a awesome mattress! We received the mattress in 3 days after we ordered it. It took 1 1/2 hours for the mattress to take its shape. The chemical smell was strong so we have left the covers off during the day so the smell could escape and that has helped so now that we can barely detect it I am now ready to make the bed everyday! We almost bought a Tempur-pedic mattress but we are happy we chose this mattress as we like this one so much better for the price! You will not go wrong purchasing this mattress, you will love it!",0.014288
"I have had my system for over a year and it has been very reliable -- to keep the system speed up, I added more memory and clean the system -- reinstall the operating system ever 6 months -- it's all the extra programs a person adds over time that slow ANY computer down.The battery last about 1.5 hours and still keeps a chargeI like the restart aspect of the systemWireless works 100% of the time -- I take the unit out on my deck about 100 feet and it still picks up well",0.014599
