# Sanity Checks within KNN

This notebook shows some options on how to check what knn is doing exactly and where problems can occur

In [74]:
from joblib import load
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
from tqdm import tqdm
from knodle.trainer.utils.denoise import get_majority_vote_probs

In [43]:
movie_reviews = pd.read_csv('../../tutorials/ImdbDataset/imdb_data_preprocessed.csv')

In [44]:
movie_reviews

Unnamed: 0,review,sentiment,reviews_preprocessed,label_id
0,One of the other reviewers has mentioned that ...,positive,One reviewers mentioned watching just 1 Oz epi...,1
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...,1
2,I thought this was a wonderful way to spend ti...,positive,I thought wonderful way spend time hot summer ...,1
3,Basically there's a family where a little boy ...,negative,Basically there's family little boy (Jake) thi...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love Time Money"" visually stu...",1
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,I thought movie did right good job. It wasn't ...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,negative,I Catholic taught parochial elementary schools...,0
49998,I'm going to have to disagree with the previou...,negative,I'm going disagree previous comment Maltin one...,0


In [6]:
tfidf = load('../../tutorials/ImdbDataset/tfidf.lib')

In [222]:
neighbors = NearestNeighbors(n_neighbors=5, n_jobs=-1).fit(tfidf)        

In [223]:
distances, indices = neighbors.kneighbors(tfidf)

In [224]:
distances

array([[0.        , 1.12913239, 1.14412834, 1.15702102, 1.16699781],
       [0.        , 1.32093838, 1.33250982, 1.3380306 , 1.33981033],
       [0.        , 1.19229358, 1.22969181, 1.23894052, 1.25502857],
       ...,
       [0.        , 1.19657199, 1.23502444, 1.25436794, 1.28945385],
       [0.        , 1.2606487 , 1.28046746, 1.28632515, 1.29973308],
       [0.        , 1.17322008, 1.2050353 , 1.20907236, 1.22807817]])

In [225]:
indices

array([[    0, 33546, 34461, 25999, 42867],
       [    1, 16764, 24062, 33664, 34279],
       [    2, 30202,   647, 28846, 48161],
       ...,
       [49997, 49844,  5356, 11953, 34663],
       [49998, 17635, 31421,    76, 27825],
       [49999, 41329, 11565,  6385, 49300]])

## Look at neighbors

Here we can see some examples of similar texts which knn find with tfidf values

In [226]:
for index, row in movie_reviews.loc[indices[2]].iterrows():
    print("===")
    print("Sentiment: {}".format(row.sentiment))
    print(row.reviews_preprocessed)

===
Sentiment: positive
I thought wonderful way spend time hot summer weekend, sitting air conditioned theater watching light-hearted comedy. The plot simplistic, dialogue witty characters likable (even bread suspected serial killer). While disappointed realize Match Point 2: Risk Addiction, I thought proof Woody Allen fully control style grown love.This I'd laughed Woody's comedies years (dare I say decade?). While I've impressed Scarlet Johanson, managed tone "sexy" image jumped right average, spirited young woman.This crown jewel career, wittier "Devil Wears Prada" interesting "Superman" great comedy friends.
===
Sentiment: positive
Great Woody Allen? No. Good Woody Allen? Definitely. I myself, audience attendance, laughing hard best Woody Allen lines we've heard while. The aging Allen created appropriate role Scarlett Johansson's "father" ... well, sort of. Some said Johansson plays "a young Dianne Keaton." I beg differ. She plays Woody's dialogue, which, comedies, similar feel...l

## Majority Vote

- Now we see if the final decision changes before and after denoising

In [227]:
rule_matches_z = load('../../tutorials/ImdbDataset/rule_matches.lib')
rule_label_t = load('../../tutorials/ImdbDataset/mapping_rules_labels.lib')

In [228]:
# Majority Vote without denoising
maj_vote = get_majority_vote_probs(rule_matches_z, rule_label_t)

  rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1, 1)


In [229]:
# Predictions without denoising
preds_wo_denoising = np.argmax(maj_vote, axis=1)

In [230]:
# Denoise now
def _activate_all_neighbors( lfs: np.ndarray, indices: np.ndarray
) -> np.ndarray:
    """
    Find all closest neighbors and take the same label ids
    Args:
        lfs:
        indices:
    Returns:
    """
    new_lfs_array = np.full(lfs.shape, fill_value=0)

    for index, lf in tqdm(enumerate(lfs)):

        try:
            matched_lfs = np.where(lf != 0)[0]
            if len(matched_lfs) == 0:
                continue
            matched_lfs = matched_lfs[:, np.newaxis]
            neighbors = indices[index]
            to_replace = new_lfs_array[neighbors, matched_lfs]
            label_matched_lfs = lf[matched_lfs][:, 0]
            tiled_labels = np.tile(
                np.array(label_matched_lfs), (to_replace.shape[1], 1)
            ).transpose()
            new_lfs_array[neighbors, matched_lfs] = tiled_labels
        except IndexError:
            pass

    return new_lfs_array


In [231]:
denoised_z = _activate_all_neighbors(rule_matches_z, indices)

50000it [00:00, 55382.88it/s]


In [232]:
np.unique(rule_matches_z)

array([0, 1])

In [233]:
np.unique(denoised_z)

array([0, 1])

In [234]:
maj_vote_denoised = get_majority_vote_probs(denoised_z, rule_label_t)

In [235]:
preds_with_denoising = np.argmax(maj_vote_denoised, axis=1)

In [236]:
# Check if final predictions have changed with denoising
np.all(preds_with_denoising == preds_wo_denoising)

False

In [237]:
rule_matches_z.sum()

188181

In [238]:
denoised_z.sum()

463272

If this is true it means the results are all the same before and after denoising.

## Different sentiments per neighbor

In [247]:
df = pd.DataFrame(indices)
column_names = ["neighbor_{}".format(i) for i in range(0,len(df.columns),1)]
df.columns=column_names
df.head()

Unnamed: 0,neighbor_0,neighbor_1,neighbor_2,neighbor_3,neighbor_4
0,0,33546,34461,25999,42867
1,1,16764,24062,33664,34279
2,2,30202,647,28846,48161
3,3,39638,7400,31299,4635
4,4,700,41941,4602,2028


In [248]:
df = df.assign(sentiment1 = df['neighbor_0'].apply(lambda x:movie_reviews.loc[x, 'sentiment']))

In [249]:
df = df.assign(sentiment2 = df['neighbor_1'].apply(lambda x:movie_reviews.loc[x, 'sentiment']))

In [250]:
df = df.assign(sameSentiments= df.sentiment1 == df.sentiment2)

In [251]:
df.sameSentiments.value_counts()

True     38187
False    11813
Name: sameSentiments, dtype: int64

In [252]:
movie_reviews.loc[df.loc[df.sameSentiments == False].iloc[0]['neighbor_0'], 'review']

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [253]:
movie_reviews.loc[df.loc[df.sameSentiments == False].iloc[0]['neighbor_1'], 'review']

'"Jake Speed" is a fine movie with a wonderful message. It has its flaws of course. At times it\'s a little slow. It introduces its villain too far into the story. It\'s action is paced at the rate of a snail\'s heartbeat. It has a Z-grade cast (Although I\'ve always admired the work of Karen Kopins, who has the straight-laced good looks of Sandra Bullock).<br /><br />But with all this going against it, "Jake Speed" really is inspiring, thanks to a charming script by Wayne Crawford(who plays the title role) and Andrew Lane.<br /><br />Why do I find it so inspiring? Because it says to me "Hey, why not try to be a good person."<br /><br />The story is essentially a "stranger in a strange land" premise, that is good-and-heroic Jake Speed is placed in the real world where bad things happen to good people. Jake is more than a Boy Scout. He\'s more than a knight in shining armor. Jake Speed is the patron saint of optimism in a dirty, mean and evil world.<br /><br />It\'s because of this that

In [254]:
movie_reviews.loc[df.loc[df.sameSentiments == False].iloc[1]['neighbor_0'], 'review']

'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case wit

In [255]:
movie_reviews.loc[df.loc[df.sameSentiments == False].iloc[1]['neighbor_1'], 'review']

"Unfortunately the only spoiler in this review is that there's nothing to spoil about that movie.Even if B. Mattei had never done any master piece he use to do his job with a bit of humor and craziness that made him a fun Eurotrash director. But for the last 10 years he seemed to have lost it.This film is just empty, nothing at all to wake us up from the deep sleep you sink into after the first 10 min.No sex, no blood(it's suppose to be about snuff?),no actors, no dialogs, just as bad as an 90'T.V film.It's even worse than his last cannibals and zombies epics.So Rest in peace Bruno, you will stay in our minds forever anyway, thanks to such unforgettable gems as:Zombi 3, Robowar,Rats, l'altro inferno,Virus, Cruel jaws and few others.So except if you want to see B Mattei possessed by jess Franco's spirit's new film, pass on this one.But if you don't know this nice artisan's career track down his old films and have fun."