# Wyszukiwarka

#### Plan:
tremin: 20.12.2017
1. Dictionary
    - Stop words removed
    - Stemming (Porter Stemmer, Porter Stemmer 2)
    - Reduction (optional)
2. Budowa BackOfWords (indeksowanie)
    - documentTermMatrix (wektory słów dla każdego artykułu)
    - Inverse Document Frequency (IDF(w) - przemnożenie kolumny słowa w w documentTermMatrix przez IDF(w)
    - Normalizacja (sprowadzenie wektora dla każdego dokumentu do jednostkowego - przydatne przy danych o zróżnicowanej długości)
3. Query
    - Przekleństwo wymiaru
    - iloczyn skalarny
4. SVD (LRMA)
    - biblioteka, która liczy pierwsze k wektorów 
    - biblioteka dla danych żadkich
5. * Semantyka w nietrywialny sposób
    - Latent semantic indexing
    - Latent Dirichlet approximation

In [44]:
import pandas as pd
from tqdm import tqdm
import re
import numpy as np
from nltk.stem import PorterStemmer, snowball
from nltk.stem import WordNetLemmatizer
from scipy.sparse import csr_matrix, lil_matrix
import itertools
from math import log

### Params

In [96]:
filename = "data.tsv"
column_name = 'review'
# filename = "articles2.csv"
# column_name = 'Article'
# filename = 'buzzfeed.csv'
# column_name = ?
stopwords_filename = "stopwords"

In [97]:
texts = pd.read_csv(filename, header=0, delimiter="\t")
# texts = pd.read_csv(filename)

In [59]:
texts.shape

(44064, 5)

In [92]:
texts['title'][31]

'Street Food In Bangkok That Will Make You Eat To The Top'

In [83]:
texts['first_paragraph'][3]

'Carol Pilon is one of a handful people on Earth who risk their lives by performing on the wings of airplanes flying 150 miles an hour hundreds of feet above the ground. And she’s trying to mentor a new generation to carry on the dying — and deadly — art form.'

In [84]:
texts['text'][3]

'before I could see it, the engine growling loudly. And then the scarlet biplane came into view, a dark figure suspended between the parallel wings becoming more distinct as it flew toward the Lake Erie shoreline where thousands of spectators were waiting for the next air show act. That figure was Carol Pilon, one of the world’s last remaining wingwalkers, and she appeared to be floating, the finger-width flying wires supporting her invisible to the audience from that distance. Her longtime pilot, Marcus Paine, accelerated into a loop, drawing a large cursive O in the sky with smoke; only centrifugal force and Pilon’s own strength kept her from falling as the plane flew upside down and then righted itself barely a hundred feet above the waterfront showbox. Then the 1940 Boeing Stearman turned away from the shore, providing the audience with a silhouette of the biplane, Pilon suspended like a spider over her carefully spun web. Paine gained altitude and turned back toward the crowd befo

In [98]:
bag_of_words = set()
with open(stopwords_filename, "r") as file:
    stopwords = set(file.read().splitlines()[1:])

In [99]:
# ps = PorterStemmer()
ps = snowball.EnglishStemmer()
wnl = WordNetLemmatizer()

In [100]:
for i, row in tqdm(texts.iterrows()):
#     row[column_name] = list(map(lambda x : ps.stem(re.sub('[^a-z]', '', x.lower())), row[column_name].split()))
    row[column_name] = list(map(lambda x : ps.stem(re.sub('[^a-z]', '', wnl.lemmatize(x))),
                                filter(lambda word : word not in stopwords, row[column_name].lower().split())))
    for word in row[column_name]:
        bag_of_words.add(word)

25000it [01:29, 279.04it/s]


In [101]:
len(bag_of_words)

87329

In [102]:
frequency = {word: 0 for word in bag_of_words}
for i, row in tqdm(texts.iterrows()):
    for word in row[column_name]:
        if word in bag_of_words:
            frequency[word] += 1

25000it [00:03, 6410.94it/s]


In [103]:
frequency = dict(filter(lambda x : len(x[0]) > 1, sorted(frequency.items(), key=lambda x: x[1], reverse=True)))

In [104]:
frequency

{'br': 57204,
 'movi': 49381,
 'film': 45774,
 'time': 14718,
 'good': 14230,
 'charact': 13795,
 'watch': 13650,
 'it': 12803,
 'the': 12511,
 'stori': 11620,
 'scene': 10472,
 'show': 9780,
 'bad': 9413,
 'great': 9221,
 'love': 8969,
 'peopl': 8827,
 'play': 8460,
 'act': 8375,
 'life': 6704,
 'plot': 6484,
 'actor': 6456,
 'work': 6420,
 'year': 5927,
 'man': 5484,
 'this': 5455,
 'find': 5302,
 'perform': 5066,
 'part': 5026,
 'feel': 4939,
 'lot': 4884,
 'guy': 4742,
 'interest': 4662,
 'director': 4500,
 'real': 4497,
 'funni': 4253,
 'music': 4186,
 'give': 4106,
 'enjoy': 4049,
 'cast': 4045,
 'end': 3978,
 'start': 3929,
 'role': 3918,
 'woman': 3692,
 'thought': 3672,
 'girl': 3671,
 'set': 3669,
 'point': 3665,
 'turn': 3658,
 'star': 3638,
 'kill': 3623,
 'well': 3611,
 'world': 3567,
 'minut': 3562,
 'fact': 3557,
 'pretti': 3556,
 'day': 3549,
 'effect': 3517,
 'origin': 3515,
 'comedi': 3482,
 'horror': 3476,
 'direct': 3468,
 'and': 3397,
 'friend': 3379,
 'young': 334

In [105]:
ids = dict((word, i) for i,word in enumerate(frequency))

In [106]:
word_from_id = {v: k for k, v in ids.items()}

# Back Of Words:

In [107]:
number_of_words = len(ids)
number_of_docs = texts.shape[0]
backOfWords = lil_matrix((number_of_docs, number_of_words), dtype=float)

In [108]:
for i, row in tqdm(texts.iterrows()):
    for word in row[column_name]:
        if word in ids:
            backOfWords[i ,ids[word]] += 1

25000it [00:32, 773.24it/s]


In [109]:
doc_word_counts = {word: 0 for word in bag_of_words}
for i, row in tqdm(texts.iterrows()):
    for word in set(row[column_name]):
        doc_word_counts[word] += 1

25000it [00:03, 7524.72it/s]


#### Number of docs that the given word appears in:

In [110]:
dict(filter(lambda x : len(x[0]) > 1, sorted(doc_word_counts.items(), key=lambda x: x[1], reverse=True)))

{'movi': 15933,
 'film': 14623,
 'br': 14537,
 'time': 9684,
 'good': 9209,
 'watch': 9006,
 'it': 8812,
 'charact': 8102,
 'the': 7742,
 'stori': 7443,
 'act': 6599,
 'scene': 6486,
 'great': 6349,
 'peopl': 6046,
 'bad': 6022,
 'love': 5784,
 'play': 5648,
 'show': 5332,
 'plot': 4986,
 'actor': 4929,
 'work': 4822,
 'life': 4666,
 'this': 4523,
 'year': 4522,
 'find': 4196,
 'part': 3883,
 'lot': 3864,
 'man': 3803,
 'feel': 3801,
 'perform': 3766,
 'interest': 3764,
 'give': 3576,
 'real': 3530,
 'director': 3501,
 'enjoy': 3370,
 'cast': 3345,
 'end': 3332,
 'guy': 3202,
 'start': 3163,
 'funni': 3115,
 'thought': 3100,
 'turn': 3049,
 'well': 3033,
 'point': 2997,
 'role': 2991,
 'set': 2979,
 'direct': 2978,
 'fact': 2933,
 'one': 2911,
 'day': 2895,
 'star': 2888,
 'and': 2871,
 'music': 2836,
 'minut': 2836,
 'pretti': 2819,
 'all': 2769,
 'effect': 2760,
 'long': 2743,
 'that': 2713,
 'big': 2713,
 'world': 2675,
 'put': 2666,
 'line': 2662,
 'bit': 2652,
 'friend': 2649,
 'f

# 5. IDF

In [111]:
# Sparse Matrix for each - the fastest way
def for_each(function, x):
    cx = x.tocoo()    
    for i,j,v in tqdm(zip(cx.row, cx.col, cx.data)):
        function((i,j,v))

In [112]:
def mapAsIDF(x):
    backOfWords[x[0],x[1]] = x[2] * log(float(number_of_docs)/doc_word_counts[word_from_id[x[1]]])
    
for_each(mapAsIDF , backOfWords)

2000629it [00:26, 74482.15it/s]


# 6. Search

In [113]:
def convert_to_vector(query):
    query = set(map(lambda x : ps.stem(re.sub('[^a-z]+', '', ps.stem(x.lower()))), query.split()))
    vec = lil_matrix((1,number_of_words), dtype=float)
    for word in query:
        if ids.get(word) is not None:
            vec[0, ids[word]] += 1
    return vec

In [114]:
def get_best_result(query):
    query_vec = convert_to_vector(query)
    query_length = query_vec.count_nonzero()
    matches = {i: 0 for i in range(number_of_docs)}
    for i in tqdm(range(number_of_docs)):
        matches[i] = query_vec.multiply(backOfWords[i]).sum()/(query_length * backOfWords[i].count_nonzero())
#         matches[i] = [word * backOfWords[i,j] for j,word in enumerate(query_vec.nonzero()[1])][0].sum()/(query_length * backOfWords[i].count_nonzero())
    return dict(sorted(matches.items(), key=lambda x: x[1], reverse=True))

In [115]:
a = pd.read_csv(filename, header=0, delimiter="\t")
a.get_value(2, column_name)

'All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my kid\'s excitement was so great that sitting was impossible. However, I am a great fan of A.A. Milne\'s books which are very subtle and hide a wry intelligence behind the childlike quality of its leading characters. This film was not subtle. It seems a shame that Disney cannot see the benefit of making movies from more of the stories contained in those pages, although perhaps, it doesn\'t have the permission to use them. I found myself wishing the theater was replaying \\Winnie-the-Pooh and Tigger too\\", instead. The characters voices were very good. I was only really bothered by Kanga. The music, however, was twice as loud in parts than the dialog, and incongruous to the film.<br /><br />As for the story, it was a bit preachy and militant in tone. Overall, I was disappointed, but I would go again just to see the same excitement on my child\'s face.<br /><br />I liked Lumpy\'s laugh...."'

In [116]:
query = "movie for kids best for child"
query_vec = convert_to_vector(query)
best_metches = get_best_result(query)

100%|██████████| 25000/25000 [00:18<00:00, 1365.43it/s]


In [117]:
best_metches

{10862: 0.1386102753107166,
 1145: 0.13424520962586711,
 22730: 0.12907613737952342,
 10311: 0.10065093942150002,
 3927: 0.098891896009910996,
 7964: 0.09870320611171958,
 14006: 0.097343759434908,
 19431: 0.093516045296729328,
 858: 0.093130152629912574,
 21000: 0.086301762017125083,
 12888: 0.084203198435498214,
 23642: 0.080576320820699612,
 4890: 0.076823979455938857,
 10462: 0.076823979455938857,
 4988: 0.07223726178847327,
 10809: 0.070499828051165059,
 20767: 0.06971595401513303,
 22870: 0.067199970056363023,
 4173: 0.066791100254655844,
 20865: 0.066498328370950377,
 22692: 0.065927930673273988,
 24699: 0.065695051931269885,
 2312: 0.064964495397481267,
 9193: 0.064774082988848694,
 23822: 0.064487958866564263,
 21973: 0.064397085662993112,
 22745: 0.062490416895398845,
 21353: 0.062131641364675597,
 8869: 0.060390563652900013,
 18773: 0.060096631508820553,
 11927: 0.059915323436707697,
 6597: 0.059312604796898274,
 15164: 0.059030661857743687,
 848: 0.058546380135535842,
 9345

In [33]:
a.get_value(10862, column_name)

'This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie. This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie. This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie. This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie. This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie.'