In [3]:
%load_ext autoreload
%autoreload 2

## HyperLogLog

In this task we use the HLL data structure to produce an estimate of the number of distinct strings in the file `hash.txt`.

The command `cat hash.txt | uniq | wc -l` has been used to produce the exact count which is *139000000* distinct strings.

The underlying hash function used by HLL has been implemented according to the **multiply-shift** scheme defined on [Wikipedia](https://en.wikipedia.org/wiki/Universal_hashing). 

We use *32 bits* for the hash, i.e. $$h:\big\{U \rightarrow [m]\big\}$$ where $m=2048$.

We then use *6 bits* for the buckets, i.e. *64 buckets*, so according to [Flajolet et. al](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) we can expect $1.04/\sqrt(m)$ relative accuracy.

Recall the formula of relative accuracy is given by:

$$RE_{accuracy} = \frac{\mbox{absolute error}}{\mbox{"true" value}} \cdot 100\%$$

Which in this setting should expect a **2.30% relative accuracy**

In [40]:
from my_hyperloglog import *

estimate = estimate_distinct("full_hash.txt")
re_acc = relative_accuracy(estimate)

print("Distinct elements: ", estimate)
print("Relative accuracy: {:.2%}".format(re_acc))

Distinct elements:  141943507
Relative accuracy: 2.12%


## Cluster analysis of Amazon Reviews

We now focus on analyzing the reviews for [fine foods from Amazon](https://www.kaggle.com/snap/amazon-fine-food-reviews) using the **KMeans algorithm** implemented from scratch.

Here's the dataset we're working on:

In [41]:
from data_loader import load_data

reviews_df = load_data()
reviews_df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


First we need to preprocess the Text column, we can do this by changing the call to `load_data()` and if the preprocessed `.csv` has not been generated before it will do it on the fly.

In [43]:
processed_reviews_df = load_data(processed=True)
processed_reviews_df.head()

Unnamed: 0,ProductId,UserId,Score,Text,ProcessedText
0,B001E4KFG0,A3SGXH7AUHU8GW,5,I have bought several of the Vitality canned d...,i have bought several of the vitality canned d...
1,B00813GRG4,A1D87F6ZCVE5NK,1,Product arrived labeled as Jumbo Salted Peanut...,product arrived labeled as jumbo salted peanut...
2,B000LQOCH0,ABXLMWJIXXAIN,4,This is a confection that has been around a fe...,this is a confection that has been around a fe...
3,B000UA0QIQ,A395BORC6FGVXV,2,If you are looking for the secret ingredient i...,if you are looking for the secret ingredient i...
4,B006K2ZZ7K,A1UQRSCLF8GW1T,5,Great taffy at a great price. There was a wid...,great taffy at a great price there was a wide ...
