# Five Cool Things in Five Minutes
Using the [Amazon Fine Foods Dataset](https://www.kaggle.com/snap/amazon-fine-food-reviews/), 100,000 row sample.



# [bit.ly/python-5ct](http://bit.ly/python-5ct)

## 1. SQL Magic
```
pip install ipython-sql
```
https://github.com/catherinedevlin/ipython-sql

In [None]:
%load_ext sql
%sql sqlite:///database-sample.sqlite

In [None]:
%%sql

SELECT name, sql FROM sqlite_master
ORDER BY name;

In [None]:
review_data = %sql SELECT productId, Score FROM reviews ORDER BY productId

In [None]:
review_data[0:5]

## 2. `from itertools import groupby`
We want to find the average score for each product. I see you reaching for `import pandas as pd`... but we're going `pandas` free!

https://docs.python.org/3/library/itertools.html#itertools.groupby

In [None]:
from itertools import groupby
from statistics import mean
help(groupby)

In [None]:
product_means = []
for product, reviews in groupby(review_data, key=lambda x: x[0]):
    reviews = list(reviews)
    product_means.append((product, mean(review[1] for review in reviews), len(reviews)))

top5_products = sorted(product_means, key=lambda x: (x[1], x[2]), reverse=True)[:5]

print(*top5_products, sep="\n")

## 3. Printing Sparkline Histograms
```
pip install sparklines
```

https://github.com/deeplook/sparklines

In [None]:
from sparklines import sparklines
import numpy as np

In [None]:
def generate_sparkline(array):
    try:
        bins = np.bincount(array)[1:6] # bincount includes 0, 1-5 range
        sparkline = sparklines(bins) # a list of bars
        return ''.join(sparkline)
    except ValueError:
        return ''

In [None]:
print(generate_sparkline(([1,4,3,4,2,1,2,3,4,3,5,5,3,2,4])))

## 🎊 3.5 BONUS: Change CSS in Notebook

In [None]:
from IPython.display import HTML
HTML("""
<style>pre {font-family: SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace;}</style>""")

Let's combine groupby with our sparkline function...

In [None]:
product_sparklines = []
for product, reviews in groupby(review_data, key=lambda x: x[0]):
    reviews = list(reviews)
    product_sparklines.append((product, generate_sparkline([review[1] for review in reviews])))

In [None]:
for product, sparkline in product_sparklines[:5]:
    print(product, sparkline, sep="\t")
    print()

## 4. Sorting by number of ratings and overall rating?

```
pip install statsmodels
```

Basically: **I want to sort on number of reviews AND review score.**

📄 **Good blog post**: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
    
> We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: *Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?* 

In [None]:
from statsmodels.stats.proportion import proportion_confint
from statistics import stdev, median

In [None]:
product_data = []
# We're going to do some extra aggregations here 
# because we'll need it for the next part
for product, reviews in groupby(review_data, key=lambda x: x[0]):
    review_scores = [review[1] for review in reviews]
    mean_score = mean(review_scores)
    median_score = median(review_scores)
    stdev_score = round(stdev(review_scores), 2) if len(review_scores) > 1 else 0
    positive_reviews = [int(r) > 3 for r in review_scores]
    p_positive = round(mean(positive_reviews), 2)
    
    # Here's the money:
    wilson_score = proportion_confint(sum(positive_reviews), len(review_scores), method='wilson')[0]
    wilson_score = round(wilson_score, 2)
    
    sparkline = generate_sparkline(review_scores)
    product_data.append(
        dict(
            product=product,
            n_ratings=len(review_scores),
            mean=mean_score,
            median=median_score,
            stdev=stdev_score,
            p_positive=p_positive,
            wilson_score=wilson_score,
            sparkline=sparkline,
        )
    )

## 5. IPywidgets
### 5.5 🎊 BONUS: `tabulate`
```
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
pip install tabulate
```

https://ipywidgets.readthedocs.io/en/stable/index.html

In [None]:
from ipywidgets import interact
from tabulate import tabulate

In [None]:
@interact(
    by=["n_ratings", "mean", "median", "stdev", "p_positive", "wilson_score"],
    descending=[True, False],
    top=(5, 50, 5),
)
def product_sorter(by, descending, top=5):
    sorted_data = sorted(product_data, key=lambda x: x[by], reverse=descending)[:top]
    print(tabulate(sorted_data, headers="keys", tablefmt="grid"))