# Recommendation Engines 1

### September 21 | Week 10 Day 1
### Instructor: Monique Wong


## Agenda

1. What is a recommendation engine?
* One technique: content-based filtering
* Last lecture with me! Open Q&A.

In [4]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.neighbors import NearestNeighbors

## What is a recommendation engine?
- Also called a recommender system
- Put simply, it's a set of algorithms that allow a machine to suggest things that you try or might like


**Discussion**: Where do we see this in practice?

## Why is this a separate topic? 

- It doesn't fit neatly into supervised or unsupervised learning
- Why? 

### Thought experiment: what constitutes a good recommendation? 

## Each recommendation system has ...
- **Users:** Purchaser of Amazon products, Netflix binge-watcher, social media subscriber...
- **Items:** Amazon products, Netflix shows, social media posts

### The basic purpose of a recommendation system is to find and recommend items that a user is most likely to be interested in.

## There are two techniques:

1. **Content-Based Recommender:** Use knowledge of each product to recommend a similar product (Product based recommendation)
    - E.g., customer comes in wanting to buy a computer as close to 8 GB RAM, 125 GB HDD, 6 hour battery life as possible...

2. **Collaborative Filtering:** Use knowledge of user’s past purchase/selection or similar decisions by other users to recommend products (User-based recommendation).
    - E.g., Netflix recommending shows to me based on what others who have watched similar shows to me have watched

## Pros and cons

| Technique | Pros | Cons |
|:-|:---|:---|
|**Content-Based Recommender**| Works without user reviews / behaviours | Need descriptive data for every product which is difficult to implement for large inventory of products |
|**Collaborative Filtering** | Product knowledge not required | Can’t recommend products if no user reviews or behaviours available so difficult to make recommendations for new users|

We will cover **content-based recommendations** today. Collaborative filtering is being covered tomorrow.

## Content-based recommendations

- E.g., Customer comes in wanting to buy a computer as close to 8 GB RAM, 125 GB HDD, 6 hour battery life as possible
- This recommendation algorithm gives us "similar to previous products you purchased" functionality or "items similar to what you searched" functionality. (What is the flaw with this?)

### What do we need to make this type of recommendation?
1. We need a **list of items** and **features that describe them** 
* We need a **definition of "distance"** between products 

## Step 1: Items and Features

For movies, this can be as simple as:

| Movie | Action | Comedy | Romance | Drama | Runtime | Actor 1 | Actor 2 | Actor 3| ... |
|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|
| Movie 1 | 1 | 0 | 0 | 0 | 123 | 1 | 0 | 1 | ... |
| Movie 2 | 0 | 0 | 1 | 0 | 96 | 0 | 0 | 1 | ... |
| Movie 1 | 0 | 1 | 0 | 0 | 89 | 0 | 0 | 0 | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |


- It is possible that many features will need to be engineered before this begins to work well. 
- For example, you can use the synopsis of these movies to generate features using NLP... 
- You can also see how this gets computationally and memory intensive when there are a lot of items...

## Step 2: Definition of distance

- For this recommendation system, we're trying to find the "closest product"
- We need some definition of "closest" 

### Option 1: Euclidean distance
- We've used this before
```
def euclidean_distance(x, y):   
    return np.sqrt(np.sum((x - y) ** 2))
```

### Option 2: Cosine similarity
- Going to introduce what's more commonly used in recommender systems
```
def cosine_similarity(x, y):
    return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))
```

## Euclidean distance vs. cosine similarity

Visualizing the difference:
<img src='imgs/euclidean-cosine.png' width=300>

- **Euclidean distance** is $d$
    - Magnitude of each feature ends up being measured
- **Cosine similarity** is $\theta$
    - Looks at the angle between features
    
### Which one you use ends up being dependent on your features
- E.g., if you're counting up how many times "science" is used in an article, does more instances of "science" mean that the article is more "science-related"? 

## Step 3: Making a recommendation

Just find the closest product to the target product!


### Quick demo using cars

In [8]:
data_url = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
cars = pd.read_csv(data_url)
cars.columns = ['car_names', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.head()

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Using sklearn Nearest Neighbors

In [12]:
# Extract only some subset of columns to reduce computation time 
X = cars.iloc[:,[1, 3, 4, 6]].values

# Looking for a car closest to the following
t = [15, 300, 160, 3.2]

#Use fit method to create model
nbrs = NearestNeighbors(n_neighbors=1, metric='euclidean').fit(X)

#Check the recommendation by your model.
print(cars.iloc[nbrs.kneighbors([t])[1][0][0]])

car_names    AMC Javelin
mpg                 15.2
cyl                    8
disp                 304
hp                   150
drat                3.15
wt                 3.435
qsec                17.3
vs                     0
am                     0
gear                   3
carb                   2
Name: 22, dtype: object


## Break

Let's reconvene at ...

### This is my last lecture with you!

When we come back, I'd like to do an open Q&A for the remainder of the time. Please think of things you want to ask / discuss over the break. 