A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 4. Recommender Systems

In this problem, we use a sample data set to learn more about recommender systems.

In [None]:
import numpy as np
import pandas as pd

from nose.tools import assert_is_instance, assert_equal
from numpy.testing import assert_array_equal

Suppose we are given the following data frame representing users' ratings of movies on a 1–5 scale, with 5 the highest rating. Note that most user-movie pairs have `NaN`s, meaning the user has not rated the movie.

```python
>>> print(movie_df)
```
```
         0    1    2    3    4    5    6
Alice  4.0  NaN  NaN  5.0  1.0  NaN  NaN
Bob    5.0  5.0  4.0  NaN  NaN  NaN  NaN
Carol  NaN  NaN  NaN  2.0  4.0  5.0  NaN
Dave   NaN  3.0  NaN  NaN  NaN  NaN  4.0
```

In [None]:
user_ids = ["Alice", "Bob", "Carol", "Dave"]
movie_ids = list(range(7))

data = np.array(
    [[4, np.nan, np.nan, 5, 1, np.nan, np.nan],
     [5, 5, 4, np.nan, np.nan, np.nan, np.nan],
     [np.nan, np.nan, np.nan, 2, 4, 5, np.nan],
     [np.nan, 3, np.nan, np.nan, np.nan, np.nan, 4]]
)

movies_df = pd.DataFrame(
    data=data,
    index=user_ids,
    columns=movie_ids
)

print(movies_df)

## Convert data frame to a 2-d numpy array of favorable ratings

The values in the above data frame hold the actual ratings. For simplicity, we want to restrict our analysis to only favorable ratings, which, since the movies are rated on a five-star system, we take to mean ratings greater than three. Thus,

- Write a function named `favorable_matrix()` that takes a data frame.
- The function converts the data frame to hold one for favorable ratings and zero for unfavorable ratings, and converts the result to a numpy matrix.
- One possible way to do this is to use the [pandas.DataFrame.applymap()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html) function (There are many ways to accomplish the same result). See the [Introduction to Recommender Systems](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week7/notebooks/intro2rs.ipynb) notebook for examples.

```python
>>> matrix = favorable_matrix(movies_df)
>>> print(matrix)
```
```
[[1 0 0 1 0 0 0]
 [1 1 1 0 0 0 0]
 [0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0]]
```

In [None]:
def favorable_matrix(df):
    """
    Takes a pandas data frame and returns a numpy matrix
    with 1 for ratings > 3 and 0 for ratings <= 3.
    
    Parameters
    ----------
    df: A pandas.DataFrame
    
    Returns
    -------
    A 2-d numpy array.
    """
    
    # YOUR CODE HERE
    
    return data

In [None]:
matrix = favorable_matrix(movies_df)
print(matrix)

In [None]:
assert_is_instance(matrix, np.ndarray)
assert_array_equal(
    matrix,
   [[1, 0, 0, 1, 0, 0, 0],
    [1, 1, 1, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 0],
    [0, 0, 0, 0, 0, 0, 1]]
)

df0 = pd.DataFrame(
   [[5, 5, 5],
    [4, 4, 4],
    [3, 3, 3],
    [2, 2, 2],
    [1, 1, 1]]
)
mat0 = favorable_matrix(df0)
assert_array_equal(mat0[0], [1, 1, 1])
assert_array_equal(mat0[1], [1, 1, 1])
assert_array_equal(mat0[2], [0, 0, 0])
assert_array_equal(mat0[3], [0, 0, 0])
assert_array_equal(mat0[4], [0, 0, 0])

df1 = pd.DataFrame(
   [[5, 4, 3, 2, 1],
    [5, 4, 3, 2, 1],
    [5, 4, 3, 2, 1]]
)
mat1 = favorable_matrix(df1)
assert_array_equal(mat1[:, 0], [1, 1, 1])
assert_array_equal(mat1[:, 1], [1, 1, 1])
assert_array_equal(mat1[:, 2], [0, 0, 0])
assert_array_equal(mat1[:, 3], [0, 0, 0])
assert_array_equal(mat1[:, 4], [0, 0, 0])

We use the same cosine similarity function from the [Introduction to Recommender Systems](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week7/notebooks/intro2rs.ipynb).

In [None]:
def cosine_similarity(u, v):
    """
    The Cosine Similarity function from intro2rs.ipynb
    """
    return np.dot(u, v) / np.sqrt((np.dot(u, u) * np.dot(v, v)))

To find the best matching user, we apply the `cosine_similarity()` function to each row by using [numpy.apply_along_axis](http://docs.scipy.org/doc/numpy/reference/generated/numpy.apply_along_axis.html) (If the second parameter in `apply_along_axis()` is 0, the function is applied to each column, i.e. `axis=0`; if it's 1, the functin is applied to each row, i.e. `axis=1`.)

After `cosine_similarity` is calculated for each user, we return the index of that user in the `matrix`. For example, if the best matching user was `Alice`, `find_best_match` will return 0; if it was `Bob`, 1 is returned; and so on.

In [None]:
def find_best_match(x, y):
    
    # Compute similarity, find maximum value
    similarities = np.apply_along_axis(cosine_similarity, 1, x, y)
    maximum = np.nanmax(similarities)

    # Find the best matching user
    user_index = np.where(similarities == maximum)[0][0]
    
    return user_index

In [None]:
def print_best_match(df, user):
    
    mat = favorable_matrix(df)
    best_match = find_best_match(mat, user)
    titles = df.index.values
    
    print("Best match = {0}, Cosine Similarity = {1:4.3f}".format(
        titles[best_match],
        cosine_similarity(mat[best_match], user)
    ))

We first create a fake user, `eve`, by selecting only the first movies as favorable. 

In [None]:
eve = np.array([1, 0, 0, 0, 0, 0, 0])

Given this new vector, we identify the user who is most similar to this new user.

In [None]:
print_best_match(movies_df, eve)

## Recommend movies

- Write a function named `recommend` that makes a movie recommendation.
- You might want to start by using `favorable_matrix()` to convert the data frame `df` into a 2-d matrix.
- Recall that `find_best_match()` returns the index of the row that represents the best matching user. You can use this index in a 2-d array, e.g., `matrix`, to extract that one row. For example,
```python
>>> array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> match = 0
>>> print(array[match])
```
```
[1 2 3]
```
```python
>>> match = 1
>>> print(array[match])
```
```
[4 5 6]
```

- When we subtract the best matching row from `user`, any negative value is a movie to recommend. For example, if we want to recommend movies for Eve, and the user who is most similar to Eve is Alice,
```python
>>> alice = np.array([1, 0, 0, 1, 0, 0, 0])
>>> eve = eve = np.array([1, 0, 0, 0, 0, 0, 0])
```
we should recommend movie 3 because 3 is the index of -1 in
```python
>>> print(eve - alice)
```
```
[ 0  0  0 -1  0  0  0]
```

In [None]:
def recommend(df, user):
    """
    Find the best maching row in "df" according to cosine similarity,
    and returns the index where diff < 0 and diff = user - best matching row.
    
    Parameters
    ----------
    df: A pandas.DataFrame
    user: A 1-d numpy array of favorable ratings from one user.
    
    Returns
    -------
    A 1-d numpy array of recommended movies.
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
print("Recommended movies: {}".format(recommend(movies_df, eve)))

In [None]:
r0 = recommend(movies_df, [1, 0, 0, 0, 0, 0, 0])
assert_is_instance(r0, np.ndarray)
assert_array_equal(r0, [3])

r1 = recommend(movies_df, [1, 0, 1, 0, 0, 0, 0])
assert_array_equal(r1, [1])

r2 = recommend(movies_df, [0, 1, 0, 0, 0, 0, 0])
assert_array_equal(r2, [0, 2])

r3 = recommend(movies_df, [0, 0, 0, 0, 1, 0, 0])
assert_array_equal(r3, [5])

r4 = recommend(movies_df, [0, 0, 0, 0, 0, 1, 0])
assert_array_equal(r4, [4])

r5 = recommend(movies_df, [1, 0, 0, 1, 0, 0, 0])
assert_equal(len(r5), 0)

r6 = recommend(movies_df, [1, 1, 1, 0, 0, 0, 0])
assert_equal(len(r6), 0)

r7 = recommend(movies_df, [1, 1, 1, 1, 0, 0, 0])
assert_equal(len(r7), 0)

So, our algorithm recommends movie 3 for Eve. Does this make sense? Recall that Eve had only the first movie as favorable.

```python
eve = np.array([1, 0, 0, 0, 0, 0, 0])
```


And we found that Alice was most similar to Eve.

```python
>>> print_best_match(movies_df, eve)
```
```
Best match = Alice, Cosine Similarity = 0.707
```

Alice has values:
```
Alice  4.0  NaN  NaN  5.0  1.0  NaN  NaN
```
which correspond to
```
[1, 0, 0, 1, 0, 0, 0]
```

It makes sense. Alice likes movies 0 and 3. Eve likes movie 0. So our algorithm recommends movie 3.

Let's try one more fake user.

In [None]:
frank = np.array([1, 0, 1, 0, 0, 0, 0])
print("Recommended movies: {}".format(recommend(movies_df, frank)))

Does this make sense? As an optional exercise, try the test cases (`r1`, `r2`, ...) and more fake users.