### Evaluation Methods

In this notebook, you will get more comfortable with the different methods for evaluating your recommendation system.  


Before working with actual data, let's review some of the high level ideas to make sure you are comfortable with the fundamental ideas.  

**Question 1:** Consider we want to classify for each movie if an individual will click or not if it is recommended to them, which of the following is **not** a metric that could be used to evaluate a recommendation system for this situation?

In [None]:
import solution_part1 as sp

a = "accuracy"
b = "precision"
c = "recall"
d = "rmse"
e = "f1-score"

your_answer = #a

sp.answer_one(your_answer)

**Question 2:** Imagine in the above scenario imagine that we know the click through rate on average of any recommendation is 10%, while 90% of movies are not clicked through on.  

You build a recommendation engine that predicts with an 88% accuracy whether a recommended movie will be clicked through on or not.  What is your reaction to this result?

In [None]:
a = "At first glance, 88% seems like a good accuracy based on what we know."
b = "At first glance, 88% accuracy could be good depending on the break out between TP, TN, FP, FN."
c = "At first glance, 88% seems not great given 90% accuracy's possible by predicting every movie won't be clicked."

your_answer = #a

sp.answer_two(your_answer)

**Question 3:** When evaluating how well your recommendation system is working, it is important to use train-test splits of the data (often along with cross-validation).  What is the name of a common problem that can occur when you train your recommendation system on all your data, and then you evaluate how well your recommendation system is working based on how well it fits all your data?

In [None]:
a = "You are likely to overfit."
b = "You are likely to always over estimate with your predictions."
c = "You are likely to always under estimate with your predictions."
d = "None of the above, you don't need to use train-test splits or cross-validation."

your_answer = #a

sp.answer_three(your_answer)

There are smart ways to make your train-test split.  One of the ways `turicreate` assists in performing smart splits is by providing `random_split_by_user` functionality.  You can find more information on this functionality in the [documentation here](https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.util.random_split_by_user.html).  The key takeaway of this function is that rather than just taking a random sample of rows for the training and test sets (like a scikit-learn split might do), this technique first randomly selects users, and then randomly selects items within the users.

**"`tc.recommender.random_split_by_user` generates a test set by first choosing a subset of the users at random, then choosing a random subset of that user's items. By default, it chooses 1000 users and, for each of these users, 20% of their items on average. Note that not all users may be represented by the test set, as some users may not have any of their items randomly selected for the test set."**

You will use this functionality to split your data into training and testing in the cells below, and then answer the following questions about your results.

In [None]:
# run this cell to read in the libraries and data needed
import numpy as np
import pandas as pd
import turicreate as tc

ratings_dat = pd.read_csv('../../data/ratings.dat', sep='::', engine='python', \
                          header=None, names=['user_id', 'movie_id','rating','time'])

ratings_dat2 = ratings_dat.copy(deep=True)
ratings_dat2.columns = ['user_id', 'item_id', 'rating', 'time']
ratings_sframe = tc.SFrame(ratings_dat2[['user_id', 'item_id', 'rating']])

In [None]:
# split your data into train and test
train, test = tc.recommender.util.random_split_by_user(ratings_sframe, 
                                                       user_id = 'user_id',
                                                       item_id = 'item_id',
                                                       max_num_users=None)

**Question 4:** What proportion of the full data ended up as `test` when using the above case with `max_num_users=None`?  

In [None]:
your_answer = #1.0235

sp.answer_four(your_answer)