## Case Study Background
This week, the Head of Marketing asks for your help to develop a Recommender System to predict the user interests in upcoming movies based on their ratings of previously viewed sessions. 

## Learning objectives
- Learn the differences between Item and User-based Recommender Systems
- Learn how to compute the predictions for Item and User-based Recommender Systems
- Learn the pros and cons of using a Collaborative Filtering Recommender System

## Workshop Overview
- Implement the Item and User-based Recommender systems approaches by hand and with Python

# <u>Concept: Collaborative Filtering</u>

Collaborative Filtering is one type of Recommender Systems, which makes predictions about a user's missing data according to the **collective** behaviour of many other users. There are 2 approaches to Collaborative Filtering: **Item-Based** and **User-Based**. It is based on the assumption that people who like similar things will give out similar ratings; and that people who give out similar ratings will like similar things.

Let's try both approaches on a small dataset show below:

|UserName|Aquaman|Avengers: Infinity War|Venom|Black Panther|Ant-Man and the Wasp|Deadpool 2|
|-----------|-----|-----|-----|-----|-----|---|
| Akira    | 3   | -   | 3   | 3.5 | 2.5 | 3 |
| Eve       | 4   | 3.5 | 2.5 | 4   | 3   | 3 |
| Chris     | 3   | 3   | 3   | -   | -   | 4 |
| Pauline    | -   | 3.5 | 2   | 4   | 2.5 | - |
| Josh       | -   | 3   | -   | 3   | 2   | 5 |
| Daniel | 2.5 | 4   | -   | 5   | 3.5 | 5 |
| Grady     | -   | 4.5 | 3   | 4   | 2   | - |

Note that a `-` denotes _missing values_.

Hint: Start off by imputing missing values. In case you forget, remember that we should not
be using imputed values for the final prediction!

<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise (By Hand)
    
Do the following questions by hand:
1. Use the **Item-based** Recommender systems approach discussed in lectures to predict Daniel's rating for Venom. Use 3 most similar items to calculate the weighted score.
2. Use the **User-based** Recommender systems approach to predict Daniel's rating for Venom. Use 3 most similar users to calculate the weighted score.
3. Identify and discuss the advantages and disadvantages of each approach.

## Answer to question 3
The item based approach works by considering the ratings of similar movies, while the user based approach works by considering the opinions of similar users. Item-based measures tend to perform better in many practical cases. Generally, users are likely to be added more frequently than movies, meaning that the offline computation needs to be updated more often for a user-based approach. It's also very difficult to make recommendations to new users with a user-based approach.

Now that we have worked it out by hand, let's implement the item-based approach using `pandas` and `numpy`.

In [0]:
import pandas as pd
import numpy as np

In [0]:
data = pd.read_csv('recommendersystem.csv')
data = data.set_index('UserName')
data

<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
    
Fill in the `...` sections of the following function that implements the item-based recommender systems.

In [0]:
def get_itembased_scores(user, item, df, n=3):
    """
    Return the predicted `user` rating for `item`, using 3 most similar items.
    """
    
    # Get the original ratings for the current user
    current_ratings = df.loc[user,:]
    
    # Column mean imputation
    imputed_df = df.fillna(df.mean())
    
    # Get the imputed ratings for the current item
    x = imputed_df.loc[:,item]
    
    # Initialise a predicted dictionary
    similarity = {}
    
    # Only include items that user has rated
    rated_items = [x for x in df.columns if not np.isnan(current_ratings[x])]
    
    # Calculate the similarity scores
    for compare_item in rated_items:
        ...

    # Convert `similarity` to a series
    similarity = pd.Series(similarity)
    
    # Create `top_n`: a LIST of the top n item labels to calculate the weighted predicted score
    top_n = ...
    
    # Calculate the predicted score
    predicted_score = (current_ratings[top_n]*similarity[top_n]).sum() / similarity[top_n].sum()
    
    return(predicted_score)

In [0]:
# ANSWER
def get_itembased_scores(user, item, df, n=3):
    """
    Return the predicted `user` rating for `item`, using 3 most similar items.
    """

    # Get the original ratings for the current user
    current_ratings = df.loc[user,:]
    
    # Column mean imputation
    imputed_df = df.fillna(df.mean())
    
    # Get the imputed ratings for the current item
    x = imputed_df.loc[:,item]
    
    # Initialise a predicted dictionary
    similarity = {}
    
    # Only include items that user has rated
    rated_items = [x for x in df.columns if not np.isnan(current_ratings[x])]
    
    # Calculate the similarity scores
    for compare_item in rated_items:
        y = imputed_df.loc[:, compare_item]
        eucl_dist = np.sqrt(np.sum([(a-b)*(a-b) for a, b in zip(x, y)]))
        similarity[compare_item] = 1/(1+eucl_dist)

    # Convert `similarity` to a series, and find weights
    similarity = pd.Series(similarity)
    
    # Create `top_n`: a LIST of the top n item labels to calculate the weighted predicted score
    top_n = similarity.sort_values(ascending=False).head(n).index
    
    # Calculate the predicted score
    predicted_score = (current_ratings[top_n]*similarity[top_n]).sum() / similarity[top_n].sum()
    
    return(predicted_score)

In [0]:
# Test the function output
get_itembased_scores('Daniel', 'Venom', data)

As a bonus, our tutors have written a program that asks for your own inputs and gives out predicted ratings using the function you have just completed. Try it out!

In [0]:
new_user = {}
new_username = input('Provide your username: ')
print()

missing_movies = []
for movie in ['Aquaman', 'Avengers: Infinity War', 'Venom', 'Black Panther', 'Ant-Man and the Wasp', 'Deadpool']:
    new_input = input(f'Provide a 0-5 rating for {movie}. Enter to skip if you have not watched it: ')
    if new_input == '':
        new_user[movie] = np.nan
        missing_movies.append(movie)
    else:
        new_user[movie] = float(new_input)
        
if len(missing_movies) > 3:
    print("\nYou haven't rated enough movies to provide useful recommendations.")
    
else:
    # Update the dataframe
    new_data = data.append(pd.DataFrame.from_dict({new_username: new_user}, orient='index'))

    # Loop through movies without a rating and perform item-based recommendation
    for movie in missing_movies:
        print(f"\nYou haven't watched {movie}, but we think that you would rate it:", get_itembased_scores(new_username, movie, new_data))

<blockquote style="padding: 10px; background-color: #FFD392;">

## Discussion Questions
1. Recommender systems are challenged by the _cold start_ problem - how to make recommendations to new users, about whom little is known, and how to make recommendations about new items. 
    - For example, if a new user `Anam` has just signed up to our platform, how can the system know to make the appropriate recommendations.
    - Likewise, if a new Avengers movie is to be released, whom should the system recommend it to?

Suggest three strategies that might be used to address this.

2. Recommender systems are sometimes criticised for over-recommending popular items to users and under-recommending less well-known items. Why do you think this happens? How might it be addressed?

For new items:
Find similar items in current dataset (based on description, author, title, category,...), take the mean or other summarisation of these neighbours as the initialisation of the new item
- Pay someone to rate the new item

For new users:
- Use similar users' data to initialise the new user (possibly, based on gender, age, ...) 
- Ask user questions to obtain more data in the first place, e.g., providing a collection of items and asking user to choose the ones they like
- Use the most popular items as the initial suggestion system wide (overall, not enough data to provide good recommendation)

Popular items recommended -> users are more likely to give high ratings to these items -> the popularity of these items increase further

The system has no understanding of the items themselves, so over and unnder recommending naturally occur as the system is performing its job.

To address this:
- Adding an extra level to manipulate the results, to ensure the diversity (cover a broad range of items)
- Giving weight to the timliness of an item (how old it is), to prevent users from getting the same recommendation all the times

# <u> Challenge questions </u>

1. Implement a `get_userbased_scores` function for the User-based approach.
2. "Collaborative filtering is a regression variation of the kNN classifier". Is this true or false and why?
3. Improve on the user-input program to handle the cold-start problem (e.g, a user who skips all of the ratings). Your new program should recommend the top 3 most popular movies if the user has no ratings for any movie!