<div style="float:left;margin:5px 10px 5px 10px" markdown="1">
    <img src="images/auc.png" width="300">
</div>

<div style="float:right;margin-top:10px" markdown="1">
    <h3><i>Text Mining & Collective Intelligence</i></h3>
</div>

<br><br><br><br>

<center><h1>Making Recommendations</h1>

<br>

<h3>by Gianluca E. Lebani</h3>
<h4>• 31 Oct. 2017 •</h4>

</center>

<br>

>### Today
>
>- [the MovieLens 1M Dataset](#the-MovieLens-1M-Dataset)
>
>
>- [user-to-user kNN](#user-to-user-kNN)
>
>
>- [item-to-item kNN](#item-to-item-kNN)

---

In [1]:
from __future__ import division

from itertools import product, combinations
from operator import itemgetter
from collections import defaultdict

from scipy.spatial import distance

from sklearn.metrics import mean_absolute_error

## the MovieLens 1M Dataset

The MovieLens 1M dataset has been developed by the members of the [GroupLens](https://grouplens.org/) lab in the Department of Computer Science and Engineering at the University of Minnesota.

The MovieLens 1M dataset in brief:

- Ratings: 1 million
- Users: 6040
- Rated Movies: 3592
- Rated Scale: {1, ... , 5}
- Additional information on the users: gender, age range, occupation, zip-code
- Additional infomation on the movies: genre

(A zipped version of this dataset should be available in the `./data` folder, in case not [download it](https://grouplens.org/datasets/movielens/1m/) and unzip it)

The dataset is composed by three files:

- `movies.dat`, providing information about the rated movies 
    - it follows the format `MovieID::Title::Genres`
    

- `users.dat`, providing information about the users
    - it follows the format `UserID::Gender::Age::Occupation::Zip-code`
    - each user has at least 20 ratings


- `ratings.dat`, encoding the ratings
    - it follows the format `UserID::MovieID::Rating::Timestamp`

In [2]:
id2movie = dict()

with open("data/ml-1m/movies.dat", "rb") as infile:
    for line in infile:
        movieId, movie, _ = line.split("::")
        id2movie[int(movieId)] = movie

IOError: [Errno 2] No such file or directory: 'data/ml-1m/movies.dat'

In [None]:
user2movies_ratings = defaultdict(dict)

with open("data/ml-1m/ratings.dat", "rb") as infile:
    for line in infile:
        userId, movieId, rating = [int(el) for el in line.split("::")[:3]]
        user2movies_ratings[userId][movieId] = rating

---

<div style="float:left;margin:0 25px 10px 20px">
    <img src="images/your_turn.jpg" width="110">
</div>

#### Your Turn.

Explore the dataset:

- what is the average rating?


- which are the top-rated movies? And which are the lowest-rated ones?


- what are the average ratings for men and women?


- which movies received the highest rates from men? Which ones from women?


In [None]:
# your code here

---

### Use Case

In the following exercise we will try to model the rating:

- given by subject  `4447`
    - he gave 982 ratings


- for the movies:

    - `Back to the Future (1985)`: id = `1270`
    - `Silence of the Lambs, The (1991)` : id = `593`
    - `Raiders of the Lost Ark (1981)` : id = `1198`

In [None]:
target_movies = [593, 1198, 1270]

In [None]:
# let's remove our target ratings from the dataset

target_ratings = dict([(i, user2movies_ratings[4447].pop(i)) for i in target_movies])

## user-to-user kNN

Given an active user *a*:
- use  a similarity measure to determine the *k* most-similar users to *a*


- obtain the prediction on item *i* for user *a* by using one of the following aggregation approaches on the ratings from the neighborhood:
    - average
    - weighted sum
    - adjusted weighted aggregation (deviation-from-mean)
    
    
- choose the top-*n* items by selecting the *né items with the highest scores calculated by applying the previous steps on the items that haven’t been rated by the user *a*

### STEP 1: finding the top-50 similar users

Let's build the neighborhood for our target user `4447` by calculating his similarity with the other raters

In [None]:
def calculate_similarity(ratings, id1, id2, measure, threshold = 0):
    # get the list of shared rated items
    shared = sorted(set(ratings[id1].keys()).intersection(set(ratings[id2].keys())))

    # ignore comparisons with too few overlapping ratings (default is 0)
    if len(shared) <= threshold:
        return 0
    
    sel_ratings = [[v for (k,v) in ratings[i].items() if k in shared] for i in [id1, id2]]
    
    # compute distance
    distance = measure(*sel_ratings)
    
    # transform distance into a similarity score
    if measure == distance.euclidean:
        return 1 / (1 + distance)
    else:
        return 1 - distance

In [None]:
# let's calculate the similarities by using both the euclidean similarity and correlation 
measure2function = {"euclidean" : distance.euclidean, "correlation" : distance.correlation}

similarities = dict()
for measure, function in measure2function.items():
    similarities[measure] = dict()
    for id1, id2 in product([4447], user2movies_ratings.keys()):
        # do not compare our target user with himself
        if id1 == id2:
            continue
        similarities[measure][id2] = calculate_distance(user2movies_ratings, id1, id2, function)

In [None]:
# select the most similar users according to each measure
neighborhood = dict()
for measure in similarities.keys():
    neighborhood[measure] = dict(sorted(similarities[measure].iteritems(), key = itemgetter(1), reverse = True)[:50])
print neighborhood

### STEP 2: obtain the predictions for all the items of interest 

- in a real life scenario, we should obtain a predictions for all the items that were not rated by our user


- in this example, we will get predictions for just our three target movies

Let's use the following weighted score to aggregate our ratings:
 
- take the votes of all other critics and multiply them by their similarity with our target user


- sum these weihted votes for each item fo interest


- in order to handle the sparseness of the dataset (no movie has been rated by all the users), divide this score by the sum of all the similarities for critics that reviewed that movie

e.g. see the example from Segaran (2007: 15)

![alt text](images/weighting-users.png)

In [None]:
def getPredictions(movieId, neighborhood, ratings):
    weigthed_scores = []
    similarities = []

    for user, sim in neighborhood.iteritems():
        if ratings[user].has_key(movieId):
            weigthed_scores.append(sim * ratings[user][movieId])
            similarities.append(sim)
    
    return sum(weigthed_scores) / sum(similarities)

In [None]:
recommendations = defaultdict(dict)

for measure in similarities.keys():
    for movie in target_movies:
        recommendations[measure][movie] = getPredictions(movie, neighborhood[measure], user2movies_ratings)

### STEP 3: choose the top-items 

- in a real life scenario, you should choose the top-rated items and recommend them to the user


- in this exercise, we will compare the **rating predictions** produced by our RC against those produced by our user

is the **ranking** preserved?

In [None]:
print "- original ratings:"

for movieID in target_ratings:
    print id2movie[movieID], "-->", target_ratings[movieID]

In [None]:
for measure in similarities.keys():
    print "-", measure, "ratings:"
    for movieID in target_ratings:
        print id2movie[movieID], "-->", recommendations[measure][movieID]
    print

let's calculate the **Mean Absolute Error** (i.e. the difference between the real ratings and those produced by the RS)

In [None]:
true_ratings = [target_ratings[movieID] for movieID in target_ratings]

for measure in similarities.keys():
    print "-", measure, "MAE:",
    predicted = [recommendations[measure][movieID] for movieID in target_ratings]
    print mean_absolute_error(true_ratings, predicted)

---

<div style="float:left;margin:0 25px 10px 20px">
    <img src="images/your_turn.jpg" width="110">
</div>

#### Your Turn.

See what happens if:

- we specify a minimum rating overlap threshold (say, 10) in the `calculate_similarity()` function


- we change the size of the neighborhood (say, to 100)

---

## item-to-item kNN

Given an active user *a*:

- for each item in the database, use a similarity measure to determine its *k* most-similar items


- for each item *i* not rated by *a*, predict its rating on the basis of the *a*’s previous ratings of the items in the *i*'s neighborhood


- choose the top-*n* items by selecting the *n* items with the highest scores calculated in the previous step


### STEP 1: finding the top-25 similar items

re-arranging the ratings dictionary allows us to use the `calculate_distance()` function to calculate the item-based similarities as well

In [None]:
# let's rearrange the dictionary of ratings 

movies2user_ratings = defaultdict(dict)

for user, user_ratings in user2movies_ratings.iteritems():
    for movie, rating in user_ratings.iteritems():
        movies2user_ratings[movie][user] = rating

To speed up the process, we will ignore all those movies that has not been rated by at least 1500 users

In [None]:
filtered_movies2user_ratings = dict()
for movie in movies2user_ratings.keys():
    if len(movies2user_ratings[movie]) < 1500:
        continue
    else:
        filtered_movies2user_ratings[movie] = movies2user_ratings[movie]

print "-", len(movies2user_ratings) - len(filtered_movies2user_ratings), "movies have been descarded"
print "-", len(filtered_movies2user_ratings), "movies have been selected"

In [None]:
# let's calculate the similarities by using only correlation (NOTE: this is very inefficient!)

similarities = defaultdict(dict)
for id1, id2 in combinations(filtered_movies2user_ratings.keys(), 2):
    similarities[id1][id2] = calculate_distance(movies2user_ratings, id1, id2, distance.correlation)

In [None]:
# select the most similar items
neighborhood = dict()
for movie in similarities.keys():
    neighborhood[movie] = dict(sorted(similarities[movie].iteritems(), key = itemgetter(1), reverse = True)[:25])

### STEP 2: obtain the predictions for all the items of interest 

For all the items that the user hasn't rated (here we restrict ourselves to our three target movies), the following weighted score is used to aggregate our ratings:

- for each pair of items composed by one item rated by our user and one of our items of interest, we calculate a score by multiplying their pairwise similarity and the rating for the known movie


- for each item of interest we sum all these scores


- the score is normalized by diving this total by the total of the pairwise similarity scores involving a item of interest

e.g. see the example from Segaran (2007: 24)

![alt text](images/weighting-item.png)

In [None]:
def getPredictionsForItems(userId, neighborhood, ratings):
    weigthed_scores = []
    similarities = []

    for item, sim in neighborhood.iteritems():
        if ratings[item].has_key(userId):
            weigthed_scores.append(sim * ratings[item][userId])
            similarities.append(sim)
    
    return sum(weigthed_scores) / sum(similarities)

In [None]:
recommendations = defaultdict(dict)

for movie in target_movies:
    recommendations[movie] = getPredictionsForItems(4447, neighborhood[movie], filtered_movies2user_ratings)

### STEP 3: choose the top-items 

- in a real life scenario, you should choose the top-rated items and recommend them to the user


- in this exercise, we will compare the **rating predictions** produced by our RC against those produced by our user

is the **ranking** preserved?

In [None]:
print "- original ratings:"

for movieID in target_ratings:
    print id2movie[movieID], "-->", target_ratings[movieID]

In [None]:
print "- predicted ratings:"
for movieID in target_ratings:
    print id2movie[movieID], "-->", recommendations[movieID]

let's calculate the **Mean Absolute Error** (i.e. the difference between the real ratings and those produced by the RS)

In [None]:
true_ratings = [target_ratings[movieID] for movieID in target_ratings]
predicted = [recommendations[movieID] for movieID in target_ratings]

print "- MAE:", mean_absolute_error(true_ratings, predicted)

---

ignore what follows

In [None]:
# Export this notebook as a HTML file
# !jupyter nbconvert TMCI-2017-w9a --output html_converted_notebooks/TMCI-2017-w9a