## Information Retrieval lab5

- Martyna Stasiak id.156071
- Maria Musiał id.156062
----

The purpose of the exercise is to implement a recommendation system for a movie search engine.

When we think about selecting a video that our user will like, let's first consider what data we have available? First of all, we have information in the database about how our user rated the movies he once watched. It's worth noting here that this is absolutely not all of the movies in our database given, and most often it's a heavily limited subset of a huge set of movies. So we can find out which movies our user liked and which ones he didn't. 

Is this all the data available? Well, no! We also have information about the preferences of other users! So we can find in the data a sample of users who have similar movie taste to our user. Note that virtually every such other user has watched some movies that our user has never watched before! The idea behind collaborative filtering is very simple: if another user with similar tastes rated a movie highly, our user will probably rate it highly too! Let's recommend movies that users with similar tastes have rated highly!


Let's formalize some ideas:
 - how to count the similarity between users' tastes? 
 
 Just calculate the correlation between their movie ratings. Users with a strongly positive correlation have similar tastes, and those with a strongly negative correlation have opposite tastes;) 
 
 - Having found similar users, how to count the predicted rating of the video by our user?
 
 We count the weighted average of ratings of users with similar tastes where the weight is the measure of similarity (correlation). The closer a user's tastes are to us, the more weight his rating has for us. (slide 27, http://www.mmds.org/mmds/v2.1/ch09-recsys1.pdf)


In [1]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import random

df = pd.read_csv('./ratings.csv')
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [2]:
random.seed(0)

-----

### <b>Task 1
Modify the dataframe to have moveID as index, userID as column and rating as values

In [3]:
dfTask = df.pivot(index='movieId', columns='userId', values='rating')
dfTask.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,,3.0,,,,,,


#### Now let's also see some stats about our movie database

In [4]:
numMovies = dfTask.shape[0]
numUsers = dfTask.shape[1]

numNonNan = dfTask.notna().sum().sum()
numNan = dfTask.isna().sum().sum()

#most and least watched movies:
movieWatchCount = dfTask.count(axis=1)
mostWatchedMovie = movieWatchCount.idxmax()
mostWatchedMovieWatchCount = movieWatchCount.max()
leastWatchedMovie = movieWatchCount.idxmin()
leastWatchedMovieWatchCount = movieWatchCount.min()

#most and least active users:
userWatchCount = dfTask.count(axis=0)
mostActiveUser = userWatchCount.idxmax()
mostActiveUserWatchCount = userWatchCount.max()
leastActiveUser = userWatchCount.idxmin()
leastActiveUserWatchCount = userWatchCount.min()


print(f"Dataset summary:")
print(f"Number of movies in the dataset: {numMovies}")
print(f"Number of users in the dataset: {numUsers}")
print(f"Number of non-NaN values in the dataset: {numNonNan}")
print(f"Number of NaN values in the dataset: {numNan}\n")

print(f"Most watched movie: {mostWatchedMovie} ({mostWatchedMovieWatchCount} watches)")
print(f"Least watched movie: {leastWatchedMovie} ({leastWatchedMovieWatchCount} watches)\n")

print(f"Most active user: {mostActiveUser} ({mostActiveUserWatchCount} movies rated)")
print(f"Least active user: {leastActiveUser} ({leastActiveUserWatchCount} movies rated)\n")


Dataset summary:
Number of movies in the dataset: 9724
Number of users in the dataset: 610
Number of non-NaN values in the dataset: 100836
Number of NaN values in the dataset: 5830804

Most watched movie: 356 (329 watches)
Least watched movie: 49 (1 watches)

Most active user: 414 (2698 movies rated)
Least active user: 53 (20 movies rated)



Small remark: <br>
Those stats for the most/least active user and watched movie might be different since there are different movies that might have the same 'watch count' (same with the users) and we print only one of them :)

--------

### <b>Task 2
Let's try to recommend movies for user 610. Calculate the correlation between this user and the remaining ones.

In [5]:
user = 610
user
userRatings = dfTask[user]
userRatings
print(f"User {user} has rated {userRatings.count()} movies")
print(f"Ratings of user {user}:\n {userRatings.dropna()}")


User 610 has rated 1302 movies
Ratings of user 610:
 movieId
1         5.0
6         5.0
16        4.5
32        4.5
47        5.0
         ... 
166534    4.0
168248    5.0
168250    5.0
168252    5.0
170875    3.0
Name: 610, Length: 1302, dtype: float64


In [18]:
def CalculatetCorrelations(user, commonMovies=2, moviesdf=dfTask):
    correlations = {}
    userRatings = moviesdf[user].dropna()
    
    for otherUser in moviesdf.columns:
        
        if otherUser != user:
            otherUserRatings = moviesdf[otherUser].dropna()
            commonRatings = userRatings.index.intersection(otherUserRatings.index)
            
            if len(commonRatings) >= commonMovies: 
                correlations[otherUser] = pearsonr(userRatings[commonRatings], otherUserRatings[commonRatings])[0]

                
    valid_correlations = {k: v for k, v in correlations.items() if not np.isnan(v)} #getting rid of Nan correlations since we get some of that
                
    sorted_correlations = sorted(valid_correlations.items(), key=lambda x: x[1], reverse=True)
    return sorted_correlations

In [19]:
user610Correlations = CalculatetCorrelations(user=610)
print(f"Top correlated users with the user 610 are:")
for user, corr in user610Correlations[:10]:
    print(f"User {user} with correlation {corr:.2f}")

  correlations[otherUser] = pearsonr(userRatings[commonRatings], otherUserRatings[commonRatings])[0]


Top correlated users with the user 610 are:
User 442 with correlation 1.00
User 545 with correlation 1.00
User 576 with correlation 1.00
User 158 with correlation 0.91
User 92 with correlation 0.90
User 595 with correlation 0.89
User 120 with correlation 0.88
User 463 with correlation 0.82
User 138 with correlation 0.82
User 494 with correlation 0.81


### <b>Task 2b
There are a few users with the perfect match. Isn't it suspicious? Check it

In our approach we get rid of all Nan values and we do not encounter any perfect match with the correlation being equal to 1.


### <b>Task 3
Find 5 users with at least 5 common movies with user=610 and the highest correlation with that user

In [20]:
numTopUsers = 5
commonMovies = 5
user610Correlations = CalculatetCorrelations(user = 610, commonMovies = commonMovies)
Best5CorrelatedUsers = user610Correlations[:numTopUsers]

print(f"Top {numTopUsers} correlated users with the user {user}, who have wathced at least {commonMovies} same movies are:")
for otherUser, correlation in Best5CorrelatedUsers:
    print(f"User {otherUser} with correlation {correlation:.2f}")


Top 5 correlated users with the user 494, who have wathced at least 5 same movies are:
User 92 with correlation 0.90
User 120 with correlation 0.88
User 463 with correlation 0.82
User 138 with correlation 0.82
User 494 with correlation 0.81


### <b> Task 4
Predict scores for each movie based on the most correlated users. Use weighted average with correlation coefficient as weights.
$$\hat{y_j} = \frac{\sum_{i \in U} w_iy_{ij}}{\sum_{i \in U} w_i}$$

$U$ is a set of those users that also watched $j$th moveie, $w$ denotes the correlation between our user and $i$th user, $y_{ij}$ is a score given by $i$th user to $j$th movie
Use only movies watched by at least two users from the considered set

In [25]:
def predictScores(user, moviesdf = dfTask, commonMovies = 2, topUsers = 5):
    userXCorrelations = CalculatetCorrelations(user, commonMovies)
    topNUsers = userXCorrelations[:topUsers]
    
    predictedScores = {}
    
    for movie in moviesdf.index:
        if np.isnan(moviesdf.loc[movie, user]):
            predictedScore = 0
            sumCorr = 0
            
            for otherUser, correlation in topNUsers:
                otherUserRating = moviesdf.loc[movie, otherUser]
                if not np.isnan(otherUserRating):
                    predictedScore += otherUserRating * correlation
                    sumCorr += correlation
                    
            if sumCorr != 0:
                predictedScores[movie] = predictedScore / sumCorr
                
    return predictedScores

In [26]:
ratings = predictScores(610)
ratings

  correlations[otherUser] = pearsonr(userRatings[commonRatings], otherUserRatings[commonRatings])[0]


{29: 3.5,
 44: 2.5,
 107: 5.0,
 193: 3.0,
 327: 3.3066304667940374,
 362: 2.5,
 372: 1.0,
 468: 1.5,
 524: 2.0,
 610: 1.0,
 616: 1.5,
 720: 3.0,
 748: 3.0,
 837: 3.744747806435169,
 1015: 4.0,
 1021: 3.0,
 1093: 3.0,
 1186: 1.0,
 1188: 1.0,
 1223: 4.0,
 1231: 1.5,
 1267: 5.0,
 1272: 0.5,
 1321: 4.0,
 1345: 3.761598988245987,
 1347: 1.5,
 1367: 3.262646961504371,
 1373: 1.0,
 1409: 2.5,
 1438: 4.0,
 1479: 5.0,
 1619: 4.0,
 1644: 0.5,
 1779: 3.0,
 1801: 3.0,
 1876: 4.5,
 1909: 4.0,
 1911: 3.0,
 1960: 3.0,
 2020: 1.0,
 2085: 3.5,
 2087: 4.5,
 2109: 3.5,
 2114: 3.0,
 2139: 4.0,
 2145: 2.0,
 2289: 5.0,
 2300: 2.5,
 2398: 4.0,
 2427: 4.0,
 2428: 4.0,
 2454: 4.0,
 2490: 5.0,
 2501: 4.0,
 2572: 4.246848683861101,
 2580: 4.5,
 2605: 3.8079949412299356,
 2641: 3.0,
 2664: 3.5,
 2671: 4.0,
 2723: 3.2847969647379616,
 2763: 4.0,
 2826: 4.500000000000001,
 2872: 2.5,
 2881: 0.5,
 2908: 2.0,
 2948: 3.0,
 3082: 4.5,
 3107: 0.5,
 3168: 4.5,
 3301: 4.0,
 3363: 2.2152030352620384,
 3386: 0.5,
 3510: 1.0

### <b> Task 5
How to check the quality of our recommendations? 

We have to remove a few scores from the dataset and then compare predictions with the real ones.

Try to improve the system, you can use the following ideas:
 - Can we use more users (e.g. with negative correlation)?
 - Which difference is more important predicting 5 when a real score is 4 or predicting 3 instead of 2?
 - Did we use the best value for the minimal number of common movies?
 - Is prediction for a movie seen by just one user trustworthy?
 
 
Describe your approach, its strengths and weaknesses, and analyze the results. Send the report (notebook with comments/markdown) within 144 hours after the class to gmiebs@cs.put.poznan.pl, start the subject with [IR]

Credits to F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872 and Mateusz Lango