# Recommendation System Framework

This projects aims to provide a framework like API in order to make high quality recommendations through various methods and also provide metrics in order to measure the results in detail.

The example usage of the framework is as follows:

1. Choose a data set.
2. Choose a Similarity Measure
3. Provide Hyperparameters
4. Choose performance metrics
5. Get the results

## Current Framework Features

### Similarity Measures

* Pearson Correlation         (Linear Similarity)
* Mutual Information          (Non-Linear Similarity)
* Timebin-Based Neighbourhood

### Data Sets:

Detailed dataset analysis are provided as extra notebooks.

* Movielens 100k
* Movielens 1M
* Netflix Prize (This version not exactly)

### Performance Metrics:

* Accuracy
* Balanced Accuracy
* Informedness
* Markedness
* F1
* MCC
* Precision
* Recall
* Specificity
* NPV

## Features In Progress

* Refactor The Framework

* Implement Timebin-Based Neighbourhood With Deep Learning

* Add Deep Learning Based Methods

## Index

The following shortucts can be used to navigate to the related code snippets.

### Helper Classes

* [Time Constraint](#TimeConstraint)

* [Accuracy](#Accuracy)

* [Dataset](#Dataset)


* [Dataset Operator](#DatasetOperator)
  
  * [Dataset User Operator](#DatasetUserOperator)

  * [Dataset Movie Operator](#DatasetMovieOperator)


* [Similarity Measure](#SimilarityMeasure)
    
  * [Pearson Similarity](#PearsonSimilarity)
  

* [Predict](#Predict)

  * [Pearson Predict](#PearsonPredict)
  

* [In Progress](#InProgress)

  * [Timebin Similarity](#TimebinSimilarity)

  * [Timebin Predict](#TimebinPredict)

  * [Analize Timebin Similarity](#AnalizeTimebinSimilarity)


### Similarity Measure Implementations

### Testing The Framework

##### Imports

In [1]:
from abc import ABC, abstractmethod
from datetime import datetime
from datetime import timedelta
from collections import defaultdict
from timeit import default_timer
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import math
import random
from random import sample

##### TimeConstraint

In [2]:
class TimeConstraint:
    """
    TimeConstraint is a constraint on the timestamp of the movie ratings.
    We classify a TimeConstraint as either max_time_constraint or time_bin_constraint.
    max_time_constraint is used to simulate real life in which we do not know the future but all the data up until one point in time.
    time_bin_constraint is used to grab a portion of a time interval where starting and ending points are strictly defined and data is well known.
    """
    
    def __init__(self, end_dt, start_dt=None):
        """
        When end_dt is only given, system will have a max time constraint only.
        When end_dt and start_dt are given, system will have a time_bin_constraint.
        
        :param end_dt: The ending time boundary.
        :param start_dt: The starting time boundary.
            Always set start_dt to None if you change the object from time_bin to max_limit.
        """
        self.end_dt = end_dt
        self.start_dt = start_dt

    def is_valid_time_bin(self) -> bool:
        """
        Check whether this TimeConstraint object represents a valid time bin.
        """
        if self.is_time_bin() and (self._end_dt > self._start_dt):
            return True
        return False

    def is_valid_max_limit(self) -> bool:
        """
        Check whether this TimeConstraint represents a valid max time limit.
        """
        if (self._end_dt is not None) and (self._start_dt is None):
            return True

    def is_time_bin(self) -> bool:
        if (self._start_dt is not None) and (self._end_dt is not None):
            return True
        return False

    # Comparing TimeConstraints

    def __eq__(self, other):
        if other is None:
            return False
        return self._start_dt == other.start_dt and self._end_dt == other.end_dt

    def __ne__(self, other):
        if other is None:
            return False
        return self._start_dt != other.start_dt or self._end_dt != other.end_dt

    # Properties

    @property
    def end_dt(self):
        return self._end_dt

    @end_dt.setter
    def end_dt(self, value):
        self._end_dt = value

    @property
    def start_dt(self):
        return self._start_dt

    @start_dt.setter
    def start_dt(self, value):
        self._start_dt = value

    # Printing TimeConstraints

    def __repr__(self):
        return f"(start = {self._start_dt}, end= {self._end_dt})"

    def __str__(self):
        return f"(start = {self._start_dt}, end= {self._end_dt})"


##### Accuracy

Accuracy currently only supports ratings in between 0.5 and 5 with 0.5 increments.

* Add min_rating - max_rating and increment parameters to accuracy class

In [3]:
class Accuracy:
    """
    Accuracy class provides deffirent metrics in order to measure accuracy of our analysis.
    
    Supported Measures:
    rmse, accuracy, balanced accuracy, informedness, markedness, 
    f1, mcc, precision, recall, specificity, NPV and 
    other threshold measures where we round ratings less than 3.5 to min rating, upper to max rating and use supported measures on this data.
    """
    
    @staticmethod
    def rmse(predictions) -> float:
        """
        Calculate Root Mean Square Error of given list or Dataframe of (prediction, actual) data.
        
        In case rmse value is found 0, it is returned as 0.001 to differentiate between successfull rmse
        calculation and erroneous calculations where no prediction data is provided.
        """
        
        # In case dataframe of predictions wher each row[0]=prediction, row[1]=actual rating
        if type(predictions) is pd.DataFrame:
            number_of_predictions = 0
            sum_of_square_differences = 0.0
            for row in predictions.itertuples(index=False):
                prediction = row[0]
                # In case valid prediction is made(0 is invalid, minimum 0.5 in movielens dataset)
                if prediction != 0:
                    # Round the ratings to the closest half or exact number
                    # since movielens dataset only containst ratings 0.5, 1, 1.5,..., 4, 4.5, 5
                    actual = Accuracy.half_round_rating(row[1])
                    prediction = Accuracy.half_round_rating(prediction)
                    
                    sum_of_square_differences += (actual - prediction) ** 2
                    number_of_predictions += 1
                
                if number_of_predictions == 0:
                    return 0 
                rmse_value = sum_of_square_differences / number_of_predictions
            return rmse_value if rmse_value != 0 else 0.001
        # In case list of predictions where each element is (prediction, actual)
        elif type(predictions) is list:
            number_of_predictions = 0
            sum_of_square_differences = 0.0
            for prediction, actual in predictions:
                if prediction != 0:                  # if the prediction is valid
                    actual = Accuracy.half_round_rating(actual)
                    prediction = Accuracy.half_round_rating(prediction)
                    
                    sum_of_square_differences += (actual - prediction) ** 2
                    number_of_predictions += 1
                
            if number_of_predictions == 0:
                return 0
        
            rmse_value = sum_of_square_differences / number_of_predictions 
            return rmse_value if rmse_value != 0 else 0.001    
        return 0
    
    @staticmethod
    def threshold_accuracy(predictions) -> float:
        """
        Threshold accuracy is the rate of sucessful prediction when we round 
        ratings between 0.5 and 3.5 to the lowest rating(0.5) ,
        ratings between 3.5 and 5 to the highest rating(5)
        
        Accuracy = (TP + TN) / (TP + TN + FP + FN)
        
        """
        
        if type(predictions) is pd.DataFrame:
            number_of_predictions = 0
            number_of_hit = 0
            for row in predictions.itertuples(index=False):
                # row[1] : actual rating, row[0] : prediction
                prediction = row[0]
                if prediction != 0:
                    actual = Accuracy.threshold_round_rating(row[1])
                    prediction = Accuracy.threshold_round_rating(prediction)
                    
                    if actual == prediction:
                        number_of_hit += 1
                    number_of_predictions += 1
            return number_of_hit / number_of_predictions if number_of_predictions != 0 else 0
        elif type(predictions) is list:            
            number_of_predictions = 0
            number_of_hit = 0
            for prediction, actual in predictions:
                if prediction != 0:
                    actual = Accuracy.threshold_round_rating(actual)
                    prediction = Accuracy.threshold_round_rating(prediction)
                    
                    if actual == prediction:
                        number_of_hit += 1
                        
                    number_of_predictions += 1
            return number_of_hit / number_of_predictions if number_of_predictions != 0 else 0
        return 0
    
    @staticmethod
    def threshold_analize(predictions):
        """
        Analize the threshold predictions with all metrics found in the Accuracy class.
        """
        
        TP, FN, FP, TN = Accuracy.threshold_confusion_matrix(predictions)
        precision = Accuracy.precision(TP, FP)     # also called PPV
        recall = Accuracy.recall(TP, FN)           # also called TPR
        specificity = Accuracy.specificity(FP, TN) # also called TNR
        NPV = Accuracy.negative_predictive_value(FN, TN)
        
        accuracy = Accuracy.accuracy(TP, FN, FP, TN)
        balanced_accuracy = Accuracy.balanced_accuracy(TPR=recall, TNR=specificity)
        informedness = Accuracy.informedness(TPR=recall, TNR=specificity)
        markedness = Accuracy.markedness(PPV=precision, NPV=NPV)
        
        f1 = Accuracy.f_measure(precision, recall)
        mcc = Accuracy.mcc(TP, FN, FP, TN)
        
                
        output = {
                  "accuracy"         :round(accuracy, 3),
                  "balanced_accuracy":round(balanced_accuracy, 3),
                  "informedness"     :round(informedness, 3),
                  "markedness"       :round(markedness, 3),
                  "f1"               :round(f1, 3),
                  "mcc"              :round(mcc, 3),
                  "precision"        :round(precision, 3),
                  "recall"           :round(recall, 3),
                  "specificity"      :round(specificity, 3),
                  "NPV"              :round(NPV, 3)
                 }
        
        return output
    
    @staticmethod
    def analize(predictions):
        """
        Analize the threshold predictions with all metrics found in the Accuracy class.
        
        https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
        
        Returns analysis for each class as list
        :return: accuracy, balanced_accuracy, informedness, markedness, f1, mcc, precision, recall, specificity, NPV
        """
        confusion_mtr = Accuracy.confusion_matrix(predictions)
        
        # Use macro averaging (https://stats.stackexchange.com/questions/187768/matthews-correlation-coefficient-with-multi-class)
        precision = [0] * 10   # 10 is the number of classes found 
        recall = [0] * 10      # 0.5 -> Class 0 , 1 -> Class 1, 1.5 -> Class 2 .... 
        specificity = [0] * 10
        NPV = [0] * 10 
        
        accuracy = [0] * 10 
        balanced_accuracy = [0] * 10 
        informedness = [0] * 10 
        markedness = [0] * 10 
        
        f1 = [0] * 10 
        mcc = [0] * 10
        
        for i in range(0, 10): # For Each Class
            TP, FN, FP, TN = Accuracy.confusion_matrix_one_against_all(confusion_mtr, i)
            precision[i] = Accuracy.precision(TP, FP)     # also called PPV
            recall[i] = Accuracy.recall(TP, FN)           # also called TPR
            specificity[i] = Accuracy.specificity(FP, TN) # also called TNR
            NPV[i] = Accuracy.negative_predictive_value(FN, TN)

            accuracy[i] = Accuracy.accuracy(TP, FN, FP, TN)
            balanced_accuracy[i] = Accuracy.balanced_accuracy(TPR=recall[i], TNR=specificity[i])
            informedness[i] = Accuracy.informedness(TPR=recall[i], TNR=specificity[i])
            markedness[i] = Accuracy.markedness(PPV=precision[i], NPV=NPV[i])

            f1[i] = Accuracy.f_measure(precision[i], recall[i])
            mcc[i] = Accuracy.mcc(TP, FN, FP, TN)
        
        output = {
                  "accuracy"         :Accuracy.round_list_elements(accuracy, 3),
                  "balanced_accuracy":Accuracy.round_list_elements(balanced_accuracy, 3),
                  "informedness"     :Accuracy.round_list_elements(informedness, 3),
                  "markedness"       :Accuracy.round_list_elements(markedness, 3),
                  "f1"               :Accuracy.round_list_elements(f1, 3),
                  "mcc"              :Accuracy.round_list_elements(mcc, 3),
                  "precision"        :Accuracy.round_list_elements(precision, 3),
                  "recall"           :Accuracy.round_list_elements(recall, 3),
                  "specificity"      :Accuracy.round_list_elements(specificity, 3),
                  "NPV"              :Accuracy.round_list_elements(NPV, 3)
                 }
        
        return output
    
    @staticmethod
    def round_list_elements(l, precision):
        """
        :param l: list of floats
        :param precision: precision after dot
        """
        return [ round(x, precision) for x in l ]
    
    @staticmethod
    def accuracy_multi_class(confusion_mtr):
        length = len(confusion_mtr)
        numenator = 0
        denuminator = 0 
        for i in range(length):
            for j in range(length):
                temp = confusion_mtr[i][j]
                denuminator += temp
                if i == j:
                    numenator += temp
        return numenator / denominator
    
    @staticmethod
    def accuracy(TP, FN, FP, TN):
        return  (TP + TN) / (TP + FN + FP + TN)
    
    @staticmethod
    def balanced_accuracy(TPR, TNR):
        """
        :param TPR : True Positive Rate or recall or sensitivity
        :param TNR : True Negative Rate or specificity or  selectivity
        """
        return (TPR + TNR) / 2
    
    @staticmethod
    def informedness(TPR, TNR):
        """
        :param TPR : True Positive Rate or recall or sensitivity
        :param TNR : True Negative Rate or specificity or  selectivity
        """
        return TPR + TNR - 1
    
    @staticmethod
    def markedness(PPV, NPV):
        """
        :param PPV: Positive Predictive Value also known as precision
        :param NPV: Negative Predictive Value
        """
        return PPV + NPV - 1 
    
    @staticmethod
    def precision(TP, FP):
        """
        Also called as precision or positive predictive value (PPV)
        
        Precision = TP / (TP + FP) for binary class
        Precision = TP / (All Predicted Positive) for multi class
        """
        denuminator = TP + FP
        return TP / denuminator if denuminator != 0 else 0

    @staticmethod
    def negative_predictive_value(FN, TN):
        """     
        NPV = TN / (TN + FN) for binary class
        NPV = TN / (All Predicted Negative) for multi class
        """
        denuminator = TN + FN
        return TN / denuminator if denuminator != 0 else 0    
    
    @staticmethod
    def recall(TP, FN):
        """
        Also called as sensitivity, recall, hitrate, or true positive rate(TPR)
        Recall = TP / (TP + FN) for binary class
        Recall = TP / (All Actual Positive) for multi class
        """
        denuminator = TP + FN
        return TP / denuminator if denuminator != 0 else 0    
    
    @staticmethod
    def specificity(FP, TN):
        """
        Also called as specificity, selectivity or true negative rate (TNR)
        specificity = TN / (TP + FN) for binary class
        specificity = TN / (All Actual Negative) for multi class
        """
        denuminator = FP + TN 
        return TN / denuminator if denuminator != 0 else 0 
    
    @staticmethod
    def f_measure(precision, recall):
        """
        F-Measure is the harmonic mean of the precision and recall.
        """
        sum_of_both = precision + recall
        return (2 * precision * recall) / sum_of_both if sum_of_both != 0 else 0
    
    @staticmethod
    def mcc(TP, FN, FP, TN):
        """
        MCC(Matthews Correlation Coefficient)
        """
        # Calulate Matthews Correlation Coefficient
        numenator   = (TP * TN) - (FP * FN) 
        denominator = (TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)
        denominator = math.sqrt(denominator) if denominator > 0 else 0
        return numenator / denominator if denominator != 0 else 0
  
    @staticmethod
    def confusion_matrix_one_against_all(confusion_mtr, class_i):
        """
        Create binary confusion matrix out of multi-class confusion matrix
        
        Positive Class: class_i
        Negative Class: non class_i
                
        TP: True Positive   FN: False Negative
        FP: False Positive  TN: True Negative
        
        "TP of Class_1" is all Class_1 instances that are classified as Class_1.
        "TN of Class_1" is all non-Class_1 instances that are not classified as Class_1.
        "FP of Class_1" is all non-Class_1 instances that are classified as Class_1.
        "FN of Class_1" is all Class_1 instances that are not classified as Class_1.
        # https://www.researchgate.net/post/How_do_you_measure_specificity_and_sensitivity_in_a_multiple_class_classification_problem
        
        --> Input matrix
                 | 0 Prediction | 1 Prediction | 2 Prediction | .....
        0 Class  |     T0       |     ..       |      ..      |
        1 Class  |     ..       |     T1       |      ..      | 
        2 Class  |     ..       |     ..       |      T2      |
        
        --> Output matrix
        
                        | Positive Prediction | Negative Prediction
        Positive Class  |       TP            |       FN
        Negative Class  |       FP            |       TN
        
        :param confusion_mtr: 10 class confusion matrix designed for movielens
        :param class_i: index of the class we are interested in(0-9)
        :return: TP, FN, FP, TN
        """
        length = len(confusion_mtr)

        TP = confusion_mtr[class_i][class_i] 
        
        actual_class_i_count = 0 
        for i in range(length):  # sum of the row
            actual_class_i_count += confusion_mtr[class_i][i]
        FN = actual_class_i_count - TP
        
        predicted_class_i_count = 0
        for i in range(length): # sum of the column
            predicted_class_i_count += confusion_mtr[i][class_i]
        FP = predicted_class_i_count - TP
        
        # sum of matrix
        sum_of_matrix = np.sum(confusion_mtr)
        # TN is found by summing up all values except the row and column of the class 
        TN = sum_of_matrix - predicted_class_i_count - actual_class_i_count - TP 
        
        return TP, FN, FP, TN
        
    @staticmethod
    def confusion_matrix(predictions):
        """
        Create confusion matrix and then return TP, FN, FP, TN

        0 Class: 0.5
        1 Class: 1
        2 Class: 1.5
        3 Class: 2
        4 Class: 2.5
        5 Class: 3
        6 Class: 3.5
        7 Class: 4
        8 Class: 4.5
        9 Class: 5

        T0: True 0
        F0: False 0
        T1: True 1
        F1: False 1
        ...

                 | 0 Prediction | 1 Prediction | 2 Prediction | .....
        0 Class  |     T0       |     ..       |      ..      |
        1 Class  |     ..       |     T1       |      ..      | 
        2 Class  |     ..       |     ..       |      T2      |
        ...
        """
        # Create multiclass confusion matrix

        conf_mtr = np.zeros( (10,10) )

        for prediction in predictions:
            predicted = Accuracy.half_round_rating(prediction[0])
            actual    = Accuracy.half_round_rating(prediction[1])

            predicted_class_index = int( (predicted * 2) - 1 )
            actual_class_index = int( (actual * 2) - 1 )

            conf_mtr[actual_class_index][predicted_class_index] += 1

        return conf_mtr
    
    @staticmethod
    def threshold_confusion_matrix(predictions):
        """
        Create confusion matrix and then return TP, FN, FP, TN

        Positive Class: 5
        Negative Class: 0.5

        TP: True Positive
        TN: True Negative
        FP: False Positive
        FN: False Negative

                        | Positive Prediction | Negative Prediction
        Positive Class  |       TP            |       FN
        Negative Class  |       FP            |       TN

        """
        # Create confusion matrix
        TP = 0
        TN = 0
        FP = 0
        FN = 0
        for prediction in predictions:
            predicted = Accuracy.threshold_round_rating(prediction[0])
            actual = Accuracy.threshold_round_rating(prediction[1])
            if predicted == 5 and actual == 5:
                TP += 1
            elif predicted == 5 and actual == 0.5:
                FP += 1
            elif predicted == 0.5 and actual == 0.5:
                TN += 1
            elif predicted == 0.5 and actual == 5:
                FN += 1
        return TP, FN, FP, TN

    @staticmethod
    def half_round_rating(rating):
        """
        Round ratings to the closest match in the movielens dataset
        For ex.
          ratings between 2 and 2.25 -> round to 2
          ratings between 2.25 and 2.5 -> round to 2.5
          ratings between 2.5 and 2.75 -> round to 2.5
          ratings between 2.75 and 3 -> round to 3

        """
        floor_value = math.floor(rating)
        if(rating > floor_value + 0.75):
            return floor_value + 1
        elif(rating > floor_value + 0.5 or rating > floor_value + 0.25):
            return floor_value + 0.5
        else:
            return floor_value
    
    @staticmethod
    def threshold_round_rating(rating):
        """
        Round ratings to the closest match in threshold fashion
          ratings between 0.5 and 3.5 -> round to 0.5
          ratings between 3.5 and 5 -> round to 5
        """
        if (0.5 <= rating < 3.5):
            return 0.5
        elif (3.5 <= rating <= 5):
            return 5
        else:
            return 0

##### Dataset

In [4]:
class Dataset(ABC):
    """
    Dataset class and its subclasses provides utilities in order to import datasets.
    
    """
    @staticmethod
    @abstractmethod
    def load():
        """ Every subclass must provide static load method"""
        pass


class MovieLensDataset(Dataset):
    def __init__(self,
                 ratings_col_names=('user_id', 'item_id', 'rating', 'timestamp'),
                 ratings_path=r'C:\Users\Yukawa\datasets\ml-latest-small\ratings.csv',
                 movies_col_names=('item_id', 'title', 'genres'),
                 movies_path=r'C:\Users\Yukawa\datasets\ml-latest-small\movies.csv',
                 is_ratings_cached=True,
                 is_movies_cached=True):
        Dataset.__init__(self)
        self.is_ratings_cached = is_ratings_cached
        self.is_movies_cached = is_movies_cached
        self.ratings = MovieLensDataset.load_ratings(ratings_path,
                                                     ratings_col_names) if self.is_ratings_cached else None
        self.movies = MovieLensDataset.load_movies(movies_path,
                                                   movies_col_names) if self.is_movies_cached else None

    @staticmethod
    def load_movies(movies_path,
                    movies_col_names=('item_id', 'title', 'genres')):
        if not os.path.isfile(movies_path) or not movies_col_names:
            return None

        # read movies
        movies = pd.read_csv(movies_path, sep=',', header=1, names=movies_col_names)

        # Extract Movie Year
        movies['year'] = movies.title.str.extract("\((\d{4})\)", expand=True)
        movies.year = pd.to_datetime(movies.year, format='%Y')
        movies.year = movies.year.dt.year  # As there are some NaN years, resulting type will be float (decimals)

        # Remove year part from the title
        movies.title = movies.title.str[:-7]

        return movies

    @staticmethod
    def load_ratings(ratings_path,
                     ratings_col_names=('user_id', 'item_id', 'rating', 'timestamp')):
        if not os.path.isfile(ratings_path) or not ratings_col_names:
            return None

        # read ratings
        ratings = pd.read_csv(ratings_path, sep=',', header=1, names=ratings_col_names)

        # Convert timestamp into readable format
        ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s', origin='unix')

        return ratings

    @staticmethod
    def create_movie_ratings(ratings, movies):
        return pd.merge(ratings, movies, on='item_id')

    @staticmethod
    def load(ratings_col_names=('user_id', 'item_id', 'rating', 'timestamp'),
             ratings_path=r'C:\Users\Yukawa\datasets\ml-latest-small\ratings.csv',
             movies_col_names=('item_id', 'title', 'genres'),
             movies_path=r'C:\Users\Yukawa\datasets\ml-latest-small\movies.csv'
             ):
        # Load movies
        movies = MovieLensDataset.load_movies(movies_path=movies_path, movies_col_names=movies_col_names)
        # Load ratings
        ratings = MovieLensDataset.load_ratings(ratings_path=ratings_path, ratings_col_names=ratings_col_names)

        # Merge the ratings and movies
        movie_ratings = pd.merge(ratings, movies, on='item_id')

        return movie_ratings

##### DatasetOperator

In [5]:
class DatasetOperator:
    @staticmethod
    def apply_time_constraint(movie_ratings : pd.DataFrame, time_constraint : TimeConstraint = None) -> pd.DataFrame:
        """
        movie_ratings can be ratings or movie_ratings.
        """
        if time_constraint is None:
            return movie_ratings

        if time_constraint.is_valid_max_limit():
            return movie_ratings.loc[movie_ratings.timestamp < time_constraint.end_dt]
        
        if time_constraint.is_valid_time_bin():
            return movie_ratings.loc[(movie_ratings.timestamp >= time_constraint.start_dt) & (movie_ratings.timestamp < time_constraint.end_dt)]
        
        return movie_ratings

##### DatasetUserOperator

In [6]:
class DatasetUserOperator:
    """
    Provides utility methods to get user data out of datasets.
    """
    
    @staticmethod
    def get_users(ratings : pd.DataFrame):
        """
        Get list of unique 'user_id's

        :return: the ids of the users found in movie_ratings
        """
        return pd.unique(ratings['user_id'])
    
    @staticmethod
    def get_active_users(ratings : pd.DataFrame, n : int=10) -> pd.DataFrame:
        """
        Get Users in sorted order where the first one is the one who has given most ratings.

        :param n: Number of users to retrieve.
        :return: user DataFrame with index of 'user_id' and columns of ['mean_rating', 'ratings_count'] .
        """
        active_users = pd.DataFrame(ratings.groupby('user_id')['rating'].mean())
        active_users['ratings_count'] = pd.DataFrame(ratings.groupby('user_id')['rating'].count())
        active_users.sort_values(by=['ratings_count'], ascending=False, inplace=True)
        active_users.columns = ['mean_rating', 'ratings_count']
        return active_users.head(n)
    
    @staticmethod
    def get_random_users(ratings : pd.DataFrame, n : int=1):
        """
        Get list of random n number of 'user_id's

        :param n: Number of random users
        :return: List of random 'user_id's
        """
        return random.choices(population=DatasetUserOperator.get_users(ratings), k=n)
    
    @staticmethod
    def get_user_ratings(ratings : pd.DataFrame, user_id : int) -> pd.DataFrame:
        """
        Get all the ratings given by of the chosen users

        :param user_id: id of the chosen user
        :return: Ratings given by the 'user_id'
        """
        return ratings.loc[ratings['user_id'] == user_id]
    
    @staticmethod
    def get_user_avg(ratings : pd.DataFrame, user_id : int):
        user_ratings = DatasetUserOperator.get_user_ratings(ratings, user_id)
        return user_ratings.rating.mean() if not user_ratings.empty else 0
    
    @staticmethod
    def get_timestamp(ratings : pd.DataFrame, user_id: int, movie_id: int):
        """
        Get the timestamp of the given rating

        :param user_id: the users whose rating timestamp we are searching
        :param movie_id: id of the movie that the user gave the rating
        :return: if found the datetime object otherwise None
        """
        timestamp = ratings.loc[(ratings['user_id'] == user_id) & (ratings['item_id'] == movie_id)]
        return timestamp.values[0, 3] if not timestamp.empty else None
    
    @staticmethod
    def get_first_timestamp(ratings : pd.DataFrame):
        return ratings['timestamp'].min()
    
    @staticmethod
    def get_user_avg_timestamp(ratings : pd.DataFrame, user_id: int):
        user_ratings = DatasetUserOperator.get_user_ratings(ratings, user_id)
        return user_ratings.timestamp.mean() if not user_ratings.empty else 0
    
    @staticmethod
    def get_user_ratings_at(ratings : pd.DataFrame, user_id: int, at: datetime) -> pd.DataFrame:
        """
        Get user ratings up until the given datetime
        :param user_id: id of the chosen user
        :param at: only those ratings that are before this date will be taken into account
        :return: Ratings given by the 'user_id' before given datetime
        """
        return ratings.loc[(ratings['user_id'] == user_id) & (ratings.timestamp < at)]

    @staticmethod
    def get_user_avg_at(ratings : pd.DataFrame, user_id: int, at: datetime):
        user_ratings = DatasetUserOperator.get_user_ratings_at(ratings, user_id, at)
        return user_ratings.rating.mean() if not user_ratings.empty else 0

##### DatasetMovieOperator

In [7]:
class DatasetMovieOperator:
    """
    Provides utility methods to get user data out of datasets.
    """
    
    @staticmethod
    def get_movie(movies : pd.DataFrame, movie_id : int) -> pd.DataFrame:
        """
        Get Movie Record

        :return: DataFrame which contains the given 'movie_id's details. If not found empty DataFrame .
        """
        return movies.loc[movies['item_id'] == movie_id]
    
    @staticmethod
    def get_movies(movies : pd.DataFrame) -> list:
        """
        Get list of unique movies.

        :return: List of movie ids
        """
        return movies['item_id'].values.tolist()
    
    @staticmethod
    def get_random_movies(movies, n=10):
        """
        Get list of random n number of 'item_id's or in other words the movies

        :param n: Number of random movies
        :return: List of random 'movie_id's
        """
        return random.choices(population=DatasetMovieOperator.get_movies(movies), k=n)
    
    @staticmethod
    def get_movies_watched(movie_ratings : pd.DataFrame, 
                           user_id: int, 
                           time_constraint: TimeConstraint = None) -> pd.DataFrame:
        """
        Get all the movies watched by the chosen user.

        :param user_id: the user that we want to get the movies he-she has watched.
        :param time_constraint: type of the time constraint.
        :return: DataFrame of all movies watched with 'item_id', 'rating' columns
        """
        filtered_movie_ratings = DatasetOperator.apply_time_constraint(movie_ratings, time_constraint)
        return filtered_movie_ratings.loc[filtered_movie_ratings['user_id'] == user_id]
        
    
    @staticmethod
    def get_movie_rating(ratings : pd.DataFrame, movie_id: int, user_id: int) -> int:
        """
        Get the movie rating taken by the chosen user

        :param movie_id: the movie chosen movie's id
        :param user_id: id of the chosen user
        :return: Rating given by user. If not found, returns 0
        """
        movie_rating = ratings.loc[(ratings['user_id'] == user_id) & (ratings['item_id'] == movie_id)]
        return movie_rating.values[0, 2] if not movie_rating.empty else 0

    @staticmethod
    def get_random_movie_watched(movie_ratings : pd.DataFrame, user_id: int) -> int:
        """
        Get random movie id watched.

        :param user_id: User of interest
        :return:  movie_id or item_id of the random movie watched by the user.
                  In case non-valid user_id supplied then returns 0
        """
        movies_watched = DatasetMovieOperator.get_movies_watched(movie_ratings, user_id)[['item_id', 'rating']]
        return random.choice(movies_watched['item_id'].values.tolist()) if not movies_watched.empty else 0

    @staticmethod
    def get_random_movies_watched(movie_ratings : pd.DataFrame, user_id: int, n=2) -> pd.DataFrame:
        """
        Get random n movies watched by the user. Only use when n > 2

        Use get_random_movie_watched if n=1 since that one 2 fold faster.

        :param user_id: the user of interest
        :param n: number of random movies to get
        :return: DataFrame of movies, if none found then empty DataFrame
        """
        movies_watched = DatasetMovieOperator.get_movies_watched(movie_ratings, user_id)[['item_id', 'rating']]
        return random.choices(population=movies_watched['item_id'].values.tolist(),
                              k=n) if not movies_watched.empty else movies_watched

    @staticmethod
    def get_random_movie_per_user(movie_ratings : pd.DataFrame, user_id_list : list) -> list:
        """
        Get random movie for each user given in the 'user_id_list'

        :param user_id_list: List of valid user_ids
        :return: List of (user_id, movie_id) tuples
                where each movie_id is randomly chosen from watched movies of the user_id .
                In case any one of the user_id's supplies invalid, then the movie_id will be 0 for that user.
        """
        user_movie_list = list()
        for user_id in user_id_list:
            user_movie_list.append((user_id, DatasetMovieOperator.get_random_movie_watched(movie_ratings, user_id)))
        return user_movie_list

#### SimilarityMeasure

##### PearsonSimilarity

In [8]:
class PearsonSimilarity:
    """
    Pearson Correlation is the classic way of giving recommendations. 
    """
    
    @staticmethod
    def get_knn(movie_ratings:pd.DataFrame, 
                user_id:int, k:int=10, 
                min_common_elements: int = 5,
                user_corr_matrix = None):
        """
        :param user_id: the user of interest
        :param k: number of neighbours to retrieve, None to get all
        :param time_constraint: time constraint when choosing neighbours
        :param user_corr_matrix: Provide correlation matrix if you want to optimize the process
        :return: Returns the k neighbours and correlations in between them. If no neighbours found, returns None
                 DataFrame which has 'Correlation' column and 'user_id' index.
        """
        
        if(user_corr_matrix is None):
            user_corr_matrix = PearsonSimilarity.create_user_corrs(movie_ratings, min_common_elements)
            # Exit if matrix is None after creation
            if user_corr_matrix is None:
                return None

        # Get the chosen 'user_id's correlations
        user_correlations = user_corr_matrix.get(user_id)
        if user_correlations is None:
            return None

        # Drop any null, if found
        user_correlations.dropna(inplace=True)

        # Create A DataFrame from not-null correlations of the 'user_id'
        users_alike = pd.DataFrame(user_correlations)
        
        # Rename the only column to 'correlation'
        users_alike.columns = ['correlation']

        # Sort the user correlations in descending order
        # so that first one is the most similar, last one least similar
        users_alike.sort_values(by='correlation', ascending=False, inplace=True)

        # Eliminate Correlation to itself by deleting first row,
        # since biggest corr is with itself it is in first row
        return users_alike.iloc[1:k+1] if k is not None else users_alike.iloc[1:]
        
    @staticmethod
    def create_user_corrs(movie_ratings, min_common_elements):
        user_movie_matrix = movie_ratings.pivot_table(index='title', columns='user_id', values='rating')
        return user_movie_matrix.corr(method="pearson", min_periods=min_common_elements)

#### Prediction

##### PearsonPredict

In [9]:
class PearsonPredict:
    """
    Only neighbourhood based methods are supported.
    """
    
    @staticmethod
    def predict_movie(movie_ratings, user_id, movie_id, 
                      time_constraint: TimeConstraint = None, k = 10, min_common_elements = 5):
        """
        Predict a movie of a user.
        
        For optimization, use predict function instead of this.
        """
        movie_ratings = DatasetOperator.apply_time_constraint(movie_ratings)
        knn = PearsonSimilarity.get_knn(movie_ratings, user_id, k, min_common_elements)
        return PearsonPredict.predict(movie_ratings, user_id, movie_id=movie_id, k_neighbours=knn)
    
    @staticmethod
    def predict(ratings, user_id, movie_id, k_neighbours: pd.DataFrame) -> float:
        """
        Predict the movie for given user using Pearson Correlation

        :param user_id: user of interest
        :param movie_id: the movie's rating is the one we we want to predict
        :param k_neighbours: k nearest neighbours in DataFrame where index user_id, column correlation in between.
        :return: Prediction rating
        """
        # If a movie with movie_id not exists, predict 0
        if DatasetMovieOperator.get_movie(ratings, movie_id).empty:
            return 0

        if k_neighbours is None or k_neighbours.empty:
            return 0
        
        user_avg_rating = DatasetUserOperator.get_user_avg(ratings, user_id)
        weighted_sum = 0.0
        sum_of_weights = 0.0
        for neighbour_id, data in k_neighbours.iterrows():
            # Get each neighbour's correlation 'user_id' and her rating to 'movie_id'
            neighbour_corr = data['correlation']
            neighbour_rating = DatasetMovieOperator.get_movie_rating(ratings, movie_id, neighbour_id)
            # If the neighbour doesnt give rating to the movie_id, pass this around of the loop
            if neighbour_rating == 0:
                continue
            neighbour_avg_rating = DatasetUserOperator.get_user_avg(ratings, neighbour_id)
            neighbour_mean_centered_rating = neighbour_rating - neighbour_avg_rating
            # Calculate Weighted sum and sum of weights
            weighted_sum += neighbour_mean_centered_rating * neighbour_corr
            sum_of_weights += neighbour_corr
            
        if sum_of_weights != 0:
            prediction = user_avg_rating + (weighted_sum / sum_of_weights)
        else:
            prediction = 0  # In this case, none of the neighbours have given rating to 'the movie'

        return prediction
    
    @staticmethod
    def predict_movies_watched(movie_ratings, user_id, n=10, k=10, time_constraint=None, min_common_elements = 5) -> pd.DataFrame:
        """
        :param user_id: user of interest
        :param n: Number of movies to predict
        :param k: k neighbours to take into account
        :param time_constraint: When calculating k neighbours,
                                only those that comply to time_constraints will be taken into account.
        :return: DataFrame of Predictions where columns = ['prediction', 'rating'] index = 'movie_id'
        """
        
        movie_ratings = DatasetOperator.apply_time_constraint(movie_ratings)
        
        # Get all movies watched by a user
        movies_watched = DatasetMovieOperator.get_movies_watched(movie_ratings, user_id)[['item_id', 'rating']]

        if movies_watched.empty:
            return None
        
        # Cache user correlations because of the repeated predictions
        user_corr_matrix = PearsonSimilarity.create_user_corrs(movie_ratings, min_common_elements)
        
        predictions = list()
        number_of_predictions = 0
        for row in movies_watched.itertuples(index=False):
            knn = PearsonSimilarity.get_knn(movie_ratings, user_id, k, min_common_elements, user_corr_matrix)
            prediction = PearsonPredict.predict(movie_ratings, user_id, movie_id=row[0], k_neighbours=knn)
            if number_of_predictions == n:
                break
            predictions.append([prediction, row[1], row[0]])
            number_of_predictions += 1

        predictions_df = pd.DataFrame(predictions, columns=['prediction', 'rating', 'movie_id'])
        predictions_df.movie_id = predictions_df.movie_id.astype(int)
        return predictions_df.set_index('movie_id')

## InProgress

In [10]:
class NeighbourhoodBasedSimilarity(ABC):
    """
    Similarity class. Its subclasses provides utilities in order to calculate similarities between users.
    """
    
    @abstractmethod
    def get_neighbours(user_id:int, movie_id:int) -> pd.DataFrame:
        """ 
        Calculate similarities between neighbours and returns each neighbour and its similarity. 
        
        :return: DataFrame of [rating, corr] 
        """
        pass

    @abstractmethod
    def get_weighted_neighbours(user_id:int, movie_id:int) -> pd.DataFrame:
        """ 
        Apply signifance weighting. 
        
        :return: DataFrame of [rating, corr] 
        """
        pass

In [11]:
class Predict(ABC):
    """
    Prediction class. Its subclassses provides utilities for making prediction.
    """
    pass

##### TimebinSimilarity

In [12]:
class TimebinSimilarity:
    """
    TimebinSimilarity is a way to provide personalized timebin based recommendations.
    Here we define an alternative to classical pearson by providing temporal timebin aspect.
    Here is how we do in summary:
      1. Choose a person to make prediction on (as always).
      2. Take the last s movie ratings of this person as a timebin from a random timepoint.(user must have at least s movie watched before!)
      3. Find neighbour timebins to the selected timebin.(using correlations between them, higher, better)
      4. Predict rating for the person using the timebin neighbours the same way as we do with k nearest neighbour.
    """
       
    @staticmethod
    def timebin_similarity(timebin_a:pd.DataFrame, timebin_a_user_avg_rating:float,
                           timebin_b:pd.DataFrame, timebin_b_user_avg_rating:float)->(float,int):
        """
        Find the correlation between the timebin and its given neighbour timebin
        
        :param merged: Merged version of timebin and neighbour timebin in this order.
        """
        
        # Filter unrelated movie ratings and only keep common movie ratings
        merged = timebin_b.merge(timebin_a, on='item_id')

        # Calculate Pearson Correlation in between the self.timebin and given neighbour
        numenator = ((merged['rating_x'] - timebin_a_user_avg_rating) * (merged['rating_y'] - timebin_b_user_avg_rating)).sum()
        denominator = math.sqrt(((merged['rating_x'] - timebin_a_user_avg_rating) ** 2).sum())
        denominator *= math.sqrt(((merged['rating_y'] - timebin_b_user_avg_rating) ** 2).sum())
        pearson = numenator / denominator if denominator != 0 else 0
        
        return pearson, len(merged)
    
    @staticmethod
    def get_timebin(ratings, user_id, timebin_i, timebin_size):
        """
        Generate the timebin by using given info.
        timebin_i is the index the first movie
        timebin_size is the number of movies to take starting at timebin_i th index.
        """
        all_movies_watched = DatasetMovieOperator.get_movies_watched(ratings, user_id)[['item_id', 'rating', 'timestamp']].set_index('item_id')
        return all_movies_watched.iloc[timebin_i:timebin_i+timebin_size]
    
    @staticmethod
    def get_neighbour_timebins(ratings, neighbour_id:int, movie_id:int,
                               neighbour_min_s, neighbour_max_s, neighbour_step_size)->list:
        """
        Given a neighbour_id, take the all the timebins of this neighbour which includes the target movie.
        
        :param neighbour_min_s: Minimum number of movies to include inside neighbour timebins
        :param neighbour_max_s: Maximum number of movies to include inside neighbour timebins
        :param neighbour_step_size: Number of movies to extend per step the timebin size
        """
        # Start by taking all movies watched by the neighbour
        all_movies_watched = DatasetMovieOperator.get_movies_watched(ratings, neighbour_id)[['item_id', 'rating', 'timestamp']].set_index('item_id')
        n_movies = len(all_movies_watched)
        neighbour_timebins = list()

        # For each timebin_size
        for timebin_size in range(neighbour_min_s, neighbour_max_s, neighbour_step_size):
            # Traverse the all movies by taking 'timebin_size' piece of ratings per loop
            for i in range(0, n_movies, timebin_size):
                timebin = all_movies_watched.iloc[i:i+timebin_size]
                # Take only timebins which includes this movie
                if not (movie_id in timebin.index):
                    continue
                # Insert the timebin, its index i, and its size timebin_size into the list as a tuple
                neighbour_timebins.append(  (timebin, i, timebin_size)  )
        return neighbour_timebins

    @staticmethod
    def get_possible_neighbour_list(ratings, user_id:int, timebin_i, timebin_size, movie_id, min_common_neighbour_ratings):
        """
        Get possible neighbour Ids as list.
        
        Possible neighbours have watched'min_common_neighbour_ratings' number movies in common with the target timebin.
        Possible neighbours also have to watch the target movie('movie_id') that we want to make prediction on.
        """
        
        # Recreate the user timebin using its identifiers
        timebin = TimebinSimilarity.get_timebin(ratings, user_id, timebin_i, timebin_size)
        timebin.drop(timebin.tail(1).index, inplace=True)
       
        # :::: Change filter on ratings.loc by adding time constraint in case we filter future neighbour timebins::::
        
        # Count number of common ratings with other users
        userlist = [0 for i in range(611)]
        for movie_id in timebin.index.values.tolist():
            users_who_watched = ratings.loc[(ratings['item_id'] == movie_id)][['user_id']].values.tolist()
            for user_who_watched in users_who_watched:
                userlist[user_who_watched[0]] += 1

        # save as neighbour, if common rating count greater than min_common_neighbour_ratings and given rating to the movie index at i
        neighbour_id_list = []
        for j in range(0, 611):
            if( (userlist[j] > min_common_neighbour_ratings) and 
               (DatasetMovieOperator.get_movie_rating(ratings, movie_id, j) != 0) and 
               (j != user_id) ):
                neighbour_id_list.append(j)

        return neighbour_id_list
    
    @staticmethod
    def get_knn(ratings, user_id, k, timebin_i, timebin_size,
                         neighbour_min_s, neighbour_max_s, neighbour_step_size,
                         corr_threshold, movie_id, min_common_elements, min_common_neighbour_ratings):
        """
        ratings must be sorted in 'timestamp' order.
        
        :param timebin_i: timebin identifier, identifies start of timebin
        :param timebin_size: timebin identifier, identifies size of the timebin starting at timebin_i index
        :param neighbour_min_s: Minimum number of movies to include inside neighbour timebins 
        :param neighbour_max_s: Maximum number of movies to include inside neighbour timebins
        :param neighbour_step_size: Number of movies to extend per step the timebin size
        :param min_common_elements: Minimum number of common elements in between timebins in order them to become neighbour.
                                              This  value is used to find similar timebins.
        :param min_common_neighbour_ratings: Minimum number of ratings in common between two users neighbour when taking timebins of them.
                                   This value is used to create possible timebins.
        """
        
        # Recreate the user timebin using its identifiers
        timebin = TimebinSimilarity.get_timebin(ratings, user_id, timebin_i, timebin_size)
        timebin.drop(timebin.tail(1).index, inplace=True)  
        
        timebin_user_avg = DatasetUserOperator.get_user_avg_at(ratings, user_id, timebin.iloc[-1]['timestamp'])
        
        # Neighbours contains users who has rated min_common_neighbour_ratings movies in common 
        # and also rated last movie of the timebin by the way, remember we are trying to predict last movie
        neighbours = TimebinSimilarity.get_possible_neighbour_list(ratings, user_id, 
                                                                             timebin_i, timebin_size, movie_id, min_common_neighbour_ratings)

        data = list()
        neighbour_count = 0
        for neighbour_id in neighbours:
            # Get all possible timebins of the neighbour that includes the movie with a timebin size with given constraints.
            neighbour_timebins = TimebinSimilarity.get_neighbour_timebins(ratings, neighbour_id, movie_id,
                                                                          neighbour_min_s, neighbour_max_s, 
                                                                          neighbour_step_size)
            
            # For each neighbour timebin, calculate pearson between the timebin and neighbour timebin 
            for neighbour_timebin, timebin_i, timebin_size in neighbour_timebins:
                
                # Calculate user avg for each timebin by assuming system at the time neighbour rated the last movie
                neighbour_time_constraint = TimeConstraint(end_dt=neighbour_timebin.iloc[-1]['timestamp'])
                neighbour_avg_rating = DatasetUserOperator.get_user_avg_at(ratings, neighbour_id, neighbour_time_constraint.end_dt)
            
                # Calculate Pearson Correlation between neighbour timebin and self.timebin
                corr, common_elements = TimebinSimilarity.timebin_similarity(timebin, timebin_user_avg,
                                                                             neighbour_timebin, neighbour_avg_rating)
                
                # If less than min_timebin_neighbour_ratings movies in common rated in the neighbour_timebin, dont process this timebin
                if common_elements < min_common_elements:
                    continue
                
                # Dont include the neighbour timebin if corr is invalid or less than corr_threshold
                if math.isnan(corr) or corr < corr_threshold:                     
                    continue
                
                if neighbour_count > k:
                    break
                
                data.append( (neighbour_id, common_elements, corr, timebin_i, timebin_size) )
                neighbour_count += 1
                
        return pd.DataFrame(data, columns=['neighbour_id', 'n_common','pearson_corr', 'timebin_i', 'timebin_size'])
    
    @staticmethod
    def store_neighbour_data(ratings, similar_timebins, movie_id):
        """
            Collect structured data aimed to generate predictions to the target movie later when needed.
        """
        data = list()
        
        for row in similar_timebins.itertuples(index=False):
            # Get neighbour timebin data
            neighbour_id = row[0]
            n_common = row[1]
            corr = row[2]
            timebin_i = row[3]
            timebin_size = row[4]
            
            # Create the neighbour timebin from its data
            neighbour_bin = TimebinSimilarity.get_timebin(ratings, neighbour_id, timebin_i, timebin_size)
            rating = None
            try: 
                rating = neighbour_bin.loc[movie_id]     # test whether the neighbour has rating for the item
            except KeyError:
                continue                                 # if neighbour dont have rating for the item, then pass this round of loop
            
            # insert rating for the neighbour into data 
            #  and also insert other data identifying the neighbour(for in case we use them in analysis)
            if rating is not None:
                data.append( (rating[0], corr, neighbour_id, n_common, timebin_i, timebin_size) )
        
        return data 
    
    @staticmethod
    def store_common_neighbour_ratings(ratings, timebin, similar_timebins):
        """
        Collect neighbour data for each commonly rated movie.
        """
        # Store data in order to return as result
        data = defaultdict(list)
        for row in similar_timebins.itertuples(index=False):
            # Get neighbour timebin data
            neighbour_id = row[0]
            n_common = row[1]
            corr = row[2]
            timebin_i = row[3]
            timebin_size = row[4]
            
            # Create the neighbour timebin from its data
            neighbour_bin = TimebinSimilarity.get_timebin(ratings, neighbour_id, timebin_i, timebin_size)
            
            # Get rid of movie ratings of the neighbour that are not found in the timebin
            merged_bin = pd.merge(timebin, neighbour_bin, left_index=True, right_index=True)
            
            # For each common movie rating, store neighbour rating and its correlation to the timebin
            for bin_row in merged_bin.itertuples(index=True):
                curr_movie = bin_row[0]
                neighbour_rating = bin_row[3]
                #print(bin_row[0], bin_row[1], bin_row[2], bin_row[3], bin_row[4])
                data[curr_movie].append( (neighbour_rating, corr, neighbour_id, n_common, timebin_i, timebin_size) )
        return data
    
    @staticmethod
    def get_timebin_size(tc: TimeConstraint):
        return abs((tc.start_dt - tc.end_dt).days)
    

##### TimebinPredict

In [13]:
class TimebinPredict:
    
    @staticmethod
    def predict_knn_data(knn_data,  min_neighbour_count=5):
        """
        Make prediction by using the neighbour timebin data on the movie of interest.
        K nearest neighbours is used only.
        """
        
        # Sort the data by corr
        knn_data.sort(key=lambda data_element: data_element[1])
        
        # each row of data ->(rating, corr, neighbour_id, n_common, timebin_i, timebin_size, timebin_size_in_days)
        weighted_sum = 0
        weight_sum = 0
        count = 0
        for (rating, corr, _, _, _, _) in knn_data:
            weighted_sum += rating * corr
            weight_sum += corr            
            count += 1
            
        if count < min_neighbour_count:
            return 0
        
        prediction = weighted_sum / weight_sum
        
        return prediction

    @staticmethod
    def predict(ratings, movie_id, knn, min_neighbour_count=5):
        neighbours_data = knn[['neighbour_id', 'pearson_corr']].groupby('neighbour_id').mean()
        neighbours_data.sort_values(by='pearson_corr', ascending=False, inplace=True)
        
        weighted_sum = 0
        weight_sum = 0
        count = 0
        for index, row in knn[['neighbour_id', 'pearson_corr']].groupby('neighbour_id').mean().iterrows():
            neighbour_id = index
            corr = row[0]
            rating = DatasetMovieOperator.get_movie_rating(ratings, movie_id, neighbour_id)
            weighted_sum += rating * corr
            weight_sum += corr
            
            count += 1

        if count < min_neighbour_count:
            return 0
        
        prediction = weighted_sum / weight_sum
        
        return prediction

##### AnalizeTimebinSimilarity

In [25]:
class AnalizeTimebinSimilarity:
    
    @staticmethod
    def analize_timebin_prediction(movie_ratings, ratings, n, s, k=10, min_neighbour=3, corr_threshold=0.1, 
                                   neighbour_min_s=5, neighbour_max_s=50, neighbour_step_size=5, 
                                   min_common_elements=3,
                                   min_common_neighbour_ratings=3):
        """
        :param s: Timebin size
        :param n: Number of prediction to make where each prediction is for a randomly chosen users' a random movie
        :param k: K nearest neighbour of the timebin based neighbourhood will be taken when making prediction.
        :param min_neighbour: Minimum number of neighbour timebins when making prediction.
        :param corr_threshold: Overrides the default behaviour if you set any value other than None
        :param neighbour_min_s: Minimum number of movies to include inside neighbour timebins
                                Overrides default behaviour of the class if given any value other than None 
        :param neighbour_max_s: Maximum number of movies to include inside neighbour timebins
                                Overrides default behaviour of the class if given any value other than None
        :param neighbour_step_size: Increase of number of movies in the timebin per step
                                Overrides default behaviour of the class if given any value other than None
        :param min_common_elements: Minimum number of common elements in between timebins in order them to become neighbour.
                                              This  value is used to find similar timebins.
        :param min_common_ratings: Minimum number of ratings in common between two users neighbour when taking timebins of them.
                                   This value is used to create possible timebins.
        """
        
        # Correlation threshold should be greater than or equal to 0 for best quality. 
        if corr_threshold < 0:
            corr_threshold = 0
        
        output = list()
        
        # collected data count
        count = 0
        normal_predictions = list()
        timebin_predictions = list()
        
        while count < n:
            
            # Choose a random user
            user_id = DatasetUserOperator.get_random_users(ratings, n=1)[0]
            
            ### Choose Random timebin
            
            # Get all movies that has been watched by the user
            movies_watched = DatasetMovieOperator.get_movies_watched(ratings, user_id)
            movies_watched = movies_watched[['item_id', 'rating', 'timestamp']].set_index('item_id')
            
            # In order to simulate that we are making prediction in random point in time
            # take a random starting point and take s piece movies starting from that point as timebin.
            max_movie_index = ( len(movies_watched) - 1 ) - s
            if max_movie_index <= 0:
                continue
            timebin_i = random.randint(0, max_movie_index)
            timebin = TimebinSimilarity.get_timebin(ratings, user_id, timebin_i, s)
            
            # Make sure we have a valid timebin before moving forward
            if timebin is None or s < min_common_elements:
                continue
            
            ### Predict last movie of the timebin with Normal way(KNN with Pearson Correlation)
            movie_id = timebin.index[-1]
            
            # Get max limit time constraint where max limit is the time we rate the last movie inside the timebin
            time_constraint = TimeConstraint(end_dt=(timebin.iloc[-1]['timestamp'] - timedelta(milliseconds=1))) 
            normal_prediction = PearsonPredict.predict_movie(movie_ratings, user_id, movie_id, time_constraint)
            
            # In order to compare predictions, normal prediction must exists
            if normal_prediction == 0:
                continue
            
            ### Predict last movie of the timebin with Timebin Based K Nearest Neighbourhood
            similar_timebins = TimebinSimilarity.get_knn(ratings, user_id, k,
                                                         timebin_i, s,
                                                         neighbour_min_s=neighbour_min_s,
                                                         neighbour_max_s=neighbour_max_s,
                                                         neighbour_step_size=neighbour_step_size,
                                                         corr_threshold=corr_threshold,
                                                         movie_id=movie_id, 
                                                         min_common_elements=min_common_elements, 
                                                         min_common_neighbour_ratings=min_common_neighbour_ratings)
            if similar_timebins.empty:
                continue  

            # Drop duplicate (neighbour_id, n_common, pearson_corr, timebin_i, timebin_size) tuples so that now no repeating timebin exists
            similar_timebins.drop_duplicates(['neighbour_id','n_common', 'pearson_corr', 'timebin_i'], inplace=True)

            prediction = TimebinPredict.predict(ratings, movie_id, similar_timebins, min_neighbour_count=min_neighbour)
            actual = DatasetMovieOperator.get_movie_rating(ratings, movie_id, user_id)
            
            # In order to compare predictions, timebin based prediction must exists
            if prediction == 0:
                continue
            
            print(movie_id, normal_prediction, prediction, actual)  # Debug Output
            
            normal_predictions.append( (normal_prediction, actual) )
            timebin_predictions.append( (prediction, actual) )
            
            #### !!!!!!!!!!! Add movieId to output in order to calculate coverage.
            
            # Save the data
            
            output.append( (user_id, timebin_i, s, similar_timebins) )
            # we saved a data succesfully, increment collected data cout
            count += 1
        
        # RMSE for debug output
        normal_rmse = Accuracy.rmse(normal_predictions)
        timebins_rmse = Accuracy.rmse(timebin_predictions)
        print(f"Normal RMSE: {normal_rmse:.2} Timebin RMSE:{timebins_rmse:.2}")
        
        result = {
            "output":output,
            "normal_predictions":normal_predictions,
            "timebin_predictions":timebin_predictions
        }
        
        return result         

##### DatasetUserOperator

In [15]:
movielens = MovieLensDataset()
movie_ratings = movielens.create_movie_ratings(movielens.ratings ,movielens.movies)
movie_ratings

Unnamed: 0,user_id,item_id,rating,timestamp,title,genres,year
0,1,3,4.0,2000-07-30 18:20:47,Grumpier Old Men,Comedy|Romance,1995.0
1,6,3,5.0,1996-10-17 12:11:36,Grumpier Old Men,Comedy|Romance,1995.0
2,19,3,3.0,2000-08-08 04:07:16,Grumpier Old Men,Comedy|Romance,1995.0
3,32,3,3.0,1997-02-23 22:16:12,Grumpier Old Men,Comedy|Romance,1995.0
4,42,3,4.0,2001-07-27 08:04:05,Grumpier Old Men,Comedy|Romance,1995.0
...,...,...,...,...,...,...,...
100616,610,160341,2.5,2016-11-19 08:55:49,Bloodmoon,Action|Thriller,1997.0
100617,610,160527,4.5,2016-11-19 08:43:18,Sympathy for the Underdog,Action|Crime|Drama,1971.0
100618,610,160836,3.0,2017-05-03 20:53:14,Hazard,Action|Drama|Thriller,2005.0
100619,610,163937,3.5,2017-05-03 21:59:49,Blair Witch,Horror|Thriller,2016.0


In [16]:
DatasetMovieOperator.get_movies_watched(movie_ratings, 448)

Unnamed: 0,user_id,item_id,rating,timestamp,title,genres,year
37,448,3,3.0,2002-04-18 11:15:36,Grumpier Old Men,Comedy|Romance,1995.0
311,448,47,4.0,2002-04-18 12:19:46,Seven (a.k.a. Se7en),Mystery|Thriller,1995.0
514,448,50,4.0,2003-09-28 09:35:27,"Usual Suspects, The",Crime|Mystery|Thriller,1995.0
628,448,101,3.5,2004-02-09 12:09:46,Bottle Rocket,Adventure|Comedy|Crime|Romance,1996.0
981,448,163,3.5,2004-01-18 10:06:59,Desperado,Action|Romance|Western,1995.0
...,...,...,...,...,...,...,...
98697,448,165347,2.0,2017-07-09 17:18:29,Jack Reacher: Never Go Back,Action|Crime|Drama|Mystery|Thriller,2016.0
98698,448,168350,2.5,2017-04-21 18:18:38,100 Streets,Drama,2016.0
98699,448,168456,2.0,2017-03-12 16:53:23,Mercury Plains,Action|Adventure|Drama,2016.0
98700,448,169180,2.5,2017-06-04 16:04:55,American Fable,Thriller,2017.0


In [17]:
t1 = TimebinSimilarity.get_timebin(movielens.ratings.sort_values(by='timestamp'), 448, 0, 20) ### Sort ratings by timestamp first
t1

Unnamed: 0_level_0,rating,timestamp
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1210,5.0,2002-04-18 09:57:46
3507,4.0,2002-04-18 09:57:46
1374,3.0,2002-04-18 09:58:00
2349,2.0,2002-04-18 09:58:00
1961,3.0,2002-04-18 09:58:14
1193,5.0,2002-04-18 09:58:48
507,3.0,2002-04-18 09:58:48
1097,4.0,2002-04-18 09:59:01
541,4.0,2002-04-18 09:59:20
4499,4.0,2002-04-18 10:00:05


In [18]:
t2 = TimebinSimilarity.get_timebin(movielens.ratings.sort_values(by='timestamp'), 334, 0, 50) ### Sort ratings by timestamp first

In [19]:
TimebinSimilarity.timebin_similarity(t1, DatasetUserOperator.get_user_avg(movielens.ratings, 448),
                                     t2, DatasetUserOperator.get_user_avg(movielens.ratings, 334))

(0.9725947945938099, 3)

In [20]:
TimebinSimilarity.get_neighbour_timebins(movielens.ratings, 
                                         448, 1214,
                                         5, 100, 5)

[(         rating           timestamp
  item_id                            
  1214        3.0 2002-04-18 10:02:54
  1220        3.0 2002-04-23 12:06:22
  1221        5.0 2002-04-18 10:02:27
  1223        4.0 2002-04-18 10:44:21
  1227        5.0 2002-04-18 11:53:58,
  180,
  5),
 (         rating           timestamp
  item_id                            
  1214        3.0 2002-04-18 10:02:54
  1220        3.0 2002-04-23 12:06:22
  1221        5.0 2002-04-18 10:02:27
  1223        4.0 2002-04-18 10:44:21
  1227        5.0 2002-04-18 11:53:58
  1228        5.0 2011-12-24 16:31:07
  1230        5.0 2002-08-12 06:55:09
  1234        4.0 2002-04-18 10:52:25
  1240        3.0 2002-04-18 10:03:14
  1244        4.0 2002-08-05 06:44:10,
  180,
  10),
 (         rating           timestamp
  item_id                            
  1214        3.0 2002-04-18 10:02:54
  1220        3.0 2002-04-23 12:06:22
  1221        5.0 2002-04-18 10:02:27
  1223        4.0 2002-04-18 10:44:21
  1227        5.0 200

In [22]:
TimebinSimilarity.get_knn(ratings=movielens.ratings, 
                                   user_id=448,
                                   k=10,
                                   timebin_i=60, 
                                   timebin_size=43, 
                                   neighbour_min_s=5, 
                                   neighbour_max_s=50, 
                                   neighbour_step_size=5, 
                                   corr_threshold=0, 
                                   movie_id=1961,
                                   min_common_neighbour_ratings=3, 
                                   min_common_elements=3)

Unnamed: 0,neighbour_id,n_common,pearson_corr,timebin_i,timebin_size
0,153,5,0.611595,0,45


In [27]:
AnalizeTimebinSimilarity.analize_timebin_prediction(movie_ratings, movielens.ratings, n=1, s=43, k=10,
                           min_neighbour=3, corr_threshold=0.1, 
                           neighbour_min_s=5, neighbour_max_s=50, neighbour_step_size=5, 
                           min_common_elements=3,
                           min_common_neighbour_ratings=3)

5952 4.628616472558159 4.442627215825005 3.5
Normal RMSE: 1.0 Timebin RMSE:1.0


{'output': [(334,
   43,
   43,
       neighbour_id  n_common  pearson_corr  timebin_i  timebin_size
   0            165         3      0.859772         30            30
   1            288         5      0.461292        840            30
   2            288         5      0.381806        840            35
   3            288         5      0.534001        840            40
   4            477         3      0.222607        380            20
   5            480         3      0.165869        520            40
   6            489         3      0.546175        420            35
   7            504         4      0.249971         35            35
   8            504         3      0.299974         40            40
   9            600         3      0.881719        595            35
   10           600         4      0.574978        585            45)],
 'normal_predictions': [(4.628616472558159, 3.5)],
 'timebin_predictions': [(4.442627215825005, 3.5)]}