# Matrix Factorization

In week4 of inzva applied AI program we apply Matrix Factorization algorithm FunkSVD using Recompy. We will go through the source code of the FunkSVD module of Recompy Library.

### User-Item Matrix

User-Item matrix in the context of recommender system is almost always a sparse matrix. Users interact with only a few of the all of the items. What can we do to find similar users with such data at hand?

- The simplestt approach can be to do nothing. By nothing we mean that we do not train any model to fill that sparse matrix. We just use item-user interactions at hand and calculate similarities between users using only intersected items. 

- Slightly better (maybe not) approach can be filling that matrix with global mean.

- Slightly bettter approach can be filling that matrix with user means.

For all of the above approaches including MF, we need some similarity measurement to find similar items. This measurement can be cosine similarity, euclidean distance, pearson correlation and other versions of these techniques.

### Recompy

Recompy is a Python Library for recommender system algorithms. It is still under development but current version supports using FunkSVD algorithm. It has MovieLens-100k built-in dataset.

https://github.com/CanBul/recompy

### MovieLens Dataset

Ratings of users to movies are given. Let's explore the MovieLens data.

In [21]:
from recompy import load_movie_data
import pandas as pd

data = load_movie_data()
df_ratings = pd.DataFrame(data, columns = ['userId', 'itemId', 'rating'])

In [22]:
df_ratings.head()

Unnamed: 0,userId,itemId,rating
0,196.0,242.0,3.0
1,186.0,302.0,3.0
2,22.0,377.0,1.0
3,244.0,51.0,2.0
4,166.0,346.0,1.0


In [23]:
df_ratings = df_ratings.astype(int)

In [24]:
df_ratings_pivot = df_ratings.pivot(index='itemId', columns='userId', values='rating')


In [25]:
df_ratings_pivot

userId,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
itemId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,,,4.0,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
2,3.0,,,,3.0,,,,,,...,4.0,,,,,,,,,5.0
3,4.0,,,,,,,,,,...,,,4.0,,,,,,,
4,3.0,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
5,3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,
1681,,,,,,,,,,,...,,,,,,,,,,


In [1]:
class FunkSVD():

    def __init__(self):
        # Initialize default hyperparameters
        self.set_hyperparameters()


    def set_hyperparameters(self, initialization_method='random', max_epoch=5, n_latent=10, learning_rate=0.01, regularization=0.1, early_stopping=False, init_mean=0, init_std=1):
        """Initialization method, epoch num, latent feature num, learning rate,
           regularization, early stopping condition...
        """
        self.initialization_method = initialization_method
        self.max_epoch = max_epoch
        self.n_latent = n_latent
        self.learning_rate = learning_rate
        self.regularization = regularization
        self.early_stopping = early_stopping
        self.init_mean = init_mean
        self.init_std = init_std

        self.min_train_error = np.inf
        self.min_test_error = np.inf

        
    def __set_data(self, data, test_portion):

        # get distinct users, items and user_existing_ratings, items_existing_users
        self.user_existing_ratings = {}
        self.items_rated_by_users = {}
        self.user_ids = []
        self.item_ids = []

        np.random.shuffle(data)

        # variables for train and test split
        user_dictionary = {}
        item_dictionary = {}
        self.train_data = []
        self.test_data = []

        self.train_data_user_ids = []
        self.train_data_item_ids = []
        self.test_data_user_ids = []
        self.test_data_item_ids = []

        for user, item, score in data:
            # Unique users and items

            try:
                user = int(user)
            except:
                pass
            try:
                item = int(item)
            except:
                pass

            user = str(user)
            item = str(item)
            score = float(score)

            if user not in self.user_existing_ratings:
                self.user_ids.append(user)
            if item not in self.items_rated_by_users:
                self.item_ids.append(item)

            self.items_rated_by_users.setdefault(item, []).append(user)
            self.user_existing_ratings.setdefault(user, []).append(item)

            ratio = len(self.test_data) / (len(self.train_data)+0.001)

            if self.test_split:
                # train and test set
                user_dictionary.setdefault(user, 0)
                item_dictionary.setdefault(item, 0)

                if user_dictionary[user] * test_portion >= 1 and item_dictionary[item] * test_portion >= 1 and ratio <= test_portion+0.02:

                    self.test_data.append([user, item, score])
                    if user not in self.test_data_user_ids: self.test_data_user_ids.append(user)
                    if item not in self.train_data_item_ids: self.test_data_item_ids.append(item)

                    user_dictionary[user] -= 1
                    item_dictionary[item] -= 1

                else:
                    self.train_data.append([user, item, score])
                    if user not in self.train_data_user_ids: self.train_data_user_ids.append(user)
                    if item not in self.train_data_item_ids: self.train_data_item_ids.append(item)

                    user_dictionary[user] += 1
                    item_dictionary[item] += 1
            else:
                self.train_data.append([user, item, score])
                if user not in self.train_data_user_ids: self.train_data_user_ids.append(user)
                if item not in self.train_data_item_ids: self.train_data_item_ids.append(item)

        print('Your data has {} distinct users and {} distinct items.'.format(
            len(self.user_ids), len(self.item_ids)))

        if len(self.test_data) < 1 and self.test_split:
            self.test_split = False
            self.early_stopping = False
            print("Training set doesn't have enough data for given test portion.")

        if self.test_split:

            print('Your data has been split into train and test set.')
            print('Length of training set is {}. Length of Test set is {}'.format(
                len(self.train_data), len(self.test_data)))
        else:

            print('Your data has no test set.')
            print('Length of training set is {}'.format(len(self.train_data)))

    def fit(self, data, test_split=True, test_portion=0.1, search_parameter_space=False):

        # Set train_data, test_data, user_ids etc. if search parameter is False
        # If True, this lets us search parameter space with the same train-test split
        if not search_parameter_space:

            self.test_split = test_split
            self.__set_data(data, test_portion)

        # Initialization
        print('Initializing features for Users and Items...')
        initial = initializer(self.user_ids, self.item_ids, self.initialization_method,
                              self.n_latent, self.init_mean, self.init_std)

        self.user_features, self.item_features = initial.initialize_latent_vectors()

        # Training
        print('Starting training...')
        error_counter = 0
        for epoch in range(self.max_epoch):

            # updating user and item features
            for user, item, rating in self.train_data:

                error = rating - \
                    np.dot(self.user_features[user], self.item_features[item])
                # Use temp to update each item and user feature in sync.
                temp = self.user_features[user]

                # Update user and item feature for each user, item and rating pair
                self.user_features[user] += self.learning_rate * \
                    (error * self.item_features[item] -
                     self.regularization * self.user_features[user])
                self.item_features[item] += self.learning_rate * \
                    (error * temp - self.regularization *
                     self.item_features[item])

            # Calculate errors
            error_counter += 1
            train_error = Test.rmse_error(
                self.train_data, self.user_features, self.item_features)

            # Show error to Client
            if self.test_split:
                test_error = Test.rmse_error(
                    self.test_data, self.user_features, self.item_features)
                print('Epoch Number: {}/{} Training RMSE: {:.2f} Test RMSE: {}'.format(epoch+1, self.max_epoch,
                                                                                       train_error, test_error))

            else:
                print('Epoch Number: {}/{} Training RMSE: {:.2f}'.format(epoch+1, self.max_epoch,
                                                                         train_error))

            # Save best features depending on test_error
            if self.test_split and test_error < self.min_test_error:
                self.min_test_error = test_error
                self.best_user_features = copy.deepcopy(self.user_features)
                self.best_item_features = copy.deepcopy(self.item_features)

                error_counter = 0
            # Save best features if test data is False
            elif not self.test_split and train_error < self.min_train_error:
                self.min_train_error = train_error
                self.best_user_features = copy.deepcopy(self.user_features)
                self.best_item_features = copy.deepcopy(self.item_features)

            # Break if test_error didn't improve for the last n rounds and early stopping is true
            if self.early_stopping and error_counter >= self.early_stopping:

                print("Test error didn't get lower for the last {} epochs. Training is stopped.".format(
                    error_counter))
                print('Best test error is: {:.2f}. Best features are saved.'.format(
                    self.min_test_error))
                break

        print('Training has ended...')
        self.user_features = copy.deepcopy(self.best_user_features)
        self.item_features = copy.deepcopy(self.best_item_features)

    def get_recommendation_for_existing_user(self, user_id, howMany=10):
        result_list = []
        # this might be more effective using matrix multiplication
        for item in self.item_ids:
            # if user did not already rate the item
            if item not in self.user_existing_ratings[user_id]:
                prediction = np.dot(
                    self.user_features[user_id], self.item_features[item])
                bisect.insort(result_list, [prediction, item])

        return [x[1] for x in result_list[::-1][0:howMany]]

    def get_recommendation_for_new_user(self, user_ratings,
                                        similarity_measure='mean_squared_difference', howManyUsers=3, howManyItems=5):

        # Get user predictions on same movies
        user_predictions = self.__user_prediction_for_same_movies(user_ratings)
        # Find most most similar user_ids
        user_ids = Similarities.get_most_similar_users(
            user_ratings, user_predictions, similarity_measure, howManyUsers)

        result_list = []
        # get user features for users who are most similar to given new user
        for user in user_ids:
            for item, item_feature in self.item_features.items():
                # predict ratings for most similar users
                prediction = np.dot(
                    self.user_features[user], item_feature)
                bisect.insort(result_list, [prediction, item])

        # remove duplicates
        return_list = []
        for pair in result_list:
            if len(return_list) >= howManyItems:
                break
            if pair[1] in return_list:
                continue

            return_list.append(pair[1])

        return return_list

    def get_similar_products(self, item_id, howMany=10):

        result_list = []
        product_features = self.item_features[item_id]

        for item in self.item_ids:

            if item == item_id:
                continue
            # add cosine sim function from similarites
            cos_sim = Similarities.cosine_similarity(
                self.item_features[item], product_features)

            bisect.insort(result_list, [cos_sim, item])

        return [x[1] for x in result_list[::-1][0:howMany]]

    def __user_prediction_for_same_movies(self, user_ratings):
        result = {}
        for key in user_ratings:
            if key not in self.item_features:
                continue

            for user in self.user_features:
                result.setdefault(user, []).append(
                    np.dot(self.user_features[user], self.item_features[key]))

        return result

In [26]:
from recompy import load_movie_data, FunkSVD

# get MovieLens data
data = load_movie_data()


In [27]:
# initialization of FunkSVD model
myFunk = FunkSVD()
# training of the model
myFunk.fit(data)

Your data has 943 distinct users and 1682 distinct items.
Your data has been split into train and test set.
Length of training set is 89285. Length of Test set is 10715
Initializing features for Users and Items...
Starting training...
Epoch Number: 1/5 Training RMSE: 0.96 Test RMSE: 0.9731815442500207
Epoch Number: 2/5 Training RMSE: 0.93 Test RMSE: 0.959175458385427
Epoch Number: 3/5 Training RMSE: 0.93 Test RMSE: 0.9552681191042977
Epoch Number: 4/5 Training RMSE: 0.92 Test RMSE: 0.9535045239093901
Epoch Number: 5/5 Training RMSE: 0.92 Test RMSE: 0.9523772647489239
Training has ended...


In [28]:
myFunk.user_features

{'736': array([0.36814515, 0.25417254, 0.57835613, 0.69447491, 0.07827329,
        0.28544296, 0.61303373, 0.94034699, 0.7417107 , 0.32442466]),
 '450': array([0.66685188, 0.59660607, 0.59591379, 0.5748041 , 0.65981093,
        0.67269402, 0.67182818, 0.58609582, 0.58523575, 0.62346937]),
 '642': array([0.61122313, 0.62523391, 0.50008371, 0.49017622, 0.74807609,
        0.61623778, 0.80059483, 0.56132414, 0.78082057, 0.54321945]),
 '234': array([0.47072978, 0.53343584, 0.54770234, 0.54961141, 0.42471647,
        0.39663615, 0.44293665, 0.48367576, 0.62575665, 0.41276302]),
 '65': array([0.76746219, 0.33612493, 0.73056089, 0.67669114, 0.43600184,
        0.74192323, 0.7517373 , 0.54196074, 0.49281737, 0.29909904]),
 '561': array([0.35992301, 0.49930103, 0.50921879, 0.55644691, 0.48218586,
        0.56364182, 0.35567478, 0.49259288, 0.32443374, 0.45345726]),
 '181': array([0.32646449, 0.29357253, 0.26427155, 0.17474059, 0.33280045,
        0.19643736, 0.33848626, 0.26803731, 0.28492893, 

In [5]:
data

array([[ 659., 1064.,    5.],
       [ 910.,  414.,    4.],
       [ 847.,  239.,    5.],
       ...,
       [ 269.,  716.,    4.],
       [ 919.,  264.,    3.],
       [ 184.,   66.,    4.]])

In [29]:
new_user = {'1':5,
            '2':4,
            '4':3}
            


In [30]:
# To find the most similar user resulting from cosine similarity. Recommend 5 items using the most similar user 
myFunk.get_recommendation_for_new_user(new_user, 
                                       similarity_measure = 'cosine_similarity', 
                                       howManyUsers = 1, howManyItems = 5)

['424', '438', '1230', '758', '743']