## Introduction
The following code is used to upload material from your local file system, [as seen here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=BaCkyg5CV5jF).

Browse to your Netflix data and upload ```TestingRatings.txt``` and ```TrainingRatings.txt```

The dataset lines is on the following format _MovieID_, _CustomerID_, _Rating_

In [0]:
from google.colab import files
 
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving TestingRatings.txt to TestingRatings.txt
Saving TrainingRatings.txt to TrainingRatings.txt
User uploaded file "TestingRatings.txt" with length 1806166 bytes
User uploaded file "TrainingRatings.txt" with length 58524008 bytes


Below let's get the dataset as numpy bidimensional arrays.

In [0]:
import numpy as np

training_ratings = np.loadtxt("TrainingRatings.txt",delimiter=',')
testing_ratings = np.loadtxt("TestingRatings.txt",delimiter=',')

In [0]:
import numpy as np

## Codes

The steps to estimate the rating are:
1. Define _active user_ (subscript a) and his/her _movie_ (subscript j) to be rated.
2. Get all users' ID from the training set 
3. For each user, do:
    - Get his/her votes (set of logs)
    - If he/she voted in the _movie_ defined above, do:
        - Get Movies both users (active one and training one) voted in and calculate ```w(a,i)``` and the difference between the user vote on the iterating movie and his/her average.
4. Compute normalizing factor _k_.
5. Estimate active user rate on _movie_.
6. Return estimation

Our dataset have the users' id in the second column and the ratings in the last one. Let's define it.

In [0]:
user_column = 1
ratings_column = 2
movie_column = 0

def mtxslv_user_ratings(user_id, dataset):
  """
  Receives user_id and dataset. Look for all
  occurences of user_id in dataset and returns 
  such subset.

  If no user_id is found, return an empty
  numpy array. 
  """
  subset = [] # the same thing as I_i (set of item user_id has voted)
  for it in range(0,np.shape(dataset)[0]):
    if (dataset[it,user_column] == user_id):
      subset.append(dataset[it,:].tolist())  
  return np.array(subset)

Up to this point we can calculate the set of items user _i_ has voted:
 $$I_i$$

The mean vote of user _i_ is $$\overline{v_i}$$ and can be computed using the function ```np.mean()``` with argument the return of the function ```mtxslv_user_ratings(user_id, training_ratings) ``` (all the lines, column ```ratings_column```)




Watch Out! Notice the following:
* _User 1567095_ has voted 100 movies only;
* _User 1174811_ has voted 224 movies.

Then, not all movies were voted.

In [0]:
# Python program to illustrate the intersection 
# of two lists in most simple way 
# from https://www.geeksforgeeks.org/python-intersection-two-lists/
def intersection(lst1, lst2): 
	lst3 = [value for value in lst1 if value in lst2] 
	return lst3 

In [0]:
def mtxslv_collab_filter(instance, product_id,training_set):
  """
  Receives the instance (a set of logs), the product_id which estimation
  needs to be calculated, and the training set (a bigger set of logs). All 
  parameters must be numpy. 

  The user in instance is the active one.

  Returns the estimation of rating user in instance must give to product_id. 
  """
  training_set_list = training_set.tolist()
  average_rating_active_user = np.mean(instance[:,ratings_column]) # \overine{v_a}

  w = [] # list of all weights/correlation measures
  k = 0  # normalizing term
  user_vote_minus_average = [] # v_ij - \overline{v_i}

  # let's get all user ids
  user_ids = set(training_set[:,user_column])

  # now iterate each, looking for those that haven't rated product_id
  for user in user_ids:
    user_subset = mtxslv_user_ratings(user,training_set)
    #print("user ", user)
    #print(user_subset)
    if(user_subset[:,movie_column].tolist().count(product_id)): # has user voted product_id? If yes
      #print("user ",user," has voted in product ", product_id)
      user_average = np.mean(user_subset[:,ratings_column])   # get the average of this user ratings
      #print("user ",user,"has average ",user_average)
      movies_both_users_voted = intersection(user_subset[:,movie_column].tolist(),
                                             instance[:,movie_column].tolist()) # and calculate the movies this user and the active user rated
      #print("movies both users voted:",movies_both_users_voted)
      user_index_product_id = user_subset[:,movie_column].tolist().index(product_id) # Calculate what index for user
                                                                                     # is related to product_id 
      
      w_numerator = 0                             # Numerator for w formula: \sum_{}^{}\mathop{}_{\mkern-5mu j} (v_{a,j} - \overline{v_a} )(v_{i,j} - \overline{v_i})
      w_active_denominator_factor = 0             # Active user factor for w denominator: \sum_{}^{}\mathop{}_{\mkern-5mu j} (v_{a,j} - \overline{v_a} )^2
      w_training_usr_denominator_factor = 0       # Training user factor for w denominator: \sum_{}^{}\mathop{}_{\mkern-5mu j} (v_{i,j} - \overline{v_i} )^2

      for movie in movies_both_users_voted: # Now, iterate over the movies both users rated 
        
        training_user_index_for_movie = user_subset[:,movie_column].tolist().index(movie) # index of user vote for movie
        active_user_index_for_movie = instance[:,movie_column].tolist().index(movie)      # index of active user vote for movie

        w_numerator = w_numerator + (instance[active_user_index_for_movie,ratings_column]-average_rating_active_user)*(user_subset[training_user_index_for_movie,ratings_column]-user_average)
        w_active_denominator_factor = w_active_denominator_factor + np.power((instance[active_user_index_for_movie,ratings_column]-average_rating_active_user),2)
        w_training_usr_denominator_factor = w_training_usr_denominator_factor + np.power( (user_subset[training_user_index_for_movie,ratings_column]-user_average) ,2)

      w.append(w_numerator/np.sqrt(w_active_denominator_factor*w_training_usr_denominator_factor))
      user_vote_minus_average.append(user_subset[user_index_product_id,ratings_column]-user_average)

  #print(w)    
  k = 1 / np.sum( np.abs(w) )
  #print(k)
  #print(user_vote_minus_average)
  estimation =  average_rating_active_user + k* np.dot(w,user_vote_minus_average) 

  return estimation


## A Minor Example
For an example, let's suppose the following dataset. 
* There are 6 products (R1,R2,R3,R4,R5 and R6);
* There are 3 "training set" users (Bob, Chris and Diana).
* The query instance is Alice's ratings.

For the training dataset, the rows are the users, the columns the products.

In [0]:
a_minor_example_list = [[1,5,4,float("nan"),3,4],
                        [5,2,float("nan"),2,1,float("nan")],
                        [3,float("nan"),2,2,float("nan"),4]]
a_minor_example_dataset = np.array(a_minor_example_list)                        

query_instance = [2,float("nan"),4,4,float("nan"),5]                 

We want to define Alice's rate on product 5 (fifth column). The manually calculated estimate is 4.336075363.

Let's turn that dataset into a form of "logs", obbeying the rule

_movieID_, _customerID_, _rate_:

In [0]:
training_set_example = np.array([[1,2,1],[2,2,5],[3,2,4],[5,2,3],
                         [6,2,4],[1,3,5],[2,3,2],[4,3,2],
                         [5,3,1],[1,4,3],[3,4,2],[4,4,2],
                         [6,4,4]])
training_set_example_list = training_set_example.tolist()

testing_instance = np.array([[1,1,2],[3,1,4],[4,1,4],[6,1,5]])                     

In [0]:
usuario_2 = mtxslv_user_ratings(2,training_set_example)
usuario_3 = mtxslv_user_ratings(3,training_set_example)
usuario_4 = mtxslv_user_ratings(4,training_set_example)

In [75]:
usuario_2

array([[1, 2, 1],
       [2, 2, 5],
       [3, 2, 4],
       [5, 2, 3],
       [6, 2, 4]])

In [76]:
training_set_example

array([[1, 2, 1],
       [2, 2, 5],
       [3, 2, 4],
       [5, 2, 3],
       [6, 2, 4],
       [1, 3, 5],
       [2, 3, 2],
       [4, 3, 2],
       [5, 3, 1],
       [1, 4, 3],
       [3, 4, 2],
       [4, 4, 2],
       [6, 4, 4]])

In [0]:
estimativa = mtxslv_collab_filter(testing_instance,5,training_set_example)

In [78]:
estimativa

4.336096188777122

## REFERENCES:

* https://stackoverflow.com/questions/21860605/python-remove-lists-from-list-of-lists-similar-functionality-to-pop

* https://www.geeksforgeeks.org/python-intersection-two-lists/

* https://www.programiz.com/python-programming/methods/list/index