# Recommendation system for Movies :  Alternating Least Square (ALS) 

Link to download the data : https://grouplens.org/datasets/movielens/

ALS : Matrix factorization models map both users and items to a joint latent factor space of dimensionality $f$, , such that user-item interactions are modeled as inner products in that space.
each item i is associated with a vector $q_{i}$ ∈ 	$\mathbb{R}^f$, and each user u is associated with a vector $p_{u}$∈$\mathbb{R}^f$.

The ratings are modeled as follows:

$$ \hat{r}_{ui} =  q_{i}^Tp_{u} $$

$µ$ over all average rating.

$b_{u}$ and $b_{i}$ deviations of user u and item i, respectively,from the average.



To minimize : 
$$\min_{ p^*,q^*} \sum_{r_{ui} \in R_{train}} \left(r_{ui} -  q_{i}^Tp_{u} \right)^2 + \lambda(||p_{u}||^2 + ||q_{i}||^2)$$

In [4]:
import pandas as pd 
from surprise import Dataset, accuracy, Reader, SVD
from surprise.model_selection import cross_validate
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)

def get_top_n_recommendations(model,df,user_id, n=3):
  user_movies = df[df['userId'] == user_id]['movieId'].unique()
  all_movies = df['movieId'].unique()
  movies_to_predict = list(set(all_movies) - set(user_movies))
  user_movie_pairs = [(user_id, movie_id, 0) for movie_id in movies_to_predict]
  predictions_cf = model.test(user_movie_pairs)
  top_n_recommendations = sorted(predictions_cf, key = lambda x: x.est)[:n]
  for pred in top_n_recommendations:
    predicted_rating = pred.est
    print(predicted_rating)
  top_n_movie_ids = [int(pred.iid) for pred in top_n_recommendations]
  top_n_movies = movie_encoder.inverse_transform(top_n_movie_ids)
  return top_n_movies

## Read Data : 

In [5]:
data_size = "ml-latest-small"# "ml-32m"
ratings = pd.read_csv(f"./data/{data_size}/ratings.csv")
movies = pd.read_csv(f"./data/{data_size}/movies.csv")



In [6]:
movies_ratings = pd.merge(left=ratings, right=movies, how='left' , on="movieId")
movies_ratings.drop(columns=["title"], inplace=True)
movies_ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp,genres
0,1,1,4.0,964982703,Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Comedy|Romance


## Preprocess data : 

In [7]:
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()
mlb = MultiLabelBinarizer()

movies_ratings['userId'] = user_encoder.fit_transform(movies_ratings['userId'])
movies_ratings['movieId'] = movie_encoder.fit_transform(movies_ratings['movieId'])

In [8]:
genres_list_df= movies_ratings.pop('genres').str.split('|')
genres_list_df

0         [Adventure, Animation, Children, Comedy, Fantasy]
1                                         [Comedy, Romance]
2                                 [Action, Crime, Thriller]
3                                       [Mystery, Thriller]
4                                [Crime, Mystery, Thriller]
                                ...                        
100831                            [Drama, Horror, Thriller]
100832                            [Action, Crime, Thriller]
100833                                             [Horror]
100834                                     [Action, Sci-Fi]
100835                     [Action, Crime, Drama, Thriller]
Name: genres, Length: 100836, dtype: object

In [9]:
# For each film find the genres that represent it. put 0 for each genres.
movies_ratings = movies_ratings.join(pd.DataFrame(mlb.fit_transform(genres_list_df), columns = mlb.classes_, index = movies_ratings.index ))
movies_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,4.0,964982703,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,2,4.0,964981247,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,5,4.0,964982224,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,43,5.0,964983815,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
4,0,46,5.0,964982931,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0


In [10]:
movies_ratings.drop(columns=["(no genres listed)"], inplace=True)

#### Split Data : 

In [11]:
train, test = train_test_split(movies_ratings, test_size=0.25)
print(f"Train size = {train.size}")
print(f"Test  size = {test.size}")

Train size = 1739421
Test  size = 579807


In [12]:
# As I am loading a custom dataset, we need to define a reader.
reader = Reader(rating_scale = (0.5, 5))
train_data = Dataset.load_from_df(train[["userId", "movieId", "rating"]], reader).build_full_trainset()
testset = train_data.build_anti_testset()

## train model using collab filtering: 

In [13]:
# Alternating Least Squares (ALS), same as SVD but bias is not added.
svd = SVD(random_state=0, n_factors=200, n_epochs=30,biased=False, verbose=True)
svd.fit(train_data)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1e32de91570>

#### Root mean squared error

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{}(r_{ui} - \hat{r}_{ui})^2}$$




In [14]:
predictions_svd = svd.test(testset)
rsme_value = accuracy.rmse(predictions_svd)
print(f"Root mean squared error = {rsme_value}")

RMSE: 1.7818
Root mean squared error = 1.7817934665946054


## Recommendations : 

In [15]:
user_id = 58
n_recommendation = 3
recommendations = get_top_n_recommendations(svd, movies_ratings,user_id, n_recommendation)
top_n_movies_titles = movies[movies['movieId'].isin(recommendations)]['title'].tolist()
print(f"Top {n_recommendation} Recommendations for User {user_id}:")
for i, title in enumerate(top_n_movies_titles, 1):
  print(f"{i}.{title}")

0.5
0.5
0.5
Top 3 Recommendations for User 58:
1.Man of the Year (1995)
2.Reckless (1995)
3.Miami Rhapsody (1995)
