# Recommender Systems

**What is a recommender system?**

The goal of recommender systems is to predict the preferences or interests of users and recommend items (such as products, movies, music, articles, or other content) that are likely to be of interest to them.

**Why are recommender systems important?**

- Content discovery
- Personalization
- Increase in sales

Practical applications of recommender systems include:

- Amazon product recommendations
- Netflix movie recommendations
- Spotify music recommendations
- Facebook friend suggestions
- LinkedIn job suggestions
- YouTube video recommendations
- Google search suggestions
- Google News personalization
- Airbnb personalized search results
- Reddit comment recommendations
- Twitter personalized feed



**Types of recommender systems**

1. **Collaborative Filtering:**
   - **User-based:** Recommends items based on the preferences of users who are similar to the target user.
   - **Item-based:** Recommends items similar to those that the target user has liked or interacted with.

2. **Content-Based Filtering:**
   - Recommends items based on the characteristics or features of the items and the user's past preferences.
   - This approach focuses on the content of the items and the user's profile.

3. **Hybrid Methods:**
   - Combines both collaborative filtering and content-based filtering to leverage the strengths of each approach.

4. **Matrix Factorization:**
   - Techniques like Singular Value Decomposition (SVD) or matrix factorization methods are used to identify latent factors that influence user preferences.

5. **Deep Learning-Based Recommenders:**
   - Utilizes neural networks and deep learning techniques to capture complex patterns and relationships in user behavior and item features.

[Data Source](https://www.kaggle.com/datasets/danielgrijalvas/movies)

In [1]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv('data/movies.csv')
# print total number of rows
print(len(df))
df.head()

7668


Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [6]:
#Clean the data
df.dropna(inplace=True)
df.drop(["year", "released", 'votes', 'writer', 'country', 'budget', 'gross'], axis=1, inplace=True)

In [40]:
print(len(df))
df.head()

5421


Unnamed: 0,name,rating,genre,score,director,star,company,runtime,combined
0,The Shining,R,Drama,8.4,Stanley Kubrick,Jack Nicholson,Warner Bros.,146.0,The Shining R Drama 8.4 Stanley Kubrick Jack ...
1,The Blue Lagoon,R,Adventure,5.8,Randal Kleiser,Brooke Shields,Columbia Pictures,104.0,The Blue Lagoon R Adventure 5.8 Randal Kleise...
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,8.7,Irvin Kershner,Mark Hamill,Lucasfilm,124.0,Star Wars: Episode V - The Empire Strikes Bac...
3,Airplane!,PG,Comedy,7.7,Jim Abrahams,Robert Hays,Paramount Pictures,88.0,Airplane! PG Comedy 7.7 Jim Abrahams Robert H...
4,Caddyshack,R,Comedy,7.3,Harold Ramis,Chevy Chase,Orion Pictures,98.0,Caddyshack R Comedy 7.3 Harold Ramis Chevy Ch...


### Content Based

- Focuses on item features. 
- Recommends items similar to those that the target user has liked or interacted with.

In [8]:
def all_into_1_column(df, cols):
    features = []
    #looping over all elements in df
    for i in range(df.shape[0]):
        to_add = " "
        #looping over all important columns
        for j in range(len(cols)):
            #adding data from each row in these columns one by one
            to_add += str(df[cols[j]][i]) + ' '
        #adding the features in the list for it to become a column in our dataframe
        features.append(to_add)
    return features

In [9]:
df.reset_index(drop=True, inplace=True)
#since some of the rows were removed
df['combined'] = all_into_1_column(df, df.columns.tolist())

In [10]:
df.head()

Unnamed: 0,name,rating,genre,score,director,star,company,runtime,combined
0,The Shining,R,Drama,8.4,Stanley Kubrick,Jack Nicholson,Warner Bros.,146.0,The Shining R Drama 8.4 Stanley Kubrick Jack ...
1,The Blue Lagoon,R,Adventure,5.8,Randal Kleiser,Brooke Shields,Columbia Pictures,104.0,The Blue Lagoon R Adventure 5.8 Randal Kleise...
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,8.7,Irvin Kershner,Mark Hamill,Lucasfilm,124.0,Star Wars: Episode V - The Empire Strikes Bac...
3,Airplane!,PG,Comedy,7.7,Jim Abrahams,Robert Hays,Paramount Pictures,88.0,Airplane! PG Comedy 7.7 Jim Abrahams Robert H...
4,Caddyshack,R,Comedy,7.3,Harold Ramis,Chevy Chase,Orion Pictures,98.0,Caddyshack R Comedy 7.3 Harold Ramis Chevy Ch...


In [13]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

matrix = CountVectorizer().fit_transform(df["combined"])

In [39]:
matrix.shape

(5421, 9778)

In [14]:
cs_matrix = cosine_similarity(matrix)

In [15]:
cs_matrix

array([[1.        , 0.09534626, 0.08164966, ..., 0.        , 0.153393  ,
        0.07905694],
       [0.09534626, 1.        , 0.07784989, ..., 0.18181818, 0.21938173,
        0.07537784],
       [0.08164966, 0.07784989, 1.        , ..., 0.07784989, 0.18786729,
        0.12909944],
       ...,
       [0.        , 0.18181818, 0.07784989, ..., 1.        , 0.14625448,
        0.        ],
       [0.153393  , 0.21938173, 0.18786729, ..., 0.14625448, 1.        ,
        0.12126781],
       [0.07905694, 0.07537784, 0.12909944, ..., 0.        , 0.12126781,
        1.        ]])

In [16]:
#we got the index of the row containing the movie in the df
index = df[df["name"] == "Superman II"].index[0]

#get a list of the scores for this specific movie/row
scores = list(enumerate(cs_matrix[index]))

#sort this list in reverse order for easier access
scores = sorted(scores, key=lambda x:x[1], reverse=True)
scores

[(8, 1.0000000000000002),
 (180, 0.6363636363636365),
 (1107, 0.3481553119113957),
 (2461, 0.3223291856101521),
 (567, 0.3113995776646092),
 (4245, 0.3113995776646092),
 (209, 0.2860387767736777),
 (222, 0.2860387767736777),
 (464, 0.2860387767736777),
 (513, 0.2860387767736777),
 (804, 0.2860387767736777),
 (989, 0.2860387767736777),
 (1549, 0.2860387767736777),
 (1995, 0.2860387767736777),
 (2484, 0.2860387767736777),
 (3853, 0.2860387767736777),
 (4488, 0.2860387767736777),
 (1405, 0.2842676218074806),
 (301, 0.2727272727272727),
 (327, 0.2727272727272727),
 (341, 0.2727272727272727),
 (505, 0.2727272727272727),
 (790, 0.2727272727272727),
 (1090, 0.2727272727272727),
 (2460, 0.2727272727272727),
 (2633, 0.2727272727272727),
 (2785, 0.2727272727272727),
 (3033, 0.2727272727272727),
 (3314, 0.2727272727272727),
 (3348, 0.2727272727272727),
 (3704, 0.2727272727272727),
 (3730, 0.2727272727272727),
 (4416, 0.2727272727272727),
 (4625, 0.2727272727272727),
 (4855, 0.2727272727272727),
 

In [17]:
scores = scores[1:]

In [18]:
#get top 5 movie titles
for movie in scores[:5]:
    mv = df["name"][movie[0]]
    print(mv)

Superman III
Company Business
Behind Enemy Lines
Superman IV: The Quest for Peace
Snow White and the Huntsman


### Collaborative filtering

- Focuses on user behavior.
- Recommends items based on the preferences of users who are similar to the target user.

In [42]:
users = {
    "name":['user1', 'user2', 'user3', 'user4', 'user5']
}
watched_movies = {
    'name': ['user1', 'user1', 'user2', 'user3', 'user3', 'user3', 'user4', 'user4', 'user5', 'user5', 'user5'],
    'movie_index': [1, 21, 21, 2, 32, 4, 2, 1, 21, 7, 8]
}
users = pd.DataFrame(users)
watched_movies = pd.DataFrame(watched_movies)
#Now we join the 2 dataframes into 1 to have a clearer view
users = pd.merge(users, watched_movies, on='name')

In [43]:
users

Unnamed: 0,name,movie_index
0,user1,1
1,user1,21
2,user2,21
3,user3,2
4,user3,32
5,user3,4
6,user4,2
7,user4,1
8,user5,21
9,user5,7


In [44]:
# Add a new column with value 1 for each row
users['watched'] = 1

# Use pivot_table to create the matrix
result = users.pivot_table(index='name', columns='movie_index', values='watched', fill_value=0)
result.head()

movie_index,1,2,4,7,8,21,32
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
user1,1.0,0.0,0.0,0.0,0.0,1.0,0.0
user2,0.0,0.0,0.0,0.0,0.0,1.0,0.0
user3,0.0,1.0,1.0,0.0,0.0,0.0,1.0
user4,1.0,1.0,0.0,0.0,0.0,0.0,0.0
user5,0.0,0.0,0.0,1.0,1.0,1.0,0.0


In [26]:
result = result.T
result.head()

name,user1,user2,user3,user4,user5
movie_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,1.0


In [27]:
correlation = result.corr()

In [29]:
correlation

name,user1,user2,user3,user4,user5
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
user1,1.0,0.645497,-0.547723,0.3,0.091287
user2,0.645497,1.0,-0.353553,-0.258199,0.471405
user3,-0.547723,-0.353553,1.0,0.091287,-0.75
user4,0.3,-0.258199,0.091287,1.0,-0.547723
user5,0.091287,0.471405,-0.75,-0.547723,1.0


In [30]:
# Get movies watched by user1
movies_user1 = users[users['name'] == 'user1']['movie_index'].tolist()

# Find similar users based on correlation from most to least similar
similar_users = correlation['user1'].drop('user1').sort_values(ascending=False)

In [31]:
def get_recommendations(user_movies, similar_users, users):
    # Get movie recommendations for user1 from top 3 neighbors
    movie_recommendations = []
    for user in similar_users.index[:3]:
        #get list of movies watched by neighbors
        movies_similar_user = users[users['name'] == user]['movie_index'].tolist()
        #add not watched movies to the list
        new_movies = [movie for movie in movies_similar_user if movie not in movies_user1]
        #append elements to our list of suggestions
        movie_recommendations.extend(new_movies)

    # Remove duplicates from recommendations
    return list(set(movie_recommendations))

In [32]:
movie_recommendations = get_recommendations(movies_user1, similar_users, users)

print("Recommendations for user1:", movie_recommendations)

Recommendations for user1: [8, 2, 7]


In [33]:
movies_user3 = users[users['name'] == 'user3']['movie_index'].tolist()
similar_users = correlation['user3'].drop('user3').sort_values(ascending=False)

movie_recommendations = get_recommendations(movies_user3, similar_users, users)

print("Recommendations for user3:", movie_recommendations)

Recommendations for user3: [2]
