# Business Problem

A set of methods and algorithms known as recommender systems enable users to be suggested "relevant" stuff. The suggested things should ideally be as pertinent to the user as feasible so that the user will interact with them. Typically, the suggestions refer to various decision-making processes, such as what product to purchase, what music to listen to, or what online news to read. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

Click to learn more about: [Recommender Systems](https://en.wikipedia.org/wiki/Recommender_system)

One approach to the design of recommender systems that has wide use is **collaborative filtering**. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items.

**Item-based**, or item-to-item, is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people's ratings of those items. Item-item collaborative filtering was invented and used by Amazon.com in 1998.

**User-Based Collaborative Filtering** is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

**Matrix factorization algorithms** work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness.

Make 10 movie recommendations for the user whose ID is given, using the item-based and user-based recommender methods.

The dataset was provided by MovieLens, a movie recommendation service. It contains the rating scores for these movies along with the movies. It contains 2,000,0263 ratings across 27,278 movies. This data set was created on October 17, 2016. Includes 138,493 users and data from 09 January 1995 to 31 March 2015. Users are randomly selected. It is known that all selected users voted for at least 20 movies.

**movie.csv**
* movieId: Unique movie number.
* title: Movie name
* genres: Movie genres

**rating.csv**
* userid: Unique user number. (UniqueID)
* movieId: Unique movie number.
* rating: Rating given to the movie by the user
* timestamp: Evaluation date

# Data Preprocessing

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
import plotly.express as px
from pandas_profiling import ProfileReport
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate

In [None]:
def check_df(dataframe, head=10):
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Variables #####################")
    print(dataframe.columns)
    print("##################### Descriptive Stats #####################")
    print(dataframe.describe().T)
    print("##################### Null Values #####################")
    print(dataframe.isnull().sum())
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Info #####################")
    print(dataframe.info())

In [None]:
movie = pd.read_csv('../input/movie-rating-datasets/movie.csv')
rating = pd.read_csv('../input/movie-rating-datasets/rating.csv')

In [None]:
check_df(movie)

In [None]:
check_df(rating)

In [None]:
profile = ProfileReport(rating, title="Pandas Profiling Report")
profile

In [None]:
df = movie.merge(rating, how="left", on="movieId")
df.head()

In [None]:
comment_counts = pd.DataFrame(df["title"].value_counts())

In [None]:
rare_movies = comment_counts[comment_counts["title"] <= 1000].index
common_movies = df[~df["title"].isin(rare_movies)]

In [None]:
popular_movies = common_movies["title"].value_counts().head(20)

In [None]:
popular_movies_by_ratings = common_movies.groupby("title").agg({"rating": "sum"}).reset_index()

In [None]:
popular_movies_by_ratings["all"] = "all"
fig = px.treemap(popular_movies_by_ratings.head(20), path=["all", 'title'], 
                 values='rating', color=popular_movies_by_ratings["rating"].head(20), hover_data=["title"])
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

In [None]:
fig = go.Figure(go.Bar(
            x=popular_movies.values,
            y=popular_movies.index,
            orientation='h'))

fig.show()

In [None]:
user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")

In [None]:
user_movie_df.head(10)

## User-Based Collaborative Filtering

In [None]:
random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=45).values)
random_user_df = user_movie_df[user_movie_df.index == random_user]
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()

In [None]:
movies_watched # by random user

In [None]:
movies_watched_df = user_movie_df[movies_watched]

In [None]:
movies_watched_df

In [None]:
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count

In [None]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
# We filter out users who have watched at least 60% of the same movie as the random user.
perc = len(movies_watched) * 60 / 100
users_same_movies = user_movie_count[user_movie_count["movie_count"] > perc]["userId"]

In [None]:
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies)],
                      random_user_df[movies_watched]])

In [None]:
final_df

In [None]:
# User rating correlations
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()

corr_df = pd.DataFrame(corr_df, columns=["corr"])

corr_df.index.names = ['user_id_1', 'user_id_2']

corr_df = corr_df.reset_index()

corr_df

Now that we found a correlation between users' ratings, we can recommend a film to our random user based on the user's liking behavior.

In [None]:
top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][
    ["user_id_2", "corr"]].reset_index(drop=True)

top_users = top_users.sort_values(by='corr', ascending=False)

top_users.rename(columns={"user_id_2": "userId"}, inplace=True)

In [None]:
top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how='inner')

top_users_ratings = top_users_ratings[top_users_ratings["userId"] != random_user]

In [None]:
top_users_ratings.sort_values("corr", ascending=False).head(10).style.background_gradient(subset= "corr", cmap='Reds')

Just looking at the correlation will not be enough for a movie recommendation. At the same time, weighting the scores and correlations given by users will give a more accurate result.

In [None]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
recommendation_df = top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"})
recommendation_df = recommendation_df.reset_index()
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values("weighted_rating", ascending=False).iloc[0:5]
movies_to_be_recommend = movies_to_be_recommend.merge(movie[["movieId", "title"]])
movies_to_be_recommend[~movies_to_be_recommend.movieId.isin(movies_watched)]
# 5 movies recommended.

## Item-Based Recommendation

In [None]:
# Random user chosen
user = 108170

Choose the latest movie that has highest rating user gave

In [None]:
movie_id = rating[(rating["userId"] == user) & (rating["rating"] == 5.0)].sort_values(by="timestamp", ascending=False)["movieId"][0:1].values[0]
movie_name = movie[movie["movieId"] == movie_id]["title"].values[0]

In [None]:
movie_name = user_movie_df[movie_name]
user_movie_df.corrwith(movie_name).sort_values(ascending=False).head(10)

In [None]:
# choose based on correlation of ratings
user_movie_df.corrwith(movie_name).sort_values(ascending=False).iloc[1:6]
# 5 movies recommended.

## Matrix Factorization

We are preparing the dataset. We will then estimate the missing ratings with latent variables. For more detailed information: [Matrix Factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems))

In [None]:
movie_ids = [130219, 356, 4422, 541]
sample_df = df[df.movieId.isin(movie_ids)]
sample_df.head()

In [None]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(sample_df[['userId',
                                       'movieId',
                                       'rating']], reader)

### Modeling

In [None]:
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD()
svd_model.fit(trainset)
predictions = svd_model.test(testset)

accuracy.rmse(predictions)

In [None]:
svd_model.predict(uid=1.0, iid=356, verbose=True)

### Model Tuning

In [None]:
param_grid = {'n_epochs': [5, 10, 20],
              'lr_all': [0.002, 0.005, 0.007]}


gs = GridSearchCV(SVD,
                  param_grid,
                  measures=['rmse', 'mae'],
                  cv=3,
                  n_jobs=-1,
                  joblib_verbose=True)

gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

In the last stage, the learning process is performed with the best parameters by considering the whole data set. As an example, not many hyperparameters have been optimized, a simple **hyperparameter optimization** has been done in the beginning.

In [None]:
svd_model = SVD(**gs.best_params['rmse'])

data = data.build_full_trainset()
svd_model.fit(data)

svd_model.predict(uid=1.0, iid=356, verbose=True)