## Demographic

In [2]:
from os import sep
import pandas as pd
import numpy as np
from classes import User, Movie
from scipy.stats.stats import pearsonr


# Genres
generos = pd.read_csv("data/genre.txt", names=["genre_id", "genre_name"], sep="\t")


# Read users
users_df = pd.read_csv(
    "data/users.txt", names=["user_id", "edad", "gender", "occupation"], sep="\t"
)

# Read movies
all_genre = generos.genre_name.values.tolist()
all_genre = ["movie_id"] + all_genre + ["title"]
films = []
films_df = pd.read_csv("data/items.txt", names=all_genre, sep="\t")
for idx, row in films_df.iterrows():
    films.append(Movie(row["movie_id"], row[all_genre].values.tolist(), row["title"]))


# Ratings
ratings = pd.read_csv(
    "data/u1_base.txt", names=["user_id", "movie_id", "rating"], sep="\t"
)

ratings = ratings.merge(users_df, on="user_id")
print(f"Data: {len(users_df)} users, {len(films_df)} movies, {len(ratings)} ratings")


Data: 943 users, 1682 movies, 80000 ratings


In [18]:
# -------------------------
# Demographic recos
def recommend_me_demographic(user,n):
    user = users_df[users_df["user_id"]==user]

    # Search for films seen by this profession
    aux_ratings = ratings[ratings["occupation"] == user.occupation.values[0]]
    aux_ratings = ratings[ratings["gender"] == user.gender.values[0]]
    aux_ratings = (
        aux_ratings[["movie_id", "rating"]]
        .groupby("movie_id")
        .agg(count=("rating", "count"), mean=("rating", "mean"))
        .reset_index()
    )
    C = aux_ratings["mean"].mean()
    M = aux_ratings["count"].quantile(0.9)

    def weighted_rating(x, m=M, C=C):
        v = x["count"]
        R = x["mean"]
        # Calculation based on the IMDB formula
        return (v / (v + m) * R) + (m / (m + v) * C)

    aux_ratings["score"] = aux_ratings.apply(weighted_rating, axis=1)
    aux_ratings = aux_ratings.merge(films_df,on="movie_id")
    # print(aux_ratings.sort_values("score", ascending=False))
    return aux_ratings.sort_values("score")["title"].values[:n]

In [19]:
def get_top_n_demographic(users,n):
    top_n = defaultdict(list)
    for user in users:
        recos = recommend_me_demographic(user,n)
        top_n[user]=recos
    return top_n

In [20]:

print(recommend_me_demographic(5,5))


['Cable Guy, The (1996)' 'Beavis and Butt-head Do America (1996)'
 'Vegas Vacation (1997)' 'Spawn (1997)' 'Airheads (1994)']


In [None]:
get_top_n_demographic(users_df.user_id.unique(),5)

###  3. Collaborative Filtering
This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts. 

##### Example - 
If person A likes 3 movies, say Interstellar, Inception and Predestination, and person B likes Inception, Predestination and The Prestige, then they have almost similar interests. We can say with some certainty that A should like The Prestige and B should like Interstellar. 

The collaborative filtering algorithm uses “User Behavior” for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. 



The collaborative filtering can be modeled in any of the following ways:-

#### User-User collaborative filtering
It finds similarity scores between users to pick the most similar users and  recommends products which these similar users have liked or bought previously.

For movies, this algorithm finds the similarity between each user based on the ratings they have previously given to different movies. The prediction of an item for a user u is calculated by computing the weighted sum of the user ratings given by other users to an item i.
Thus the prediction for user 'u' is given as:

P(u,i) = E [r(v,i) * s(u,v)] / E [S(u,v)]

where,
* Pu,i is the prediction of an item
* Rv,i is the rating given by a user v to a movie i
* Su,v is the similarity between users

Basic steps for this-
* For predictions we need the similarity between the user u and v. We can make use of Pearson correlation.
* First we find the items rated by both the users and based on the ratings, correlation between the users is calculated.
* The predictions can be calculated using the similarity values. This algorithm, first of all calculates the similarity between each user and then based on each similarity calculates the predictions. Users having higher correlation will tend to be similar.


Disadvantage-
This algorithm is quite time consuming as it involves calculating the similarity for each user and then calculating prediction for each similarity score. One way of handling this problem is to select only a few users (neighbors) instead of all to make predictions, i.e. instead of making predictions for all similarity values, we choose only few similarity values. There are various ways to select the neighbors:

* Select a threshold similarity and choose all the users above that value
* Randomly select the users
* Arrange the neighbors in descending order of their similarity value and choose top-N users
* Use clustering for choosing neighbors



This algorithm is useful when the number of users is less. Its not effective when there are a large number of users as it will take a lot of time to compute the similarity between all user pairs. This leads us to item-item collaborative filtering, which is effective when the number of users is more than the items being recommended.


#### Item-Item collaborative filtering
Similarity is found between each items. Thus for movies similarities between movies is found and based on that recommendations of similar movies are made for the user. 

This algorithm works similar to user-user collaborative filtering with just a little change – instead of taking the weighted sum of ratings of “user-neighbors”, we take the weighted sum of ratings of “item-neighbors”.

##### What will happen if a new user or a new item is added in the dataset? 
It is called a Cold Start. 

* Visitor Cold Start
Visitor Cold Start means that a new user is introduced in the dataset. Since there is no history of that user, the system does not know the preferences of that user. It becomes harder to recommend products to that user. So, how can we solve this problem? One basic approach could be to apply a popularity based strategy, i.e. recommend the most popular products. These can be determined by what has been popular recently overall or regionally. Once we know the preferences of the user, recommending products will be easier.

*  Product Cold Start
Product Cold Start means that a new product is launched in the market or added to the system. User action is most important to determine the value of any product. More the interaction a product receives, the easier it is for our model to recommend that product to the right user. We can make use of Content based filtering to solve this problem. The system first uses the content of the new product for recommendations and then eventually the user actions on that product.

In [None]:
from collections import defaultdict


In [None]:
import pandas as pd 
import numpy as np 

In [None]:
# credits = pd.read_csv("D:/C-Drive Project's Datasets/tmdb-5000-movie-dataset/tmdb_5000_credits.csv")
movies = pd.read_csv("data/processed/movies.csv")

In [None]:
movies.head()

In [None]:
from surprise import Reader, Dataset, SVD, KNNBasic
from surprise.model_selection import cross_validate

reader = Reader()

In [None]:


ratings = pd.read_csv("data/u1_base.txt",sep="\t",names=["user_id", "movie_id", "rating"])
ratings.head()

In [None]:
data = Dataset.load_from_df(ratings[["user_id", "movie_id", "rating"]], reader)

In [None]:
svd = KNNBasic()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv=5)

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset)

In [None]:
data

In [None]:
recos = svd.test(trainset.build_testset())

In [None]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
top_n = get_top_n(recos, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

 For movies id = 654 we got estimated prediction of 3.83.
 
This recommender system doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

#### Conclusion
We create recommenders using demographic , content- based and collaborative filtering. While demographic filtering is very elemantary and cannot be used practically, Hybrid Systems can take advantage of content-based and collaborative filtering as the two approaches are proved to be almost complimentary. This model was very baseline and only provides a fundamental framework to start with.

References:
* https://www.kaggle.com/rounakbanik/movie-recommender-systems
* https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/#
        