# **Module 2: AI problem solving**
## DAT410

### Group 29 

We hereby declare that we have both actively participated in solving every exercise. All solutions are entirely our own work, without having taken part of other solutions.

___


## 1) Article summary and take-aways

The articles highlight the difficulty and complexity of implementing accurate recommendation systems. *The Netflix Prize* was a competition launched by Netflix in 2006 which challenged participants to implement algorithms that could beat the accuracy of the company's existing recommendation system. The contest encouraged a collaborative and competitive environment which led to major advancements in recommendation algorithms. Some of the design features that were notable include the large dataset provided (over 100 million movie ratings from Netflix users), enabling better filtering and development of algorithms that could predict preferences based on patterns observed across a vast user base. Another feature is that multiple methods were combined in order to improve the accuracy. Different models might capture different behavior and aspects of user preferences, so blending these create a more robust recommendation system. 

Our major takeaways were how different methods in combination led to the best recommendations, and the suggested ways to improve on the previously implemented algorithms. In the end, our implementation is simpler and does not really implement all of the suggestions from the article, but reading about it helped inspire with ideas and give context to the problem for us.

___

## 2) Implementation Discussion

### System description (why we made the choices we did)

A recommendation system can be implemented using various methods. One approach is to recommend new movies to a user based on features of movies they have watched, or *Content-based filtering*. This means that we use a single user's ratings to find their most liked films and examine given attributes of the film, in this case the genres. Then, we look for movies of same or similar genres that the user has not yet rated. This approach is purely mathematical as it uses vectors and calculation of distances, i.e. no modelling. Further, there are different ways to calculate distances: k-nearest neigbors (KNN), cosine similarity, etc. We chose to implement KNN in this assigment because we felt that this was the most appropriate model based on the data we recieved, and the nature of the challenge/assignment.

Another approach is *Collaborative-filtering*, which involves matching the profile of multiple users to each other. This recommends movies based on what other users that have rated movies similarly enjoy. We struggeled come up with a good idea on how  to properly implement this into a model that would combine well with the content based system, and did not use this in the final predictions.

Additionaly, a model could be implemented where the a user's rated movies are used as training data with the movie's features, in order to predict movies they most likely will want to watch from the set of unrated movies. However, the data provided in this assignment is insufficient for this type of solution.

### Strengths and Weaknesses

The strength in our system is that we can find similar movies to what the user prefers. If a user loves horror movies, and rates those highly, our system will keep recommending the user these types of movies. This is however also inherently a weakness in itself: how can a user ever get into new styles of movies? Perhaps the user would actually love reality TV, if the user ever tried watching it. Perhaps a user gets tired of horror movies and wants to try something new? Our system will keep recommending the user horror movies, even a while after the person has started to rate other movies highly. This could for example also ne countered by weighing recently rated movie-categories higher, but it is still an issue. Ideally we would be able to combine this with some *collaborative-filtering*. In our case we struggled to come up with a good idea for a combination that would fit five recommendations. We do however like the idea of the fifth movie being a wildcard movie from collaborative filtering, although we did not implement this as we were not sure of a good approach.

Some further improvements include using more features other than movie ratings, for instance movie keywords, actors in the movie, director, release year, etc. If we hade further data, better predictions could be made, and more things could be taken into consideration. For example: Perhaps the viewer loves movies that are made by a certain direction, because of his/her story telling or movie style. The viewer does not care about what type of movie that director has made. In that case, our system would not be able to make proper recommendations.

___

## 3) Why is it difficult to asses the quality before deployement?

Recommendation systems rely on historical user data to make predictions. The data itself however may be skewed/biased, since a only small subset of objects receive most of the ratings, while a much larger number of items have few or sometimes no ratings. The skewness may lead to concentration of recommendation around the popular items and neglect less popular ones. It becomes a balancing issue because the system will be good at predicting what a user will like based on their past behavior, but it will fail to introduce them to new/diverse content.

Personal recommendations for new or not so active users is also difficult, because they have limited or no history to base predictions on.

Essentially we have data on what movies the users like, but we do not have any data on how much they like the recommendations. If the system is deployed for a while, we can use the rating of the recommendations as additional data for our model to make further adjustments to the recommendations.

If we have data on recommendation satisfaction we could help fix the previously mention example of the user who as a history of loving horror movies but does not anymore. If we see that the user starts disliking more and more horror recommendations we can also use this data to change our recommendations.

___

## Implementation

We start by reading the data from the dataframes

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors

# Read in the data
user_reviews = pd.read_csv("user_reviews.csv", index_col=1).drop('Unnamed: 0', axis=1)
movie_genres = pd.read_csv('movie_genres.csv', index_col=1).drop('Unnamed: 0', axis=1)

This is mostly used for testing

In [2]:
movie_names = movie_genres.index.tolist()
movie_ratings_matrix = np.array(user_reviews)
movie_genres_matrix = np.array(movie_genres)

Here we experiment with KNN for finding movies that are similar to other movies.

We use two different approaches.

**Content based filtering:**

Here we find five movies that are the closest related to the movie we find, in terms of categories using k-NN. In the example we use a Harry Potter movie and we can see that the algorithm returns other Harry Potter movies.

**Colaborative filtering:**

We experiment with finding movies that were rated similary by other users, this is not completely finished or fully implemented, and mostly experimented with.

In [3]:
def find_similar_movies(movie_name, movie_names, features_matrix):
    # Check if the movie is in our database
    if movie_name not in movie_names:
        return "Movie not found."
    
    knn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=6, n_jobs=-1)
    knn_model.fit(features_matrix)

    # Find the index of the movie
    movie_index = movie_names.index(movie_name)

    # Get the feature vector for the selected movie
    movie_features = features_matrix[movie_index].reshape(1, -1)

    # Find the 5 nearest neighbors (or however many neighbors you want)
    distances, indices = knn_model.kneighbors(movie_features)

    # Retrieve the names of the nearest neighbors
    similar_movies = [movie_names[index] for index in indices[0]]

    return similar_movies

test_movie = 'Harry Potter and the Order of the Phoenix'

# KNN Genre similarity
print(find_similar_movies(test_movie, movie_names, movie_genres_matrix)[1:])

# KNN User rating similarity
print(find_similar_movies(test_movie, movie_names, movie_ratings_matrix.copy().T)[1:])

['Harry Potter and the Prisoner of Azkaban', 'Harry Potter and the Chamber of Secrets', 'Harry Potter and the Half-Blood Prince', 'The Spiderwick Chronicles', 'Alice in Wonderland']
['Escape from L.A.', 'Sense and Sensibility', 'Glitter', 'Teen Wolf Too', 'Quantum of Solace']


**User profile creation**

Here we create a user profile which returns which movies the user already has rated as well as the users' content profile, which is a list of how prefered/liked a certain category is.

In [16]:
def user_profile(reviews_df, genres_df, user_idx) -> pd.Series:

    reviews = reviews_df.loc[user_idx]

    rated_movies = reviews[reviews != 0]
    unrated_movies = reviews[reviews == 0]

    genres_rated_movies = genres_df.loc[rated_movies.index]
    genres_unrated_movies = genres_df.loc[unrated_movies.index]
    
    content_profile = genres_rated_movies.T.dot(rated_movies)
    content_profile /= np.sum(content_profile) #normalize

    return content_profile, rated_movies

**Movie recommendation using k-NN**

By using the users' category profile and filtering out the already rated movies, we can make five recommendations based on that users' category preferences

In [17]:
def find_movie_from_features(user_profile, movie_category_df, neighbors=5):

    user_categories, user_rated_movies = user_profile[0], user_profile[1]

    # Remove seen movies
    movie_category_df = movie_category_df.copy().drop(user_rated_movies.index)
    movie_names = movie_category_df.index.tolist()

    # Create and fit the KNN model
    knn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=neighbors, n_jobs=-1)
    knn_model.fit(movie_category_df.values)

    # Get the feature vector for the selected movie
    user_categories_matrix = np.array(user_categories.to_list()).reshape(1, -1)

    # Find the 5 nearest neighbors (or however many neighbors you want)
    distances, indices = knn_model.kneighbors(user_categories_matrix)

    # Retrieve the names of the nearest neighbors
    similar_movies = [movie_names[index] for index in indices[0]]

    return similar_movies, (distances[0], indices[0])



**Movie reccomendations**

Using the user_profile and find_movie_from_features functions; we give each of the users five recommendations for movies to watch in the future.

In [18]:
user_names = ['Vincent', 'Edgar', 'Addilyn', 'Marlee', 'Javier']

# Recommendations for all users

for user in user_names:
    user_prof = user_profile(user_reviews, movie_genres, user)
    recommended_movies = find_movie_from_features(user_prof, movie_genres, 5)[0]
    print(f"Recommendations for {user}: {recommended_movies}\n")
    

Recommendations for Vincent: ['Kites', "Perrier's Bounty", 'Top Gun', 'Crouching Tiger, Hidden Dragon', 'The Good Thief']

Recommendations for Edgar: ['Safe Haven', 'Down in the Valley', 'Match Point', 'The Color Purple', 'Lies in Plain Sight']

Recommendations for Addilyn: ['Adaptation.', 'Stepmom', 'The Upside of Anger', 'The Intern', 'Ghost World']

Recommendations for Marlee: ['Striptease', 'Freeway', 'Novocaine', 'Made', 'The Informant!']

Recommendations for Javier: ['Chicken Run', 'Sinbad: Legend of the Seven Seas', 'Alpha and Omega 4: The Legend of the Saw Toothed Cave', 'Babe: Pig in the City', 'Babe']



Above, we see the recommended movies for the first five users. A manual look at the preferred genres of the user Javier, and his recommendations, prove that the implemented system works fine. For instance, his three most liked genres are comedy, drama, and family while the movie *Chicken Run* that he has not watched yet are labelled with those genres among others.

In [27]:
# Javier's top 5 genres

user_prof[0].sort_values(ascending=False)[:5]

genre_comedy       0.179856
genre_drama        0.158273
genre_family       0.129496
genre_animation    0.086331
genre_adventure    0.071942
dtype: float64

In [28]:
# Genres of the movie 'Chicken Run'

chickenrun = movie_genres.loc['Chicken Run']
chickenrun[chickenrun == 1]

genre_adventure    1
genre_animation    1
genre_comedy       1
genre_drama        1
genre_family       1
Name: Chicken Run, dtype: int64