# Group Project: Movie Recommendations (2487-T2 Machine Learning) [Group 2]
- Nova School of Business and Economics, Portugal
- Instructor: Qiwei Han, Ph.D.
- Program: Masters Program in Business Analytics
- Group Members: 
    - **Luca Silvano Carocci (53942)**
    - **Fridtjov Höyerholt Stokkeland (52922)**
    - **Diego García Rieckhof (53046)**
    - **Matilde Pesce (53258)**
    - **Florian Fritz Preiss (54385)**<br>

---
# Phase 3: Data Preparation [04 Feature Engineering]

## 3.1 Content-Based Recommender System

In this section, we will engineer the following features for our recommender system: vote average, vote count, score, genres, actors, directors, languages, collection name, combined text, movie age, key words, and description sentiment. We will also further preprocess the data and merge datasetsas needed.


**Vote Average and Vote Count:** These two features are created by aggregating the ratings given by users for each movie. The vote average is the mean rating for each movie, while the vote count is the number of ratings given by users for each movie. These features are useful for a recommender system because they give an indication of how popular or well-received a movie is.

**Score (weighted rating):** The score feature is a weighted rating that takes into account both the vote average and the vote count. This feature is useful for a recommender system because it provides a more nuanced view of a movie's popularity than just the vote average or vote count alone.

**Genres, Actors, Directors, Languages, and Collection Name:** These features are created by extracting information from the movie's metadata, including its genres, actors, directors, languages, and collection name. These features are useful for a recommender system because they provide information about the movie's content and production that can be used to recommend similar movies.

**Combined Text:** The combined text feature is created by concatenating all of the movie's metadata features into a single string. This feature is useful for a recommender system because it allows the system to perform text-based analysis and recommend movies that are similar in content or theme.

**Movie Age:** The movie age feature is created by subtracting the movie's release year from the current year. This feature is useful for a recommender system because it allows the system to recommend newer or older movies depending on the user's preference.

**Key Words:** The key words feature is created by using a text analysis tool to extract the most important words from the combined text feature. This feature is useful for a recommender system because it allows the system to recommend movies that have similar content or themes, even if they don't share the same metadata features.

**Description Sentiment:** The description sentiment feature is created by using a natural language processing tool to analyze the sentiment of the movie's description. This feature is useful for a recommender system because it allows the system to recommend movies that have a similar mood or tone.

In [1]:
# Required libraries
import pandas as pd
import numpy as np
import datetime
import re
from ast import literal_eval
from rake_nltk import Rake
from sklearn.preprocessing import MinMaxScaler
import nltk
from textblob import TextBlob
import warnings

# Settings
nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\flori\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flori\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Load the movies dataset
movies_df = pd.read_csv('../00_Data/pre-processed/prepr_movies.csv',
                        lineterminator='\n',
                        dtype={'movieId': object, 'tmdbId': object},
                        converters={'production_countries': literal_eval,
                                    'spoken_languages': literal_eval})

movies_df = movies_df.drop('Unnamed: 0', axis=1)
movies_df['year'] = pd.to_datetime(movies_df['year'], format='%Y')
movies_df['genres'] = movies_df['genres'].apply(lambda x: literal_eval(str(x)))
movies_df.loc[pd.isnull(movies_df['director']), 'director'] = 'None'
movies_df['director'] = movies_df['director'].apply(lambda x: literal_eval(str(x)))

movies_df.loc[pd.isnull(movies_df['actors']), 'actors'] = 'None'
movies_df['actors'] = movies_df['actors'].apply(lambda x: literal_eval(str(x)))
movies_df.head(2)

Unnamed: 0,movieId,title,genres,year,tmdbId,tag,collection_name,original_language,description,runtime,...,description_meanword_wsw,description_nchars,description_nchars_wsw,description_diff_nchars,description_root_wrds,description_jj_n,description_nn_n,description_prp_n,description_rb_n,description_vb_n
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995-01-01,862,['2009reissueinstereoscopic3-d' '3d'\n '55movi...,Toy Story Collection,en,"Led by Woody, Andy's toys live happily in his ...",81.0,...,5.575758,297,216,81,led woody andy toy live happily room andy birt...,2.0,25.0,4.0,4.0,2.0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995-01-01,8844,['adaptationofbook' 'adaptedfrom:book' 'advent...,Jumanji Collection,en,When siblings Judy and Peter discover an encha...,104.0,...,5.675,391,266,125,sibling judy peter discover enchanted board ga...,3.0,27.0,3.0,3.0,7.0


In [3]:
# Load the ratings dataset
ratings_df = pd.read_csv('../00_Data/pre-processed/prepr_ratings.csv', dtype={'userId': object, 'movieId': object})
ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'], unit='s', origin='unix')
ratings_df = ratings_df.drop('Unnamed: 0', axis=1)
ratings_df.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,2006-05-17 15:34:04
1,1,306,3.5,2006-05-17 12:26:57


In [4]:
# Preprocess the datasets
movies_df['year'] = pd.to_datetime(movies_df['year'], format='%Y')
movies_df['genres'] = movies_df['genres'].apply(lambda x: literal_eval(str(x)))
movies_df['director'] = movies_df['director'].fillna('None').apply(lambda x: literal_eval(str(x)))
movies_df['actors'] = movies_df['actors'].fillna('None').apply(lambda x: literal_eval(str(x)))

ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'], unit='s', origin='unix')

In [5]:
# Aggregate vote data and merge with movies dataframe
vote_data = ratings_df.groupby('movieId').agg({'rating': ['mean', 'count']}).reset_index()
vote_data.columns = ['movieId', 'vote_average', 'vote_count']
movies_df = movies_df.merge(vote_data, on='movieId', how='left')
movies_df['vote_average'] = movies_df['vote_average'].fillna(0)
movies_df['vote_count'] = movies_df['vote_count'].fillna(0)

**a. Generating the feature 'score'**

In [6]:
# Calculate the weighted rating
m = movies_df['vote_count'].quantile(0.9)
C = movies_df['vote_average'].mean()

def weighted_rating(x, m, C):
    v = x["vote_count"]
    R = x["vote_average"]
    return (v / (v + m) * R) + (m / (m + v) * C)

movies_df['score'] = movies_df.apply(lambda x: weighted_rating(x, m, C), axis=1)

**b. Functions for further processing the data**

In [7]:
def process_names(names):
    return " ".join([name.replace(" ", "").lower() for name in names])

def process_genres_and_languages(genres_and_languages):
    if not isinstance(genres_and_languages, (list, tuple, set)):
        genres_and_languages = [genres_and_languages]

    cleaned_genres_and_languages = []
    for g in genres_and_languages:
        if not isinstance(g, float):
            cleaned_g = re.sub(r'[^a-zA-Z\s]', '', str(g))
            cleaned_genres_and_languages.append(cleaned_g.lower())

    return " ".join(cleaned_genres_and_languages)

**c. Generating the feature 'combined_text'**

In [8]:
# Create combined_text column
movies_df['combined_text'] = (
    movies_df['genres'].apply(process_genres_and_languages) + " " +
    movies_df['tag'].apply(process_genres_and_languages) + " " +
    movies_df['collection_name'].fillna("").str.lower() + " " +
    movies_df['description_root_wrds'] + " " +
    movies_df['actors'].apply(process_names) + " " +
    movies_df['director'].apply(process_names) + " " +
    movies_df['original_language'] + " " +
    movies_df['spoken_languages'].apply(process_genres_and_languages)
)

**d. Perform sentiment analysis on the movie description**

In [9]:
# Sentiment analysis on the movie descriptions
def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

movies_df['sentiment'] = movies_df['description'].apply(lambda x: sentiment_analysis(str(x)))

**e. Generating the feature 'movie_age'**

In [10]:
# Create movie_age column
current_year = datetime.datetime.now().year
movies_df['movie_age'] = current_year - movies_df['year'].dt.year

In [11]:
# Keep only relevant columns
movies_df = movies_df[['movieId', 'title', 'movie_age', 'genres', 'combined_text', 'vote_average', 'vote_count', 'score', 'sentiment']]
movies_df.head(2)

Unnamed: 0,movieId,title,movie_age,genres,combined_text,vote_average,vote_count,score,sentiment
0,1,Toy Story (1995),28,"[Adventure, Animation, Children, Comedy, Fantasy]",adventure animation children comedy fantasy re...,3.893708,57309.0,3.883305,0.112121
1,2,Jumanji (1995),28,"[Adventure, Children, Fantasy]",adventure children fantasy adaptationofbook ad...,3.251527,24228.0,3.242912,-0.21875


In [13]:
# Save the processed dataset
movies_df.to_csv('../00_Data/engineered/movies_df_engineered.csv', index=False)