# Group Project: Movie Recommendations (2487-T2 Machine Learning) [Group 2]
- Nova School of Business and Economics, Portugal
- Instructor: Qiwei Han, Ph.D.
- Program: Masters Program in Business Analytics
- Group Members: 
    - **Luca Silvano Carocci (53942)**
    - **Fridtjov Höyerholt Stokkeland (52922)**
    - **Diego García Rieckhof (53046)**
    - **Matilde Pesce (53258)**
    - **Florian Fritz Preiss (54385)**<br>
---
# Phase 2: Data Understanding [01 Data Collection]
## 1. MovieLens Data

This project leverages data by GroupLens (available at https://grouplens.org/datasets/movielens/). The data is divided into multiple tables:

### A. Ratings Data File Structure (ratings.csv)

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

* Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

* Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### B. Movies Data File Structure (movies.csv)

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

* Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

* Genres are a pipe-separated list, and are selected from the following: Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, (no genres listed)

### C. Links Data File Structure (links.csv)

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

* movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

* imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

* tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.


### D. Tags Data File Structure (tags.csv)

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

* Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

* Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

***Harper, F. Maxwell and Konstan, Joseph A. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>***

## 2. TMDB Data

The information contained in the movies.csv dataset by GroupLens does not provide sufficient information on movies to build a sound content-based recommender system. Therefore, we collect additional data on the movies by retrieving additional information on each movie from the TMDB API (https://www.themoviedb.org/):

   * 'collection_name': The name of the collection the movie is part of (NaN-values are not part of any collection)
   * 'original_language': The original language of the movie.
   * 'description': A short description of the movie plot.
   * 'runtime': The duration of the movie in minutes.
   * 'actors': A list containing the top 3 most popular actors starring in the movie according to the popularity index by TMDB.
   * 'director': The director(s) of the movie.
   * 'production_countries': The countries in which the movie has been produced.
   * 'spoken_languages': A list of the languages spoken in the movie.

In [1]:
import pandas as pd
import numpy as np
import re
import tmdbsimple as tmdb

In [8]:
links_df = pd.read_csv('../00_Data/00_raw/links.csv', dtype={'movieId': object, 'imdbId': object, 'tmdbId': object})
links_df = links_df.drop_duplicates(subset='tmdbId')
links_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57878 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  57878 non-null  object
 1   imdbId   57878 non-null  object
 2   tmdbId   57877 non-null  object
dtypes: object(3)
memory usage: 1.8+ MB


In [9]:
tmdb.API_KEY = '' # Insert API-Key here
tmdb.REQUESTS_TIMEOUT = (2, 5)  # seconds, for connect and read specifically

In [None]:
tmdb_df = pd.DataFrame(columns=['tmdbId', 'title', 'release_date', 'collection_name',
                                'original_language', 'description', 'runtime', 'actors',
                                'director', 'production_countries', 'spoken_languages'])

def get_top_3_actors(people):
    # Filter the list by the 'known_for_department' key
    filtered_list = [person for person in people if person['known_for_department'] == 'Acting']
    
    # Sort the list by the 'popularity' key in descending order
    sorted_list = sorted(filtered_list, key=lambda x: x['popularity'], reverse=True)
    
    top_three = sorted_list[:3]
    actor_names = [person['original_name'] for person in top_three]
    return actor_names



for tmdb_id in links_df['tmdbId'].iloc[1:]:
    try:
        movie = tmdb.Movies(tmdb_id)
        movie_data = movie.info()

        # Get TMDB Movie Title
        try:
            title = movie_data['title']
        except:
            title = ''

        # Get Release Date
        try:
            release_date = movie_data['release_date']
        except:
            release_date = ''

        # Get Collection Name if movie belongs to collection
        try:
            collection_name = movie_data['belongs_to_collection']['name']
        except:
            collection_name = ''

        # Get original movie language
        try:
            original_language = movie_data['original_language']
        except:
            original_language = ''

        # Get plot description
        try:
            description = movie_data['overview']
        except:
            description = ''

        # Get runtime
        try:
            runtime = movie_data['runtime']
        except:
            runtime = ''

        # Get top 3 most popular actors from movie
        try:
            cast = movie.credits()['cast']
            actors = get_top_3_actors(cast)
        except:
            actors = np.nan

        # Get director(s)
        try:
            director = [crew_member['original_name'] for crew_member in movie.credits()['crew'] if crew_member['job'] == 'Director']
        except:
            director = np.nan

        # Get Production Countries
        try:
            production_countries = [item['iso_3166_1'] for item in movie_data['production_countries']]
        except:
            production_countries = ''

        # Get spoken languages
        try:
            spoken_languages = [item['iso_639_1'] for item in movie_data['spoken_languages']]
        except:
            spoken_languages = ''
    
    # Error handling if movie cannot be found
    except:
        title = np.nan
        release_date = np.nan
        collection_name = np.nan
        original_language = np.nan
        description = np.nan
        runtime = np.nan
        actors = np.nan
        directors = np.nan
        production_countries = np.nan
        spoken_languages = np.nan
        
    tmdb_df = pd.DataFrame({'tmdbId': [tmdb_id],
                                                'title': [title],
                                                'release_date': [release_date],
                                                'collection_name': [collection_name],
                                                'original_language': [original_language],
                                                'description': [description],
                                                'runtime': [runtime],
                                                'actors': [actors],
                                                'director': [director],
                                                'production_countries': [production_countries],
                                                'spoken_languages': [spoken_languages]})
    tmdb_df.to_csv('../00_Data/00_raw/tmdb.csv', mode='a', header=False)