# Creating dataset of movies with highest revenue

In this notebook, I will collect the data about 5,000 movies highest grossing movies, according to [TMDB website](https://www.themoviedb.org/). Using its API, I will:
1. Get the ids of the movies with the highest revenue.
2. Get information about the selected movies and combine it in a single dataframe.

In [1]:
#importing necessary libraries
import numpy as np
import pandas as pd
import requests
from tqdm import tqdm

pd.set_option('display.max_columns', 50)

In [14]:
#getting ids of 5,000 highest grossing movies in the database
movies_list = []
for page in range(1, 251):
    url = f'https://api.themoviedb.org/3/discover/movie?api_key=c3eee2ef7290f6733d92e617c223b7c0&sort_by=revenue.desc&include_adult=false&include_video=false&page={page}'
    r = requests.get(url)
    movies_dict = r.json()
    movies_list.append(movies_dict)

movies_dfs = []
for i in range(0, 250):
    movies_df = pd.DataFrame(movies_list[i]['results'])
    movies_dfs.append(movies_df)

movies_combined = pd.concat(movies_dfs, ignore_index=True)
movies_ids = movies_combined.id.values

100%|████████████████████████████████████████████████████████████████████████████████| 250/250 [00:34<00:00,  7.17it/s]


In [15]:
#getting info about movies with selected ids
movies_dicts = []
for movie_id in tqdm(movies_ids):
    url = f'https://api.themoviedb.org/3/movie/{movie_id}?api_key=c3eee2ef7290f6733d92e617c223b7c0'
    r = requests.get(url)
    movies_dict = r.json()
    movies_dicts.append(movies_dict)

100%|██████████████████████████████████████████████████████████████████████████████| 5000/5000 [12:39<00:00,  6.59it/s]


In [16]:
#creating a dataframe with obtained info
movies_details = pd.DataFrame(movies_dicts)
movies_details.shape

(5000, 25)

In [17]:
#getting additional info about movies with selected ids
movies_credits_dicts = []
for movie_id in tqdm(movies_ids):
    url = f'https://api.themoviedb.org/3/movie/{movie_id}/credits?api_key=c3eee2ef7290f6733d92e617c223b7c0'
    r = requests.get(url)
    movies_credits_dict = r.json()
    movies_credits_dicts.append(movies_credits_dict)

100%|██████████████████████████████████████████████████████████████████████████████| 5000/5000 [13:28<00:00,  6.19it/s]


In [18]:
#creating a dataframe with obtained additional info
movies_credits = pd.DataFrame(movies_credits_dicts)
movies_credits.shape

(5000, 3)

In [20]:
#merging the two dataframes
movies = pd.merge(movies_details, movies_credits)

In [21]:
#function for getting necessary values from existing columns and storing them in separate lists

def get_values(column):
    
    def extract_values(row, column):
        values_list = []
        for d in movies[column][row]:
            values_list.append(d['name'])
        values_string = ', '.join(values_list)
        return values_string

    all_values_list = []
    for row in range(len(movies)):
        values = extract_values(row, column)
        all_values_list.append(values)
    
    return all_values_list

In [22]:
#modified function for getting the names of directors or producers and storing them in separate lists

def get_values_director_producer(role):
    
    def extract_values(row):
        values_list = []
        for d in movies.crew[row]:
            if d['job'] == role: 
                values_list.append(d['name'])
        values_string = ', '.join(values_list)
        return values_string

    all_values_list = []
    for row in range(len(movies)):
        values = extract_values(row)
        all_values_list.append(values)
    
    return all_values_list

In [23]:
#adding columns with the necessary infomation to the dataframe
movies['actors'] = get_values('cast')
movies['genres'] = get_values('genres')
movies['studios'] = get_values('production_companies')
movies['countries'] = get_values('production_countries')
movies['directors'] = get_values_director_producer('Director')
movies['producers'] = get_values_director_producer('Producer')

In [24]:
#selecting only columns that are relevant for the analysis
cols_to_keep = [
    'budget', 
    'genres', 
    'original_language', 
    'original_title', 
    'release_date', 
    'revenue', 
    'runtime',
    'actors',
    'studios',
    'countries',
    'directors',
    'producers'
]

#subsetting the movies dataframe
movies = movies[cols_to_keep]

#saving the dataframe as a csv file
movies.to_csv('data/movies.csv', index=False)

In the next notebook, I will use the created dataset for data analysis, feature engineering, and predicting movie revenue.