# CINEMA. KEYS FOR SUCCESS

![](MoviesImages/collage1.jpg)

## 1. Introduction.

Cinema is known as the seventh art. But it is a big industry as well. According to EFE agency, cultural and creative industries represent 3% of worldwide PIB and employ 29.5 million people (source [EFE](https://www.efe.com/efe/espana/cultura/las-industrias-creativas-representan-el-3-por-ciento-del-pib-mundial-segun-la-unesco/10005-2780740)). So it is easy to understand there is a big interest in making cinema profitable and it is not difficult to find articles about what the keys for getting so are (https://phys.org/news/2017-11-explores-movie-successful.html). This is a topic that has already been analyzed with data science models, but of course there is always something else to say about it. This is what this study tries to do, going deep on cinema metrics to bring some light to this question: **what are the keys that determine revenue in a movie?** I will try to answer this question by creating a model to predict movies revenue.

The main source of information I have used in this study is **The Movie Database** ([TMDb](https://www.themoviedb.org/)). As written on their website:

*The Movie Database (TMDb) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different.* ([source](https://www.themoviedb.org/about))

The reasons for choosing TMDb have been:
* Allowed access to download their entire database of movies, actors and crew, with more than 400,000 movies and 1,200,000 actors and crew in the whole industry.
* Their very good support, with good documentation and help forums. This has been a big help on the process to download and understand the data.
I would like to thank TMDb for doing this study possible!

Additionally, I have used:

* A dataset of 5,000 movies downloaded from **data.world** (access through this [link](https://data.world/popculture/imdb-5000-movie-dataset)), whose original source is **Internet Movie Database** ([IMDb](https://www.imdb.com/)).
* **[The Numbers](https://www.the-numbers.com/)**, as help for double check revenue and budget data for specific movies.
* https://www.usinflationcalculator.com/, to get US inflation information from 1913 to 2018.

With regards to the script, it is all original (of course, I got help on [stackoverflow](https://stackoverflow.com/) and many other forums). The only exception to this has been the script in python to download movies and cast/crew information using TMDb API, which I slighly adapted from the original of **galli-leo**. Thank you, galli-leo! (access through this [link](https://gist.github.com/galli-leo/6398f9128ffc20af70c6c7eedfeb0a65)). 

The whole script for this study consist of **eight jupyter notebooks in Python 3**, plus **this additional** including my final report and one **R file** with some extra models. Notebooks 01 and 02 have been run from a linux computer and are used to download the whole information of movies and cast/crew through TMDb API. Notebooks 03 to 08 have been run from **Google Colab** and some parts of the script assume this. I recommend to run these notebooks (03 to 08) from Google Colab, as part of the script, specific for Google Colab, gets the data I have used from my Google Drive account.

## 2. The data.

Data from TMDb have been downloaded using script on notebooks *01_Getting_Movies_Data* and *02_Getting_People_Data* through TMDb API (API [overview](https://www.themoviedb.org/documentation/api) and [documentation](https://developers.themoviedb.org/3/getting-started/introduction)). TMDb API is available for everyone to use; you only need a TMDb user account to request an API key. 

However, when using the API, we found some rate limiting. The limits are 40 requests every 10 seconds and are limited by IP address, not API key. So this is being burstable to 40 in a single second, or as an average of 4 requests/second. The timer resets 10 seconds from the first request within the current 10 second "bucket". This means that if the limit is triggered, then it is necessary to wait up to 9 seconds before the timer resets, but depending where we are within the 10 second window, it could be the very next second.

The API returns a json file with the whole information for every movie id and people (cast or crew) id. So we have two different sets of json files:
* One set made of 409,791 json files for movies (one json file = one movie).
* One set made of 1,197,556 json files for cast/crew (one json file = one person).

The API call can be personalized to define what information we want to get. At this initial point, I selected a wide range of fields to download, just because they could be used in the future to make this study evolve or even for other studies (why not a recommendation system?). We have to take into account that the whole process of downloading took about 2 weeks! 

After getting the json files, it is time to import them. In both datasets (movies and crew) I previously ran a script to check the structure of the files (in notebooks *03_Tidying_Movies_Data* for movies and *05_Tidying_People_Data* for people). This was useful to know the nested structure of dictionaries and decide which information I would finally import. At this point, memory limit was a key driver and I had to estimate if working in a cluster would be necessary. It was finally not.

These are the fields I finally imported for **movies**:

* **id** for TMDb (*int*)
* **titles**: list of alternative titles (*list of str*)
* **belongs_to_collection_id** (*int*)
* **belongs_to_collection_name** (*str*)
* **budget** (*int*) in US dollars.
* **credits cast**: list of cast with different additional features: **cast_id** (*int*), **character** (*str*), **gender** (*int*), **id** (*int*), **name** (*str*), **order** (*int*)
* **credits crew**: list of crew with different additional features: **department** (*str*), **gender** (*int*), **id** (*int*), **job** (*str*), **name** (*str*)
* **imdb_id** (*int*)
* **genres**: list of id (*int*) - name (*str*) of generes associated to every movie
* **keywords**: list of id (*int*) - name (*str*)
* **original_language** (*str*)
* **original_title** (*str*)
* **overview** (*str*)
* **popularity** (*float*): numeric value rating popularity coming from TMDb community
* **production_companies**: list of companies and additional features: **id** (*int*), **name** (*str*), **origin_country** (*str*)
* **production_countries**: list of **iso_3166_1** (*str*) and **name** (*str*) for production countries
* **release_date** (*str*)
* **release_dates**: list of dictionaries with information for every release: **iso_3166_1** (*str*), **certification** (*str*), **iso_639_1** (*str*), **release_date** (*str*), **type** (*int*) (numeric code for type of release)
* **revenue** (*int*) in US dollars.
* **reviews_page** (*int*)
* **reviews_results**: list of additional features: **author** (*str*), **content** (*str*), **id** (*int*)
* **reviews_total_pages** (*int*)
* **reviews_total_results** (*int*)
* **runtime** (*int*)
* **spoken_languages**: list of **iso_639_1** (*str*) and **name** (*str*) for spoken languages in the movie
* **status** (*str*) (released/production/planned...)
* **tagline** (*str*)
* **title** (*str*): title in English
* **vote_average** (*float*) (from TMDb users)
* **vote_count** (*int*)  (from TMDb users)

These are the fields I finally imported for **people**:

* **id** (*int*) for TMDb
* **adult** (*bool*) coming/not coming form adult cinema
* **birthday** (*str*)
* **deathday** (*str*)
* **gender** (*int*)
* **imdb_id**  (*int*)
* **known_for_department** (*str*)
* **movie_credits_cast**: list of movies and additional features where the person worked as cast: **adult** (*bool*), **character** (*str*), **genre_ids** (*list of int*), **id** (*int*), **original_language** (*str*), **original_title** (*str*), **popularity** (*int*), **release_date** (*str*), **title** (*str*), **vote_average** (*float*), **vote_count** (*int*)
* **movie_credits_crew**: list of movies and additional features where the person worked as crew: **adult** (*bool*), **department** (*str*), **genre_ids** (*list of int*), **id** (*int*), **job** (*str*), **original_language** (*str*), **original_title** (*str*), **popularity** (*int*), **release_date** (*str*), **title** (*str*), **vote_average** (*float*), **vote_count** (*int*)
* **name** (*str*)
* **place_of_birth** (*str*)
* **popularity** (*int*): numeric value rating popularity coming from TMDb community
* **tv_credits_cast**: list of tv movies and additional features where the person worked as cast: **character** (*str*), **episode_count** (*int*), **first_air_date** (*str*), **genre_ids** (*list of int*), **id** (*int*), **name** (*str*), **origin_country** (*str*), **original_language** (*str*), **original_name** (*str*), **popularity** (*int*), **vote_average** (*float*), **vote_count** (*int*)
* **tv_credits_crew**: list of tv movies and additional features where the person worked as crew: **department** (*str*), **episode_count** (*int*), **first_air_date** (*str*), **genre_ids** (*list of int*), **id** (*int*), **job** (*str*), **name** (*str*), **origin_country** (*str*), **original_language** (*str*), **original_name** (*str*), **popularity** (*int*), **vote_average** (*float*), **vote_count** (*int*)

Although this is the set of features I finally uploaded, not all of them participated in the model. And this is because I decided to exclude variables that in fact can be affected by the revenue itself, as their value is got some time after the release of the movie. **My goal is trying to analyze the variables that have an effect on the later revenue, but that can be controlled before the release**. For example: *popularity* will be for sure higher for movies with higher revenue. And *votes* are got after the release. But both features are not available when planning a movie. So those kind of features will be excluded. Similar case for *belongs to collection*: second and third parts are shot precisely because previous movies in the collection were a success! So we won´t consider that feature either.

It would have been interesting to use tv information in our model. So we could have included the effect of actors or crew working on tv and then on cinema. But I got a problem here: one key feature in our analysis will be realease date. We will create features for our movies holding information of their cast and crew, **about their work up to that movie**. So it is critical to know the dates of their works. We don´t have that information for tv. We have the episode number our cast or crew worked on and we have the first air date of the tv series or tv movie, but we don´t know the real date of the episodes. So this information could be missleading.

But the **key issue I had with the data was about revenue**: after getting our 409,791 json files for movies I discovered that less than 3% had revenue informed. This is one of the reasons why I got the 5,000 movies dataset from data.world (originally from IMDb): increasing the number of films with revenue informed. This new dataset held IMDb id. And we had it in our TMDb dataset as well. So merging was possible. 

And the second reason for doing so was quality control: the information in TMDb is maintained by a big community. There are rules to do so, as showed in [TMDb bible](https://www.themoviedb.org/bible/movie#59f73b759251416e71000007). But there are many oportunities for errors. And some errors can produce outliers. And outliers are a problem. An easy error will be adding revenue (or budget) information in million instead of units. Or adding revenue (or budget) information in different currencies. The rule about it is *including revenue and budget information in dollars in the time when the movie was released*. But this is not always the case. Having another source of information was useful to apply some quality control and clean up some outliers.