# Data Collecting using  API of  www.themoviedb.org

## 1. Import Libraries


In [7]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import json
import numpy as np
np.random.seed(102)

## 2. Defining Functions

### 2a. Function to get API key
Check the link below to get started with API and how to get API key:
[Getting Started](https://developers.themoviedb.org/3/getting-started/introduction)

In [8]:
# Function to get API keys
def get_keys(path):
    '''
    Function:
    Returns API keys.
    
    Parameters:
    path: path to the json file with the API keys. 

    '''
    with open(path) as f:
        return json.load(f)

# Using the function to open and load all keys in that file 
api_keys = get_keys("/Users/ASM/Documents/secrets/tmdb_api.json")

# Setting the first (and only) value as a variable
key = list(api_keys.values())[0]

### 2b. Function to get lists with movie IDs and dates
Function will return lists with movies IDs and release year
For more information, check the link [TMDB DISCOVER API](https://developers.themoviedb.org/3/discover/movie-discover)

In [9]:
# Function to get the list with movie IDs 
# and list with with movies release dates
def discover_request(pages = 1, year=2013):
    '''
    Function:
    Returns list of movies' ids and list of corresponding release years from the source link.
    The source is TMDB.org Discover-API.
    
    Parameters:
    
    page: (type integer). Default = 1. 
    The number of pages to get the data NOTE: it is 20 movies per page.
    
    year: (type integer). Default = 2013.
    Release year (looking at all release years) that is greater or equal to the specified value.
    '''
    pages_list = range(1,pages+1)
    movies_list = []
    movies_list_date = []
    for page in pages_list:
        discover_response = requests.get(f'https://api.themoviedb.org/3/discover/movie?api_key={key}&language=en-US&region=US&sort_by=popularity.desc&include_adult=false&include_video=false&page={page}&release_date.gte={year}&with_original_language=en')
        discover_list = discover_response.json()['results']
        discover_list_id = [i['id'] for i in discover_list]
        discover_list_date = [i['release_date'] for i in discover_list]
        movies_list += discover_list_id
        movies_list_date += discover_list_date
    
    return movies_list, movies_list_date


### 2c. Function to get lists of dictionaries with detailed data for each movie.
Function will return: title, genres, populaity, average vote, budget, revenue, production company and country, release date, etc.
For more information, check the link [TMDB MOVIES API](https://developers.themoviedb.org/3/movies/get-movie-details)


In [10]:
# Function to get the list with detailed data, 
# Expected input is movies_list returned by discover_request() function.
def movies_request(movies_list):
    '''
    Function:
    Returns list of dictionaries with specific information for each movie (title, genres, populaity, average vote, budget, revenue, production company and country, release date, etc.)
    The source is TMDB.org Movie-API.
    
    Parameters:
    movies_list: (type-list). List with movies IDs. 
    
    '''
    movies_list_complete = []
    for title in movies_list:
        response = requests.get(f'https://api.themoviedb.org/3/movie/{title}?api_key={key}&language=en-US')
        movies_list_complete.append(response.json())
    return movies_list_complete

### 3. Getting data by sending API requests.

Run requests for 500 pages with filter of year 2000 and greater. 
[TMDB DISCOVER API](https://developers.themoviedb.org/3/discover/movie-discover)

In [13]:
# Uncomment below only if need to run API again:

# movies_list, movies_list_date= discover_request(pages=500,year=2000)

The below code is to combine two lists together and sort by year of release.
Then create list with movies IDs of year 2000 and greater(in case if it was not filtered correctly by the site)

In [547]:
# Create data frame with IDs and release date
df_id_date = pd.DataFrame(list(zip(movies_list, movies_list_date)), 
               columns =['ID', 'release_date'])

# convert release date to 'datetime' type
df_id_date['release_date'] = pd.to_datetime(df_id_date.release_date) 

# create a column with years only
df_id_date['release_date_2'] = df_id_date.release_date.dt.year

# sort the data frame by release date in descending order(from recent to the oldest)
df_id_date = df_id_date.sort_values('release_date_2', ascending=False)

# get the list of movies with IDS(year 2000 and greater)
list_2000 = list(df_id_date.loc[(df_id_date['release_date_2'] >= 2000)]["ID"])

Run requests get list of dictionaries with detailed data for each movie. 
[TMDB MOVIES API](https://developers.themoviedb.org/3/movies/get-movie-details)

In [12]:
# Uncomment below only if need to run API again, as it will run few thousands of requests:

# movies_list_complete= movies_request(list_2000)
# movies_list_complete

### 4. Create data frame and save as csv file

NOTE: all data were collected using api of themoviedb.org. For terms of use of the below data, please consult [TMDB terms-of-use](https://www.themoviedb.org/documentation/api/terms-of-use)

In [629]:
df= pd.DataFrame(movies_list_complete)

In [630]:
df.to_csv('df_movies.csv')