# Exploratory Data Analysis using Python
## Movies
_by Virginia Herrero_

## Data collection
The Movie Database (TMDb) is a popular database for movies and TV shows. TMDb provides a web API in the form of so-called RESTful web service. In order to get access to the TMDb API, it is necessary to register in their website and generate an API key. The process is the following:

1. register at https://www.themoviedb.org/account/signup
2. after sucessful registration, go to setting > api
3. generate and memorize the API key

The TMDb API expects a movie ID. All movie IDs can be downloaded from their website as a compressed JSON file. To find more information about it, check the following link: https://developer.themoviedb.org/docs/daily-id-exports. This daily file export is only valid for approximately 3 months, so it is recommended to use the current date to avoid any problems when downloading the files. Note: the date format of the file is MONTH_DAY_YEAR.

To download the daily id export file, the urllib.request library is used. The compressed file is unzipped using gzip. This file is in binary format so it must be converted to text form using UTF-8, and finally save the content on the file in the variable content. 

In [None]:
# Import all required libraries
import urllib.request
import gzip
import json

In [None]:
# The API endpoint
url = "http://files.tmdb.org/p/exports/movie_ids_11_14_2024.json.gz"

In [None]:
# Request to the API
request = urllib.request.Request(url, headers = {"Accept-Encoding": "gzip"})
# Response from the API
response = urllib.request.urlopen(request)
# Decompress the retrieve data
results = gzip.decompress(response.read())
# UTF-8 decoding
content = results.decode("utf-8")

Now, the daily id export file content is saved in the variable **content**. However, the daily id export file is not a valid JSON object, instead each line of this file is. Therefore, each line of the file must be initialized as a JSON object and add it to a list named **data**.

In [None]:
# Initialize empty data list
data = []

# Iterate through all the lines of the variable content for the length of the file -1 since the last line is empty
for item in content.split("\n")[0:(len(content.split("\n"))-1)]:
    # Parse json object to python object
    parse_json = json.loads(item)
    # Append dictionary to the data list
    data.append(parse_json)

data

Loop through the first 10 rows of the list **data** to see the IDs from the first 10 movies.

In [None]:
# Loop to obtain the 10 first movies IDs
for element in data[0:10]:
    print(element["id"])

Now the movie IDs and other information are stored in the list named **data**. To access and download all the information about the movies it is necessary to use the TMDb API. For this purpose use the python library TMDbSimple and initialize it with the API key. 

In [None]:
import tmdbsimple as tmdb
tmdb.API_KEY = "4e6e31fe089f3cc86ad755423c61d3a7"

Loop through the first 10 movie IDs to take a look at the movie titles.

In [None]:
# Loop to obtain the 10 first movies titles
for element in data[0:10]:
    print(element["original_title"])

Now all the information of the movies can be accessed with the library tmdbsimple and the TMDb API. The following loop can be used to output the individual keys that can be queried for a movie:

In [None]:
# Create search class instance
search = tmdb.Search()
# Iterate through the 10 elements of the list data
for element in data[0:10]:
    response = search.movie(query = element["original_title"]) # Search movie by original title
    for result in search.results: # Iterate through the result of the search
        print(result["title"], result["id"], result["release_date"], result["popularity"]) # Print title, id, release data an popularity of each movie

Using the tmdbsimple library, create a new variable called **movies** where all the information of the **data** variable and the cast and crew information of all the movies is stored. 

Loop 1000 times over the following loop to obtained a movie dataset of 1000 items. All the data is stored in a json file named tmdb_movies.json

In [None]:
# Initialize empty movies list
movies = []

# Loop though the elements of the list data
for element in data[0:1000]:
    # Create a movie class instance
    movie = tmdb.Movies(element["id"])
    response = movie.info() # Get all the primary info available of the movie
    movie_all_info = response.copy() # Create a copy of the dictionary
    response2 = movie.credits() # Get all the info from cast and crew of that movie
    movie_all_info.update(response2) # Add the dictionary containing cast and crew info to the one with the primary information
    movies.append(movie_all_info) # Add the dictionary with all the info to the list movies

movies

Create a file named tmdb_movies.json and store all the information of the variable **movies** in it to use it later as the dataset of this project. 

In [None]:
movies # Python objects to store in the json file
tmdb_movies = open("tmdb_movies.json", "w")  # Open the json file where all the info from movies will be saved   
json.dump(movies, tmdb_movies)  # Parse python objects into json objects
tmdb_movies.close()  # Close the json file