## Webscraping TheMovieDB website

### Goal
This data extraction is to provide ratings and vote count details of movies from TMDB for a parent project on MovieLens data.
Movie ID's to come from MovieLens *links.csv* file

- This notebook uses TMDB provided API. Need an account with TMDB and API secret.

### Process
TMDB API allows to get data for one movie at a time. 
Generic information of ~58k unique movies is retrived into **tmdb_json_dump.txt** file in raw json format.
Each json response is parsed to get Ratings and Vote count are parsed that are needed for MovieLens analytics project is captured in **tmdb_ratings_file.csv**.

NOTE: This data is extracted in loop, took about 10+ hours with twice application (python) failure. Hence you will see step 2.

##### Step 1
- Load links.csv file to get TMDB movie id's that are required for data extraction.
- Remove null's
- Covert to a list

In [1]:
#Imports
import pandas as pd
import numpy as np

#read file
ml_links = pd.read_csv('links.csv')
ml_links.head(10)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
5,6,113277,949.0
6,7,114319,11860.0
7,8,112302,45325.0
8,9,114576,9091.0
9,10,113189,710.0


In [2]:
#check nulls
ml_links[ml_links['tmdbId'].isnull()]

Unnamed: 0,movieId,imdbId,tmdbId
709,721,114103,
718,730,125877,
756,769,116992,
757,770,38426,
778,791,113610,
...,...,...,...
24855,114963,322250,
25029,115715,3670792,
25057,115821,3900116,
29740,128734,4438688,


In [3]:
ml_links.shape

(58098, 3)

In [4]:
#remove nulls
tmdbid = ml_links['tmdbId'].tolist()
tmdbid = [x for x in tmdbid if ~np.isnan(x)]

In [5]:
len(tmdbid)

57917

##### Step 2
- If we need to collect data in batches or restart program for any reason. This excludes already collected ID's.

In [6]:
#remove id's from above list that are already collected
collected_ids = pd.read_csv('tmdb_ratings_file.csv')
collected_tmdbId = collected_ids['tmdbId'].tolist()
len(collected_tmdbId)

42184

In [7]:
tmdbid_remaining = [tmid for tmid in tmdbid if tmid not in collected_tmdbId]
len(tmdbid_remaining)

15706

##### Step 3
- Code to make API connection, pass movie ID one at a time, load response json to txt file and extract required fileds to csv file.
- If movie id not found this prints a short message. Data for ~600 movies is not found.

In [None]:
import urllib.request
import json
import os
import time

start_time = time.time()

API_KEY = os.environ.get('TMDB_API_KEY')

for ID in tmdbid_remaining:
    tmdb_ratings = pd.DataFrame(columns=["tmdbId", "tmdbId_title", "tmdbId_vote_average", "tmdbId_vote_count"])
    temp_list = []
    try:
        with urllib.request.urlopen("https://api.themoviedb.org/3/movie/{}?api_key={}".format(ID, API_KEY)) as url:
            data = json.loads(url.read().decode())
            #write json dump to file
            json_dump = open('tmdb_json_dump.txt', 'a')
            json_dump.write(json.dumps(data))
            json_dump.write("\n")
            json_dump.close()
            
            #get required fields to create a dataframe
            temp_list.append(ID)
            temp_list.append(data['title'])
            temp_list.append(data['vote_average'])
            temp_list.append(data['vote_count'])
            tmdb_ratings.loc[len(tmdb_ratings)] = temp_list
            if os.path.exists('tmdb_ratings_file.csv'):
                tmdb_ratings.to_csv('tmdb_ratings_file.csv', mode='a', header=False)
            else:
                tmdb_ratings.to_csv('tmdb_ratings_file.csv')
        
    except:
        print("{} not found".format(ID))
        #tmdb_ratings.loc[len(tmdb_ratings)] = [ID, np.NaN, np.NaN, np.NaN]
    
end_time = time.time()
print(end_time - start_time)

538286.0 not found
12773.0 not found
17882.0 not found
68149.0 not found
24549.0 not found
14980.0 not found
164721.0 not found
140207.0 not found
192936.0 not found
876.0 not found
2413.0 not found
82205.0 not found
149645.0 not found
8677.0 not found
13057.0 not found
119324.0 not found
2670.0 not found
215993.0 not found
47350.0 not found
13519.0 not found
152426.0 not found
30983.0 not found
7096.0 not found
15738.0 not found
11944.0 not found
110147.0 not found
15024.0 not found
206216.0 not found
19341.0 not found
2518.0 not found
36763.0 not found
64699.0 not found
69234.0 not found
13716.0 not found
11343.0 not found
185441.0 not found
18976.0 not found
10700.0 not found
24019.0 not found
37525.0 not found
15594.0 not found
24269.0 not found
41758.0 not found
58923.0 not found
17266.0 not found
17919.0 not found
253768.0 not found
78057.0 not found
34573.0 not found
27138.0 not found
49870.0 not found
244797.0 not found
21847.0 not found
31653.0 not found
14305.0 not found
1354