In [2]:
import pandas as pd
import requests
import config

## Extract
Here’s how we send a single GET request to the API. In the response, we receive a JSON record with the movie_id we specify:
For this exercise, we’re going to request 6 movies with movie_id ranging from 550 to 555. We create a loop that requests each movie one at a time and appends the response to a list.

In [3]:
response_list = []
API_KEY = config.api_key

for movie_id in range(550,556): 
    url = f'https://api.themoviedb.org/3/movie/{movie_id}?api_key={API_KEY}'
    r = requests.get(url)
    response_list.append(r.json())

We now have a list of long, unwieldy JSON records delivered to us from the API. Create a pandas dataframe from the records using from_dict():

In [4]:
df = pd.DataFrame.from_dict(response_list)
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/rr7E0NoGKxvbkb89eR1GwfoYjpA.jpg,,63000000,"[{'id': 18, 'name': 'Drama'}]",http://www.foxmovies.com/movies/fight-club,550,tt0137523,en,Fight Club,...,1999-10-15,100853753,139,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Mischief. Mayhem. Soap.,Fight Club,False,8.4,24366
1,False,/v1QEIuBM1vvpvfqalahhIyXY0Cm.jpg,"{'id': 372257, 'name': 'The Poseidon Adventure...",5000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,551,tt0069113,en,The Poseidon Adventure,...,1972-12-13,84563118,117,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"Hell, upside down.",The Poseidon Adventure,False,7.2,658
2,False,/k4JIHyAXaGHwAwT7y5Skd17f0Wl.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,552,tt0237539,it,Pane e tulipani,...,2000-03-03,8478434,114,"[{'english_name': 'Italian', 'iso_639_1': 'it'...",Released,Imagine your life. Now go live it.,Bread and Tulips,False,7.3,210
3,False,/r3xsFBD1VTUusk393bBc7SsDUJe.jpg,"{'id': 1952, 'name': 'USA: Land of Opportuniti...",10000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,553,tt0276919,en,Dogville,...,2003-05-19,16680836,178,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,A quiet little town not far from here.,Dogville,False,7.8,1950
4,False,/1qwXItFKqvKYyW1CwbYhxyUC8Pj.jpg,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",,554,tt0308476,ru,Кукушка,...,2002-01-01,0,100,"[{'english_name': 'German', 'iso_639_1': 'de',...",Released,She's Making Peace One Man at a Time.,The Cuckoo,False,7.1,65
5,False,,,0,"[{'id': 53, 'name': 'Thriller'}]",http://www.luecke-im-system.de/,555,tt0442896,en,Absolut,...,2005-04-20,0,94,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,Absolut,False,7.9,20


## Transform

We create a list of column names called df_columns that allows us to select the columns we want from the main dataframe.

In [5]:
df_columns = ['budget', 'genres', 'id', 'imdb_id', 'original_title', 'release_date', 'revenue', 'runtime']
df_columns

['budget',
 'genres',
 'id',
 'imdb_id',
 'original_title',
 'release_date',
 'revenue',
 'runtime']

### Genres

It is a column of lists of JSON records, which is hard to read or quickly understand in this format. We want to expand this column out so we can easily see and make use of the internal records.

In [6]:
genres_list = df['genres'].tolist()
genres_list

[[{'id': 18, 'name': 'Drama'}],
 [{'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'},
  {'id': 53, 'name': 'Thriller'}],
 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}],
 [{'id': 80, 'name': 'Crime'},
  {'id': 18, 'name': 'Drama'},
  {'id': 53, 'name': 'Thriller'}],
 [{'id': 18, 'name': 'Drama'},
  {'id': 36, 'name': 'History'},
  {'id': 10749, 'name': 'Romance'},
  {'id': 35, 'name': 'Comedy'}],
 [{'id': 53, 'name': 'Thriller'}]]

In [7]:
flat_list = [item for sublist in genres_list for item in sublist]
flat_list

[{'id': 18, 'name': 'Drama'},
 {'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 53, 'name': 'Thriller'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10749, 'name': 'Romance'},
 {'id': 80, 'name': 'Crime'},
 {'id': 18, 'name': 'Drama'},
 {'id': 53, 'name': 'Thriller'},
 {'id': 18, 'name': 'Drama'},
 {'id': 36, 'name': 'History'},
 {'id': 10749, 'name': 'Romance'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 53, 'name': 'Thriller'}]

We’ll create a temporary column called genres_all as a list of lists of genres that we can later expand out into a separate column for each genre.

In [8]:
result = []
for l in genres_list:
    r = []
    for d in l:
        r.append(d['name'])
    result.append(r)
df = df.assign(genres_all=result)

In [9]:
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,genres_all
0,False,/rr7E0NoGKxvbkb89eR1GwfoYjpA.jpg,,63000000,"[{'id': 18, 'name': 'Drama'}]",http://www.foxmovies.com/movies/fight-club,550,tt0137523,en,Fight Club,...,100853753,139,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Mischief. Mayhem. Soap.,Fight Club,False,8.4,24366,[Drama]
1,False,/v1QEIuBM1vvpvfqalahhIyXY0Cm.jpg,"{'id': 372257, 'name': 'The Poseidon Adventure...",5000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,551,tt0069113,en,The Poseidon Adventure,...,84563118,117,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"Hell, upside down.",The Poseidon Adventure,False,7.2,658,"[Action, Adventure, Thriller]"
2,False,/k4JIHyAXaGHwAwT7y5Skd17f0Wl.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,552,tt0237539,it,Pane e tulipani,...,8478434,114,"[{'english_name': 'Italian', 'iso_639_1': 'it'...",Released,Imagine your life. Now go live it.,Bread and Tulips,False,7.3,210,"[Comedy, Romance]"
3,False,/r3xsFBD1VTUusk393bBc7SsDUJe.jpg,"{'id': 1952, 'name': 'USA: Land of Opportuniti...",10000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,553,tt0276919,en,Dogville,...,16680836,178,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,A quiet little town not far from here.,Dogville,False,7.8,1950,"[Crime, Drama, Thriller]"
4,False,/1qwXItFKqvKYyW1CwbYhxyUC8Pj.jpg,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",,554,tt0308476,ru,Кукушка,...,0,100,"[{'english_name': 'German', 'iso_639_1': 'de',...",Released,She's Making Peace One Man at a Time.,The Cuckoo,False,7.1,65,"[Drama, History, Romance, Comedy]"
5,False,,,0,"[{'id': 53, 'name': 'Thriller'}]",http://www.luecke-im-system.de/,555,tt0442896,en,Absolut,...,0,94,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,Absolut,False,7.9,20,[Thriller]


Here’s where we create the genres table:
Convert structured or record ndarray to DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html 

In [10]:
df_genres = pd.DataFrame.from_records(flat_list).drop_duplicates()

In [11]:
df_genres

Unnamed: 0,id,name
0,18,Drama
1,28,Action
2,12,Adventure
3,53,Thriller
4,35,Comedy
5,10749,Romance
6,80,Crime
10,36,History


In [12]:
df_columns = ['budget', 'id', 'imdb_id', 'original_title', 'release_date', 'revenue', 'runtime']
df_columns

['budget',
 'id',
 'imdb_id',
 'original_title',
 'release_date',
 'revenue',
 'runtime']

In [13]:
df_genre_columns = df_genres['name'].to_list()
df_genre_columns 

['Drama',
 'Action',
 'Adventure',
 'Thriller',
 'Comedy',
 'Romance',
 'Crime',
 'History']

In [14]:
# The extend() method adds the specified list elements (or any iterable) to the end of the current list.
# https://www.w3schools.com/python/ref_list_extend.asp 
df_columns.extend(df_genre_columns)
df_columns

['budget',
 'id',
 'imdb_id',
 'original_title',
 'release_date',
 'revenue',
 'runtime',
 'Drama',
 'Action',
 'Adventure',
 'Thriller',
 'Comedy',
 'Romance',
 'Crime',
 'History']

In [15]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html 
# Transform each element of a list-like to a row, replicating index values.
s = df['genres_all'].explode()
s

0        Drama
1       Action
1    Adventure
1     Thriller
2       Comedy
2      Romance
3        Crime
3        Drama
3     Thriller
4        Drama
4      History
4      Romance
4       Comedy
5     Thriller
Name: genres_all, dtype: object

In [16]:
# https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
# pandas.crosstab(index, columns)
pd.crosstab(s.index, s)

genres_all,Action,Adventure,Comedy,Crime,Drama,History,Romance,Thriller
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,0,0,1,0,0,0
1,1,1,0,0,0,0,0,1
2,0,0,1,0,0,0,1,0
3,0,0,0,1,1,0,0,1
4,0,0,1,0,1,1,1,0
5,0,0,0,0,0,0,0,1


In [17]:
df = df.join(pd.crosstab(s.index, s))
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,vote_count,genres_all,Action,Adventure,Comedy,Crime,Drama,History,Romance,Thriller
0,False,/rr7E0NoGKxvbkb89eR1GwfoYjpA.jpg,,63000000,"[{'id': 18, 'name': 'Drama'}]",http://www.foxmovies.com/movies/fight-club,550,tt0137523,en,Fight Club,...,24366,[Drama],0,0,0,0,1,0,0,0
1,False,/v1QEIuBM1vvpvfqalahhIyXY0Cm.jpg,"{'id': 372257, 'name': 'The Poseidon Adventure...",5000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,551,tt0069113,en,The Poseidon Adventure,...,658,"[Action, Adventure, Thriller]",1,1,0,0,0,0,0,1
2,False,/k4JIHyAXaGHwAwT7y5Skd17f0Wl.jpg,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,552,tt0237539,it,Pane e tulipani,...,210,"[Comedy, Romance]",0,0,1,0,0,0,1,0
3,False,/r3xsFBD1VTUusk393bBc7SsDUJe.jpg,"{'id': 1952, 'name': 'USA: Land of Opportuniti...",10000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,553,tt0276919,en,Dogville,...,1950,"[Crime, Drama, Thriller]",0,0,0,1,1,0,0,1
4,False,/1qwXItFKqvKYyW1CwbYhxyUC8Pj.jpg,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",,554,tt0308476,ru,Кукушка,...,65,"[Drama, History, Romance, Comedy]",0,0,1,0,1,1,1,0
5,False,,,0,"[{'id': 53, 'name': 'Thriller'}]",http://www.luecke-im-system.de/,555,tt0442896,en,Absolut,...,20,[Thriller],0,0,0,0,0,0,0,1


### Working with datetimes

Finally we’ll expand out the datetime column into a table. Pandas has built-in functions to extract specific parts of a datetime. Notice we need to convert the release_date column into a datetime first.

In [18]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['day'] = df['release_date'].dt.day
df['month'] = df['release_date'].dt.month
df['year'] = df['release_date'].dt.year
df['day_of_week'] = df['release_date'].dt.day_name()
df_time_columns = ['id', 'release_date', 'day', 'month', 'year', 'day_of_week']

In [19]:
df[df_time_columns]

Unnamed: 0,id,release_date,day,month,year,day_of_week
0,550,1999-10-15,15,10,1999,Friday
1,551,1972-12-13,13,12,1972,Wednesday
2,552,2000-03-03,3,3,2000,Friday
3,553,2003-05-19,19,5,2003,Monday
4,554,2002-01-01,1,1,2002,Tuesday
5,555,2005-04-20,20,4,2005,Wednesday


## Load

We ended up creating 3 tables for the tmdb schema that we’ll call movies, genres, and datetimes. We export our tables by writing them to file. This will create 3 .csv files in the same directory that our script is in.

In [20]:
df[df_columns].to_csv('tmdb_movies.csv', index=False)
df_genres.to_csv('tmdb_genres.csv', index=False)
df[df_time_columns].to_csv('tmdb_datetimes.csv', index=False)

That’s it! We’ve created our first ETL pipeline.

## Visualization

In [22]:
df_genres_all = pd.crosstab(s.index, s)
df_genres_all

genres_all,Action,Adventure,Comedy,Crime,Drama,History,Romance,Thriller
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,0,0,1,0,0,0
1,1,1,0,0,0,0,0,1
2,0,0,1,0,0,0,1,0
3,0,0,0,1,1,0,0,1
4,0,0,1,0,1,1,1,0
5,0,0,0,0,0,0,0,1


In [27]:
df_genres_all = df_genres_all.append(df_genres_all.sum().rename('Total'))

In [30]:
df_genres_all

genres_all,Action,Adventure,Comedy,Crime,Drama,History,Romance,Thriller
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,0,0,1,0,0,0
1,1,1,0,0,0,0,0,1
2,0,0,1,0,0,0,1,0
3,0,0,0,1,1,0,0,1
4,0,0,1,0,1,1,1,0
5,0,0,0,0,0,0,0,1
Total,1,1,2,1,3,1,2,3


In [38]:
df = df_genres_all.T
df

row_0,0,1,2,3,4,5,Total
genres_all,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Action,0,1,0,0,0,0,1
Adventure,0,1,0,0,0,0,1
Comedy,0,0,1,0,1,0,2
Crime,0,0,0,1,0,0,1
Drama,1,0,0,1,1,0,3
History,0,0,0,0,1,0,1
Romance,0,0,1,0,1,0,2
Thriller,0,1,0,1,0,1,3


In [41]:
import plotly.express as px

fig = px.bar(x=df.index, y=df['Total'])
fig.update_layout(
    title="Most frequent genres",
    xaxis_title="Genres",
    yaxis_title="Counts")
fig.update_traces(hovertemplate = 'Genre: %{x} <br>Counts: %{y}<extra></extra>')
fig.show()