What is it?

TheMovieDB is a Movie Database, useful for NLP, rescsys, and search experimentation. This repo crawls the TMDB API following the TMDB API rules, and places them into local gzipped JSON files so you can go forth and experiment with movie data.

OSC uses this dataset in it's training classes, which if you are into search, you will be interested in!

Dependencies

Python 3
Requests library

To Run

export TMDB_API_KEY=<get an API key from TMDB>
python tmdb.py

This script will crawl TMDB from 0 to the latest movie added to TMDB. Every 1000 movies will be dumped to the chunks/ folder in gzipped json form.

In order to clean the data for use in public (remove Adult films etc) we have a second script that collects the results in chunks/ and filters them into a single JSON file with ~ 50,000 English feature length films.

python scrub_and_shrink.py

This will produce a JSON file tmdb_dump_{YYYY-MM-DD}.json. The dating is to version the data so existing tutorials are not broken.

Understanding Data Structure

You can use jq to parse the JSON. Just unzip a chunk and then do:

cat mychunk.json | jq .

Or, to look at a specific movie dataset, look it up by id:

jq -c '.["702557"]' temp/tmdb.702.json | jq .

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scrub_and_shrink.py		scrub_and_shrink.py
tmdb.py		tmdb.py
tmdb_dataflows.png		tmdb_dataflows.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is it?

Dependencies

To Run

Understanding Data Structure

About

Releases

Packages

Contributors 3

Languages

License

o19s/tmdb_dump

Folders and files

Latest commit

History

Repository files navigation

What is it?

Dependencies

To Run

Understanding Data Structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages