<a href="https://colab.research.google.com/github/christopherhuntley/DATA6510/blob/master/HW4_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Homework 4: IMDB** 
_Fun with Movie Listings._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- How to explore structural relationships in a huge dataset
- How data gets loaded into a relational database from CSV files. 

### **Skills / Know how to ...**
- Determine table schema from SQL DDL
- Debug queries that may take a while to run (and crash the database)

The data for this assignment comes from [IMDB](https://www.imdb.com). It is big enough that it *barely* fits in SQLite. While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). 

## **0. Boilerplate Code to get us started**

In [1]:
# lock down the package versions due to SQLAlchemy 2.0 compatibility bug
!pip install SQLAlchemy==1.4.46
!pip install PyMySQL==1.0.2 # or whichever
!pip install ipython-sql==0.4.1

# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting SQLAlchemy==1.4.46
  Downloading SQLAlchemy-1.4.46-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: SQLAlchemy
  Attempting uninstall: SQLAlchemy
    Found existing installation: SQLAlchemy 2.0.0
    Uninstalling SQLAlchemy-2.0.0:
      Successfully uninstalled SQLAlchemy-2.0.0
Successfully installed SQLAlchemy-1.4.46
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyMySQL==1.0.2
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 KB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMySQL
Successfully installe

## **1. Explore the data source.**

We will be building a database of every movie released in since the 1890s. While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). Take a moment to read through the download page, which lists downloadable data sets along with column names and data types.

We will focus on `title.basics.tsv.gz`, `names.basics.tsv.gz`, and `title.principals.tsv.gz` files. Some notes:
- `titles`, `names`, and `principals` are equivalent to movies, artists, and credits.  
- `principals` is not quite the same as the cast; it includes writers and crew but not every actor who appears. (So, unfortunately, we cannot calculate  [Bacon Numbers](https://oracleofbacon.org/help.php) accurately.) 
- The `.tsv` file extension means that the files are in tab separated values (TSV) format, an ancient cousin to the more common CSV format. In the days before everybody pulled up data sets into a spreadsheet to explore their contents, data was something you would edit in a text editor (note: MS Word is not a text editor). The tabs forced the data to appear in columns, for the most part. 
- The `.gz` file extension indicates that the data has been compressed using the `gzip` utility. In this case the compression is about 5 to 1. 

**As you are exploring draw an ERD to represent the design of the database.** (No, there is not need to submit it. You can even use crayon if you like. We'll figure out if it's right below.)





## **2. Create and Load the Database.**
The load process below may take a few minutes to complete. If Google is very busy then it may take a couple hours. 

In [2]:
# Download from IMDB
!wget https://datasets.imdbws.com/name.basics.tsv.gz
!wget https://datasets.imdbws.com/title.basics.tsv.gz
!wget https://datasets.imdbws.com/title.principals.tsv.gz

--2023-02-05 19:41:53--  https://datasets.imdbws.com/name.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 108.156.107.5, 108.156.107.31, 108.156.107.22, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|108.156.107.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 240498219 (229M) [binary/octet-stream]
Saving to: ‘name.basics.tsv.gz’


2023-02-05 19:41:55 (150 MB/s) - ‘name.basics.tsv.gz’ saved [240498219/240498219]

--2023-02-05 19:41:55--  https://datasets.imdbws.com/title.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 108.156.107.5, 108.156.107.31, 108.156.107.22, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|108.156.107.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 167518382 (160M) [binary/octet-stream]
Saving to: ‘title.basics.tsv.gz’


2023-02-05 19:41:56 (117 MB/s) - ‘title.basics.tsv.gz’ saved [167518382/167518382]

--2023-02-05 19:41:56--  https://datasets

In [3]:
%sql sqlite:///IMDB_Mirror.db

**It may help to refer to your ERD for this ...**

In [4]:
%%sql
DROP TABLE IF EXISTS names;
CREATE TABLE names (
    nconst TEXT PRIMARY KEY,
    primaryName TEXT DEFAULT 'No Name Given',
    birthYear TEXT,
    deathYear TEXT,
    primaryProfession TEXT,
    knownForTitles TEXT
);
DROP TABLE IF EXISTS titles;
CREATE TABLE titles (
    tconst TEXT PRIMARY KEY,
    titleType TEXT NOT NULL,
    primaryTitle TEXT DEFAULT 'Untitled',
    originalTitle TEXT DEFAULT 'Untitled', 
    isAdult INTEGER,
    startYear TEXT NOT NULL,
    endYear TEXT, 
    runtimeMinutes INTEGER DEFAULT 0,
    genres TEXT
);
DROP TABLE IF EXISTS principals;
CREATE TABLE principals (
    principalID INTEGER PRIMARY KEY,
    tconst TEXT NOT NULL,
    nconst TEXT NOT NULL,
    ordering INTEGER,
    category TEXT,
    job TEXT,
    characters TEXT,
    FOREIGN KEY (nconst) REFERENCES names (nconst),
    FOREIGN KEY (tconst) REFERENCES titles (tconst)
);

 * sqlite:///IMDB_Mirror.db
Done.
Done.
Done.
Done.
Done.
Done.


[]

### **Load from files**
- This uses the pandas `pd.read_csv()` function with `\tab` separators.
- Again, note the location of the database file. The file name and location have to match %sql magic. 
- **THIS WILL LIKELY CRASH COLAB AFTER ~5mins -- too much data all at once -- BUT WE HAVE A QUICK FIX**

In [None]:
data_conf = {'titles':'title.basics.tsv.gz', 'names': 'name.basics.tsv.gz', 'principals':'title.principals.tsv.gz', }
conn = sqlite3.connect('IMDB_Mirror.db') 
with conn:
  for tbl,fname in data_conf.items():
    print(tbl,fname)
    df = pd.read_csv(fname,sep='\t')
    df.to_sql(tbl,conn,if_exists='append',index=False)

titles title.basics.tsv.gz


  exec(code_obj, self.user_global_ns, self.user_ns)


names name.basics.tsv.gz
principals title.principals.tsv.gz


**Pop Quiz: Why do we load the principals table last?**

YOUR ANSWER

## **3. Refresh and Test database connection**

Run the cells below before moving on to part 3. It works around the "too much data" bug in Colab. 



In [1]:
# lock down the package versions due to SQLAlchemy 2.0 compatibility bug
!pip install SQLAlchemy==1.4.46
!pip install SQLAlchemy==1.4.46
!pip install PyMySQL==1.0.2 # or whichever
!pip install ipython-sql==0.4.1

# Download the database file. 
!pip3 install --upgrade gdown
!gdown https://drive.google.com/uc?id=1MPTKr9xQJc00zyyhT9kbKJ2vXD2vwPx4

# Reload the %sql magic after SqlAlchemy runs
%load_ext sql
%sql sqlite:///IMDB_Mirror.db

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.0-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.6.0
Downloading...
From: https://drive.google.com/uc?id=1MPTKr9xQJc00zyyhT9kbKJ2vXD2vwPx4
To: /content/IMDB_Mirror.db
100% 441M/441M [00:03<00:00, 125MB/s]


In [3]:
%%sql @IMDB_Mirror.db
-- A query to makes sure we have data loaded
SELECT * FROM titles LIMIT 10;

Done.


tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000502,movie,Bohemios,Bohemios,0,1905,\N,100,\N
tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,\N,70,"Action,Adventure,Biography"
tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,\N,90,Drama
tt0000615,movie,Robbery Under Arms,Robbery Under Arms,0,1907,\N,\N,Drama
tt0000630,movie,Hamlet,Amleto,0,1908,\N,\N,Drama
tt0000675,movie,Don Quijote,Don Quijote,0,1908,\N,\N,Drama
tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,\N,120,"Adventure,Fantasy"
tt0000793,movie,Andreas Hofer,Andreas Hofer,0,1909,\N,\N,Drama
tt0000814,movie,La bocana de Mar Chica,La bocana de Mar Chica,0,1909,\N,\N,\N


In [4]:
%%sql
SELECT * FROM titles LIMIT 10;

 * sqlite:///IMDB_Mirror.db
Done.


tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000502,movie,Bohemios,Bohemios,0,1905,\N,100,\N
tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,\N,70,"Action,Adventure,Biography"
tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,\N,90,Drama
tt0000615,movie,Robbery Under Arms,Robbery Under Arms,0,1907,\N,\N,Drama
tt0000630,movie,Hamlet,Amleto,0,1908,\N,\N,Drama
tt0000675,movie,Don Quijote,Don Quijote,0,1908,\N,\N,Drama
tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,\N,120,"Adventure,Fantasy"
tt0000793,movie,Andreas Hofer,Andreas Hofer,0,1909,\N,\N,Drama
tt0000814,movie,La bocana de Mar Chica,La bocana de Mar Chica,0,1909,\N,\N,\N


In [5]:
%%sql
SELECT * FROM principals LIMIT 10;

 * sqlite:///IMDB_Mirror.db
Done.


principalID,tconst,nconst,ordering,category,job,characters
25,tt0000009,nm0063086,1,actress,\N,"[""Miss Geraldine Holbrook (Miss Jerry)""]"
26,tt0000009,nm0183823,2,actor,\N,"[""Mr. Hamilton""]"
27,tt0000009,nm1309758,3,actor,\N,"[""Chauncey Depew - the Director of the New York Central Railroad""]"
28,tt0000009,nm0085156,4,director,\N,\N
851,tt0000502,nm0215752,1,actor,\N,\N
852,tt0000502,nm0252720,2,actor,\N,\N
853,tt0000502,nm0063413,3,director,\N,\N
854,tt0000502,nm0657268,4,writer,\N,\N
855,tt0000502,nm0675388,5,writer,\N,\N
1043,tt0000574,nm0675239,10,cinematographer,director of photography,\N


## **4. Now for the fun part.**

Write `SELECT` queries to answer the questions below. 

> **Note: Colab will delete your files, including your database, after 12 hours of inactivity. If your session resets then you will need to *rerun* all the cells in Part 3 above to recreate the database.**

### **In what movies did Eli Wallach appear? TV does not count. (three tables, two joins)**##

### **How many years long was Eli Wallach's career, from his first film to his last?**

### **Who were Eli Wallach's costars (note: actors only) in movies released in 1996? (two tables, three joins)**

### **How many total co-stars did Eli Wallach have over his career?**

### **Which artists were both actors and directors in movies released in 1996? (That's actor and director in the same movie.)**

### **How many artists were there in the above query?**

### **Who has the record for appearing in the most different movies in one year?**

### **Movie titles are not unique. Which move title has been reused the most times over the years? (Exclude "Untitled" or similar non-titles. Also be sure to only include movie titles.)**

### **Come up with your own query and post it on Slack. The student with the most interesting query -- as voted in class -- gets a perfect score on the next quiz.**

---
## **On your way out ... Be sure to save your work**.
Save this notebook and turn it in on Google Classroom. 
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.