<a href="https://colab.research.google.com/github/christopherhuntley/DATA6510/blob/master/HW4_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Homework 4: IMDB** 
_Fun with Movie Listings._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- How to explore structural relationships in a huge dataset
- How data gets loaded into a relational database from CSV files. 

### **Skills / Know how to ...**
- Determine table schema from SQL DDL
- Debug queries that may take a while to run (and crash the database)

The data for this assignment comes from [IMDB](https://www.imdb.com). It is big enough that it *barely* fits in SQLite. While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). 

## **0. Boilerplate Code to get us started**

In [None]:
# lock down the package versions due to SQLAlchemy 2.0 compatibility bug
!pip install SQLAlchemy==1.4.46
!pip install PyMySQL==1.0.2 # or whichever
!pip install ipython-sql==0.4.1

# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


## **1. Explore the data source.**

We will be building a database of every movie released in since the 1890s. While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). Take a moment to read through the download page, which lists downloadable data sets along with column names and data types.

We will focus on `title.basics.tsv.gz`, `names.basics.tsv.gz`, and `title.principals.tsv.gz` files. Some notes:
- `titles`, `names`, and `principals` are equivalent to movies, artists, and credits.  
- `principals` is not quite the same as the cast; it includes writers and crew but not every actor who appears. (So, unfortunately, we cannot calculate  [Bacon Numbers](https://oracleofbacon.org/help.php) accurately.) 
- The `.tsv` file extension means that the files are in tab separated values (TSV) format, an ancient cousin to the more common CSV format. In the days before everybody pulled up data sets into a spreadsheet to explore their contents, data was something you would edit in a text editor (note: MS Word is not a text editor). The tabs forced the data to appear in columns, for the most part. 
- The `.gz` file extension indicates that the data has been compressed using the `gzip` utility. In this case the compression is about 5 to 1. 

**As you are exploring draw an ERD to represent the design of the database.** (No, there is not need to submit it. You can even use crayon if you like. We'll figure out if it's right below.)





## **2. Create and Load the Database.**
The load process below may take a few minutes to complete. If Google is very busy then it may take a couple hours. 

In [None]:
# Download from IMDB
!wget https://datasets.imdbws.com/name.basics.tsv.gz
!wget https://datasets.imdbws.com/title.basics.tsv.gz
!wget https://datasets.imdbws.com/title.principals.tsv.gz

--2022-09-28 18:45:12--  https://datasets.imdbws.com/name.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 18.64.174.83, 18.64.174.10, 18.64.174.31, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|18.64.174.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 234036192 (223M) [binary/octet-stream]
Saving to: ‘name.basics.tsv.gz’


2022-09-28 18:45:15 (87.6 MB/s) - ‘name.basics.tsv.gz’ saved [234036192/234036192]

--2022-09-28 18:45:15--  https://datasets.imdbws.com/title.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 18.64.174.83, 18.64.174.10, 18.64.174.31, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|18.64.174.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161744281 (154M) [binary/octet-stream]
Saving to: ‘title.basics.tsv.gz’


2022-09-28 18:45:17 (70.3 MB/s) - ‘title.basics.tsv.gz’ saved [161744281/161744281]

--2022-09-28 18:45:17--  https://datasets.imdbws.co

In [None]:
%sql sqlite:///IMDB_Mirror.db

'Connected: @IMDB_Mirror.db'

**It may help to refer to your ERD for this ...**

In [None]:
%%sql
DROP TABLE IF EXISTS names;
CREATE TABLE names (
    nconst TEXT PRIMARY KEY,
    primaryName TEXT DEFAULT 'No Name Given',
    birthYear TEXT,
    deathYear TEXT,
    primaryProfession TEXT,
    knownForTitles TEXT
);
DROP TABLE IF EXISTS titles;
CREATE TABLE titles (
    tconst TEXT PRIMARY KEY,
    titleType TEXT NOT NULL,
    primaryTitle TEXT DEFAULT 'Untitled',
    originalTitle TEXT DEFAULT 'Untitled', 
    isAdult INTEGER,
    startYear TEXT NOT NULL,
    endYear TEXT, 
    runtimeMinutes INTEGER DEFAULT 0,
    genres TEXT
);
DROP TABLE IF EXISTS principals;
CREATE TABLE principals (
    principalID INTEGER PRIMARY KEY,
    tconst TEXT NOT NULL,
    nconst TEXT NOT NULL,
    ordering INTEGER,
    category TEXT,
    job TEXT,
    characters TEXT,
    FOREIGN KEY (nconst) REFERENCES names (nconst),
    FOREIGN KEY (tconst) REFERENCES titles (tconst)
);

 * sqlite:///IMDB_Mirror.db
Done.
Done.
Done.
Done.
Done.
Done.


[]

### **Load from files**
- This uses the pandas `pd.read_csv()` function with `\tab` separators.
- Again, note the location of the database file. The file name and location have to match %sql magic. 

In [None]:
data_conf = {'titles':'title.basics.tsv.gz', 'names': 'name.basics.tsv.gz', 'principals':'title.principals.tsv.gz', }
conn = sqlite3.connect('IMDB_Mirror.db') 
with conn:
  for tbl,fname in data_conf.items():
    print(tbl,fname)
    df = pd.read_csv(fname,sep='\t')
    df.to_sql(tbl,conn,if_exists='append',index=False)

titles title.basics.tsv.gz


  exec(code_obj, self.user_global_ns, self.user_ns)


names name.basics.tsv.gz
principals title.principals.tsv.gz


**Pop Quiz: Why do we load the principals table last?**

YOUR ANSWER

### **Refresh and Test database connection**

Run the cells below before moving on to part 3. 



In [None]:
# Reload the %sql magic after SqlAlchemy runs
%load_ext sql
%sql sqlite:///IMDB_Mirror.db

'Connected: @IMDB_Mirror.db'

In [None]:
%%sql @IMDB_Mirror.db
-- A query to makes sure we have data loaded
SELECT * FROM titles LIMIT 10;

Done.


tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In [None]:
%%sql
SELECT * FROM titles LIMIT 10;

 * sqlite:///IMDB_Mirror.db
Done.


tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In [None]:
import sqlite3
import pandas as pd

data_conf = {'principals':'title.principals.tsv.gz'}
conn = sqlite3.connect('IMDB_Mirror.db') 
with conn:
  for tbl,fname in data_conf.items():
    print(tbl,fname)
    df = pd.read_csv(fname,sep='\t')
    df.to_sql(tbl,conn,if_exists='append',index=False)

principals title.principals.tsv.gz


In [None]:
# Reload the %sql magic after SqlAlchemy runs
%load_ext sql
%sql sqlite:///IMDB_Mirror.db

'Connected: @IMDB_Mirror.db'

In [None]:
%%sql
SELECT * FROM principals LIMIT 10;

 * sqlite:///IMDB_Mirror.db
Done.


principalID,tconst,nconst,ordering,category,job,characters


## **3. Now for the fun part.**

Write `SELECT` queries to answer the questions below. 

> **Note: Colab will delete your files, including your database, after 12 hours of inactivity. If your session resets then you will need to *rerun* all the above cells to recreate the database.**

### **In what movies did Eli Wallach appear? TV does not count. (three tables, two joins)**##

### **How many years long was Eli Wallach's career, from his first film to his last?**

### **Who were Eli Wallach's costars (note: actors only) in movies released in 1996? (two tables, three joins)**

### **How many total co-stars did Eli Wallach have over his career?**

### **Which artists were both actors and directors in movies released in 1996? (That's actor and director in the same movie.)**

### **How many artists were there in the above query?**

### **Who has the record for appearing in the most different movies in one year?**

### **Movie titles are not unique. Which move title has been reused the most times over the years? (Exclude "Untitled" or similar non-titles. Also be sure to only include movie titles.)**

### **Come up with your own query and post it on Slack. The student with the most interesting query -- as voted in class -- gets a perfect score on the next quiz.**

---
## **On your way out ... Be sure to save your work**.
Save this notebook and turn it in on Google Classroom. 
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.