<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Homework 7: SQL DDL with Big Data** 
_Fun with IMDB data._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The importance of testing in SQL ETL processes
- How big data is different from small data

### **Skills / Know how to ...**
- Use SQL DDL to define table schema imported data
- Import data from CSV files into existing tables
- Apply the Strongest First table load technique

In this assignment you will build a database that is just at the very edge of SQLite's capabilities. In fact, it is likely that Colab will crash at least once along the way. So, you will need to proceed in several passes through the design $\rightarrow$ code $\rightarrow$ test $\rightarrow$ design ... cycle.  

The data will come from [IMDB](https://www.imdb.com). While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). 

## **1. Explore the data source.**

We will be building a database of every movie released in 1996. While IMDB does not provide a free API, it makes a large sampling of its data [available for download](https://www.imdb.com/interfaces). Take a moment to read through the download page, which lists downloadable data sets along with column names and data types.

In keeping with the Movies Tonight case, we will focus on `title.basics.tsv.gz`, `names.basics.tsv.gz`, and `title.principals.tsv.gz` files. Some notes:
- `titles`, `names`, and `principals` are equivalent to movies, artists, and credits in the Movies Tonight database.  
- The `.tsv` file extension means that the files are in tab separated values (TSV) format, an ancient cousin to the more common CSV format. In the days before everybody pulled up data sets into a spreadsheet to explore their contents, data was something you would edit in a text editor (note: MS Word is not a text editor). The tabs forced the data to appear in columns, for the most part. 
- The `.gz` file extension indicates that the data has been compressed using the `gzip` utility. In this case the compression is about 5 to 1. 

**As you are exploring draw an ERD to represent the design of the database.** (No, there is not need to submit it. We'll figure out if it's right below.)





## **2. Write and test SQL DDL with a nontrivial sample of data.**

The data is much, much too big to work with when writing DDL code. You'll want to the design design $\rightarrow$ code $\rightarrow$ test loop to take a few second, not a couple hours. 

So, we will work with a sample that includes $-$ you guessed it $-$ every movie released in 1996. Here are URLs for the raw data files:
- https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/titles.csv
- https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/names.csv
- https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/principals.csv

Run each of the cells below, which ...
- download the data to our Colab workspace
- mount Google Drive and create the IMDB folder
- create a symlink for SQLite to work with Google Drive
- initialize a %%sql connection to a new test database


In [None]:
# download the source data from GitHub
!wget https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/titles.csv
!wget https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/names.csv
!wget https://github.com/christopherhuntley/BUAN6510/raw/master/data/IMDB/principals.csv

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create the BUAN6510/data/MoviesTonight folder in Google Drive
from pathlib import Path
data_root = Path("./drive/My Drive/Colab Notebooks/BUAN6510")
if not data_root.exists():
  print(
      '''
      Warning! The folder '/Colab Notebooks/BUAN6510' could not be found in the connected Google Drive. 
      Please make 100% sure that both Colab and Chrome are set up use your @student.fairfield.edu account. 
      For now, a new folder with the correct path has been created in whatever Google Drive it found. 
      ''')
data_root = data_root / 'data' / 'IMDB'
data_root.mkdir(parents=True, exist_ok=True)



In [None]:
%%bash
ln -s drive/My\ Drive/Colab\ Notebooks/BUAN6510 buan6510

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Database connection
%sql sqlite:///buan6510/data/IMDB/IMDB_movies_1996.db

Now that we have all that set up, we can write some code. In the cells below write SQL DDL to design, load, and test  your new database.
- Name the tables the same as IMDB: `title`, `names`, and `principals`. 
- Use `DROP TABLE ...` statements just before the `CREATE TABLE` statements to make the code rerunnable.
- Use the design $\rightarrow$ code $\rightarrow$ test technique to avoid writing too much code at a time. Write a little, (re)run the DDL code, (re)load the data, (re)test with with loaded data. Keep going until everything seems to work. 
- Get the data types right before worrying about keys, etc. Once the data types seem to work, then define the primary keys. Finally set any foreign keys. If you try to do it all at once, then debugging is much harder. 

### **SQL DDL** 


In [None]:
%%sql
YOUR DDL CODE HERE

### **Load from Files**
The cell below follows the Strongest First loading technique. 

In [None]:
# Load the data from csv files
data_conf = {'titles':'titles.csv', 'names': 'names.csv', 'principals':'principals.csv', }
conn = sqlite3.connect('buan6510/data/IMDB/IMDB_movies_1996.db') 
with conn:
  for tbl,fname in data_conf.items():
    print(tbl,fname)
    df = pd.read_csv(fname)
    df.to_sql(tbl,conn,if_exists='append',index=False)

### **Test with `SELECT` queries.**

In [None]:
%%sql 
-- some test queries

## **3. Now try with the real thing.**

The load process below may take at least 30 minutes per pass. If Google is very busy then it may take a couple hours. Imagine if you had to wait that long for each little change to your SQL DDL. 

### **Data Download**
This step only need to be done once per Colab session. 



In [None]:
# Download from IMDB
!wget https://datasets.imdbws.com/name.basics.tsv.gz
!wget https://datasets.imdbws.com/title.basics.tsv.gz
!wget https://datasets.imdbws.com/title.principals.tsv.gz

### **Initialize database**
- To avoid filling up your Google Drive, the database is located in your Colab workspace. You should see it in the file browser to the left. If not then refresh the file browser to be sure it isn't lost. 
- If Colab crashes then delete the database file before rerunning your code.

In [None]:
%sql sqlite:///IMDB_Mirror.db

### **SQL DDL**
You should be able to copy and paste from part 2. 

In [None]:
%%sql
YOUR DDL CODE HERE

### **Load from files**
- This uses the pandas `pd.read_csv()` function with `\tab` separators.
- Again, note the location of the database file. The file name and location have to match %sql magic. 

In [None]:
data_conf = {'titles':'title.basics.tsv.gz', 'artists': 'name.basics.tsv.gz', 'principals':'title.principals.tsv.gz', }
conn = sqlite3.connect('IMDB_Mirror.db') 
with conn:
  for tbl,fname in data_conf.items():
    print(tbl,fname)
    df = pd.read_csv(fname,sep='\t')
    df.to_sql(tbl,conn,if_exists='append',index=False)

### **Test with `SELECT` queries**
- Take care to limit your query results.
- Test each table but avoid lots of complex joins.
- Be prepared for Colab to crash on you when it runs out of RAM. When that happens rerun the code cells to this point. 

In [None]:
%%sql 
-- some test queries

## **Want to keep a copy?**
Drag the `IMDB_Mirror.db` file to your Google Drive.

---
## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.