# IMDB Dataset Processing

## Overview

This notebook performs data engineering tasks for the IMDB dataset. This data engineering is part of the [larger example project](https://github.com/donald-f-ferguson/W4111-Project-Template) for [Prof. Donald Ferguson's](https://github.com/donald-f-ferguson) sections of _W4111 - Introduction to Databases_ at [Columbia University.](https://www.cs.columbia.edu) Students in this class use the example project to help the get started with and implement their own project.

## Setup

### Install Packages

In [7]:
# Install packages for the project. Students using the project template should be doing so in a new virtual environment
# for the overall project. Users only need to run this cell once to install the packages. For the first execution,
# uncomment the #%pip installs in the cells. After the first execution, students can comment out the installs.

#%pip install pandas

In [25]:
import pandas

In [8]:
#%pip install pymysql

In [26]:
import pymysql

In [9]:
#%pip install sqlalchemy

In [27]:
import sqlalchemy

In [10]:
#%pip install ipython-sql

### Import Additional Packages

__Note:__ These cells follow the same pattern. Students only need to uncomment and execute the pip installs
on the first execution.

In [36]:
#%pip install requests
import requests

### Test Setup

#### ipython-sql

In [15]:
%load_ext sql

In [12]:
# Please set these values to the correct values for you local installation of MySQL.
#
mysql_user = "root"
mysql_password = "dbuserdbuser"

In [13]:
database_url = f"mysql+pymysql://{mysql_user}:{mysql_password}@localhost"
database_url

'mysql+pymysql://root:dbuserdbuser@localhost'

In [17]:
# Test ipython-sql
%sql {database_url}

In [18]:
# Test the connection. NOTE: The output will show the databases on your system. The output below is from mine.
#
%sql show databases;

 * mysql+pymysql://root:***@localhost
19 rows affected.


Database
classicmodels
columbia_model
course_management
course_student_coupons
courseworks_videos
db_book
F24_examples
F24_IMDB_clean
F24_IMDB_Raw
fitness


#### SQLAlchemy

In [23]:
from sqlalchemy import create_engine

In [24]:
engine = create_engine(database_url)

In [28]:
the_databases = pandas.read_sql(
    "show databases", con=engine
)

In [29]:
the_databases

Unnamed: 0,Database
0,classicmodels
1,columbia_model
2,course_management
3,course_student_coupons
4,courseworks_videos
5,db_book
6,F24_examples
7,F24_IMDB_clean
8,F24_IMDB_Raw
9,fitness


#### PyMySQL

In [30]:
mysql_con = pymysql.connect(
    user=mysql_user,
    password=mysql_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True
)

In [31]:
show_db_sql = "show databases;"

cur = mysql_con.cursor()
res = cur.execute(show_db_sql)

if res:
    result = cur.fetchall()
else:
    result = "Something bad happened."
    
result

[{'Database': 'classicmodels'},
 {'Database': 'columbia_model'},
 {'Database': 'course_management'},
 {'Database': 'course_student_coupons'},
 {'Database': 'courseworks_videos'},
 {'Database': 'db_book'},
 {'Database': 'F24_examples'},
 {'Database': 'F24_IMDB_clean'},
 {'Database': 'F24_IMDB_Raw'},
 {'Database': 'fitness'},
 {'Database': 'import_web_data'},
 {'Database': 'information_schema'},
 {'Database': 'lahmansbaseballdb'},
 {'Database': 'lor_data'},
 {'Database': 'mysql'},
 {'Database': 'p1_database'},
 {'Database': 'performance_schema'},
 {'Database': 'sys'},
 {'Database': 'testdb'}]

## Get IMDB Dataset

### Overview

Students should NOT execute the cells in this section. The dataset files are very large and processing the data may
overwhelm the students computer or MySQL.

Students wanting to execute later content in this notebook can use the reduced size ```.csv``` files in the [Students_Data](../data/Students_Data) directory.

### Download the Datasets

The "free" and "open" IMDB datasets are available from the [IMDB web site.](https://developer.imdb.com/non-commercial-datasets/)

In [32]:
# Set the URL of the page with the download links to the datasets.
#
imdb_data_download_url = "https://datasets.imdbws.com/"

In [33]:
# The names of the datafiles
#
file_name = [
    "name.basics",
    "title_akas",
    "title_basics",
    "title_crew",
    "title_episodes",
    "title.principals",
    "title_ratings"
]

# The datafiles are ```gzip``` compressed ```tsv``` tab separated datafiles.
#
file_suffix = "tsv.gz"

In [None]:
def download_file(url, filename):
    try:
        # Send a GET request to the URL
        response = requests.get(url, stream=True)
        # Check if the request was successful
        response.raise_for_status()
        
        # Open the file in binary write mode and write chunks of data to it
        with open(filename, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        print(f"File downloaded successfully: {filename}")
        
    except requests.exceptions.RequestException as e:
        print(f"Error downloading file: {e}")

# Usage
url = "https://example.com/file.zip"
filename = "file.zip"
download_file(url, filename)
