<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_tutorial_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Jupyter!

This is a Jupyter notebook. Some blocks are simply text and others are Python code or shell commands.

We run the notebook from Top->Bottom in sequence with each block of code often requiring the prior block of code to have been executed.

# Step 1: Install required packages

The standard Python distribution doesn't contain all the functionality required for many tasks. To extend the functionality we install packages. Packages can be installed by using pip (package installer for Python) which is a Python script. To call this script we use !

### !

In order to call pip we can use the ! prefix to tell the Jupyer notebook to execute the code as a shell command. Shell commands are a common way to interact with the operating system on unix/linux platforms. Your Google Colab environment is running the Ubuntu Linux operating system.

## pip details

### pip -q

The "-q" flag tells the pip to be quiet about its output. This hides more verbose output that would normally be printed when the packages install. We don't want to see all that output so we use the "-q" option to tell pip to hide it.

### install --upgrade

The install command tells pip we are planning to install a package. Since the ipython package is already installed we use the --upgrade flag to tell pip to upgrade ipython to the latest version.

### Which package?

Finally we tell pip which package we want to install. In this case we are installing/upgrading 2 packages, ipython and ipython-sql.

In [None]:
!pip -q install --upgrade ipython
!pip -q install --upgrade ipython-sql

### YOUDO

Install a package named boto3 and import boto3.



> Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. You can find the latest, most up to date, documentation at our doc site, including a list of services that are supported.



### YOUDO ANSWER

In [None]:
import boto3

# Step 2: Import sqlite3 package

sqlite3 is a package that allows us to use a lightweight, file-based SQL database. We'll use this simple database to perform common database operations without needing to install a more complex database like MySql or Postres. For our purposes a simple flat file database is sufficient.

In [None]:
import sqlite3

How do you learn what these packages can do? Read the documentation.

https://www.sqlite.org/doclist.html

In [None]:
# help(sqlite3)

# Step 3: Download the database to a local file

use the shell command 'wget' to retrieve the movies database file from github. You can see it in your files path on the left nav bar in colab after running the code.

If you attempt to run this code on a windows operating system it will fail because wget isn't an available command on windows, only unix based platforms.

In [None]:
!wget -O movies.db https://github.com/matthewpecsok/data_engineering/blob/main/data/movies.sqlite?raw=true

# Step 4: Create a sqlite connection object.

To interact with the database we can create a connection object. We'll call it 'movie_con' but be aware we can give it any name we like, and we can have multiple connection objects to multiple databases open at any time.

In [None]:
movie_con = sqlite3.connect('movies.db')

In [None]:
type(movie_con) # use type to tell us what type of object movie_con is.

In [None]:
# here we setup Jupyter to allow us to interact with the sqlite database easily within code blocks.

%load_ext sql
%sql sqlite:///movies.db

%config SqlMagic.autopandas = True # automatically return a pandas dataframe object.
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Querying the sqlite_master table

What table names are in our database?

Use %% magic to run a multi-line sql statement querying the sqlite_master table which tells us all of the tables in the database.


The two tables of interest are directors and movies.

Let's explore those tables.

In [None]:
%%sql SELECT name, sql FROM sqlite_master
WHERE type='table'
ORDER BY name;

# Step 5: Sql queries

## sql select * from table

notice the return is a pandas dataframe and includes the first 5 and last 5 movies.

\* returns all columns from the table.


In [None]:
%sql select * from movies

# sql count

%sql runs a single line sql statement. How many rows are in our movies table?



In [None]:
%sql select count(1) as movie_count from movies

How many rows in directors table?

### YOUDO

### YOUDO ANSWER

In [None]:
%sql select count(1) as director_count from directors

# sql order by

In [None]:
%sql select popularity,title from movies order by popularity desc limit 5

In [None]:
%sql select popularity,title from movies order by popularity limit 5

# sql Limit

Show the first 5 rows in the movies table.

* The id column is the primary key of the table
* original_title is the original name of the movie
* budget is the cost of the movie (in dollars)
* popularity is a score for how popular th emovie was.
* revenue (in dollars, how much the movie brought in)
* title is the name of the movie
* vote_average a score of votes on a 10 point scale
* vote_count: the count of votes
* overview: description of the movie
* tagline: the tagline of the movie
* uid (ignorable)
* director_id: foreign key to the director table.


In [None]:
%sql select * from movies limit 5

### YOUDO

Show the first 3 rows in the directors table

### YOUDO ANSWER

In [None]:
%sql select * from directors limit 3

# sql max function

Show both the maximum budget and maximum revenue from all movies. Note these may be from 2 different movies.

In [None]:
%sql select max(budget),max(revenue) from movies

## YOUDO

Find the minimum revenue and budget from the movie table.

In [None]:
%sql select min(budget),min(revenue) from movies

# Which movie(s) had the max budget?

In [None]:
%sql select * from movies where budget = 380000000

Which movie(s) had the max revenue??

In [None]:
%sql select * from movies where revenue = 2787965087

# sql left join

In [None]:
%sql select count(1) from movies m left join directors d on m.director_id = d.id

# sql inner join

In [None]:
%sql select count(1) from movies m inner join directors d on m.director_id = d.id

select specific columns from the movie table, sort the dataset by the budget in descending order, limit to the top 10 budgets.

left join to the directors table so we can also retrieve the director's name.

In [None]:
%sql select budget,title,d.name from movies m left join directors d on m.director_id = d.id order by budget desc limit 10

### YOUDO

for the title and budget from movies table sort the movies in descending order by revenue limit to 5 movies.

### YOUDO ANSWER

In [None]:
%sql select title,budget from movies order by revenue desc limit 5

We might be curious what the earliest release data was, and the latest release date. Finally we might want to know what range of years are in our dataset.

We can subtract the min release_date from the max_release date to give us the count of years in the dataset.  

In [None]:
%sql select min(release_date),max(release_date),max(release_date) - min(release_date) as years_of_releases from movies

# sql avg and group by

We can compute the average budget and revenue over our dataset grouped by director. We then show which directors had the highest average budget.

While Rob Marshal had the most expensive movie by budget his average movie budget was less than other averages. It's worth noting that many directors such as Byron Howard only had 1 movie in the dataset.

In [None]:
%%sql select

avg(budget),
avg(revenue),
name as director_name,
count(1) as movie_count

from movies m left join directors d on m.director_id = d.id
group by director_id order by (avg(budget)) desc
limit 8

# sql distinct

In [None]:
%sql select distinct(department) from directors

In [None]:
%%sql select *

from movies m
left join directors d on m.director_id = d.id


limit 5

# comparing max and average budgets by director.



## Top 5 directors by average budget.

In [None]:
%%sql select
avg(budget),
max(budget),
avg(revenue),
max(revenue),
director_id,
name,
count(1) as movie_count


from movies m
left join directors d on m.director_id = d.id

group by director_id
order by avg(budget)
desc
limit 5


# sql like statement

Query for any movie with the words star wars. Notice that sqlite is case insensitive.

'% %' searches for any text within the column


In [None]:
%sql select title from movies where title like '%star wars%'

# sql ends with %

search for any title ending with the word 'star'.

In [None]:
%sql select title from movies where title like '%star'

In [None]:
%sql select * from movies where overview like '%comedy%'

### YOUDO

find all movies with a tagline starting with the text hero

### YOUDO ANSWER

In [None]:
%sql select * from movies where tagline like 'hero%'

# sql greater than, greater than or equal

there are some popular movies... that only 1 person voted for.

In [None]:
%sql select * from movies where vote_average > 9

In [None]:
%sql select * from movies where vote_average >= 8.5 and vote_count >= 10

# sql having clause

We can group by and then use the having clause to filter records on aggregated data. For example we can filter movies that have directors who have directed at least 5 movies.

There are 211 directors who have directed at least 5 movies.

In [None]:
%sql select d.name,count(1) from movies m left join directors d on m.director_id = d.id group by director_id having count(1) >= 5 order by count(1) desc

# year extraction

What if we want to extract the year itself from the date?

In [None]:
%sql select release_date from movies limit 5

## solution 1: substring

we can use subtring to get the first through fourth characters.

In [None]:
%sql select release_date,substr(release_date,1,4) as year from movies limit 5

## solution 2: string format

we can use date specific string formats to more easily do this (the code is also more readable)

In [None]:
%sql select strftime('%Y', release_date) as year from movies limit 5

### YOUDO

extract the day from the text using substring and strftime.



### YOUDO ANSWER

%sql select release_date, strftime('%d', release_date) as day,substr(release_date,9,2) from movies limit 5

# sql count of movies by year

In [None]:
%sql select strftime('%Y', release_date) as movie_year,count(1) as year_count from movies group by movie_year order by year_count desc limit 5

# VIEWS

a view is a sql statement masquerading as a table. views are great for exposing underlying tables to your users without giving them direct access to those tables themselves.



In [None]:
%sql select * from sqlite_master

In [None]:
%sql create view movie_year_count as select strftime('%Y', release_date) as movie_year,count(1) as year_count from movies group by movie_year order by year_count desc

In [None]:
%sql select * from sqlite_master

Now we have a view and can query for all views

In [None]:
%sql select * from sqlite_master where type = 'view'

In [None]:
%sql select * from movie_year_count

# Table creation and population

## drop the table (if it exists)

In [None]:
%sql drop table if exists user_favorite_movies

## create the table

use if not exists to avoid an error if this code re-runs and the table already exists. this is best practice.

In [None]:
%sql create table if not exists user_favorite_movies (user_id integer, movie_id integer,user_score integer)

## insert 5 rows into the table

In [None]:
%sql insert into user_favorite_movies values (1,1,10)
%sql insert into user_favorite_movies values (1,2,7.5)
%sql insert into user_favorite_movies values (1,3,7.5)
%sql insert into user_favorite_movies values (1,5,3.5)
%sql insert into user_favorite_movies values (1,9,5.5)

## query the table.

In [None]:
%sql select * from user_favorite_movies

# round score

In [None]:
%sql select round(user_score,0),user_score user_score from user_favorite_movies