In [1]:
import pandas as pd
import sqlalchemy as sa
import psycopg2 as ps
from sqlalchemy import create_engine

In [2]:
%load_ext sql
%sql postgresql://postgres:lingga28@localhost:2828/datacamp
conn = create_engine('postgresql://postgres:lingga28@localhost/datacamp')

# 1. Sorting text
### Exercises
SQL provides you with the ORDER BY keyword to sort one or more fields from your data. It can do this multi-directionally and helps make results easy to interpret.

How does ORDER BY sort a column of text values by default?

### Possible Answers

- A. Alphabetically (A-Z)
- B. Reverse alphabetically (Z-A)
- C. There's no natural ordering to text data
- D. By number of characters (fewest to most)

Answer: A

# 2. Sorting single fields
### Exercises
Now that you understand how ORDER BY works, you'll put it into practice. In this exercise, you'll work on sorting single fields only. This can be helpful to extract quick insights such as the top-grossing or top-scoring film.

The following exercises will help you gain further insights into the film database.`

### task 1
### Instruction
Select the name of each person in the people table, sorted alphabetically.

In [5]:
%%sql

-- Select name from people and sort alphabetically
SELECT name
FROM cinema.people
ORDER BY name
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


name
50 Cent
A. Michael Baldwin
A. Raven Cruz


### task 2
### Instruction
Select the title and duration for every film, from longest duration to shortest.

In [6]:
%%sql

-- Select the title and duration from longest to shortest film
SELECT title, duration
FROM cinema.films
ORDER BY duration
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


title,duration
The Touch,7
Vessel,14
Wal-Mart: The High Cost of Low Price,20


# 3. Sorting multiple fields
### Exercises
ORDER BY can also be used to sort on multiple fields. It will sort by the first field specified, then sort by the next, and so on. As an example, you may want to sort the people data by age and keep the names in alphabetical order.

Try using ORDER BY to sort multiple columns.

### task 1
### Instruction
Select the release_year, duration, and title of films ordered by their release year and duration, in that order.

In [7]:
%%sql

-- Select the release year, duration, and title sorted by release year and duration
SELECT release_year, duration, title
FROM cinema.films
ORDER BY release_year, duration
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year,duration,title
1916.0,123,Intolerance: Love's Struggle Throughout the Ages
1920.0,110,Over the Hill to the Poorhouse
1925.0,151,The Big Parade


### task 2
### Instruction
Select the certification, release_year, and title from films ordered first by certification (alphabetically) and second by release year, starting with the most recent year.

In [8]:
%%sql

-- Select the certification, release year, and title sorted by certification and release year
SELECT certification, release_year, title
FROM cinema.films
ORDER BY certification, release_year
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


certification,release_year,title
Approved,1933.0,She Done Him Wrong
Approved,1935.0,Top Hat
Approved,1936.0,The Charge of the Light Brigade


# 4. GROUP BY single fields
### Exercises
GROUP BY is a SQL keyword that allows you to group and summarize results with the additional use of aggregate functions. For example, films can be grouped by the certification and language before counting the film titles in each group. This allows you to see how many films had a particular certification and language grouping.

In the following steps, you'll summarize other groups of films to learn more about the films in your database.

### task 1
### Instruction
Select the release_year and count of films released in each year aliased as film_count.

In [9]:
%%sql

-- Find the release_year and film_count of each year
SELECT release_year, count(*) as film_count
FROM cinema.films
GROUP BY release_year
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year,film_count
1958.0,1
1936.0,2
1991.0,31


### task 2
### Instruction
Select the release_year and average duration aliased as avg_duration of all films, grouped by release_year.

In [10]:
%%sql

-- Find the release_year and average duration of films for each year
SELECT release_year, AVG(duration) as avg_duration
FROM cinema.films
GROUP BY release_year
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year,avg_duration
1958.0,108.0
1936.0,93.5
1991.0,113.06451612903224


# 5. GROUP BY multiple fields
### Exercises
GROUP BY becomes more powerful when used across multiple fields or combined with ORDER BY and LIMIT.

Perhaps you're interested in learning about budget changes throughout the years in individual countries. You'll use grouping in this exercise to look at the maximum budget for each country in each year there is data available.

### Instructions
Select the release_year, country, and the maximum budget aliased as max_budget for each year and each country; sort your results by release_year and country.

In [13]:
%%sql

-- Find the release_year, country, and max_budget, then group and order by release_year and country
SELECT release_year, country, MAX(budget) as max_budget
FROM cinema.films
GROUP BY release_year, country
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year,country,max_budget
2000.0,Germany,60000000
1970.0,USA,25000000
1998.0,UK,75000000


# 6. Answering business questions
### Exercises
In the real world, every SQL query starts with a business question. Then it is up to you to decide how to write the query that answers the question. Let's try this out.

Which release_year had the most language diversity?

Take your time to translate this question into code. We'll get you started then it's up to you to test your queries in the console.

"Most language diversity" can be interpreted as COUNT(DISTINCT ___). Now over to you.

### Possible Answers
- A. 2005
- B. 1916
- C. 2006
- D. 1990

Answer: C

# 7. Filter with HAVING
### Exercises
Your final keyword is HAVING. It works similarly to WHERE in that it is a filtering clause, with the difference that HAVING filters grouped data.

Filtering grouped data can be especially handy when working with a large dataset. When working with thousands or even millions of rows, HAVING will allow you to filter for just the group of data you want, such as films over two hours in length!

Practice using HAVING to find out which countries (or country) have the most varied film certifications.

### Instructions
- Select country from the films table, and get the distinct count of certification aliased as certification_count.
- Group the results by country.
- Filter the unique count of certifications to only results greater than 10.

In [14]:
%%sql

-- Select the country and distinct count of certification as certification_count
SELECT country, COUNT(DISTINCT certification) as certification_count
FROM cinema.films

-- Group by country
GROUP BY country
-- Filter results to countries with more than 10 different certifications
HAVING count(DISTINCT certification) > 10;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


country,certification_count
USA,12


# 8. HAVING and sorting
### Exercises
Filtering and sorting go hand in hand and gives you greater interpretability by ordering our results.

Let's see this magic at work by writing a query showing what countries have the highest average film budgets.

### Instructions
- Select the country and the average budget as average_budget, rounded to two decimal, from films.
- Group the results by country.
- Filter the results to countries with an average budget of more than one billion (1000000000).
- Sort by descending order of the average_budget.

In [15]:
%%sql

-- Select the country and average_budget from films
SELECT country, ROUND(AVG(budget), 2) AS average_budget
FROM cinema.films
-- Group by country
GROUP BY country
-- Filter to countries with an average_budget of more than one billion
HAVING AVG(budget) > 1000000000
-- Order by descending order of the aggregated budget
ORDER BY average_budget DESC;

 * postgresql://postgres:***@localhost:2828/datacamp
2 rows affected.


country,average_budget
South Korea,1383960000.0
Hungary,1260000000.0


# 9. All together now
### Exercises
It's time to use much of what you've learned in one query! This is good preparation for using SQL in the real world where you'll often be asked to write more complex queries since some of the basic queries can be answered by playing around in spreadsheet applications.

In this exercise, you'll write a query that returns the average budget and gross earnings for films each year after 1990 if the average budget is greater than 60 million.

This will be a big query, but you can handle it!

### task 1
### Instruction
Select the release_year for each film in the films table, filter for records released after 1990, and group by release_year.

In [18]:
%%sql

-- Select the budget for films released after 1990 grouped by year
SELECT release_year
FROM cinema.films
WHERE release_year > 1990
GROUP BY release_year
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year
1991
2009
2013


### task 2
### Instruction
Modify the query to include the average budget aliased as avg_budget and average gross aliased as avg_gross for the results we have so far.

In [19]:
%%sql

-- Modify the query to also list the average budget and average gross
SELECT release_year, AVG(budget) as avg_budget, AVG(gross) as avg_gross
FROM cinema.films
WHERE release_year > 1990
GROUP BY release_year
LIMIT 3; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
3 rows affected.


release_year,avg_budget,avg_gross
1991,25176548.387096774,53844501.666666664
2009,37073287.03703704,46207440.2
2013,40519044.91549296,56158357.77540106


### task 3
### Instruction
Modify the query once more so that only years with an average budget of greater than 60 million are included.

In [20]:
%%sql

SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM cinema.films
WHERE release_year > 1990
GROUP BY release_year
-- Modify the query to see only years with an avg_budget of more than 60 million
HAVING AVG(budget) > 60000000;

 * postgresql://postgres:***@localhost:2828/datacamp
2 rows affected.


release_year,avg_budget,avg_gross
2006,93968929.5774648,39237855.953703694
2005,70323938.23152709,41159143.2906404


### task 4
### Instruction
Finally, order the results from the highest average gross and limit to one.

In [21]:
%%sql

SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM cinema.films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000
-- Order the results from highest to lowest average gross and limit to one
ORDER BY avg_gross DESC
LIMIT 1;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


release_year,avg_budget,avg_gross
2005,70323938.23152709,41159143.2906404
