# Advanced SQL - Part A

During this course, we will go further into SQL and learn how to perform calculations and use aggregation functions 💗.

## What you will learn in this course 🧐🧐

- Make calculations using `COUNT`, `SUM`, `AVG` & `ROUND` queries
- Segment your data into different groups with `GROUP BY`
- Achieve maxima and minima with `MAX` & `MIN`
- Filter your results with `HAVING`


## Aggregation Functions 🍲

The aggregation functions enable you to group values into a single result. For example, if you want to sum or average a certain column in your table, you can use an aggregation function.

`GROUP BY` queries allows you to segment data into different clusters (or groups) and generate aggregated results for each of these clusters.

For example, `GROUP BY` and aggregation functions would be useful when you would want to determine the average spend of your customers based on their subscription plan and the city they belong to 🦊.

## COUNT

The `COUNT()` function allows you to quickly count rows.

```sql
SELECT COUNT(*)
FROM IMDB.movies ;
```

`COUNT()` takes a column as a parameter and counts the number of rows other than `NULL` in that column. It is also possible to count all rows using `*` as a parameter of the function. By adding a condition, for example using `WHERE`, it is possible to extract very useful information from a given Dataset.

Have you ever watched a movie that looked like it would never come to an end? Use `COUNT()` to see how many movies are longer than *4 hours*:

```sql
SELECT COUNT(*)

FROM IMDB.movies

WHERE duration > 240;
```

In this example, we count all rows in our table that are longer than *240 minutes* (4 hours).

## GROUP BY

### General Use

`GROUP BY` is one of the most useful query when using aggregation functions, and can only be used with aggregation functions. You will often couple this query with `SELECT`.

```sql
SELECT title_year, COUNT(*) as cnt_movies_per_year FROM IMDB.movies
GROUP BY title_year
ORDER BY cnt_movies_per_year DESC;
```

In the above example, `COUNT()` is the aggregation function. We grouped rows counted by `COUNT()` by `title_year`: this will return the number of movies released each year.

We can also sort our results by the aggregation function. Here we have sorted the number of movies released per year from the largest to the smallest.

**NB1**: the aggregation function in `ORDER BY` is not necessarily the same as the one in the `SELECT` query.

**NB2**: it is recommended to pass the `GROUP BY` argument in `SELECT`. This way, you will view the results more accurately.

Even with `GROUP BY`, it is possible to insert a `WHERE` condition as below :

```sql
SELECT genres, COUNT(*)

FROM IMDB.movies

WHERE imdb_score >= 9

GROUP BY genres;
```

This instruction returns the count of all movies in the **IMDB.movies** table that have an imdb score (**imdb_score**) greater than or equal to 9, grouped by **genres**.

**NB**: If you select a column in your `SELECT` that is not aggregated, you will have to put it in your `GROUP BY`.

### Google Big Query specific use

In some DBMS, including Google Big Query, `GROUP BY` can also be used to select unique values and thus avoid duplicates. In other DBMS, `DISTINCT` should be used.

For clarity, let's take an example :

```sql
SELECT genres FROM IMDB.movies;
```

After executing your query, you will see a list of repeated genres:

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M01-Data_visualisation/D01-Data_visualisation/01-Data_Visualisation_with_Tableau/Advanced_SQL_Part_A_1.png)

In some DBMS such as Google Big Query, you could use the `DISTINCT` query to avoid having this kind of result :

```sql
SELECT DISTINCT genres FROM IMDB.movies;
```

However some DBMS does not include the `DISTINCT` query and prefers to use `GROUP BY` which gives exactly the same result:

```sql
SELECT genres FROM IMDB.movies

GROUP BY genres;
```

Results:

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M01-Data_visualisation/D01-Data_visualisation/01-Data_Visualisation_with_Tableau/Advanced_SQL_Part_A_2.png)

Today, both of these queries can be used in Google Big Query ! 

## SUM

```sql
SELECT SUM(gross)

FROM IMDB.movies;
```

`SUM()` takes a column as argument and returns the sum of all values in that column. In the example above, we sum the gross revenue (`gross`) of each row.

Example: what is the genre that makes the most money?

```sql
SELECT genres, SUM(gross) as revenue
FROM IMDB.movies
GROUP BY genres
ORDER BY revenue DESC;
```

## MAX

```sql
SELECT MAX(gross) from IMDB.movies;
```

`MAX()` also takes a column as an argument and returns the maximum number of that column. In the example above the maximum revenue from the IMDB.movies table.

Example: Select the movie that made the most money in each genre from the IMDB.movies table. For ease of use, don't bother trying to display the name of the movie, just the revenue it made:

```sql
SELECT genres, MAX(gross)
FROM IMDB.movies
GROUP BY genres
ORDER BY MAX(gross) desc;
```

## MIN

```sql
SELECT MIN(imdb_score) FROM IMDB.movies;
```

`MIN()` returns the smallest value in a column.

Example: what is the smallest `imdb_score` in each genre?

```sql
SELECT genres, MIN(imdb_score)

FROM IMDB.movies

GROUP BY genres

ORDER BY MIN(imdb_score) desc;
```

## AVG

`AVG()` function takes a column as argument and returns the average of that column. The column must contain numerical data.

```sql
SELECT AVG(imdb_score)
FROM IMDB.movies;
```

This statement returns the average **imdb_score** of all the movies in the **IMDB.movies** table.

Example: calculate the average of the imdb_score of each genre.

```sql
SELECT genres, AVG(imdb_score)

FROM IMDB.movies

GROUP BY genres

ORDER BY AVG(imdb_score) DESC;
```

## HAVING

`HAVING` query is used to filter the results of an aggregation function.

For example, if you want to find the most successful genres, we should exclude genres with less than 50 movies to have a more representative average. We can then use `HAVING` and keep only those genres that have made more than 50 movies. Here is the code:

```sql
SELECT genres, AVG(imdb_score) AS average_score
FROM IMDB.movies
GROUP BY genres
HAVING count(*) > 50
ORDER BY average_score DESC;
```

## ROUND

By default, SQL will try to return an exact number. But this is not always the most relevant. For example, you may have to manipulate a number of humans (and therefore not divisible) or you may have to manipulate currencies (and therefore not more than two digits after the decimal point).

`ROUND()` will allow you to get a rounding of your exact result. This function also takes a column as an argument and an integer. The integer determines how many decimal places you want to round your number to.

```sql
SELECT genres, ROUND(AVG(imdb_score),1) AS average_score_rounded
FROM IMDB.movies
GROUP BY genres
HAVING count(*) > 50
ORDER BY average_score_rounded DESC;
```

## Subqueries (or nested queries)

A subquery is a query within another SQL query. These subqueries can be used with aggregation functions.

Let's take the *IMDB* database: if you want to see all the movies that have a longer duration than the average duration of the movies in the database, you can do this statement:

```sql
SELECT movie_title
FROM IMDB.movies
WHERE duration > (
    SELECT AVG(duration)
    FROM IMDB.movies
)
ORDER BY movie_title;
```

Subqueries can also replace `JOIN`s. The result of the subquery is then integrated into the main query. Here is the syntax:

```sql
SELECT column_name
FROM table_A
WHERE un_id IN (
    SELECT id
    FROM table_B
    WHERE...
);
```

Let's consider a concrete example and count all the movies whose director has more than 10000 likes on Facebook :

```sql
SELECT COUNT(*)
FROM IMDB.movies
WHERE director_id IN (
    SELECT director_id
    FROM IMDB.directors
    WHERE director_facebook_likes > 10000
);
```

Finally, subqueries can be included in `FROM`s. For example, you may want to know the average score of films in the USA:

```sql
SELECT AVG(imdb_score)
FROM (
    SELECT imdb_score
    FROM IMDB.movies
    WHERE Country = "USA"
);
```

However, it is advisable not to use subqueries when it is not necessary. In our example, the following query gives the same result:

```sql
SELECT AVG(imdb_score)
FROM IMDB.movies
WHERE Country = "USA";
```

However, formatting a subquery has a benefit which we will see in the last part.

## Resources 📚📚

- Aggregate functions - [http://bit.ly/2AvOjHU](http://bit.ly/2AvOjHU)
- Aggregate function in Google Big Query - [https://bit.ly/2HnMMXJ](https://bit.ly/2HnMMXJ)
- Subqueries - [https://bit.ly/2HJMImE](https://bit.ly/2HJMImE)