# SQL: Aggregation Queries

## Setup

We are now installing the necessary packages to interact with the MySQL database and issue SQL queries using the notebook.

In [1]:
!sudo apt-get install python3-mysqldb
!sudo pip3 install -U sqlalchemy sql_magic

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-mysqldb is already the newest version (1.3.10-1build1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Requirement already up-to-date: sqlalchemy in /usr/local/lib/python3.6/dist-packages (1.3.16)
Requirement already up-to-date: sql_magic in /usr/local/lib/python3.6/dist-packages (0.0.4)




In [2]:
%reload_ext sql_magic

In [3]:
from sqlalchemy import create_engine

conn_string = 'mysql://{user}:{password}@{host}/?charset=utf8'.format(
    host='db.ipeirotis.org',
    user='student',
    password='dwdstudent2015',
    encoding='utf-8')
engine = create_engine(conn_string)

In [4]:
%config SQL.conn_name = 'engine'

## Basic aggregation functions


#### Switch to IMDb

In [5]:
%%read_sql
USE imdb

Query started at 03:20:29 AM UTC; Query executed in 0.01 m

<sql_magic.exceptions.EmptyResult at 0x7fbaaf7f8048>

### `COUNT(*)`

#### Find the number of movies in the database


In [6]:
%%read_sql
SELECT COUNT(*) AS num_movies
FROM movies

Query started at 03:20:29 AM UTC; Query executed in 0.01 m

Unnamed: 0,num_movies
0,388269


#### Find the number of actors in the database


In [7]:
%%read_sql
SELECT COUNT(*) AS num_actors
FROM actors

Query started at 03:20:30 AM UTC; Query executed in 0.01 m

Unnamed: 0,num_actors
0,817718


### `COUNT(attr)`


#### Find the number of movies with a rating



In [8]:
%%read_sql
SELECT COUNT(*) AS rated_movies
FROM movies

Query started at 03:20:30 AM UTC; Query executed in 0.00 m

Unnamed: 0,rated_movies
0,388269


#### Find the number of roles where the role is not empty

In [9]:
%%read_sql
SELECT COUNT(role) AS named_roles
FROM roles

Query started at 03:20:30 AM UTC; Query executed in 0.05 m

Unnamed: 0,named_roles
0,2511546


In [10]:
%%read_sql
SELECT COUNT(*) AS named_roles
FROM roles
WHERE role IS NOT NULL

Query started at 03:20:34 AM UTC; Query executed in 0.02 m

Unnamed: 0,named_roles
0,2511546


### `COUNT(DISTINCT attr)`



#### Find the number of distinct genres in the database


In [11]:
%%read_sql
SELECT COUNT(DISTINCT genre) AS num_genres
FROM movies_genres

Query started at 03:20:35 AM UTC; Query executed in 0.01 m

Unnamed: 0,num_genres
0,21


#### Find the number of movies that have a genre associated with them

In [12]:
%%read_sql
SELECT COUNT(DISTINCT movie_id) AS num_movies
FROM movies_genres

Query started at 03:20:35 AM UTC; Query executed in 0.00 m

Unnamed: 0,num_movies
0,269990


Compare the query above with the (incorrect!) query below without the `DISTINCT`. Without the `DISTINCT` we may count the same `movie_id` multiple times. Notice that the query below returns as the count a number larger than the actual number of movies in the database.

In [13]:
%%read_sql
SELECT COUNT(movie_id)
FROM movies_genres

Query started at 03:20:35 AM UTC; Query executed in 0.00 m

Unnamed: 0,COUNT(movie_id)
0,417784


### `MIN(attr)`, `MAX(attr)`, `AVG(attr)`, `STDDEV(attr)`, `SUM(attr)`



#### Find the earliest release year and the latest release year for movies


In [14]:
%%read_sql
SELECT 
    MAX(year) AS max_year, 
    MIN(year) AS min_year
FROM movies

Query started at 03:20:36 AM UTC; Query executed in 0.00 m

Unnamed: 0,max_year,min_year
0,2008,1888


#### Find the average rating of the movies and the standard deviation

In [15]:
%%read_sql
SELECT 
    MAX(rank) AS max_rank, 
    MIN(rank) AS min_rank, 
    AVG(rank) AS avg_rank, 
    STDDEV(rank) AS stdev_rank
FROM movies

Query started at 03:20:36 AM UTC; Query executed in 0.00 m

Unnamed: 0,max_rank,min_rank,avg_rank,stdev_rank
0,9.9,1.0,5.874239,1.6227


## `GROUP BY`, Examples on IMDb

#### Switch to IMDb

In [16]:
%%read_sql
USE imdb

Query started at 03:20:36 AM UTC; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7fbaaf3442e8>

#### Count the number of movies that were released in each year

In [17]:
%%read_sql
SELECT year, COUNT(*) AS num_movies
FROM movies
GROUP BY year

Query started at 03:20:36 AM UTC; Query executed in 0.00 m

Unnamed: 0,year,num_movies
0,1888,2
1,1890,3
2,1891,6
3,1892,9
4,1893,2
...,...,...
115,2004,8718
116,2005,1449
117,2006,195
118,2007,7


#### Compute the average rank for the movies released in each year



In [18]:
%%read_sql
SELECT year, AVG(rank) AS avg_movies
FROM movies
GROUP BY year

Query started at 03:20:36 AM UTC; Query executed in 0.00 m

Unnamed: 0,year,avg_movies
0,1888,
1,1890,7.300000
2,1891,3.683333
3,1892,2.866667
4,1893,6.800000
...,...,...
115,2004,6.217399
116,2005,
117,2006,
118,2007,


#### Compute the min, max, and standard deviation of the movies in each year


In [19]:
%%read_sql
SELECT year, 
    MAX(rank) AS max_rank, 
    MIN(rank) AS min_rank, 
    AVG(rank) AS avg_rank, 
    STDDEV(rank) AS stdev_rank
FROM movies
GROUP BY year

Query started at 03:20:37 AM UTC; Query executed in 0.00 m

Unnamed: 0,year,max_rank,min_rank,avg_rank,stdev_rank
0,1888,,,,
1,1890,7.3,7.3,7.300000,0.000000
2,1891,4.3,3.2,3.683333,0.362476
3,1892,5.1,1.4,2.866667,1.156623
4,1893,6.8,6.8,6.800000,0.000000
...,...,...,...,...,...
115,2004,9.9,1.0,6.217399,1.810537
116,2005,,,,
117,2006,,,,
118,2007,,,,


#### Examine the difference between `COUNT(*)` and `COUNT(rank)` when reporting movies per year

In [20]:
%%read_sql
SELECT year, 
    COUNT(*) AS num_movies,
    COUNT(rank) AS rated_movies,
    MAX(rank) AS max_rank, 
    MIN(rank) AS min_rank, 
    AVG(rank) AS avg_rank, 
    STDDEV(rank) AS stdev_rank
FROM movies
GROUP BY year

Query started at 03:20:37 AM UTC; Query executed in 0.00 m

Unnamed: 0,year,num_movies,rated_movies,max_rank,min_rank,avg_rank,stdev_rank
0,1888,2,0,,,,
1,1890,3,1,7.3,7.3,7.300000,0.000000
2,1891,6,6,4.3,3.2,3.683333,0.362476
3,1892,9,9,5.1,1.4,2.866667,1.156623
4,1893,2,1,6.8,6.8,6.800000,0.000000
...,...,...,...,...,...,...,...
115,2004,8718,1138,9.9,1.0,6.217399,1.810537
116,2005,1449,0,,,,
117,2006,195,0,,,,
118,2007,7,0,,,,


In [21]:
%%read_sql
SELECT year, 
    COUNT(*) AS num_movies,
    COUNT(rank) AS rated_movies,
    MAX(rank) AS max_rank, 
    MIN(rank) AS min_rank, 
    ROUND(AVG(rank),2) AS avg_rank, 
    ROUND(STDDEV(rank),2) AS stdev_rank
FROM movies
GROUP BY year

Query started at 03:20:37 AM UTC; Query executed in 0.00 m

Unnamed: 0,year,num_movies,rated_movies,max_rank,min_rank,avg_rank,stdev_rank
0,1888,2,0,,,,
1,1890,3,1,7.3,7.3,7.30,0.00
2,1891,6,6,4.3,3.2,3.68,0.36
3,1892,9,9,5.1,1.4,2.87,1.16
4,1893,2,1,6.8,6.8,6.80,0.00
...,...,...,...,...,...,...,...
115,2004,8718,1138,9.9,1.0,6.22,1.81
116,2005,1449,0,,,,
117,2006,195,0,,,,
118,2007,7,0,,,,


#### Compute the number of movies per director ID. 
Rank first the directors with the most movies




In [22]:
%%read_sql
SELECT director_id, 
    COUNT(*) AS num_movies
FROM movies_directors
GROUP BY director_id
ORDER BY num_movies DESC

Query started at 03:20:38 AM UTC; Query executed in 0.01 m

Unnamed: 0,director_id,num_movies
0,25116,619
1,56530,562
2,30570,536
3,9277,370
4,1958,360
...,...,...
88599,78239,1
88600,50807,1
88601,54909,1
88602,34384,1


#### Compute the number of movies per actor ID, 
Rank first the actors with the most movies

In [23]:
%%read_sql
SELECT actor_id, 
    COUNT(*) AS num_movies
FROM roles
GROUP BY actor_id
ORDER BY num_movies DESC

Query started at 03:20:38 AM UTC; Query executed in 0.05 m

Unnamed: 0,actor_id,num_movies
0,45332,909
1,621468,672
2,283127,549
3,41669,544
4,89951,544
...,...,...
817713,332792,1
817714,4221,1
817715,340925,1
817716,216824,1


#### Compute the number of actors per movie ID
Rank first the movies with the most actors

In [24]:
%%read_sql
SELECT movie_id, 
    COUNT(*) AS num_roles,
    COUNT(DISTINCT actor_id) AS num_actors
FROM roles
GROUP BY movie_id
ORDER BY num_actors DESC

Query started at 03:20:41 AM UTC; Query executed in 0.27 m

Unnamed: 0,movie_id,num_roles,num_actors
0,20625,1274,1274
1,389858,1083,1083
2,385299,907,907
3,385824,747,747
4,380391,680,680
...,...,...,...
300247,287605,1,1
300248,294830,1,1
300249,226739,1,1
300250,302360,1,1


#### Count the number of male actors and the number of female actors

In [25]:
%%read_sql
SELECT gender, COUNT(*) 
FROM actors
GROUP BY gender

Query started at 03:20:58 AM UTC; Query executed in 0.02 m

Unnamed: 0,gender,COUNT(*)
0,F,304412
1,M,513306


#### Compute the number of movies for each genre



In [26]:
%%read_sql
SELECT genre, COUNT(DISTINCT movie_id), COUNT(movie_id)
FROM movies_genres
GROUP BY genre

Query started at 03:20:59 AM UTC; Query executed in 0.12 m

Unnamed: 0,genre,COUNT(DISTINCT movie_id),COUNT(movie_id)
0,Action,14865,14885
1,Adult,20666,20667
2,Adventure,8976,8992
3,Animation,17879,17888
4,Comedy,57829,57860
5,Crime,12929,12966
6,Documentary,42308,42320
7,Drama,74510,74615
8,Family,11221,11232
9,Fantasy,5217,5223


## `GROUP BY`, Examples on Facebook

#### Switch to Facebook

In [27]:
%%read_sql
USE facebook

Query started at 03:21:06 AM UTC; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7fbaaf369160>

#### List the number of males and females


In [28]:
%%read_sql
SELECT Sex, COUNT(*) AS cnt
FROM Profiles
GROUP BY Sex

Query started at 03:21:06 AM UTC; Query executed in 0.00 m

Unnamed: 0,Sex,cnt
0,,4498
1,Female,12311
2,Male,8975


#### List the number of students for each political view

In [29]:
%%read_sql
SELECT PoliticalViews, COUNT(*) AS cnt
FROM Profiles
GROUP BY PoliticalViews

Query started at 03:21:06 AM UTC; Query executed in 0.00 m

Unnamed: 0,PoliticalViews,cnt
0,,11091
1,Apathetic,805
2,Conservative,936
3,Liberal,6461
4,Libertarian,325
5,Moderate,2898
6,Other,824
7,Very Conservative,167
8,Very Liberal,2277


#### List the number of males and female students for each political view

In [30]:
%%read_sql
SELECT Sex, PoliticalViews, COUNT(*) AS cnt
FROM Profiles
GROUP BY Sex, PoliticalViews

Query started at 03:21:06 AM UTC; Query executed in 0.00 m

Unnamed: 0,Sex,PoliticalViews,cnt
0,,,3942
1,,Apathetic,34
2,,Conservative,34
3,,Liberal,211
4,,Libertarian,16
5,,Moderate,75
6,,Other,62
7,,Very Conservative,22
8,,Very Liberal,102
9,Female,,4283


In [31]:
%%read_sql
SELECT Sex, PoliticalViews, COUNT(*) AS cnt
FROM Profiles
WHERE Sex IS NOT NULL AND PoliticalViews IS NOT NULL
GROUP BY Sex, PoliticalViews

Query started at 03:21:06 AM UTC; Query executed in 0.00 m

Unnamed: 0,Sex,PoliticalViews,cnt
0,Female,Apathetic,309
1,Female,Conservative,428
2,Female,Liberal,4054
3,Female,Libertarian,113
4,Female,Moderate,1444
5,Female,Other,280
6,Female,Very Conservative,38
7,Female,Very Liberal,1362
8,Male,Apathetic,462
9,Male,Conservative,474


#### Find the most popular TV Shows and Books

In [32]:
%%read_sql
SELECT Book, COUNT(*) AS cnt
FROM FavoriteBooks
GROUP BY Book
ORDER BY cnt DESC
LIMIT 25

Query started at 03:21:06 AM UTC; Query executed in 0.01 m

Unnamed: 0,Book,cnt
0,Harry Potter,1320
1,Catcher In The Rye,1079
2,The Great Gatsby,963
3,1984,725
4,Pride And Prejudice,602
5,To Kill A Mockingbird,577
6,Catch 22,560
7,Angels And Demons,520
8,Memoirs Of A Geisha,463
9,The Da Vinci Code,445


In [33]:
%%read_sql
SELECT TVShow, COUNT(*) AS cnt
FROM FavoriteTVShows
GROUP BY TVShow
ORDER BY cnt DESC
LIMIT 25

Query started at 03:21:07 AM UTC; Query executed in 0.00 m

Unnamed: 0,TVShow,cnt
0,Family Guy,1146
1,Sex And The City,649
2,Lost,640
3,Arrested Development,610
4,Grey s Anatomy,575
5,Friends,543
6,Seinfeld,520
7,Desperate Housewives,457
8,24,388
9,Curb Your Enthusiasm,353


#### Find the number of students in various relationship statuses

In [34]:
%%read_sql
SELECT Status, COUNT(*) AS cnt
FROM Relationship
GROUP BY Status

Query started at 03:21:07 AM UTC; Query executed in 0.00 m

Unnamed: 0,Status,cnt
0,Engaged,1
1,In a Relationship,4851
2,In an Open Relationship,565
3,It's complicated,17
4,Married,2337
5,Single,7872


#### Find the most popular majors (concentration)

In [35]:
%%read_sql
SELECT Concentration, COUNT(*) AS cnt
FROM Concentration
GROUP BY Concentration
ORDER BY cnt DESC

Query started at 03:21:07 AM UTC; Query executed in 0.00 m

Unnamed: 0,Concentration,cnt
0,Finance,1810
1,Psychology,1571
2,Economics,1533
3,Journalism and Mass Communication,1267
4,Politics,1196
...,...,...
134,Education (minor only; through School of Educa...,1
135,Slavic Studies,1
136,German and Linguistics (major only),1
137,Ancient Studies (minor only),1


#### List the number of students per each birth year 
Use the `YEAR(date)` function to get the year value from a datetime column. Then (try to) List only years that have at least 10 students.

In [36]:
%%read_sql
SELECT YEAR(Birthday) AS YoB, COUNT(*) AS cnt
FROM Profiles
WHERE Birthday IS NOT NULL
GROUP BY YoB
ORDER BY cnt DESC

Query started at 03:21:07 AM UTC; Query executed in 0.00 m

Unnamed: 0,YoB,cnt
0,1986,3735
1,1985,3525
2,1984,2985
3,1987,2870
4,1983,2601
...,...,...
59,1923,1
60,1908,1
61,1929,1
62,1903,1


## `HAVING`

#### Switch to IMDb

In [37]:
%%read_sql
USE imdb;

Query started at 03:21:07 AM UTC; Query executed in 0.00 m

<sql_magic.exceptions.EmptyResult at 0x7fbaa40e6630>

#### Find the movies (just movie IDs) with more than 100 actors



In [38]:
%%read_sql
SELECT movie_id, 
    COUNT(*) AS num_roles,
    COUNT(DISTINCT actor_id) AS num_actors
FROM roles
GROUP BY movie_id
HAVING num_roles>100
ORDER BY num_actors DESC

Query started at 03:21:07 AM UTC; Query executed in 0.19 m

Unnamed: 0,movie_id,num_roles,num_actors
0,20625,1274,1274
1,389858,1083,1083
2,385299,907,907
3,385824,747,747
4,380391,680,680
...,...,...,...
544,381356,101,101
545,384288,107,97
546,388419,153,84
547,405500,112,74


In [39]:
%%read_sql
SELECT movie_id, 
    COUNT(*) AS num_roles,
    COUNT(DISTINCT actor_id) AS num_actors
FROM roles
GROUP BY movie_id
HAVING num_actors>100
ORDER BY num_actors DESC

Query started at 03:21:19 AM UTC; Query executed in 0.18 m

Unnamed: 0,movie_id,num_roles,num_actors
0,20625,1274,1274
1,389858,1083,1083
2,385299,907,907
3,385824,747,747
4,380391,680,680
...,...,...,...
540,388209,101,101
541,119803,101,101
542,363560,101,101
543,192100,101,101


#### Find the first names of actors that appear more than 1000 times

In [40]:
%%read_sql
SELECT first_name, COUNT(*) AS cnt
FROM actors
GROUP BY first_name
HAVING cnt>1000

Query started at 03:21:30 AM UTC; Query executed in 0.09 m

Unnamed: 0,first_name,cnt
0,A.,1123
1,Alex,1113
2,Andrew,1091
3,Anna,1612
4,Anne,1091
...,...,...
56,Thomas,1354
57,Tom,1758
58,Tony,1653
59,Walter,1085


#### Find all the movie ids for movies that have more roles than actors (i.e, the same actor plays multiple roles in the movie)

In [41]:
%%read_sql
SELECT movie_id, 
    COUNT(*) AS num_roles,
    COUNT(DISTINCT actor_id) AS num_actors
FROM roles
GROUP BY movie_id
HAVING num_roles<>num_actors
ORDER BY num_actors DESC

Query started at 03:21:35 AM UTC; Query executed in 0.19 m

Unnamed: 0,movie_id,num_roles,num_actors
0,317309,397,396
1,2252,276,275
2,411420,243,242
3,387120,212,178
4,315678,178,177
...,...,...,...
275,388594,3,2
276,394824,4,2
277,408978,3,2
278,183995,2,1


#### Find all the actor ids for actors that have more roles than actors (i.e, the same actor plays multiple roles in the movie)

In [42]:
%%read_sql
SELECT actor_id, 
    COUNT(*) AS num_roles,
    COUNT(DISTINCT movie_id) AS num_movies
FROM roles
GROUP BY actor_id
HAVING num_roles<>num_movies
ORDER BY num_movies DESC

Query started at 03:21:46 AM UTC; Query executed in 0.16 m

Unnamed: 0,actor_id,num_roles,num_movies
0,506067,422,421
1,228392,317,307
2,352778,287,286
3,159402,254,253
4,707739,251,250
...,...,...,...
604,603466,2,1
605,644804,3,1
606,731051,3,1
607,766543,2,1


#### Find data quality issues: In the movies_genres table, the same movie id may be associated multiple times with the same genre. Identify these cases.

In [43]:
%%read_sql
SELECT movie_id, genre, COUNT(*) AS cnt
FROM movies_genres
GROUP BY movie_id, genre 
HAVING cnt>1
ORDER BY cnt DESC

Query started at 03:21:56 AM UTC; Query executed in 0.03 m

Unnamed: 0,movie_id,genre,cnt
0,146416,Short,7
1,264009,Documentary,6
2,264009,Short,6
3,37131,Documentary,3
4,7987,Drama,2
...,...,...,...
320,363796,Crime,2
321,367253,Drama,2
322,367282,Horror,2
323,373383,Animation,2


### Compare `WHERE` and `HAVING`


In [44]:
%%read_sql
SELECT COUNT(*), COUNT(rank)
FROM movies


Query started at 03:21:58 AM UTC; Query executed in 0.00 m

Unnamed: 0,COUNT(*),COUNT(rank)
0,388269,67245


In [45]:
%%read_sql
SELECT COUNT(*), COUNT(rank)
FROM movies
WHERE rank IS NOT NULL


Query started at 03:21:58 AM UTC; Query executed in 0.00 m

Unnamed: 0,COUNT(*),COUNT(rank)
0,67245,67245


## `JOIN` and `GROUP BY` together

#### For each movie genre, list the average rating of the movies from year 2000. 

Also list:
* the maximum and minimum ratings
* the standard deviation of the ratings
* the number of rated movies and the total number of movies




In [46]:
%%read_sql
SELECT G.genre, 
    MAX(M.rank) AS max_rating,
    MIN(M.rank) AS min_rating,
    ROUND(AVG(M.rank),2) AS avg_rating,
    ROUND(STDDEV(M.rank),2) AS std_rating,
    COUNT(*) AS num_movies,
    COUNT(M.rank) AS rated_movies
FROM movies M
    INNER JOIN movies_genres G ON M.id = G.movie_id
WHERE M.year = 2000
GROUP BY G.genre
ORDER BY avg_rating DESC

Query started at 03:21:58 AM UTC; Query executed in 0.00 m

Unnamed: 0,genre,max_rating,min_rating,avg_rating,std_rating,num_movies,rated_movies
0,Documentary,9.5,1.0,6.91,1.5,1692,218
1,Short,9.8,1.0,6.56,1.61,2595,495
2,Animation,9.5,2.2,6.51,1.38,569,119
3,Music,8.8,2.5,6.51,1.47,274,24
4,War,8.7,2.3,6.44,1.46,62,27
5,Romance,8.9,1.0,6.2,1.27,423,233
6,Drama,9.6,1.0,6.09,1.44,1869,941
7,Musical,9.0,2.6,6.07,1.43,112,41
8,Mystery,9.3,1.0,5.96,1.69,138,58
9,Fantasy,9.8,1.4,5.94,1.87,236,76


#### For each director, compute:
* The number of rated and total number of movies
* The average, min, max, and standard deviation of the movie ratings
* Limit the results to directors who directed at least 40 movies, with at least 30 rated movies




In [47]:
%%read_sql
SELECT D.*,
    COUNT(*) AS num_movies,
    COUNT(M.rank) AS rated_movies,
    MAX(M.rank) AS max_rating,
    MIN(M.rank) AS min_rating,
    ROUND(AVG(M.rank),2) AS avg_rating,
    ROUND(STDDEV(M.rank),2) AS std_rating
FROM directors D
    JOIN movies_directors MD ON D.id = MD.director_id
    JOIN movies M ON M.id = MD.movie_id
GROUP BY 
    D.id
HAVING 
    num_movies>40
    AND rated_movies>30
ORDER BY 
    avg_rating DESC

Query started at 03:21:58 AM UTC; Query executed in 0.03 m

Unnamed: 0,id,first_name,last_name,num_movies,rated_movies,max_rating,min_rating,avg_rating,std_rating
0,68416,Carole,Roussopoulos,45,42,9.9,6.2,8.82,0.79
1,75143,Lionel,Soukaz,50,41,9.8,1.0,8.35,1.33
2,52215,Robert F.,McGowan,110,43,9.5,5.8,7.93,0.81
3,60041,Yasujiro,Ozu,54,34,9.0,5.0,7.62,0.83
4,69458,Alekos,Sakellarios,52,35,9.2,3.9,7.49,1.24
...,...,...,...,...,...,...,...,...,...
178,34703,Godfrey,Ho,82,48,7.3,1.0,3.66,1.64
179,86830,Jim,Wynorski,52,34,4.5,1.9,3.51,0.61
180,60036,Mariano,Ozores hijo,100,37,6.5,1.7,3.42,1.26
181,19037,David,DeCoteau,49,36,8.5,1.3,3.31,1.48


#### What roles have the best movie ratings? 
* Do not include movies without ratings in the calculations for number of movies
* Limit to only roles that appear in at least 10 distinct movies
* Limit only to roles played by at least 10 distinct  actors

In [49]:
%%read_sql
SELECT R.role,
    COUNT(*) AS num_roles,
    COUNT(DISTINCT movie_id) AS num_movies,
    COUNT(DISTINCT actor_id) AS num_actors,
    MAX(M.rank) AS max_rating,
    MIN(M.rank) AS min_rating,
    ROUND(AVG(M.rank),2) AS avg_rating,
    ROUND(STDDEV(M.rank),2) AS std_rating
FROM roles R
    JOIN movies M ON M.id = R.movie_id
WHERE
    M.rank IS NOT NULL
GROUP BY 
    R.role
HAVING
    num_movies>=10
    AND
    num_actors>=10
ORDER BY 
    avg_rating DESC
LIMIT 50

Query started at 03:33:30 AM UTC; Query executed in 1.48 m

Unnamed: 0,role,num_roles,num_movies,num_actors,max_rating,min_rating,avg_rating,std_rating
0,Varvara,10,10,10,8.4,6.4,7.43,0.61
1,Tango Dancer,59,15,59,8.7,2.7,7.42,1.63
2,Churchgoer,55,11,55,8.3,3.4,7.41,1.25
3,The Young Man,11,11,10,9.6,5.2,7.36,1.30
4,Volodya,14,14,13,9.2,4.8,7.32,1.25
...,...,...,...,...,...,...,...,...
7111,Ninja,57,17,56,9.5,1.8,3.80,1.42
7112,Shower Girl,24,13,24,6.6,2.1,3.80,1.30
7113,American soldier,122,31,120,8.7,1.4,3.77,2.38
7114,Eulalia,28,28,12,7.7,2.3,3.72,1.43
