# SQL with Python Reference Guide 9
# Movie-Rating Query Excercises
## (Justin M. Olds)
Based on Stanford SQL course: https://lagunita.stanford.edu/courses/DB/SQL/SelfPaced/info

---


In [2]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("class.db")
c = conn.cursor()

---
###  **Table** **(Relations)**

* Movie(mID int, title text, year int, director text)
* Reviewer (rID int, name text)
* Rating (rID int, mID int, stars int, ratingDate date)

In [3]:
c.executescript("""
/* Delete the tables if they already exist */
drop table if exists Movie;
drop table if exists Reviewer;
drop table if exists Rating;

/* Create the schema for our tables */
create table Movie(mID int, title text, year int, director text);
create table Reviewer(rID int, name text);
create table Rating(rID int, mID int, stars int, ratingDate date);

/* Populate the tables with our data */
insert into Movie values(101, 'Gone with the Wind', 1939, 'Victor Fleming');
insert into Movie values(102, 'Star Wars', 1977, 'George Lucas');
insert into Movie values(103, 'The Sound of Music', 1965, 'Robert Wise');
insert into Movie values(104, 'E.T.', 1982, 'Steven Spielberg');
insert into Movie values(105, 'Titanic', 1997, 'James Cameron');
insert into Movie values(106, 'Snow White', 1937, null);
insert into Movie values(107, 'Avatar', 2009, 'James Cameron');
insert into Movie values(108, 'Raiders of the Lost Ark', 1981, 'Steven Spielberg');

insert into Reviewer values(201, 'Sarah Martinez');
insert into Reviewer values(202, 'Daniel Lewis');
insert into Reviewer values(203, 'Brittany Harris');
insert into Reviewer values(204, 'Mike Anderson');
insert into Reviewer values(205, 'Chris Jackson');
insert into Reviewer values(206, 'Elizabeth Thomas');
insert into Reviewer values(207, 'James Cameron');
insert into Reviewer values(208, 'Ashley White');

insert into Rating values(201, 101, 2, '2011-01-22');
insert into Rating values(201, 101, 4, '2011-01-27');
insert into Rating values(202, 106, 4, null);
insert into Rating values(203, 103, 2, '2011-01-20');
insert into Rating values(203, 108, 4, '2011-01-12');
insert into Rating values(203, 108, 2, '2011-01-30');
insert into Rating values(204, 101, 3, '2011-01-09');
insert into Rating values(205, 103, 3, '2011-01-27');
insert into Rating values(205, 104, 2, '2011-01-22');
insert into Rating values(205, 108, 4, null);
insert into Rating values(206, 107, 3, '2011-01-15');
insert into Rating values(206, 106, 5, '2011-01-19');
insert into Rating values(207, 107, 5, '2011-01-20');
insert into Rating values(208, 104, 3, '2011-01-02');

""")
conn.commit()

---
### Examine the Movie table to be sure the database was created. 


In [7]:
df = pd.read_sql_query("""
    SELECT *
    FROM Movie
    """, conn);df

Unnamed: 0,mID,title,year,director
0,101,Gone with the Wind,1939,Victor Fleming
1,102,Star Wars,1977,George Lucas
2,103,The Sound of Music,1965,Robert Wise
3,104,E.T.,1982,Steven Spielberg
4,105,Titanic,1997,James Cameron
5,106,Snow White,1937,
6,107,Avatar,2009,James Cameron
7,108,Raiders of the Lost Ark,1981,Steven Spielberg


### Q1: Find the titles of all movies directed by Steven Spielberg. 

In [8]:
df = pd.read_sql_query("""
    SELECT title
    FROM Movie
    WHERE director = "Steven Spielberg"
    """, conn);df


Unnamed: 0,title
0,E.T.
1,Raiders of the Lost Ark


### Q2: Find all years that have a movie that received a rating of 4 or 5, and sort them in increasing order. 

This requires a subquery in the WHERE clause. Let's check that first. 

In [19]:
df = pd.read_sql_query("""
    SELECT *
    FROM Rating
    WHERE stars = 4 OR stars = 5
""", conn);df

Unnamed: 0,rID,mID,stars,ratingDate
0,201,101,4,2011-01-27
1,202,106,4,
2,203,108,4,2011-01-12
3,205,108,4,
4,206,106,5,2011-01-19
5,207,107,5,2011-01-20


In [23]:
df = pd.read_sql_query("""
    SELECT year
    FROM Movie 
    WHERE mID IN
        (SELECT mID
        FROM Rating
        WHERE stars = 4 OR stars = 5)
    ORDER BY year ASC
    """, conn);df

Unnamed: 0,year
0,2009
1,1981
2,1939
3,1937


### Q3: Find the titles of all movies that have no ratings. 

In [25]:
df = pd.read_sql_query("""
    SELECT title
    FROM Movie
    WHERE mID NOT IN 
       (SELECT mID
        FROM Rating)
""", conn);df

Unnamed: 0,title
0,Star Wars
1,Titanic


### Q4: Some reviewers didn't provide a date with their rating. Find the names of all reviewers who have ratings with a NULL value for the date. 

In [26]:
df = pd.read_sql_query("""
    SELECT name
    FROM Reviewer
    WHERE rID IN
       (SELECT rID
        FROM Rating
        WHERE ratingDate is NULL)
""", conn);df

Unnamed: 0,name
0,Daniel Lewis
1,Chris Jackson


### Q5: Write a query to return the ratings data in a more readable format: reviewer name, movie title, stars, and ratingDate. Also, sort the data, first by reviewer name, then by movie title, and lastly by number of stars. 

In [29]:
df = pd.read_sql_query("""
    SELECT name, title, stars, ratingDate
    FROM Reviewer INNER JOIN Rating USING(rID) INNER JOIN Movie USING(mID)
    ORDER BY name, title, stars
""", conn);df

Unnamed: 0,name,title,stars,ratingDate
0,Ashley White,E.T.,3,2011-01-02
1,Brittany Harris,Raiders of the Lost Ark,2,2011-01-30
2,Brittany Harris,Raiders of the Lost Ark,4,2011-01-12
3,Brittany Harris,The Sound of Music,2,2011-01-20
4,Chris Jackson,E.T.,2,2011-01-22
5,Chris Jackson,Raiders of the Lost Ark,4,
6,Chris Jackson,The Sound of Music,3,2011-01-27
7,Daniel Lewis,Snow White,4,
8,Elizabeth Thomas,Avatar,3,2011-01-15
9,Elizabeth Thomas,Snow White,5,2011-01-19


### Q6: For all cases where the same reviewer rated the same movie twice and gave it a higher rating the second time, return the reviewer's name and the title of the movie. 


In [57]:
df = pd.read_sql_query("""
    SELECT 
        a.rID, a.mID, a.stars, a.ratingDate, 
        b.rID, b.mID, b.stars, b.ratingDate,
        title, name
    FROM Rating AS a, Rating AS b, Movie, Reviewer
    WHERE 
        a.rID = Reviewer.rID AND     
        a.mID = Movie.mID AND
        a.rID = b.rID AND 
        a.mID = b.mID AND
        a.ratingDate > b.ratingDate AND   -- table a has later date
        a.stars > b.stars
    ORDER BY name
    
""", conn);df

Unnamed: 0,rID,mID,stars,ratingDate,rID.1,mID.1,stars.1,ratingDate.1,title,name
0,201,101,4,2011-01-27,201,101,2,2011-01-22,Gone with the Wind,Sarah Martinez


In [58]:
df = pd.read_sql_query("""
    SELECT 
        name, title
    FROM Rating AS a, Rating AS b, Movie, Reviewer
    WHERE 
        a.rID = Reviewer.rID AND     
        a.mID = Movie.mID AND
        a.rID = b.rID AND 
        a.mID = b.mID AND
        a.ratingDate > b.ratingDate AND   -- table a has later date
        a.stars > b.stars
   
""", conn);df

Unnamed: 0,name,title
0,Sarah Martinez,Gone with the Wind


### Q7: For each movie that has at least one rating, find the highest number of stars that movie received. Return the movie title and number of stars. Sort by movie title. 

In [72]:
df = pd.read_sql_query("""
    SELECT 
        title , MAX(stars)
    FROM Rating INNER JOIN Movie USING(mID)
    GROUP BY title  
""", conn);df

Unnamed: 0,title,MAX(stars)
0,Avatar,5
1,E.T.,3
2,Gone with the Wind,4
3,Raiders of the Lost Ark,4
4,Snow White,5
5,The Sound of Music,3


### Q8: For each movie, return the title and the 'rating spread', that is, the difference between highest and lowest ratings given to that movie. Sort by rating spread from highest to lowest, then by movie title. 

In [81]:
df = pd.read_sql_query("""
    SELECT 
        title, (MAX(stars) - MIN(stars)) AS RatingsSpread
    FROM Rating INNER JOIN Movie USING(mID)
    GROUP BY title  
    ORDER BY RatingsSpread DESC, title 
""", conn);df

Unnamed: 0,title,RatingsSpread
0,Avatar,2
1,Gone with the Wind,2
2,Raiders of the Lost Ark,2
3,E.T.,1
4,Snow White,1
5,The Sound of Music,1


### Q9: Find the difference between the average rating of movies released before 1980 and the average rating of movies released after 1980. (Make sure to calculate the average rating for each movie, then the average of those averages for movies before 1980 and movies after. Don't just calculate the overall average rating before and after 1980.) 

In [11]:
df = pd.read_sql_query("""
    SELECT 
        title, AVG(stars), year
    FROM Rating INNER JOIN Movie USING(mID)
    WHERE year > 1980
    GROUP By title
    -- ORDER BY RatingsSpread DESC, title 
""", conn);df

Unnamed: 0,title,AVG(stars),year
0,Avatar,4.0,2009
1,E.T.,2.5,1982
2,Raiders of the Lost Ark,3.333333,1981


In [25]:
df = pd.read_sql_query("""
    SELECT 
    
    (SELECT AVG(FilmRatingsAvg) AS OverallAVG
    FROM
    (SELECT 
        title, AVG(stars) as FilmRatingsAvg, year
        FROM Rating INNER JOIN Movie USING(mID)
        WHERE year < 1980
        GROUP By title) AS Before1980)
        
    -
    
    (SELECT AVG(FilmRatingsAvg) AS OverallAVG
    FROM
    (SELECT 
        title, AVG(stars) as FilmRatingsAvg, year
        FROM Rating INNER JOIN Movie USING(mID)
        WHERE year > 1980
        GROUP By title) AS After1980
        ) AS RatingsDiff
""", conn);df

Unnamed: 0,RatingsDiff
0,0.055556


### Q2.1: Find the names of all reviewers who rated Gone with the Wind. 

In [28]:
df = pd.read_sql_query("""
    SELECT DISTINCT name
    FROM Reviewer INNER JOIN Rating USING(rID) INNER JOIN Movie USING(mID)
    WHERE title = "Gone with the Wind"
    
    
""", conn);df

Unnamed: 0,name
0,Sarah Martinez
1,Mike Anderson


### Q2.2: For any rating where the reviewer is the same as the director of the movie, return the reviewer name, movie title, and number of stars. 

In [62]:
df = pd.read_sql_query("""
    SELECT name, title, stars
    FROM Reviewer INNER JOIN Rating USING(rID) INNER JOIN Movie USING(mID)
    WHERE director = name
    
    
""", conn);df

Unnamed: 0,name,title,stars
0,James Cameron,Avatar,5


### Q2.3: Return all reviewer names and movie names together in a single list, alphabetized. (Sorting by the first name of the reviewer and first word in the title is fine; no need for special processing on last names or removing "The".) 

In [35]:
df = pd.read_sql_query("""
    SELECT x
    FROM 
        (SELECT name AS x
        FROM Reviewer)
    UNION        
    SELECT x
    FROM 
        (SELECT title AS x
        FROM Movie) 
    ORDER BY x
    
""", conn);df

Unnamed: 0,x
0,Ashley White
1,Avatar
2,Brittany Harris
3,Chris Jackson
4,Daniel Lewis
5,E.T.
6,Elizabeth Thomas
7,Gone with the Wind
8,James Cameron
9,Mike Anderson


### Q2.4: Find the titles of all movies not reviewed by Chris Jackson. 

In [60]:
df = pd.read_sql_query("""
    SELECT title
    FROM Movie
    WHERE mID NOT IN
        (SELECT mID
        FROM Reviewer INNER JOIN Rating USING(rID)
        WHERE name = "Chris Jackson")   
""", conn);df        

Unnamed: 0,title
0,Gone with the Wind
1,Star Wars
2,Titanic
3,Snow White
4,Avatar


### Q2.5: For all pairs of reviewers such that both reviewers gave a rating to the same movie, return the names of both reviewers. Eliminate duplicates, don't pair reviewers with themselves, and include each pair only once. For each pair, return the names in the pair in alphabetical order. 

In [91]:
df = pd.read_sql_query("""
    SELECT DISTINCT X.name, Y.name 
    FROM (Rating INNER JOIN Reviewer using(rID)) as X, (Rating join Reviewer using(rID)) as Y 
    WHERE Y.name > X.name and X.mID = Y.mID 
    ORDER by X.name
""", conn);df 

Unnamed: 0,name,name.1
0,Ashley White,Chris Jackson
1,Brittany Harris,Chris Jackson
2,Daniel Lewis,Elizabeth Thomas
3,Elizabeth Thomas,James Cameron
4,Mike Anderson,Sarah Martinez


### Q2.6: For each rating that is the lowest (fewest stars) currently in the database, return the reviewer name, movie title, and number of stars. 

In [99]:
df = pd.read_sql_query("""
    SELECT name, title, stars
    FROM Reviewer INNER JOIN Rating USING(rID) INNER JOIN Movie USING(mID)
    WHERE stars = 
        (SELECT MIN(stars)
        FROM Rating)
""", conn);df 

Unnamed: 0,name,title,stars
0,Sarah Martinez,Gone with the Wind,2
1,Brittany Harris,The Sound of Music,2
2,Brittany Harris,Raiders of the Lost Ark,2
3,Chris Jackson,E.T.,2


### Q2.7: List movie titles and average ratings, from highest-rated to lowest-rated. If two or more movies have the same average rating, list them in alphabetical order. 

In [103]:
df = pd.read_sql_query("""
    SELECT title, AVG(stars) AS AverageRating
    FROM Reviewer INNER JOIN Rating USING(rID) INNER JOIN Movie USING(mID)
    GROUP BY title
    ORDER BY AverageRating DESC, title
""", conn);df 

Unnamed: 0,title,AverageRating
0,Snow White,4.5
1,Avatar,4.0
2,Raiders of the Lost Ark,3.333333
3,Gone with the Wind,3.0
4,E.T.,2.5
5,The Sound of Music,2.5
