# Analyzing Streaming Service Content in SQL## By: Hrishikesh Dipak Desai

## Exploring our data
Let's start by checking out the data we will be working with. We can start with the `amazon`, `hulu`, `netflix`, and `disney` tables.

In [6]:
SELECT *
FROM amazon

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type
0,3784,Underworld: Evolution,2006,18+,,76/100,0
1,3788,Across the Universe,2007,13+,,76/100,0
2,3821,Resident Evil: Apocalypse,2004,18+,,74/100,0
3,3861,Breach,2007,13+,,72/100,0
4,3838,Robot & Frank,2012,13+,,73/100,0
...,...,...,...,...,...,...,...
2952,2104,Elfen Lied,2004,18+,8.0/10,77/100,1
2953,2117,Stargate Atlantis,2004,7+,8.1/10,76/100,1
2954,2301,Ghost Whisperer,2005,7+,6.4/10,68/100,1
2955,2289,The Amazing Race,2001,7+,7.6/10,68/100,1


In [11]:
SELECT *
FROM hulu

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type
0,4083,Assassination Nation,2018,18+,,64/100,0
1,4092,Soul Food,1997,18+,,63/100,0
2,4094,Hail Satan?,2019,18+,,63/100,0
3,4095,Tiny Toon Adventures: How I Spent My Vacation,1992,,,63/100,0
4,4096,Little Fish,2021,16+,,63/100,0
...,...,...,...,...,...,...,...
1670,2545,Whose Line Is It Anyway? (UK),1988,7+,8.2/10,60/100,1
1671,2544,The Incredible Dr. Pol,2011,7+,8.6/10,60/100,1
1672,2541,The Real Housewives of Beverly Hills,2010,16+,5.1/10,60/100,1
1673,2540,Dharma & Greg,1997,16+,6.3/10,60/100,1


In [12]:
SELECT * 
FROM netflix

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type
0,3605,Arango y Sanint: Ríase el show,2018,,,32/100,0
1,3606,Match,2017,,,32/100,0
2,3607,One Like It,2020,,,32/100,0
3,3608,What the Jatt!!,2015,all,,32/100,0
4,3609,Deewana Main Deewana,2013,,,32/100,0
...,...,...,...,...,...,...,...
4743,1945,Million Yen Women,2017,,,14/100,1
4744,1955,Paul Hollywood's Big Continental Road Trip,2017,,,13/100,1
4745,2716,Paradox,2009,16+,7.0/10,56/100,1
4746,2637,Midnight Sun,2016,,7.5/10,58/100,1


In [13]:
SELECT *
FROM disney

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type
0,4132,The Kid,2019,18+,,62/100,0
1,3959,Diary of a Wimpy Kid,2010,7+,,67/100,0
2,4204,Teen Spirit,2019,13+,,60/100,0
3,4187,The Hunchback of Notre Dame,1923,,,60/100,0
4,4864,Pinocchio,2019,13+,,72/100,0
...,...,...,...,...,...,...,...
779,2396,Good Luck Charlie,2010,all,7.0/10,65/100,1
780,2616,America's Funniest Home Videos,1989,7+,6.2/10,58/100,1
781,2595,Running Wild with Bear Grylls,2014,7+,7.7/10,59/100,1
782,2564,Life Below Zero,2013,7+,8.0/10,60/100,1


We can also inspect the `genres` table, which is different from the other tables.

In [14]:
SELECT *
FROM genres

Unnamed: 0,title,genre
0,Sara's Notebook,"Dramas, International Movies, Thrillers"
1,Rare Exports: A Christmas Tale,"Action, Adventure, Comedy"
2,Gretel & Hansel,"Drama, Horror, Mystery"
3,Mr. Jones,"Drama, History, Thriller"
4,The Limehouse Golem,"Crime, Mystery, Thriller"
...,...,...
9590,The Incredible Dr. Pol,"Documentary, Reality-TV"
9591,The Real Housewives of Beverly Hills,Reality
9592,Dharma & Greg,"Comedy, Sitcom"
9593,Mobile Suit Gundam Wing,"Anime, International, Science Fiction"


## Preparing our data
### Joining the different tables
Our data appears to mostly have the same column names. So we can join the data with a series of [`UNION`](https://www.postgresql.org/docs/14/queries-union.html)s, which will append each table to the previous one.

We use `UNION ALL` to preserve any possible duplicate rows, as we will want to count entries if they appear in multiple services.

In [17]:
SELECT *
FROM amazon
UNION ALL
SELECT *
FROM hulu
UNION ALL
SELECT * 
FROM netflix
UNION ALL
SELECT *
FROM disney

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type
0,3784,Underworld: Evolution,2006,18+,,76/100,0
1,3788,Across the Universe,2007,13+,,76/100,0
2,3821,Resident Evil: Apocalypse,2004,18+,,74/100,0
3,3861,Breach,2007,13+,,72/100,0
4,3838,Robot & Frank,2012,13+,,73/100,0
...,...,...,...,...,...,...,...
10159,2396,Good Luck Charlie,2010,all,7.0/10,65/100,1
10160,2616,America's Funniest Home Videos,1989,7+,6.2/10,58/100,1
10161,2595,Running Wild with Bear Grylls,2014,7+,7.7/10,59/100,1
10162,2564,Life Below Zero,2013,7+,8.0/10,60/100,1


One problem with the above approach is that we lose out on the streaming service information. So let's repeat our query, but add in the required info!

In [16]:
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'hulu' AS service
FROM hulu
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'disney' AS service
FROM disney

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type,service
0,4536,The Current Occupant,2020,,,49/100,0,hulu
1,751,Sofia the First,2013,all,6.9/10,57/100,1,netflix
2,1182,Donald Glover: Weirdo,2012,,,60/100,0,netflix
3,4543,Shangri-La Suite,2016,16+,,49/100,0,hulu
4,874,Intersection,2016,,6.6/10,54/100,1,netflix
...,...,...,...,...,...,...,...,...
10159,1833,The Lovers,2017,18+,,54/100,0,netflix
10160,5613,Slackers,2002,18+,,58/100,0,amazon
10161,287,A Futile and Stupid Gesture,2018,18+,,75/100,0,netflix
10162,97,Monty Python's Flying Circus,1969,16+,8.8/10,81/100,1,netflix


Great! But we have one more table that might prove useful. Let's add in the genre information with a join.

To do this, we will need to use a [Common Table Expression](https://www.postgresql.org/docs/current/queries-with.html), or CTE.

In [1]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
)

SELECT *
FROM service_data AS sd
LEFT JOIN genres AS g
	ON sd.title = g.film

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type,service,film,genre
0,4536,The Current Occupant,2020,,,49/100,0,hulu,The Current Occupant,Horror
1,751,Sofia the First,2013,all,6.9/10,57/100,1,netflix,Sofia the First,Kids' TV
2,1182,Donald Glover: Weirdo,2012,,,60/100,0,netflix,Donald Glover: Weirdo,Stand-Up Comedy
3,4543,Shangri-La Suite,2016,16+,,49/100,0,hulu,Shangri-La Suite,"Comedy, Crime, Drama"
4,874,Intersection,2016,,6.6/10,54/100,1,netflix,Intersection,"Drama, Suspense"
...,...,...,...,...,...,...,...,...,...,...
10159,1833,The Lovers,2017,18+,,54/100,0,netflix,The Lovers,"Comedies, Dramas, Independent Movies"
10160,5613,Slackers,2002,18+,,58/100,0,amazon,Slackers,Comedy
10161,287,A Futile and Stupid Gesture,2018,18+,,75/100,0,netflix,A Futile and Stupid Gesture,Comedies
10162,97,Monty Python's Flying Circus,1969,16+,8.8/10,81/100,1,netflix,Monty Python's Flying Circus,"British TV Shows, Classic & Cult TV, Internati..."


### Inspecting missing data
It looks like we are missing some values in the `age` and `imdb` columns. We will also check the `rotten_tomatoes` column because we may use it later. Let's see how extensive this problem is.

To calculate the null values per column, we will use a combination of [SUM()](https://www.postgresql.org/docs/8.2/functions-aggregate.html) and [CASE WHEN](postgresql.org/docs/current/functions-conditional.html) to count the number of null values.

In [1]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
    SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
        ON sd.title = g.film
)

SELECT 
	SUM(CASE WHEN imdb IS NULL THEN 1 ELSE 0 END) AS imdb_missing,
    SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) AS age_missing,
    SUM(CASE WHEN rotten_tomatoes IS NULL THEN 1 ELSE 0 END) AS rt_missing
FROM all_data

Unnamed: 0,imdb_missing,age_missing,rt_missing
0,6901,3633,7


## Analyzing our data
### Which is the most family-friendly streaming service?
Let's start by looking at the most family-friendly streaming service by the percentage of content geared towards children.

We have our primary genre column, but that could leave out some content. A better way may be to use [pattern matching](https://www.postgresql.org/docs/current/functions-matching.html) to find any references to "kids", "family", etc.

In [10]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
    SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
        ON sd.title = g.film
)

SELECT
	*
FROM all_data
WHERE genre ILIKE '%kids%' 
    OR genre ILIKE '%family%' 
    OR genre ILIKE '%children%'

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type,service,film,genre
0,751,Sofia the First,2013,all,6.9/10,57/100,1,netflix,Sofia the First,Kids' TV
1,2703,A Fairly Odd Summer,2014,all,,45/100,0,hulu,A Fairly Odd Summer,"Children & Family Movies, Comedies"
2,9266,Hawaiian Holiday,1937,,,49/100,0,disney,Hawaiian Holiday,"Animation, Short, Comedy, Family, Sport"
3,9196,All in a Nutshell,1949,,,52/100,0,disney,All in a Nutshell,"Family, Comedy, Animation, Short"
4,1559,Motown Magic,2018,all,7.7/10,41/100,1,netflix,Motown Magic,"Kids' TV, TV Sci-Fi & Fantasy"
...,...,...,...,...,...,...,...,...,...,...
1536,2171,For the Birds,2018,,,50/100,0,amazon,For the Birds,"Animation, Short, Comedy, Family"
1537,9096,Minutemen,2008,all,,56/100,0,disney,Minutemen,"Adventure, Comedy, Family, Sci-Fi"
1538,9381,Yellowstone Cubs,1963,all,,45/100,0,disney,Yellowstone Cubs,Family
1539,8703,Frozen II,2019,7+,,78/100,0,disney,Frozen II,"Animation, Adventure, Comedy, Family, Fantasy,..."


Great! That seems to be working. Let's adapt our query and use `CASE WHEN` to perform an aggregation and see which platform has the highest percentage of family content.

In [8]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
    SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
        ON sd.title = g.film
)

SELECT
	service,
    AVG(CASE WHEN genre ILIKE '%kids%' 
        OR genre ILIKE '%family%' 
        OR genre ILIKE '%children%' THEN 1.0000 ELSE 0.0000 END) * 100 AS pct_family
FROM all_data
GROUP BY service
ORDER BY pct_family DESC

Unnamed: 0,service,pct_family
0,disney,74.744898
1,netflix,11.057287
2,hulu,10.985075
3,amazon,8.319242


![image.png](attachment:image.png)

### Which has the highest-rated content?
We also have access to information on the ratings of each piece of content in the `rotten_tomatoes` column. We use [`SPLIT_PART()`](https://www.postgresql.org/docs/9.1/functions-string.html) to extract the number from the column and then cast (`::`) the result as a numeric.

We will also further split the data by movie and tv shows and visualize the result in a grouped bar chart.

In [15]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
    SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
        ON sd.title = g.film
)

SELECT
	service,
    CASE WHEN type = 1 THEN 'TV' ELSE 'Movie' END AS type,
    AVG(SPLIT_PART(rotten_tomatoes, '/', 1)::NUMERIC) AS rt_score
FROM all_data
GROUP BY service, type
ORDER BY rt_score DESC

Unnamed: 0,service,type,rt_score
0,hulu,Movie,60.482517
1,disney,Movie,60.047934
2,hulu,TV,59.690625
3,netflix,Movie,54.965913
4,disney,TV,54.486034
5,netflix,TV,54.229586
6,amazon,TV,52.377207
7,amazon,Movie,51.990146


![image.png](attachment:image.png)

### Have critics and audiences diverged over time?
Okay, for our final analysis, lets put the service aside and look into whether critics and audiences were more aligned on tv shows in the past.

To prepare the date for the chart cell, we will need to use [`TO_DATE()`](https://www.postgresql.org/docs/current/functions-formatting.html) to convert the year into a date.

In [13]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM amazon
    UNION
    SELECT *, 'hulu' AS service
    FROM hulu
    UNION
    SELECT *, 'netflix' AS service
    FROM netflix
    UNION
    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
    SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
        ON sd.title = g.film
)

SELECT
	date,
    AVG(ABS(imdb_score - rt_score)) AS avg_difference
FROM (
    SELECT 
        TO_DATE(year::TEXT, 'YYYY') AS date,
        SPLIT_PART(rotten_tomatoes, '/', 1)::NUMERIC AS rt_score,
        SPLIT_PART(imdb, '/', 1)::NUMERIC * 10 AS imdb_score
    FROM all_data
    WHERE imdb IS NOT NULL
        AND rotten_tomatoes IS NOT NULL
    	AND year >= 2000
) AS sub
GROUP BY date
ORDER BY date

Unnamed: 0,date,avg_difference
0,2000-01-01 00:00:00+00:00,9.714286
1,2001-01-01 00:00:00+00:00,9.588235
2,2002-01-01 00:00:00+00:00,10.210526
3,2003-01-01 00:00:00+00:00,5.517241
4,2004-01-01 00:00:00+00:00,9.258065
5,2005-01-01 00:00:00+00:00,7.867925
6,2006-01-01 00:00:00+00:00,8.574074
7,2007-01-01 00:00:00+00:00,13.025
8,2008-01-01 00:00:00+00:00,10.5
9,2009-01-01 00:00:00+00:00,8.746988


![image.png](attachment:image.png)