## Mid-Semester Project - CS 181

**Stage**: #3 - Data Cleaning and Processing

**Name**: Hieu Tran

**Central Question**: The average duration and ratio between movies and TV shows of popular streaming services.

In [1]:
# import necessary modules
import os
import os.path
import pandas as pd
import csv
from statistics import mean

**Stage #2 - Data Parsing**

In [2]:
# Amazon Prime
with open ("data/amazon_prime_titles.csv", 'r') as amazon:
    header = amazon.readline().strip().split(',')
    amazon_DoL = {}
    
    for line in csv.reader(amazon): 
        for i in range(len(header)):
            amazon_DoL.setdefault(header[i], []).append(line[i]) 

amazon_df = pd.DataFrame(amazon_DoL)


# Disney Plus
with open ("data/disney_plus_titles.csv", 'r') as disney:
    header = disney.readline().strip().split(',')
    disney_file = []
        
    for line in csv.reader(disney):
        disney_file.append(line)

disney_df = pd.DataFrame(disney_file, columns=header) 

# Netflix
netflix_df = pd.read_csv('data/netflix_titles.csv')

# Hulu
hulu_df = pd.read_csv('data/hulu_titles.csv')


**1. Cleaning the data**

Here, we will select only the 4 columns (`show_id`, `title`, `type`, and `duration`) from each DataFrame, and concatenate those DataFrames into one.

In [3]:
# Select only columns required for analysis
amazon_df = amazon_df[['show_id', 'title', 'type', 'duration']]
amazon_df.insert(0, 'service', 'Amazon') # insert a column in each dataset for the provider 

disney_df = disney_df[['show_id', 'title', 'type', 'duration']]
disney_df.insert(0, 'service', 'Disney')

netflix_df = netflix_df[['show_id', 'title', 'type', 'duration']]
netflix_df.insert(0, 'service', 'Netflix')

hulu_df = hulu_df[['show_id', 'title', 'type', 'duration']]
hulu_df.insert(0, 'service', 'Hulu')

# Tidying up the data by concatenating
data = pd.concat([amazon_df, disney_df, netflix_df, hulu_df], ignore_index=True)
data.set_index(['service','show_id'], inplace=True)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,title,type,duration
service,show_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Amazon,s1,The Grand Seduction,Movie,113 min
Amazon,s2,Take Care Good Night,Movie,110 min
Amazon,s3,Secrets of Deception,Movie,74 min
Amazon,s4,Pink: Staying True,Movie,69 min
Amazon,s5,Monster Maker,Movie,45 min
...,...,...,...,...
Hulu,s3069,Star Trek: The Original Series,TV Show,3 Seasons
Hulu,s3070,Star Trek: Voyager,TV Show,7 Seasons
Hulu,s3071,The Fades,TV Show,1 Season
Hulu,s3072,The Twilight Zone,TV Show,5 Seasons


In this dataset, the independent variables are `service` and `show_id`, the dependent variables are `title`, `type`, and `duration`.

**2. Prepare the data for analysis**

2.1. First, we will split the DataFrame into two for each service based on the "type" column (movies or TV shows) by using pandas .groupby() function

In [4]:
group = data.groupby(["service", "type"]) # split by "service" and "type" columns

amazon_movies_df = group.get_group(("Amazon", "Movie")) # get corresponding dataset based on "service" and "type"
amazon_tv_shows_df = group.get_group(("Amazon", "TV Show"))

disney_movies_df = group.get_group(("Disney", "Movie"))
disney_tv_shows_df = group.get_group(("Disney", "TV Show"))

netflix_movies_df = group.get_group(("Netflix", "Movie"))
netflix_tv_shows_df = group.get_group(("Netflix", "TV Show"))

hulu_movies_df = group.get_group(("Hulu", "Movie"))
hulu_tv_shows_df = group.get_group(("Hulu", "TV Show"))

2.2. Second, we will convert the durations of all new DataFrames from strings to integers by using pandas .replace() and .to_numeric() functions.

In [None]:
# Step 1: Get rid of the everything in the 'duration' columns that is not a number (' min' or ' seasons',...) by replacing them with blanks

amazon_movies_df['duration'] = amazon_movies_df['duration'].replace({' min': ''}, regex=True) # regular expression replacement
amazon_tv_shows_df['duration'] = amazon_tv_shows_df['duration'].replace({' Season| Seasons|s': ''}, regex=True)

disney_movies_df['duration'] = disney_movies_df['duration'].replace({' min': ''}, regex=True)
disney_tv_shows_df['duration'] = disney_tv_shows_df['duration'].replace({' Season| Seasons|s': ''}, regex=True)

hulu_movies_df['duration'] = hulu_movies_df['duration'].replace({' min': ''}, regex=True)
hulu_tv_shows_df['duration'] = hulu_tv_shows_df['duration'].replace({' Season| Seasons|s': ''}, regex=True)

netflix_movies_df['duration'] = netflix_movies_df['duration'].replace({' min': ''}, regex=True)
netflix_tv_shows_df['duration'] = netflix_tv_shows_df['duration'].replace({' Season| Seasons|s': ''}, regex=True)


# Step 2: Convert the durations into integers

amazon_tv_shows_df['duration'] = pd.to_numeric(amazon_tv_shows_df['duration'], errors='coerce') # errors='coerce': any invalid data will be set as NaN
amazon_movies_df['duration'] = pd.to_numeric(amazon_movies_df['duration'], errors='coerce')

disney_tv_shows_df['duration'] = pd.to_numeric(disney_tv_shows_df['duration'], errors='coerce')
disney_movies_df['duration'] = pd.to_numeric(disney_movies_df['duration'], errors='coerce')

hulu_tv_shows_df['duration'] = pd.to_numeric(hulu_tv_shows_df['duration'], errors='coerce')
hulu_movies_df['duration'] = pd.to_numeric(hulu_movies_df['duration'], errors='coerce')

netflix_tv_shows_df['duration'] = pd.to_numeric(netflix_tv_shows_df['duration'], errors='coerce')
netflix_movies_df['duration'] = pd.to_numeric(netflix_movies_df['duration'], errors='coerce')

In [6]:
# Example
amazon_movies_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,title,type,duration
service,show_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Amazon,s1,The Grand Seduction,Movie,113
Amazon,s2,Take Care Good Night,Movie,110
Amazon,s3,Secrets of Deception,Movie,74
Amazon,s4,Pink: Staying True,Movie,69
Amazon,s5,Monster Maker,Movie,45


**3. Analyze the processed data and answer the central question**

Question: The average duration and ratio between movies and TV shows of popular streaming services.

We will answer this question by applying the .describe() function to every DataFrame corresponds to each service's movies and TV shows, which will return, among other statistics, the count and mean of each type. This will exclude any NaN datapoint (only movies or TV shows with an available duration will be counted for the sake of consistency).

In [7]:
# Amazon
print (amazon_movies_df.describe().round(1))
print (amazon_tv_shows_df.describe().round(1))

       duration
count    7814.0
mean       91.3
std        40.3
min         0.0
25%        75.0
50%        91.0
75%       106.0
max       601.0
       duration
count    1854.0
mean        1.7
std         1.8
min         1.0
25%         1.0
50%         1.0
75%         2.0
max        29.0


---
On Amazon Prime:
- The average duration of movies is **91.3 minutes** while that of TV shows is **1.7 seasons**.
- The ratio between TV shows and movies is 1854 : 7814 or **1 : 4.2**
---

In [8]:
# Disney Plus
print (disney_movies_df.describe().round(1))
print (disney_tv_shows_df.describe().round(1))

       duration
count    1052.0
mean       71.9
std        40.6
min         1.0
25%        44.0
50%        85.0
75%        98.0
max       183.0
       duration
count     398.0
mean        2.1
std         2.4
min         1.0
25%         1.0
50%         1.0
75%         2.0
max        32.0


---
On Disney Plus:
- The average duration of movies is **71.9 minutes** while that of TV shows is **2.1 seasons**.
- The ratio between TV shows and movies is 398 : 1052 or **1 : 2.6**
---

In [9]:
# Hulu
print (hulu_movies_df.describe().round(1))
print (hulu_tv_shows_df.describe().round(1))

       duration
count    1005.0
mean       98.3
std        21.2
min         1.0
25%        89.0
50%        97.0
75%       109.0
max       192.0
       duration
count    1589.0
mean        2.7
std         3.2
min         1.0
25%         1.0
50%         1.0
75%         3.0
max        34.0


---
On Hulu:
- The average duration of movies is **98.3 minutes** while that of TV shows is **2.7 seasons**.
- The ratio between TV shows and movies is 1589 : 1005 or **1 : 0.6**
---

In [10]:
# Netflix
print (netflix_movies_df.describe().round(1))
print (netflix_tv_shows_df.describe().round(1))

       duration
count    6128.0
mean       99.6
std        28.3
min         3.0
25%        87.0
50%        98.0
75%       114.0
max       312.0
       duration
count    2676.0
mean        1.8
std         1.6
min         1.0
25%         1.0
50%         1.0
75%         2.0
max        17.0


---
On Netflix:
- The average duration of movies is **99.6 minutes** while that of TV shows is **1.8 seasons**.
- The ratio between TV shows and movies is 2676 : 6128 or **1 : 2.2**
---

In [11]:
# Get overall statistics

overall_movie_duration = mean([amazon_movies_df['duration'].mean(), disney_movies_df['duration'].mean(), netflix_movies_df['duration'].mean(), hulu_movies_df['duration'].mean()])
print ("Overall movie duration:", overall_movie_duration.round(1))

overall_tv_show_duration = mean([amazon_tv_shows_df['duration'].mean(), disney_tv_shows_df['duration'].mean(), netflix_tv_shows_df['duration'].mean(), hulu_tv_shows_df['duration'].mean()])
print ("Overall TV show duration:", overall_tv_show_duration.round(1))

print ("Overall ratio between TV shows and movies:", (amazon_tv_shows_df['duration'].count()+disney_tv_shows_df['duration'].count()+netflix_tv_shows_df['duration'].count()+hulu_tv_shows_df['duration'].count()), 
       ":", (amazon_movies_df['duration'].count()+disney_movies_df['duration'].count()+netflix_movies_df['duration'].count()+hulu_movies_df['duration'].count()))

Overall movie duration: 90.3
Overall TV show duration: 2.1
Overall ratio between TV shows and movies: 6517 : 15999


---
**Overall (across all platforms)**
- The average duration of movies is **90.3 minutes** while that of TV shows is **2.1 seasons**
- The ratio between TV shows and movies is 6517 : 15999 or **1 : 2.5**
---

---
**Some interesting information we have gathered from this project:**
- The overall ratio between TV shows and movies of the streaming services indicates that movies tend to be produced more across platforms (by 2.5 times), except Hulu which focuses more on making TV shows (by 1.6 times)
- Most streaming services have TV shows' duration around the 2-season mark, except Hulu's which is 2.7 seasons
- Most streaming services have movies' duration from 90 to 100 minutes, except Disney Plus' which is only 71.2 minutes

---