<a href="https://www.kaggle.com/code/appoooo/streaming-services-comparison?scriptVersionId=178463374" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Which streaming services should i choose? Netflix or Disney+

I am a beginner in data analysis. The purpose of this project is to analyze and visualize the data, and then find the most suitable services. Everyone has different preferences for movies and tv shows. After watching this project, you will know which services you should choose.

Dataset for this project

Netflix: https://www.kaggle.com/datasets/shivamb/netflix-shows

Disney+: https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows

**These dataset were collected up to mid-2021, so there may be discrepancies with the current situation.**

# Reference
JOSH, Netflix Data Visualization, https://www.kaggle.com/code/joshuaswords/netflix-data-visualization/notebook

I learn a lot from his project, especially in data visualization. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and clean the data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
dataset_netflix = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
dataset_disney = pd.read_csv("/kaggle/input/disney-movies-and-tv-shows/disney_plus_titles.csv")

In [None]:
def dataset_observation(dataset):
    for i in dataset.columns:
        rate = dataset[i].isna().sum() / len(dataset) * 100
        if rate > 0: print('The missing rate of {} is {}'.format(i, round(rate, 2)))

In [None]:
# Observe the dataset
title=['Netflix', 'Disney+']
for i, dataset in enumerate([dataset_netflix, dataset_disney]):
    print('-'*10 + title[i] + '-'*10)
    dataset_observation(dataset)
    print()
    dataset.info()
    print()

# Clean the dataset
* For me, i don't think 'director', 'cast' and 'description' are useful for my analysis, so drop them at first.
* The ratio of missing 'country' is big, it can not be filled with other countries, so i decide to fill it with 'unknown'.
* The ratio of missing 'rating' and 'duration' is low, so rows with missing 'rating' or 'duration' can be dropped.

# Processing some columns
* The 'date_added' is object type, convert it into date type is easier for analysis.
* Movies or TV Shows may be collaborated with many countries. I assume that the first country is the main production country.

In [None]:
def dataset_processing(dataset):
    df = dataset.drop(['director', 'cast', 'description'], axis=1)
    df.country = df.country.fillna('unknown')
    df.dropna(inplace = True)
    df.drop_duplicates(inplace=True)
    
    # convert date
    df.date_added = df.date_added.str.strip()
    df.date_added = pd.to_datetime(df.date_added)
    df['add_year'] = df.date_added.dt.year
    df['add_month'] = df.date_added.dt.month
    
    # assume the movies or tv shows are mainly created by first country
    df['main_country'] = df['country'].apply(lambda x : x.replace(' ,', ',').replace(', ', ',').split(',')[0])
    return df

In [None]:
# check, no null value in the dataset
print('-'*10 + title[0] + '-'*10)
net_df = dataset_processing(dataset_netflix)
print(net_df.info())
print()

print('-'*10 + title[1] + '-'*10)
dis_df = dataset_processing(dataset_disney)
print(dis_df.info())
print()

# Others 
* Set the color which relates to both dataset.
* After obseving the ratio between the movie and tv show, split it for following analysis.

In [None]:
# red, black
net_color = ["#b20710", "#221f1f"]
# dark blue, light blue
dis_color = ['#000079', '#84C1FF']
# all color
all_color = [net_color, dis_color]
def select_type(df, ty):
    return df[df.type == ty]

net_movie = select_type(net_df, 'Movie')
net_tv = select_type(net_df, 'TV Show')
dis_movie = select_type(dis_df, 'Movie')
dis_tv = select_type(dis_df, 'TV Show')

# The ratio between movies and tv shows

In [None]:
def data_type(df):
    type_df = df.groupby('type').type.count()
    r = round(type_df / len(df), 2)
    return pd.DataFrame(r).T

In [None]:
def q1_imageSet(rate_df, ax, title, color):
    # use bar to build the basic image
    ax.barh(rate_df.index, rate_df['Movie'], color=color[0])
    ax.barh(rate_df.index, rate_df['TV Show'], left=rate_df['Movie'], color=color[1])
    ax.set_xlim(0, 1)
    ax.set_xticks([])
    ax.set_yticks([])

    # set the text
    for i in rate_df.index:
        ax.annotate(f"{int(rate_df['Movie'][i]*100)}%", xy=(rate_df['Movie']['type']/2, 0),
                   fontsize=40, fontweight='light', fontfamily='serif', color='white',va = 'center', ha='center')
        ax.annotate(f"Movie", xy=(rate_df['Movie']['type']/2, -0.25),
                   fontsize=15, fontweight='light', fontfamily='serif', color='white',va = 'center', ha='center')

        ax.annotate(f"{int(rate_df['TV Show'][i]*100)}%", xy=(rate_df['Movie']['type']+rate_df['TV Show']['type']/2, 0),
                   fontsize=40, fontweight='light', fontfamily='serif', color='white',va = 'center', ha='center')
        ax.annotate(f"TV Show", xy=(rate_df['Movie']['type']+rate_df['TV Show']['type']/2, -0.25),
                   fontsize=15, fontweight='light', fontfamily='serif', color='white',va = 'center', ha='center')

    # display
    for s in ['top', 'left', 'right', 'bottom']:
        ax.spines[s].set_visible(False)
    ax.legend().set_visible(False)
    ax.set_title(title)

In [None]:
q1_title = ['Netflix: The percentage of Movie and TV show', 'Disney+: The percentage of Movie and TV show']
net_type = data_type(net_df)
dis_type = data_type(dis_df)
for i, df in enumerate([net_type, dis_type]):
    fig, ax = plt.subplots(1, 1, figsize=(6.5, 2.5))
    q1_imageSet(df, ax, q1_title[i], all_color[i])

    plt.tight_layout()
    plt.show()

Both services have similar ratio between movies and tv shows, 7:3.

# The number of movies and tv shows uploading in each year

In [None]:
def q2_imageSet(df, fig, ax, title, color):
    for i, mtv in enumerate(df['type'].value_counts().index):
        mtv_rel = df[df['type']==mtv].date_added.dt.year.value_counts().sort_index()
        ax.plot(mtv_rel.index, mtv_rel, color[i], label=mtv)
        ax.fill_between(mtv_rel.index, 0, mtv_rel, color=color[i])

    # adjust the image
    minR = mtv_rel.index.min()
    maxR = mtv_rel.index.max()
    ax.yaxis.tick_right()
    ax.axhline(y = 0, color = 'black', linewidth = 1.4)
    for s in ['top', 'right', 'bottom', 'left']:
        ax.spines[s].set_visible(False)
    ax.set_xlim(minR, maxR)
    plt.xticks(np.arange(minR, maxR + 1, 1))
    fig.text(0.5,0.7,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color=color[0])
    fig.text(0.56,0.7,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
    fig.text(0.57,0.7,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color=color[1])
    ax.set_title(title)

In [None]:
q2_title = ['Netflix: The number of movies and tv shows which were added', 'Disney+: The number of movies and tv shows which were added']
for i, df in enumerate([net_df, dis_df]):
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    q2_imageSet(df, fig, ax, q2_title[i], all_color[i])
    
    plt.tight_layout()
    plt.show()

These figures also show when the online services start. Netflix is around 2008, and Disney+ is at 2019. (According to wiki, netflix started at 2007)

* Netflix: The number of movies and tv shows are significantly increase from 2016 to 2020. After 2020, the numbers are decrease because data only collected until mid-2021.
* Disney+: They add a lot of movies in their first year, but the number decreased significantly afterward. Conversely, the number of TV Shows remain stable, but is very low.

Netflix has more options for users.

# Top 10 countries which created the movies or tv shows

In [None]:
def q3_imageSet(data, ax, title, color):
    color_map = [color[1] for _ in range(10)]
    color_map[0] = color_map[1] = color_map[2] =  color[0]
    sns.barplot(x=data.index, y=data.values, ax=ax, palette=color_map)

    # show the values
    offset = data.max() * 0.05
    for i, value in enumerate(data.values):
        ax.annotate(f"{data[i]}", xy=(i, value+offset), va='center', ha='center', fontweight='light', fontfamily='serif')

    ax.set_title(title)
    for s in ['top', 'left', 'right']:
        ax.spines[s].set_visible(False)
    ax.tick_params(axis='both', length=0)
    ax.set_xlabel('')
    ax.set_title(title)

    # set grid line to make figure more pretty
    maxR = data.values.max()
    ax.grid(axis='y', linestyle='-', alpha=0.5)
    ax.set_yticks(np.arange(0, maxR, 200))
    ax.set_axisbelow(True)

In [None]:
q3_title = [('Top 10 countries with the most movies on Netflix', 'Top 10 countries with the most TV shows on Netflix')
           , ('Top 10 countries with the most movies on Disney+', 'Top 10 countries with the most TV shows on Disney+')]
for i, data in enumerate([(net_movie, net_tv), (dis_movie, dis_tv)]):
    movie, tv = data[0], data[1]
    movie_df = movie.main_country.value_counts()[:10]
    tv_df = tv.main_country.value_counts()[:10]
    fig, ax = plt.subplots(2, 1, figsize=(12, 6))
    q3_imageSet(movie_df, ax[0], q3_title[i][0], all_color[i])
    q3_imageSet(tv_df, ax[1], q3_title[i][1], all_color[i])
    plt.tight_layout()
    plt.show()

USA produces the most movies and tv shows in both services, not surprise.

* Netflix: i know that Europe also produces many movies, but India, Egypt, Nigeria are really surprise, i have never seen these movies from these countries before. For TV Shows, as a Taiwanese, i don't know there are 70 tv shows on Netflix, the amount is out of my expectations.

* Disney+: except USA, there are not many options. 

# Observe the published years of movies and tv shows

In [None]:
def q4_processing(data):
    return data.groupby('release_year').agg(cnt=('show_id', 'count')).reset_index() \
    .sort_values(by='release_year', ascending=False)[:10]

In [None]:
def q4_imageSet(data, ax, title, color):
    img = sns.barplot(x='release_year', y='cnt', data=data, ax=ax, color=color)
    ax.set_title(title)
    ax.set_xlabel('')
    ax.set_ylabel('')
    for s in ['top', 'right']:
        ax.spines[s].set_visible(False)
    ax.bar_label(img.containers[0], label_type='center', color='white', fontsize=16)

In [None]:
# clean the data
q4_title = [('The number of Movie be published in each year on Netflix', 'The number of TV Show be published in each year on Netflix')
           , ('The number of Movie be published in each year on Disney+', 'The number of TV Show be published in each year on Disney+')]
for i, data in enumerate([(net_movie, net_tv), (dis_movie, dis_tv)]):
    movie, tv = data[0], data[1]
    movie_df = q4_processing(movie)
    tv_df = q4_processing(tv)

    # create the image
    fig, ax = plt.subplots(2, 1, figsize=(12, 6))
    q4_imageSet(movie_df, ax[0], q4_title[i][0], all_color[i][0])
    q4_imageSet(tv_df, ax[1], q4_title[i][1], all_color[i][1])
    plt.tight_layout()
    plt.show()

They add old movies for their users.
* Netflix: Most of their movies are published from 4 to 5 years ago. They don't tend to add movies from too long ago, nor do they add too many recent ones.
* Disney+: They add a lot of movies that publish in recent 3 years.

For both services, they don't add the TV shows from too long ago.
* Netflix: They keep producing thier own tv shows, and the amount increases every year.
* Disney+: They started their services in 2019, so they add some old tv shows. However, they also keep producing their own tv shows.

# How long do the movies upload after published

According to above analysis, i am curious how long it takes between being published and being added for top 10 countries on thier services?

In [None]:
def q5_processing(df):
    country_list = df.groupby('main_country').show_id.agg(cnt='count').sort_values(by='cnt', ascending=False).reset_index()[:10]
    df = df[df.main_country.isin(country_list.main_country)]
    return df.groupby('main_country').agg(mean_rel=('release_year', lambda x: round(x.mean())), mean_add=('add_year', lambda x: round(x.mean()))) \
            .sort_values(by='mean_rel').reset_index()

In [None]:
def q5_imageSet(df, ax, title, color, r):
    ax.hlines(y=r, xmin=df['mean_rel'], xmax=df['mean_add'], color='grey')
    ax.scatter(df['mean_rel'], r, color=color[1], s=100, marker=4)
    ax.scatter(df['mean_add'], r, color=color[0], s=100, marker=5)

    for s in ['top', 'bottom', 'left', 'right']:
        ax.spines[s].set_visible(False)
        
    ax.tick_params(axis='both', length=0)
    ax.yaxis.tick_right()
    ax.set_title(title, loc='left', fontname='serif', fontsize=18)
    ax.set_yticks(r)
    ax.set_yticklabels(df['main_country'], fontname='serif', fontsize=12)

In [None]:
# we need movie type, and select top 10 country which create the most movies.
q5_title = [('The average time between movie publised and added on Netflix', 'The average time between TV Show publised and added on Netflix')
           , ('The average time between movie publised and added on Disney+', 'The average time between TV Show publised and added on Disney+')]

for i, data in enumerate([(net_movie, net_tv), (dis_movie, dis_tv)]):
    movie, tv = data[0], data[1]
    movie_df = q5_processing(movie)
    tv_df = q5_processing(tv)

    rm = range(1, len(movie_df) + 1)
    rt = range(1, len(tv_df) + 1)

    fig, ax = plt.subplots(2, 1, figsize=(12, 8))
    q5_imageSet(movie_df, ax[0], q5_title[i][0], all_color[i], rm)
    q5_imageSet(tv_df, ax[1], q5_title[i][1], all_color[i], rt)

    fig.text(0.1, 0.56, 'Released', fontweight='bold', fontfamily='serif', fontsize=12, color=all_color[i][1])
    fig.text(0.77, 0.56, 'Added', fontweight='bold', fontfamily='serif', fontsize=12, color=all_color[i][0])
    fig.text(0.1, 0.07, 'Released', fontweight='bold', fontfamily='serif', fontsize=12, color=all_color[i][1])
    fig.text(0.77, 0.07, 'Added', fontweight='bold', fontfamily='serif', fontsize=12, color=all_color[i][0])

    plt.tight_layout()
    plt.show()

This analysis uses average time, it isn't very accurate, but still an intersting analysis.
* Netflix: according to the above analysis, USA and UK produce many movies, but it takes quite a long time before the movies are added on Netflix. TV shows are stable, with the interval time of around 2 years, even for the USA. 
* Disney+: combining the above analysis, the source of new movies may not be from USA. Many movies come from the USA, but most of them are very old on Disney+. There are 105 movies from unknown, and the lag is only 3 years. TV shows also have the similar issues, especially the USA TV shows.



# Rating analysis for the number of movies and tv shows released

In [None]:
def add_miss(ori_df, df):
    ori = ori_df.rating.value_counts().index
    missing = set(ori) - set(df.index)
    for i in missing: df[i] = 0
    return df[list(ori)] 

In [None]:
def q6_imageSet(rate_movie_df, rate_tv_df, ax, title, color):
    sns.barplot(x=rate_movie_df.index, y=rate_movie_df.values, color=color[0], width=0.5)
    sns.barplot(x=rate_tv_df.index, y=rate_tv_df.values, color=color[1], width=0.5)

    # show the value
    offset = max(rate_movie_df.max(), rate_tv_df.abs().max()) * 0.05
    for i, value in enumerate(rate_movie_df.values):
        ax.annotate(f"{value}", xy=(i, value + offset), va='center', ha='center',fontweight='light', fontfamily='serif')
    for i, value in enumerate(rate_tv_df.values):
        ax.annotate(f"{-1 * value}", xy=(i, value - offset), va='center', ha='center',fontweight='light', fontfamily='serif')

    # set the figure background
    for s in ['top', 'left', 'right', 'bottom']:
        ax.spines[s].set_visible(False)
        
    ax.set_xlabel('')
    ax.tick_params(axis='both', length=0)
    ax.set_title(title)
    ax.set_yticks([])
    #plt.tight_layout()

In [None]:
# groupby rating
q6_title = ['Ratio by Movies and TV Shows on Netflix', 'Ratio by Movies and TV Shows on Disney+']
ori_list = [net_df, dis_df]
for i, data in enumerate([(net_movie, net_tv), (dis_movie, dis_tv)]):
    movie, tv = data[0], data[1]
    rate_movie_df = add_miss(ori_list[i], movie.rating.value_counts())
    rate_tv_df = add_miss(ori_list[i], tv.rating.value_counts())
    rate_tv_df = -1 * rate_tv_df

    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    q6_imageSet(rate_movie_df, rate_tv_df, ax, q6_title[i], all_color[i])
    fig.text(0.7, 0.76, 'Movie', fontweight="bold", fontfamily='serif', fontsize=15, color=all_color[i][0])
    fig.text(0.76, 0.76, '|', fontweight="bold", fontfamily='serif', fontsize=15, color=all_color[i][1])
    fig.text(0.77, 0.76, 'TV Show', fontweight="bold", fontfamily='serif', fontsize=15, color=all_color[i][1])

    plt.tight_layout()
    plt.show()

* Netflix: obviously, their target audience consists of adults, as many of the movies and TV shows provided are rated TV-MA and R.
* Disney+: their target audience is very different from Netflix, they don't provide any TV-MA and R content. TV-G, TV-PG, G and PG are the most common rating. They provide many content for teens and children.

# Observe the genre of Movies and tv shows

In [None]:
def compute_genre(data):
    data = data.apply(lambda x : x.replace(' ,', ',').replace(', ', ',').split(','))
    table = {}
    for d in list(data):
        if len(d) == 1:
            if d[0] in table: table[d[0]] += 1
            else: table[d[0]] = 1
        else:
            for i in d:
                if i in table: table[i] += 1
                else: table[i] = 1
    df = pd.DataFrame.from_dict(table, orient='index', columns=['cnt'])
    return df.sort_values(by='cnt', ascending=False)[:12]

In [None]:
def q7_pie(data, ax, title):
    ax.set_title(title)
    ax.pie(data.cnt, labels=data.index, wedgeprops={'linewidth':3, 'edgecolor':'w', 'width':0.6}, autopct='%.1f%%')

In [None]:
# clean the genre, and calculate the rate of genre 
q7_title = [('The percentage of Movie genre on Netflix', 'The percentage of TV Show genre on Netflix')
           , ('The percentage of Movie genre on Disney+', 'The percentage of TV Show genre on Disney+')]
for i, data in enumerate([(net_movie, net_tv), (dis_movie, dis_tv)]):
    movie, tv = data[0], data[1]
    movie_genre = compute_genre(movie['listed_in'])
    tv_genre = compute_genre(tv['listed_in'])

    fig, ax = plt.subplots(1, 2, figsize=(12, 8))
    q7_pie(movie_genre, ax[0], q7_title[i][0])
    q7_pie(tv_genre, ax[1], q7_title[i][1])

    plt.tight_layout()
    plt.show()

* Netflix: a portion of the contents is comprised of drama, documentary, and crime, which is closely correlated with the results of previous analysis.
* Disney+: a portion of the contents is comprised of family, animation and kids. They provide many classic animations for teens and children, such as Frozen, Snow white, and Finding Nemo. 

After reading this analysis, which services do you prefer?

Personly, i prefer Netflix. As an adult who enjoys thrilling movies and TV shows, Netflix provides more content that aligns with my intersts. I tend to watch more American content, and compared to Disney+, i can access the content i want more quickly, and it also releases content in recent years.

If i have children, maybe i will subscribe to disney+ for them.

# Thanks for reading.