# Introduction 

## Datasets Overview

This project utilizes five datasets from Kaggle, providing comprehensive information on popular streaming platforms and IMDb ratings. Each dataset is updated daily, ensuring accurate and relevant content.

1. **Netflix**

    * Source: [Netflix Movies & TV Series Dataset](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
    * **Description**: A complete collection of Netflix's available titles (movies and TV series) with IMDb-specific data such as IMDb ID, average rating, and number of votes.

2. **Apple TV+**

    * Source: [Full Apple TV+ Dataset](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)
    * Description: A dataset covering all Apple TV+ titles, including key IMDb data for in-depth analysis of content quality.

3. **HBO Max**

    * Source:  [Full HBO Max Dataset](https://www.kaggle.com/datasets/octopusteam/full-hbo-max-dataset)
    * Description: An extensive collection of titles on HBO Max with associated IMDb data for comparison.

3. **Amazon Prime**

   * Source: [Full Amazon Prime Dataset](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset)
    * Description: Comprehensive data on Amazon Prime's movie and TV series offerings, including IMDb-specific metrics.

4. **Hulu**

    * Source: [Full Hulu Dataset](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
    * Description: A dataset detailing Hulu's catalog with IMDb-related columns for evaluating content quality and popularity.

Each of the streaming platform datasets includes the following columns:

* **title**: Name of the content.
* **type**: Either "movie" or "tv series."
* **genres**: Genres associated with the title.
* **releaseYear**: Year the title was released.
* **imdbId**: Unique IMDb identifier.
* **imdbAverageRating**: Average user rating on IMDb.
* **imdbNumVotes**: Number of votes received on IMDb.
* **availableCountries**: Countries where the title is available.

This initial exploration will set the stage for more advanced analyses, such as clustering, statistical comparisons, and the evaluation of platform value.

### Imports

In [1]:
# Import Libraries

import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import sys
sys.path.append(r"C:\Users\kimbe\Documents\StreamingAnalysis\scripts")  # Corrected path
# Set to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


In [3]:
# Import custom scripts

from utils import *
from explore import *


In [4]:
# Define the base path
base_path = r"C:\Users\kimbe\Documents\StreamingAnalysis\data\raw_data"

# Load the datasets
amazon_df = pd.read_csv(f'{base_path}\\amazon_catalog.csv')
hulu_df = pd.read_csv(f'{base_path}\\hulu_catalog.csv')
netflix_df = pd.read_csv(f'{base_path}\\netflix_catalog.csv')
hbo_df = pd.read_csv(f'{base_path}\\hbo_catalog.csv')
apple_df = pd.read_csv(f'{base_path}\\apple_catalog.csv')

# List of datasets
datasets = [amazon_df, hulu_df, netflix_df, hbo_df, apple_df]
platforms = ["Amazon", "Hulu", "Netflix", "HBO", "Apple"]


In [5]:
# Count and remove rows that are not available in the US
cleaned_dfs = process_platform_data(datasets, platforms)


## Platform Dataset Exploration

In [10]:
# Count the number of rows missing IMDbID and remove them
missing_percent_df, cleaned_datasets = calculate_missing_percent_and_clean(datasets, platforms)

# Display the missing percentages
missing_percent_df

Unnamed: 0,Platform,Missing Percent
1,Hulu,9.307972
0,Amazon,8.441807
4,Apple,7.342813
2,Netflix,6.437875
3,HBO,5.236838


In [11]:
# Apply the function to all platforms datasets
datasets = drop_duplicates_by_imdbId(datasets)

# Check the number of rows before and after dropping duplicates
for df in datasets:
    print(f"Rows after dropping duplicates: {df.shape[0]}")

Rows after dropping duplicates: 62159
Rows after dropping duplicates: 8961
Rows after dropping duplicates: 18889
Rows after dropping duplicates: 6822
Rows after dropping duplicates: 16680


### Check Basic Structure

In [12]:
# Generate the separate summaries
separate_summaries = dataset_summary_separate(datasets, platforms)

# Access individual tables
missing_values_table = separate_summaries["Missing Values"]
data_types_table = separate_summaries["Data Type"]
unique_values_table = separate_summaries["Unique Values"]
duplicates_table = separate_summaries["Duplicates"]

#### Missing Values

In [13]:
# How many missing values each column contains for every dataset.
print("----- Missing Values -----")
missing_values_table

----- Missing Values -----


Unnamed: 0_level_0,Amazon,Hulu,Netflix,HBO,Apple
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
title,0,0,0,0,0
type,0,0,0,0,0
genres,89,17,1,8,134
releaseYear,3,1,0,1,1
imdbId,0,0,0,0,0
imdbAverageRating,2164,273,151,72,412
imdbNumVotes,2164,273,151,72,412
availableCountries,0,0,0,0,0


In [14]:
# Remove rows missing 'releaseYear'
cleaned_datasets = []
for platform, df in zip(platforms, datasets):
    print(f"--- {platform} Dataset ---")
    print(f"Rows before removal: {df.shape[0]}")
    df_cleaned = df.dropna(subset=['releaseYear'])
    cleaned_datasets.append(df_cleaned)
    print(f"Rows after removal: {df_cleaned.shape[0]}\n")

# Update the datasets list with cleaned datasets
datasets = cleaned_datasets

--- Amazon Dataset ---
Rows before removal: 62159
Rows after removal: 62156

--- Hulu Dataset ---
Rows before removal: 8961
Rows after removal: 8960

--- Netflix Dataset ---
Rows before removal: 18889
Rows after removal: 18889

--- HBO Dataset ---
Rows before removal: 6822
Rows after removal: 6821

--- Apple Dataset ---
Rows before removal: 16680
Rows after removal: 16679



In [15]:
# Copy of df
amazon_df = amazon_df.copy()
hulu_df = hulu_df.copy()
netflix_df = netflix_df.copy()
hbo_df = hbo_df.copy()
apple_df = apple_df.copy()

#List of columns to drop 
columns_to_drop = ['imdbAverageRating', 'imdbNumVotes']  

# Apply the function to all platform datasets
datasets = drop_columns_from_datasets(datasets, columns_to_drop)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=columns_to_drop, inplace=True, errors='ignore')


In [16]:
# Reprint the missing values table after cleaning
missing_values_table = print_missing_values(datasets, platforms)
missing_values_table

----- Missing Values -----


Unnamed: 0,Amazon,Hulu,Netflix,HBO,Apple
title,0,0,0,0,0
type,0,0,0,0,0
genres,89,17,1,8,134
releaseYear,0,0,0,0,0
imdbId,0,0,0,0,0
availableCountries,0,0,0,0,0


In [17]:
# Number of unique values in each column for every dataset.
print("----- Unique Values -----")
unique_values_table

----- Unique Values -----


Unnamed: 0_level_0,Amazon,Hulu,Netflix,HBO,Apple
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
title,58087,8790,18235,6703,16122
type,2,2,2,2,2
genres,1113,550,678,504,712
releaseYear,117,90,106,116,105
imdbId,62159,8961,18889,6822,16680
imdbAverageRating,90,75,82,77,80
imdbNumVotes,13605,4739,9450,4923,7940
availableCountries,11714,3,6175,797,315


In [18]:
# Data type of each column for every dataset
print("----- Data Types -----")
data_types_table

----- Data Types -----


Unnamed: 0_level_0,Amazon,Hulu,Netflix,HBO,Apple
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
title,object,object,object,object,object
type,object,object,object,object,object
genres,object,object,object,object,object
releaseYear,float64,float64,float64,float64,float64
imdbId,object,object,object,object,object
imdbAverageRating,float64,float64,float64,float64,float64
imdbNumVotes,float64,float64,float64,float64,float64
availableCountries,object,object,object,object,object


In [19]:
# Number of duplicate rows per column for every dataset
print("----- Duplicate Values -----")
duplicates_table

----- Duplicate Values -----


Unnamed: 0_level_0,Amazon,Hulu,Netflix,HBO,Apple
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
title,4072,171,654,119,558
type,62157,8959,18887,6820,16678
genres,61045,8410,18210,6317,15967
releaseYear,62041,8870,18783,6705,16574
imdbId,0,0,0,0,0
imdbAverageRating,62068,8885,18806,6744,16599
imdbNumVotes,48553,4221,9438,1898,8739
availableCountries,50445,8958,12714,6025,16365


### Filter IMDb datasets so we can use them to fill missing values

In [None]:
# Extract imbdId from platforms and create a df of imdbid with only those
platform_imdb_ids = {
    "Amazon": amazon_df["imdbId"].unique(),
    "Hulu": hulu_df["imdbId"].unique(),
    "Netflix": netflix_df["imdbId"].unique(),
    "HBO": hbo_df["imdbId"].unique(),
    "Apple": apple_df["imdbId"].unique()
}

# Combine the imdbIds from all platforms into a single list (unique values only)
all_platform_imdb_ids = set()
for platform in platform_imdb_ids:
    all_platform_imdb_ids.update(platform_imdb_ids[platform])



Filtered IMDb DataFrame Shape: (89159, 6)


Unnamed: 0,imdbId,type,title,genres,averageRating,numVotes
300,tt0000417,short,A Trip to the Moon,"Adventure,Comedy,Fantasy",8.1,57373
337,tt0000499,short,An Impossible Voyage,"Action,Adventure,Family",7.5,4153
1216,tt0002646,movie,Atlantis,Drama,6.5,502
1285,tt0003014,movie,Ingeborg Holm,Drama,7.0,1479
1543,tt0004181,movie,Judith of Bethulia,Drama,6.2,1482


In [21]:
# Number of unique values in each column for every dataset.
print("----- Unique Values -----")
unique_values_table

----- Unique Values -----


Unnamed: 0_level_0,Amazon,Hulu,Netflix,HBO,Apple
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
title,58087,8790,18235,6703,16122
type,2,2,2,2,2
genres,1113,550,678,504,712
releaseYear,117,90,106,116,105
imdbId,62159,8961,18889,6822,16680
imdbAverageRating,90,75,82,77,80
imdbNumVotes,13605,4739,9450,4923,7940
availableCountries,11714,3,6175,797,315


In [22]:
missing_values = filtered_imdb_info.isnull().sum()
missing_values

imdbId             0
type               0
title              0
genres           605
averageRating      0
numVotes           0
dtype: int64

In [23]:
# Reprint the missing values table after cleaning
missing_values_table = print_missing_values(datasets, platforms)
missing_values_table

----- Missing Values -----


Unnamed: 0,Amazon,Hulu,Netflix,HBO,Apple
title,0,0,0,0,0
type,0,0,0,0,0
genres,89,17,1,8,134
releaseYear,0,0,0,0,0
imdbId,0,0,0,0,0
availableCountries,0,0,0,0,0


In [26]:
# Platform Info
platforms = ['Amazon', 'Hulu', 'Netflix', 'HBO', 'Apple']
platform_dfs = [amazon_df, hulu_df, netflix_df, hbo_df, apple_df]

# Apply the function for each platform dataset
for platform_name, df in zip(platforms, platform_dfs):
    add_platform_column(df, platform_name)

## Merging Platforms

In [27]:
# Merge the datasets
merged_platforms_df = merge_platform_datasets(platforms, platform_dfs)

In [28]:
merged_platforms_df.head()

Unnamed: 0,imdbId,type,genres,releaseYear,title,Amazon,Hulu,Netflix,HBO,Apple
0,tt0000417,movie,"Action, Adventure, Comedy",1902.0,A Trip to the Moon,1,0,0,1,1
1,tt0000499,movie,"Action, Adventure, Family",1904.0,An Impossible Voyage,0,0,0,1,0
2,tt0002646,movie,Drama,1913.0,Atlantis,1,0,0,0,0
3,tt0003014,movie,Drama,1913.0,Ingeborg Holm,0,0,1,0,0
4,tt0004181,movie,Drama,1914.0,Judith of Bethulia,1,0,0,0,0


In [29]:
# Change datatype of releaseYear
# Convert releaseYear to int while handling NaN values
merged_platforms_df['releaseYear'] = merged_platforms_df['releaseYear'].fillna(0).astype(int)
streaming_platforms = merged_platforms_df

# Final output after cleaning

In [30]:
streaming_platforms.head(10)

Unnamed: 0,imdbId,type,genres,releaseYear,title,Amazon,Hulu,Netflix,HBO,Apple
0,tt0000417,movie,"Action, Adventure, Comedy",1902,A Trip to the Moon,1,0,0,1,1
1,tt0000499,movie,"Action, Adventure, Family",1904,An Impossible Voyage,0,0,0,1,0
2,tt0002646,movie,Drama,1913,Atlantis,1,0,0,0,0
3,tt0003014,movie,Drama,1913,Ingeborg Holm,0,0,1,0,0
4,tt0004181,movie,Drama,1914,Judith of Bethulia,1,0,0,0,0
5,tt0004873,movie,"Adventure, Family, Fantasy",1915,Alice in Wonderland,1,0,0,0,0
6,tt0004972,movie,"Drama, War",1915,The Birth of a Nation,1,0,0,0,0
7,tt0005078,movie,"Drama, Romance",1915,The Cheat,1,0,0,0,0
8,tt0005302,movie,Drama,1915,"Fanchon, the Cricket",1,0,0,0,0
9,tt0005339,movie,Drama,1915,A Fool There Was,1,0,0,0,0


In [31]:
filtered_imdb_info.head(10)

Unnamed: 0,imdbId,type,title,genres,averageRating,numVotes
300,tt0000417,short,A Trip to the Moon,"Adventure,Comedy,Fantasy",8.1,57373
337,tt0000499,short,An Impossible Voyage,"Action,Adventure,Family",7.5,4153
1216,tt0002646,movie,Atlantis,Drama,6.5,502
1285,tt0003014,movie,Ingeborg Holm,Drama,7.0,1479
1543,tt0004181,movie,Judith of Bethulia,Drama,6.2,1482
1683,tt0004873,movie,Alice in Wonderland,"Adventure,Family,Fantasy",6.1,805
1707,tt0004972,movie,The Birth of a Nation,"Drama,War",6.1,26924
1735,tt0005078,movie,The Cheat,"Drama,Romance",6.5,2893
1777,tt0005302,movie,"Fanchon, the Cricket",Drama,6.4,366
1793,tt0005339,movie,A Fool There Was,Drama,5.7,1080


In [32]:
streaming_platforms.describe()

Unnamed: 0,releaseYear,Amazon,Hulu,Netflix,HBO,Apple
count,92049.0,92049.0,92049.0,92049.0,92049.0,92049.0
mean,2007.404089,0.675282,0.09735,0.205206,0.074113,0.181208
std,24.008189,0.468272,0.296436,0.403854,0.261956,0.385192
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,2004.0,0.0,0.0,0.0,0.0,0.0
50%,2015.0,1.0,0.0,0.0,0.0,0.0
75%,2020.0,1.0,0.0,0.0,0.0,0.0
max,2025.0,1.0,1.0,1.0,1.0,1.0


In [33]:
filtered_imdb_info.describe()

Unnamed: 0,averageRating,numVotes
count,89159.0,89159.0
mean,6.114558,12175.78
std,1.311521,66964.57
min,1.0,5.0
25%,5.3,114.0
50%,6.3,508.0
75%,7.1,2594.0
max,9.9,2970653.0


In [34]:
null_counts = streaming_platforms.isnull().sum()
print(null_counts)

imdbId           0
type             0
genres         245
releaseYear      0
title            0
Amazon           0
Hulu             0
Netflix          0
HBO              0
Apple            0
dtype: int64


In [35]:
null_counts = filtered_imdb_info.isnull().sum()
print(null_counts)

imdbId             0
type               0
title              0
genres           605
averageRating      0
numVotes           0
dtype: int64


In [36]:
print(filtered_imdb_info.columns)
print(streaming_platforms.columns)


Index(['imdbId', 'type', 'title', 'genres', 'averageRating', 'numVotes'], dtype='object')
Index(['imdbId', 'type', 'genres', 'releaseYear', 'title', 'Amazon', 'Hulu',
       'Netflix', 'HBO', 'Apple'],
      dtype='object')


In [37]:
# Create a copy of filtered_imdb_info to avoid the SettingWithCopyWarning
filtered_imdb_info_copy = filtered_imdb_info.copy()

# Rename 'genres' column in the copy of filtered_imdb_info
filtered_imdb_info_copy.rename(columns={'genres': 'imdb_genres'}, inplace=True)

# Merge streaming_platforms with the modified filtered_imdb_info copy
final_df = pd.merge(streaming_platforms, filtered_imdb_info_copy[['imdbId', 'imdb_genres', 'averageRating', 'numVotes']], 
                     on='imdbId', how='left')

# Fill missing 'genres' in streaming_platforms with 'imdb_genres' from filtered_imdb_info
final_df['genres'] = final_df['genres'].fillna(final_df['imdb_genres'])

# Drop the temporary 'imdb_genres' column
final_df.drop(columns=['imdb_genres'], inplace=True)

streaming_titles_info = final_df



In [38]:
# Check for NaN and infinite values in the merged dataset
nan_values_merged  = streaming_titles_info.isna().sum()
inf_values_merged = (streaming_titles_info == float('inf')).sum() + (streaming_titles_info == float('-inf')).sum()

# Print results
print(f"NaN values in each column in the merged dataset:\n{nan_values_merged}")
print(f"Infinite values in each column in the merged dataset:\n{inf_values_merged}")


NaN values in each column in the merged dataset:
imdbId              0
type                0
genres            245
releaseYear         0
title               0
Amazon              0
Hulu                0
Netflix             0
HBO                 0
Apple               0
averageRating    2890
numVotes         2890
dtype: int64
Infinite values in each column in the merged dataset:
imdbId           0
type             0
genres           0
releaseYear      0
title            0
Amazon           0
Hulu             0
Netflix          0
HBO              0
Apple            0
averageRating    0
numVotes         0
dtype: int64


In [39]:
# Specify the file path
file_path = r'C:\Users\kimbe\Documents\StreamingAnalysis\data\cleaned_data\streaming_titles_info.csv'

# Save the final merged DataFrame to the specified path
streaming_titles_info.to_csv(file_path, index=False)