# Introduction 

The '01_exploration.ipynb notebook' serves as the foundation for understanding the data landscape of streaming platforms and imdb supplemental data in this project. It focuses on exploring and analyzing the structure, content, and quality of datasets collected from multiple sources.

### Exploration Goals
* Understand the structure of each dataset.
* Identify potential issues (e.g., missing values, duplicates, outliers).
* Generate summary statistics to get an overview.
* Visualize initial trends (e.g., genre distribution, ratings).

## Datasets Overview

This project utilizes five datasets from Kaggle, providing comprehensive information on popular streaming platforms and IMDb ratings. Each dataset is updated daily, ensuring accurate and relevant content.

1. **Netflix**

    * Source: [Netflix Movies & TV Series Dataset](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
    * **Description**: A complete collection of Netflix's available titles (movies and TV series) with IMDb-specific data such as IMDb ID, average rating, and number of votes.

2. **Apple TV+**

    * Source: [Full Apple TV+ Dataset](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)
    * Description: A dataset covering all Apple TV+ titles, including key IMDb data for in-depth analysis of content quality.

3. **HBO Max**

    * Source:  [Full HBO Max Dataset](https://www.kaggle.com/datasets/octopusteam/full-hbo-max-dataset)
    * Description: An extensive collection of titles on HBO Max with associated IMDb data for comparison.

3. **Amazon Prime**

   * Source: [Full Amazon Prime Dataset](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset)
    * Description: Comprehensive data on Amazon Prime's movie and TV series offerings, including IMDb-specific metrics.

4. **Hulu**

    * Source: [Full Hulu Dataset](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
    * Description: A dataset detailing Hulu's catalog with IMDb-related columns for evaluating content quality and popularity.

Each of the streaming platform datasets includes the following columns:

* **title**: Name of the content.
* **type**: Either "movie" or "tv series."
* **genres**: Genres associated with the title.
* **releaseYear**: Year the title was released.
* **imdbId**: Unique IMDb identifier.
* **imdbAverageRating**: Average user rating on IMDb.
* **imdbNumVotes**: Number of votes received on IMDb.
* **availableCountries**: Countries where the title is available.

Additionally, the supplementary IMDb datasets were downloaded directly from IMDB. From the initial 7 datasets available on IMDb, these datasets complemented my project and will be able to expand on the main question

[IMDb Data Files](https://datasets.imdbws.com/)

The data that I was able to extract due to its usefulness is represented in the following columns:

* title.basics.tsv.gz
    * tconst (string) - alphanumeric unique identifier of the title
    * primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
    * isAdult (boolean) - 0: non-adult title; 1: adult title
    * runtimeMinutes – primary runtime of the title, in minutes
    * genres (string array) – up to three genres associated with the title
      
* title.ratings.tsv.gz
  
    * tconst (string) - alphanumeric unique identifier of the title
    * averageRating – weighted average of all the individual user ratings
    * numVotes - number of votes the title has received

This initial exploration will set the stage for more advanced analyses, such as clustering, statistical comparisons, and the evaluation of platform value.

### Imports

In [1]:
# Import Libraries

import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [2]:
import sys
sys.path.append(r"C:\Users\kimbe\Documents\StreamingAnalysis\scripts")  # Corrected path

from utils import *
from data_load import *


In [3]:
# Call the function to load and check the data
load_and_print_data()


Dataframe amazon_df successfully loaded. (67912, 8)
Dataframe hulu_df successfully loaded. (9884, 8)
Dataframe netflix_df successfully loaded. (20193, 8)
Dataframe hbo_df successfully loaded. (7199, 8)
Dataframe apple_df successfully loaded. (18004, 8)
Dataframe basics_df successfully loaded. (11278847, 9)
Dataframe ratings_df successfully loaded. (1506750, 3)


## Check the basic structure of **Platform Datasets**