# AMAZON PRIME VIDEO CONTENT ANALYSIS


# Project Summary

This project performs an in-depth exploratory data analysis (EDA) on Amazon Prime Video’s content available in the United States.
The analysis uses two datasets: one containing detailed information about titles (movies and TV shows) such as genres, release year, runtime, IMDb scores, and popularity metrics, and another containing credits information for actors and directors.

The project begins by understanding and cleaning the datasets — handling missing values, duplicates, and ensuring that the data is ready for analysis. Then, the analysis explores multiple aspects of the content library:

**Content Diversity:** *Understanding which genres dominate the platform, and analyzing patterns in show types (TV shows vs movies).*

**Trends Over Time:** *Examining how the number of titles and their genres have evolved over the years.*

**Ratings & Popularity:** *Identifying top-rated and most popular titles based on IMDb scores, votes, and TMDB popularity.*

**Cast & Crew Analysis:** *Investigating contributions from actors and directors, and identifying recurring patterns.*

The project uses Pandas and NumPy for data manipulation, Matplotlib and Seaborn for visualizations, and optionally Plotly for interactive charts.

By the end of this analysis, the notebook provides actionable insights into content trends, audience preferences, and key metrics that can guide stakeholders in making data-driven decisions for content strategy and platform growth.

# PROBLEM STATEMENT

*This project analyzes all shows available on Amazon Prime Video in the United States.*

*The goal is to extract insights about content diversity, genre trends, regional availability, IMDb ratings, and popularity.*

*These insights will help content creators, analysts, and platform stakeholders make data-driven decisions to improve engagement and guide content strategy*.

# BUSINESS CONTEXT


*In the competitive streaming industry, platforms like Amazon Prime Video constantly expand their content library to cater to diverse audiences.*

*With thousands of titles available, understanding user preferences, popular genres, and trending content over time is critical for subscription growth and investment in content.*

*This analysis leverages Amazon Prime’s dataset to uncover patterns that influence strategic decisions and audience satisfaction.*

# LET'S BEGIN

## Importing Libraries and Setting up Notebook

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline
#Ensures plots display in the notebook
sns.set(style="whitegrid")  # Sets a clean grid style for Seaborn plots
import warnings
warnings.filterwarnings('ignore')  # Hides unnecessary warnings

## Loading the Dataset

In [5]:
# Load titles dataset
titles = pd.read_csv('/content/titles.csv')

# Load credits dataset
credits = pd.read_csv('/content/credits.csv')

In [6]:
credits.head()

Unnamed: 0,person_id,id,name,character,role
0,59401,ts20945,Joe Besser,Joe,ACTOR
1,31460,ts20945,Moe Howard,Moe,ACTOR
2,31461,ts20945,Larry Fine,Larry,ACTOR
3,21174,tm19248,Buster Keaton,Johnny Gray,ACTOR
4,28713,tm19248,Marion Mack,Annabelle Lee,ACTOR


In [7]:
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,TV-PG,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],26.0,tt0850645,8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,"During America’s Civil War, Union spies steal ...",1926,,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],,tt0017925,8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,It's the hope that sustains the spirit of ever...,1946,,171,"['romance', 'war', 'drama']",['US'],,tt0036868,8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,,92,"['comedy', 'drama', 'romance']",['US'],,tt0032599,7.8,57835.0,11.27,7.4
4,tm56584,In a Lonely Place,MOVIE,An aspiring actress begins to suspect that her...,1950,,94,"['thriller', 'drama', 'romance']",['US'],,tt0042593,7.9,30924.0,8.273,7.6


## Dataset Overview

In [8]:
print("Titles dataset shape:", titles.shape)      #shape of titles dataset
print("Credits dataset shape:", credits.shape)    #shape of credits dataset

Titles dataset shape: (9871, 15)
Credits dataset shape: (124235, 5)


In [9]:
# Checking column names, data types, and non-null counts to understand dataset structure
titles.info()
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9871 entries, 0 to 9870
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    9871 non-null   object 
 1   title                 9871 non-null   object 
 2   type                  9871 non-null   object 
 3   description           9752 non-null   object 
 4   release_year          9871 non-null   int64  
 5   age_certification     3384 non-null   object 
 6   runtime               9871 non-null   int64  
 7   genres                9871 non-null   object 
 8   production_countries  9871 non-null   object 
 9   seasons               1357 non-null   float64
 10  imdb_id               9204 non-null   object 
 11  imdb_score            8850 non-null   float64
 12  imdb_votes            8840 non-null   float64
 13  tmdb_popularity       9324 non-null   float64
 14  tmdb_score            7789 non-null   float64
dtypes: float64(5), int64(

In [10]:
# Titles dataset missing values
titles.isnull().sum()

# Credits dataset missing values
credits.isnull().sum()

Unnamed: 0,0
person_id,0
id,0
name,0
character,16287
role,0


In [11]:
# Titles dataset duplicates
titles.duplicated().sum()

# Credits dataset duplicates
credits.duplicated().sum()

np.int64(56)

In [12]:
titles.describe()  # For numeric columns like release_year, runtime, imdb_score

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,9871.0,9871.0,1357.0,8850.0,8840.0,9324.0,7789.0
mean,2001.327221,85.973052,2.791452,5.976395,8533.614,6.910204,5.984247
std,25.810071,33.512466,4.148958,1.343842,45920.15,30.004098,1.517986
min,1912.0,1.0,1.0,1.1,5.0,1.1e-05,0.8
25%,1995.5,65.0,1.0,5.1,117.0,1.232,5.1
50%,2014.0,89.0,1.0,6.1,462.5,2.536,6.0
75%,2018.0,102.0,3.0,6.9,2236.25,5.634,6.9
max,2022.0,549.0,51.0,9.9,1133692.0,1437.906,10.0


In [20]:
titles['type'].value_counts()  # How many Movies vs TV Shows


Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
MOVIE,8514
SHOW,1357


In [19]:
titles['age_certification'].value_counts()  # Age ratings distribution

Unnamed: 0_level_0,count
age_certification,Unnamed: 1_level_1
R,1249
PG-13,588
PG,582
G,269
TV-MA,217
TV-14,188
TV-PG,91
TV-Y,78
TV-G,57
TV-Y7,52
