# **Project Name**    - Amazon Prime EDA




##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### Team Member 1 - Meghashyam Parab


# **Project Summary -**

📊 Amazon Prime EDA: Unlocking Insights from Streaming Data

🎬 Welcome to the ultimate deep dive into Amazon Prime's content landscape! This project unravels the mysteries behind movies and TV shows on Amazon Prime through Exploratory Data Analysis (EDA). From genre trends to IMDb ratings, we've crunched the numbers to uncover what makes Prime Video tick.

---

🚀 Project Highlights

🔹 Data Cleaning & Preprocessing – Handling missing values, duplicate entries, and ensuring data integrity.

🔹 Genre & Content Trends – What types of content dominate Prime Video? Movies vs. TV Shows?

🔹 IMDb Ratings & User Preferences – Do higher-rated movies follow a pattern?

🔹 Country-Wise Content Distribution – Which countries contribute the most content?

🔹 Release Year Analysis – Is Prime focusing on newer content or old classics?

🔹 Visualizing Insights – Beautiful plots and interactive charts for better understanding.

---

📈 Tech Stack

✅ Python (Pandas, NumPy, Matplotlib, Seaborn, Plotly)

✅ Jupyter Notebook for EDA

✅ Data sourced from Kaggle/Amazon Prime dataset

----

🔍 Findings & Insights

💡 Most of Amazon Prime's content comes from the United States & India.

💡 TV Shows are generally higher-rated than movies.

💡 Comedy & Drama dominate as the most popular genres.

💡 Amazon Prime has been focusing on recent releases (post-2010) more than classics.

----

🎯 Next Steps

📌 Sentiment Analysis on Amazon Prime reviews

📌 Predictive modeling for content success

📌 Comparing Amazon Prime with Netflix & Disney+

# **GitHub Link -**

https://github.com/meghashyam123/Amazon-Prime-EDA

# **Problem Statement**


Amazon Prime Video hosts a massive library of content, but what truly keeps viewers hooked? Are IMDb ratings a reliable indicator of success? Do specific genres, runtimes, or production countries dominate? How has content evolved over the years?

🔍 Problem Statement:

"To analyze Amazon Prime’s content library through Exploratory Data Analysis (EDA) and uncover key insights about movie/TV show trends, audience preferences, and content performance. This will help optimize recommendations, improve user engagement, and drive strategic content decisions."

🎯 Key Goals:

✅ Identify top-performing genres, countries, and content types

✅ Analyze IMDb scores vs. runtime, release year, and production trends

✅ Spot patterns in audience preferences for better recommendations

✅ Provide data-backed insights to boost engagement & revenue

#### **Define Your Business Objective?**

The primary objective of conducting an Exploratory Data Analysis (EDA) on Amazon Prime is to derive actionable insights that can enhance customer engagement, optimize content strategy, and improve overall business performance. The key goals include:

Understanding Customer Behavior :

Identify user preferences based on watch history, genres, ratings, and subscription trends.
Analyze customer retention patterns and churn rates.

Content Performance Analysis:

Determine which genres, movies, and TV shows perform best in terms of viewership and engagement.
Identify underperforming content and factors contributing to low engagement.

Competitor Benchmarking:

Compare Amazon Prime’s performance against competitors like Netflix and Disney+ based on content offerings and user engagement.

Subscription & Revenue Insights:

Analyze subscription trends across different demographics and geographic regions.
Identify factors influencing subscription cancellations and renewals.




# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import  matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.subplots as sp
import plotly.figure_factory as ff
from itertools import cycle
import re
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')


### Dataset Loading

In [None]:
credit_df = pd.read_csv("/content/credits.csv")
title_df = pd.read_csv("/content/titles.csv")

### Dataset First View

In [None]:
# Dataset First Look
print(title_df.head())
print(credit_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
title_df.shape

In [None]:
credit_df.shape

### Dataset Information

In [None]:
print(title_df.info())
print(credit_df.info())

#### Duplicate Values

In [None]:
title_df.duplicated().sum()


In [None]:
credit_df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
title_df.isna().sum()


In [None]:
credit_df.isna().sum()


In [None]:
# percentile of missing values

round(100*(credit_df.isnull().sum()/len(credit_df.index)),2).sort_values(ascending = False)
round(100*(title_df.isnull().sum()/len(title_df.index)),2).sort_values(ascending = False)



In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(title_df)
plt.show()

In [None]:
msno.bar(credit_df)
plt.show()

### What did you know about your dataset?

Overview of the Datasets

**credits.csv (124,235 entries, 5 columns)**

Contains details about cast and crew members.

Key columns: person_id, id (links to titles.csv), name, character (for actors), and role (e.g., actor, director).
Missing values: character column has some missing values.


**titles.csv (9,871 entries, 15 columns)**

Contains details about shows and movies.

Key columns: id, title, type (MOVIE/SHOW), release_year, genres, imdb_score, runtime, production_countries.



## ***2. Understanding Your Variables***

In [None]:
rows, cols = credit_df.shape
rows, cols

In [None]:
rows, cols = title_df.shape
rows, cols

### Variables Description

The dataset contains the following key variables:  

| Column Name      | Description |
|-----------------|-------------|
| `id` | Unique identifier for each show or movie |
| `title` | Name of the show or movie |
| `type` | Specifies whether the content is a "Movie" or "TV Show" |
| `release_year` | Year when the show or movie was released |
| `age_certification` | Age rating or certification of the content (e.g., PG, R, 18+) |
| `runtime` | Duration of the content in minutes |
| `genres` | List of genres associated with the show or movie |
| `production_countries` | Country or countries where the content was produced |
| `seasons` | Number of seasons (if it's a TV show) |
| `imdb_score` | IMDb rating of the show or movie |
| `imdb_votes` | Number of votes the show or movie has on IMDb |
| `tmdb_popularity` | Popularity score on The Movie Database (TMDB) |
| `tmdb_score` | TMDB rating of the show or movie |
| `cast` | List of main actors in the show or movie |
| `crew` | List of crew members, including directors and producers |

---

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in credit_df.columns:
    unique_values = credit_df[column].unique()
    print(f"Unique values for '{column}': {unique_values}")

In [None]:
for x in credit_df.columns:
    print(f"{x} - {credit_df[x].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling missing values in 'age_certification' and 'runtime'
title_df.fillna({'age_certification': 'Unknown', 'runtime': title_df['runtime'].median()}, inplace=True)

# Removing rows with missing 'imdb_score' as it's crucial for analysis
title_df.dropna(subset=['imdb_score'], inplace=True)

# Removing rows with any missing values in the credit_df
credit_df.dropna(inplace=True)

# Ensuring 'release_year' and 'imdb_score' are numeric
title_df['release_year'] = pd.to_numeric(title_df['release_year'], errors='coerce')
title_df['imdb_score'] = pd.to_numeric(title_df['imdb_score'], errors='coerce')

### What all manipulations have you done and insights you found?





🛠 Data Manipulations Performed :

Loaded the datasets: Read both CSV files using pandas.

Checked dataset structure: Used .info() to understand column types, missing values, and dataset size.

Displayed first few rows: Used .head() to inspect the structure and sample values.

Identified missing values:age_certification, seasons, imdb_score, and tmdb_score have missing values in titles.csv.
character has some missing values in credits.csv.

Checked column relationships:
id is the key linking titles and credits.
imdb_score and tmdb_score can be used for rating analysis.

---

📊 Key Insights Found :

1️⃣ Content Distribution by Type (Movies vs. TV Shows)
The dataset contains both movies and TV shows.
Need to check if one type dominates the dataset.

2️⃣ IMDb & TMDB Scores Missing for Some Titles
Many records lack IMDb or TMDB scores, meaning some popular movies/shows may not have enough rating data.

3️⃣ Genre Distribution & Popularity
Movies & TV shows belong to multiple genres (e.g., ['drama', 'action']).
Further analysis can show which genres are most popular.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Distribution of Movies vs TV Shows


In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(7,4))
title_df['type'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['#ff9999','#66b3ff'])
plt.title('Movies vs. TV Shows Distribution')
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart visually represents the distribution of content types on Amazon Prime, showing that:


*   86.8% of the content consists of Movies

*   Only 13.2% consists of TV Shows







##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart:

Movies make up 86.8% of the dataset.
TV Shows account for only 13.2%, indicating that Amazon Prime's catalog is heavily dominated by movies.
This imbalance suggests that users might have more movie choices compared to TV series on the platform.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📊 **Business Impact of Gained Insights**

✅ Positive Impact:

Content Strategy: Since 86.8% of the catalog is movies, Amazon Prime can focus on acquiring more TV shows to balance the content mix and attract binge-watchers.
Genre Optimization: Identifying popular genres helps Amazon invest in high-demand content and tailor recommendations.
Localized Expansion: Analyzing production countries guides regional content acquisition, boosting engagement in untapped markets.



---



⚠️ Potential Negative Growth Areas:

Limited TV Shows: With only 13.2% of content as TV series, binge-watchers may shift to competitors (like Netflix, which has a stronger TV lineup).
Missing Ratings Data: Incomplete IMDb/TMDB scores mean Amazon might be promoting low-quality content, affecting user satisfaction.
Unbalanced Age Ratings: If most content lacks age certification, family-friendly users may hesitate to subscribe.

#### **Top 10 Most Popular Movies**

In [None]:
top_10_movies = title_df[(title_df['type'] == 'MOVIE')].sort_values(by=['imdb_votes'], ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(y='title', x='imdb_votes', data=top_10_movies, palette='viridis')
plt.title('Top 10 Most Popular Movies (Based on IMDb Votes)')
plt.xlabel('IMDb Votes')
plt.ylabel('Movie Title')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it offers better readability and clarity, especially when dealing with long movie titles.

Here's why:

📌 Improved Label Visibility – Movie titles are easier to read when listed vertically rather than squeezed at an angle on the x-axis.

📌 Better Comparison – The horizontal format allows for a clearer visual comparison of IMDb votes, making it easy to see which movies are the most popular.

📌 Space Efficiency – Works well for ranking-based data, ensuring that bars are evenly spaced and data isn’t cluttered.

📌 User-Friendly Design – A horizontal layout is more intuitive when displaying top ranked items, as viewers naturally scan from top to bottom.

##### 2. What is/are the insight(s) found from the chart?

📊 **Insights from the Chart: Top 10 Most Popular Movies (Based on IMDb Votes)**

1️⃣ "Titanic" is the most popular movie, receiving the highest IMDb votes, indicating its evergreen appeal and strong rewatch value.

2️⃣ Classic thrillers and dramas dominate the list, including The Usual Suspects, Braveheart, and The Sixth Sense, proving that older, critically acclaimed films still attract significant audience engagement.

3️⃣ Action and Sci-Fi movies remain a stronghold, with The Terminator, Skyfall, and District 9 securing top spots, confirming audience preference for high-adrenaline content.

4️⃣ Animated movies like "Shrek" have high engagement, suggesting a demand for family-friendly content alongside action and thrillers.

5️⃣ Diverse genre representation: The list includes romance, thriller, sci-fi, drama, and animation, showing Amazon Prime's broad audience reach.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📊 **Business Impact of Gained Insights**

✅ Positive Impact:

Boost User Engagement 🚀 – Featuring top-voted movies like Titanic and The Usual Suspects in recommendations can increase watch time.
Content Acquisition Strategy 🎯 – High audience interest in action, thriller, and animation suggests investing more in these genres.
Targeted Promotions 📢 – Running campaigns on classic and high-voted films can attract both new and returning viewers.




---

⚠️ Potential Negative Growth Areas:


Over-Reliance on Older Content 🕰️ – Most top movies are older classics. A lack of fresh, high-engagement titles may cause younger audiences to look elsewhere.
Genre Imbalance ⚖️ – Heavy focus on thrillers and action; underrepresentation of newer niche genres (e.g., documentaries, indie films) might limit audience diversity.


#### Top 10 Most popular shows


In [None]:
top_10_shows = title_df[(title_df['type'] == 'SHOW')].sort_values(by=['imdb_votes'], ascending=False).head(10)
plt.figure(figsize=(12, 6))
sns.barplot(y='title', x='imdb_votes', data=top_10_shows, palette='viridis')
plt.title('Top 10 Most Popular Shows (Based on IMDb Votes)')
plt.xlabel('IMDb Votes')
plt.ylabel('Show Title')
plt.show()


##### 1. Why did you pick the specific chart?

This Horizontal chart was chosen because it visually showcases viewer preferences based on IMDb votes, helping Amazon Prime identify:

🔹 Top-performing shows that drive engagement.

🔹 Genres and content themes that attract maximum viewership.

🔹 Potential content gaps, guiding future investments in licensing or original productions.

##### 2. What is/are the insight(s) found from the chart?

"Dexter" is the Most Popular Show 🩸 – With the highest number of IMDb votes, "Dexter" leads the chart, indicating its massive fan following.

Comedy Still Dominates 😂 – "How I Met Your Mother" securing second place highlights the strong appeal of sitcoms even years after their release.

Historical & Crime Thrillers Are Fan Favorites ⚔️ – "Vikings" and "Better Call Saul" rank high, suggesting a strong audience interest in period dramas and crime narratives.

Diverse Genres Keep Viewers Hooked 🎭 – From crime ("Dexter"), drama ("House"), and comedy ("Community") to sci-fi ("Mr. Robot") and horror ("American Horror Story"), Amazon Prime offers a well-balanced mix of content.

Older Shows Still Hold Relevance ⏳ – Many shows on the list premiered years ago, proving that timeless storytelling continues to attract new viewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Growth:

The insights highlight Amazon Prime’s strength in nostalgia-driven content, genre diversity, and high audience engagement. This can guide future investments in reviving classics, acquiring high-rated IMDb shows, and optimizing recommendations to retain viewers.

⚠️ Potential Risk:

A heavy reliance on older, established shows could stagnate growth if fresh, original content isn’t prioritized. The absence of new Prime-exclusive hits in the top 10 suggests a need for stronger in-house productions to stay competitive.

#### **Top 10 Production Countries**

In [None]:
from collections import Counter

# Function to extract countries from a string
def extract_countries(country_str):
  if country_str is not np.nan:
    return [country.strip() for country in country_str.split(',')]
  return []

# Extract countries and count occurrences
all_countries = title_df['production_countries'].apply(extract_countries).sum()
country_counts = Counter(all_countries)

# Get top 10 countries
top_10_countries = country_counts.most_common(10)

# Convert to dataframe for plotting
countries_df = pd.DataFrame(top_10_countries, columns=['Country', 'Count'])

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(y='Country', x='Count', data=countries_df, palette='viridis')
plt.title('Top 10 Production Countries for Amazon Prime Content')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because it visually represents the top 10 production countries for Amazon Prime content, providing key insights into content distribution and market dominance.

🔹 Clear Identification of Market Leaders – The chart highlights which countries contribute the most to Amazon Prime’s content library, helping understand where the platform's strongest presence is.

🔹 Spotting Opportunities for Growth – By analyzing the disparity between content production across countries, we can identify underrepresented regions where Amazon Prime can expand its content strategy.

##### 2. What is/are the insight(s) found from the chart?

Key Insights from the Chart 📊

1️⃣ US Leads the Production Race 🇺🇸 – The United States dominates Amazon Prime’s content library, contributing the highest number of titles.

2️⃣ India & UK as Strong Contenders 🇮🇳🇬🇧 – India and the UK have significant contributions, highlighting a growing presence of regional content.

3️⃣ Limited Representation from Other Regions 🌍 – Countries outside the top three have considerably fewer titles, indicating an opportunity for Amazon Prime to expand content diversity and attract a more global audience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

🚀 **Business Impact of Insights from the Chart**


✅ Positive Business Impact:

The insights gained will help Amazon Prime make data-driven decisions in content acquisition and production.

Strengthening Market Position: The dominance of US content suggests Amazon Prime’s stronghold in the American market, allowing strategic partnerships with leading studios.

Emerging Market Opportunities: The growing contribution from India (IN) and the UK (GB) presents an opportunity to expand localized content and increase international subscriber engagement.

Content Diversification: By identifying underrepresented markets, Amazon Prime can invest in regional content to attract diverse audiences.



---



❌ Insights That May Lead to Negative Growth:

Over-Reliance on US Content: A strong dependence on US productions could alienate international audiences, leading to lower engagement in non-US regions.

Data Inconsistencies: The presence of duplicate country labels (e.g., multiple variations of "US" and "CA") and an empty category suggests data quality issues, which could affect strategic decision-making.

Limited Global Appeal: If Amazon Prime does not expand its regional content, competitors like Netflix or Disney+ could capture more market share in diverse regions.



**Released per year of Movies and TV Shows**

In [None]:
yearly_releases = title_df.groupby(['release_year', 'type'])['id'].count().reset_index(name='count')


plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='count', hue='type', data=yearly_releases)
plt.title('Number of Movies and TV Shows Released per Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Releases')
plt.show()


#### **Duration of Movies and TV Shows**

In [None]:

# Create the box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='type', y='runtime', data=title_df)
plt.title('Comparing Duration of Movies and TV Shows')
plt.xlabel('Content Type')
plt.ylabel('Runtime (minutes)')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this boxplot because it is the best way to compare the distribution and variability of movie and TV show durations on Amazon Prime. Here’s why it was the ideal choice:

🔹 Clear Comparison – The chart shows how the median, interquartile range (IQR), and outliers differ between movies and TV shows.

🔹 Highlights Outliers – It visually represents unusually long movies and special/long TV show episodes, which could be useful insights for content planning.

🔹 User Engagement Insights – Helps Amazon Prime understand whether longer movies or shorter content are more common and how they can optimize content for user preferences.

##### 2. What is/are the insight(s) found from the chart?

This boxplot compares the duration of Movies and TV Shows on Amazon Prime. Key takeaways:

🔹 Movies are generally longer than TV shows, with a median runtime around 90–100 minutes, while TV shows typically range around 30–40 minutes per episode.

🔹 Greater Variability in Movie Lengths – Movies have a wider spread with many outliers, some exceeding 300 minutes, showing a mix of short films and long-form content.

🔹 TV Shows have a more consistent duration, with most episodes clustered under 50 minutes but still containing some outliers reaching up to 150 minutes, possibly indicating special episodes or mini-series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact – Insights help Amazon Prime optimize content recommendations, ensuring users get content suited to their viewing habits (e.g., shorter content for casual watchers, longer ones for dedicated movie buffs).

⚠️ Negative Growth Potential – If movies are too long, they might deter audiences preferring bite-sized content, leading to higher drop-off rates. Amazon Prime can balance the content library by introducing more short films or engaging series to maintain user retention.



#### Chart - 6 - Genre Distribution based of No of Movies and TV Shows

In [None]:
title_df["genres"] = title_df["genres"].apply(lambda x: re.findall("\w+", x))

genres = list(title_df["genres"].values)
genres = list(set([item for sublist in genres for item in sublist]))

for i, genre in enumerate(genres):
    title_df[genre] = title_df.genres.apply(lambda x: 1 if genre in x else 0).astype(int)

print("Number of Genres: ", len(genres))
print("Genres:", genres)

In [None]:
genre_movie_dict = {}

for genre in genres:
    genre_movie_dict[genre] = title_df.query("type == 'MOVIE'")[genre].sum()

genre_movie_dict = dict(sorted(genre_movie_dict.items(), key=lambda x: x[0]))

genre_series_dict = {}

for genre in genres:
    genre_series_dict[genre] = title_df.query("type == 'SHOW'")[genre].sum()

genre_series_dict = dict(sorted(genre_series_dict.items(), key=lambda x: x[0]))

fig = sp.make_subplots(
    rows=2,
    cols=1,
    subplot_titles=["Movies", "Series"],
)

genre_movie_count = go.Bar(
    x=list(genre_movie_dict.keys()),
    y=list(genre_movie_dict.values()),
    marker=dict(color=list(genre_movie_dict.values()),
                colorscale=px.colors.qualitative.Dark2),
    name="Movies",
)

genre_series_count = go.Bar(
    x=list(genre_series_dict.keys()),
    y=list(genre_series_dict.values()),
    marker=dict(color=list(genre_series_dict.values()),
                colorscale=px.colors.qualitative.Dark2),
    name="Series",
)

fig.add_trace(genre_movie_count, row=1, col=1)
fig.update_xaxes(title_text="Genres", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.add_trace(genre_series_count, row=2, col=1)
fig.update_xaxes(title_text="Genres", row=2, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update(
    layout_title_text="Genre Distribution based on No. of Movies and Shows",
    layout_title_font_size=30,
    layout_title_x=0.5,
    layout_template="plotly",
    layout_showlegend=False,
    layout_height=800,
    layout_paper_bgcolor='rgb(239, 247, 267)',
    layout_plot_bgcolor='rgb(239, 247, 267)',
)

fig.update_annotations(font_size=18)

fig.show()

##### 1. Why did you pick the specific chart?

This genre distribution chart was chosen because it provides a clear comparative analysis of the number of movies and TV shows available across
different genres on Amazon Prime.

Here’s why it’s valuable:

🔹 Genre Popularity Insight – Helps understand which genres dominate Amazon Prime’s content library. For instance, Drama, Comedy, and Thriller seem to be leading categories for movies, while Comedy and Drama are dominant in TV shows.

🔹 Strategic Content Decisions – By analyzing underrepresented genres (e.g., animation, war, or sports), Amazon Prime can decide whether to expand its offerings in these areas to attract a more diverse audience.

🔹 Viewer Preference Alignment – This helps in aligning content recommendations and marketing strategies based on popular genres.

##### 2. What is/are the insight(s) found from the chart?

1️⃣ Drama is King 👑

Drama dominates both movies and TV shows, suggesting that viewers on Amazon Prime prefer deep storytelling and character-driven narratives.
This indicates that investing in high-quality drama productions could further enhance engagement.

2️⃣ Comedy & Thriller Are Key Players 🎭🔪

Comedy is a strong genre in both movies and TV shows, showing that light-hearted content is a major attraction for subscribers.
Thrillers have a high presence in movies, meaning there’s a strong demand for suspenseful and intense storytelling.

3️⃣ Low Representation of Certain Genres 🧐

Genres like war, sports, and animation have very few titles in both movies and TV shows.
If there is demand for these genres, Amazon Prime can increase content acquisition or production in these areas to attract niche audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

🚀 Business Impact of Insights

✅ Positive Impact:
The insights help Amazon Prime focus on high-demand genres like Drama, Comedy, and Thriller, ensuring higher engagement and retention. Expanding underrepresented genres (like Sports and Animation) can attract niche audiences, boosting subscriptions.

⚠ Potential Negative Growth:
Over-reliance on Drama & Thriller may create content fatigue, making the platform feel repetitive. Lack of investment in emerging genres could alienate audiences looking for fresh content. A diversified content strategy is key to sustained growth.

🎯 Solution: Balance mainstream hits with experimental content to stay ahead of the competition! 🚀









#### Chart - 7 - Genre Distribution based on IMDB Votes

In [None]:
genre_movies_popularity_dict = {}

for i, genre in enumerate(genres):
    genre_movies_popularity_dict[genre] = title_df.query("type == 'MOVIE'").groupby(genre)["imdb_votes"].sum().sort_index().__getitem__(1)

genre_movies_popularity_dict = dict(sorted(genre_movies_popularity_dict.items(), key=lambda x: x[0]))

genre_series_popularity_dict = {}

for i, genre in enumerate(genres):
    genre_series_popularity_dict[genre] = title_df.query("type == 'SHOW'").groupby(genre)["imdb_votes"].sum().sort_index().__getitem__(1)

genre_series_popularity_dict = dict(sorted(genre_series_popularity_dict.items(), key=lambda x: x[0]))

fig = sp.make_subplots(
    rows=2,
    cols=1,
    subplot_titles=["Movies", "Series"],
)

genre_movies_pop = go.Bar(
    x=list(genre_movies_popularity_dict.keys()),
    y=list(genre_movies_popularity_dict.values()),
    marker=dict(color=list(genre_movies_popularity_dict.values()),
                colorscale=px.colors.qualitative.Dark2),
    hoverinfo="x+y",
)

genre_series_pop = go.Bar(
    x=list(genre_series_popularity_dict.keys()),
    y=list(genre_series_popularity_dict.values()),
    marker=dict(color=list(genre_series_popularity_dict.values()),
                colorscale=px.colors.qualitative.Dark2),
    hoverinfo="x+y",
)

fig.add_trace(genre_movies_pop, row=1, col=1)
fig.update_xaxes(title_text="Genre", row=1, col=1)
fig.update_yaxes(title_text="IMDB Votes", row=1, col=1)
fig.update

fig.add_trace(genre_series_pop, row=2, col=1)
fig.update_xaxes(title_text="Genre", row=2, col=1)
fig.update_yaxes(title_text="IMDB Votes", row=2, col=1)

fig.update(
    layout_title_text="Genre Distribution based on IMDB Votes",
    layout_title_font_size=30,
    layout_title_x=0.5,
    layout_template="plotly",
    layout_showlegend=False,
    layout_height=800,
    layout_paper_bgcolor='rgb(239, 247, 267)',
    layout_plot_bgcolor='rgb(239, 247, 267)',
)

fig.update_annotations(font_size=18)

fig.show()


##### 1. Why did you pick the specific chart?

I picked this specific genre distribution chart because:

1️⃣ Clear Business Insights – It visually highlights which genres attract the most IMDB votes for both movies and series, helping businesses understand audience preferences.

2️⃣ Comparative Analysis – The dual representation (movies vs. series) allows for an easy comparison of viewer interest, enabling strategic content investment.

3️⃣ Actionable Decision-Making – The data can guide content creators, streaming platforms, or production houses in choosing genres that maximize engagement and revenue.

##### 2. What is/are the insight(s) found from the chart?


Drama Dominates 🎭 :

Both movies and series see the highest engagement in the drama genre.
Businesses can prioritize drama-based content for maximum audience reach.


Action & Thriller are Popular 🔥 :

Action, thriller, and comedy genres receive significant votes, showing strong audience demand.
Streaming platforms should focus on these genres for high engagement.


Animation & Documentary Lag Behind 🎬 :

These genres have fewer votes, indicating lower mainstream popularity.
Marketing efforts or niche audience targeting might be needed for growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact

Prioritizing Drama, Action, and Thriller: These genres receive the highest engagement, indicating strong audience demand. Investing in these will likely boost viewership and revenue.

Strategic Content Planning: Knowing which genres perform well helps businesses allocate resources efficiently, reducing financial risks.

Enhanced Customer Satisfaction: Creating more content in popular genres ensures higher viewer retention and platform loyalty.



⚠️ Potential Negative Growth Areas :

Neglecting Low-Performing Genres: Ignoring genres like animation or documentary may alienate niche audiences who value diverse content.

Oversaturation of Popular Genres: Overloading drama, action, and thriller content without variety might lead to audience fatigue and decreased engagement over time.

Risk of Missing Emerging Trends: While current trends favor drama and action, failing to explore new or rising genres could limit long-term growth.

#### Chart - 8 - IMDB Rating vs TMDB Score

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12,5))
sns.scatterplot(data=title_df, x='release_year', y='imdb_score', alpha=0.5)
plt.title('IMDb Score vs. Release Year')
plt.show()

In [None]:

# Assuming your dataframe is named 'title_df' and contains 'imdb_score' and 'tmdb_score' columns
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.scatterplot(x='imdb_score', y='tmdb_score', data=title_df)
plt.title('IMDb Rating vs. TMDB Score')
plt.xlabel('IMDb Rating')
plt.ylabel('TMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

This scatter plot was chosen because it effectively visualizes the relationship between IMDb ratings and TMDB scores. It helps identify patterns, such as whether there is a positive correlation between the two rating systems or if there are significant deviations. By using this plot, one can assess how closely the two rating platforms align and detect any anomalies or trends in audience perception across platforms.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot provides the following insights:

Positive Correlation – There is a general upward trend, suggesting that movies with high IMDb ratings tend to have high TMDB scores, indicating consistency in audience perception across platforms.

Variance in Scores – While many points align closely along an increasing trend, there are several outliers where a movie has a high IMDb rating but a significantly lower TMDB score (or vice versa). This could indicate differences in audience demographics or rating criteria between the platforms.

Clustered Distribution – The majority of the data points are concentrated between IMDb ratings of 4 to 8 and TMDB scores of 4 to 8, suggesting that most movies fall within this moderate-to-high rating range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact :

Yes! The insights can help businesses make data-driven decisions. The strong correlation between IMDb and TMDB scores indicates that audience preferences are fairly consistent across platforms. This means studios can confidently use IMDb ratings as a benchmark for predicting audience reception on other platforms, optimizing marketing efforts, and selecting movies for streaming services.

Negative Growth Potential :

A key concern is the rating inconsistencies and outliers. If a movie has a high IMDb rating but a low TMDB score, it may indicate biased reviews, platform-specific audience preferences, or manipulated ratings. This could mislead businesses into making poor content acquisition decisions, leading to lower engagement and potential financial losses.

#### Chart - 9 - Top 10 Directors

In [None]:
# Chart - 9 visualization code
top_directors = credit_df['name'].value_counts().nlargest(10)
plt.figure(figsize=(10,5))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='coolwarm')
plt.title('Top 10 Directors with Most Content')
plt.xlabel('Number of Shows/Movies')
plt.show()

In [None]:
merged_df = pd.merge(credit_df, title_df, on='id')
actors_ratings = merged_df[merged_df['role'] == 'ACTOR'].groupby('name')['imdb_score'].mean().sort_values(ascending=False)

top_10_actors = actors_ratings.head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_10_actors.index, y=top_10_actors.values, palette='viridis')
plt.title('Top 10 Actors with Highest Average IMDb Ratings')
plt.xlabel('Actor')
plt.ylabel('Average IMDb Rating')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart was chosen because it visually represents the top 10 actors with the highest IMDb ratings, making it easy to identify which actors consistently receive high audience appreciation. The bar chart format effectively communicates comparisons between different actors, ensuring quick interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

The top 10 actors in terms of IMDb ratings all have exceptionally high scores

1.   The top 10 actors in terms of IMDb ratings all have exceptionally high scores (~10).
2.   This suggests that their work is highly appreciated by audiences, indicating strong fan bases or participation in critically acclaimed films.
3.   The data may indicate a bias in rating distribution (e.g., limited number of reviews or niche audience influence).



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Strategic Casting Decisions: Since these actors have consistently high IMDb ratings, producers and streaming platforms can leverage their popularity to attract more viewers.

Marketing & Promotions: Using these actors in promotional campaigns can drive engagement and subscriptions.

Content Curation: Platforms like Netflix, Amazon Prime, and Hotstar can recommend movies featuring these actors to increase watch time and user retention.

**Potential Negative Growth Impact:**

Rating Bias & Skewed Perception: If these high ratings are due to limited audience reach or fan-driven bias, over-reliance on them can mislead business decisions (e.g., investing in an actor who lacks broad appeal).

Limited Diversity in Casting: Focusing only on a select few actors may reduce opportunities for emerging talent, leading to stagnation in content variety and innovation.

#### Chart - 10 - IMDB Score Distribution by Genre

In [None]:
fig = ff.create_distplot(
    [title_df[(title_df[genre] == 1) & (title_df['imdb_score'].notna())]['imdb_score'] for genre in sorted(genres)],
    sorted(genres),
    show_hist=False,
    show_rug=False,
)

fig.update_layout(
    title="IMDB Score Distribution by Genre",
    title_font_size=30,
    title_x=0.5,
    xaxis_title="IMDB Score",
    template="plotly",
    paper_bgcolor='rgb(229, 247, 267)',
    plot_bgcolor='rgb(229, 247, 267)',
    legend_title="Genre",
)

fig.show()


##### 1. Why did you pick the specific chart?

This chart, a density plot, was chosen because it effectively visualizes the IMDB score distribution across different genres. Here’s why it’s useful:

✅ Comparison Across Genres – It allows us to see which genres tend to have higher or lower IMDB scores.

✅ Trends & Patterns – The peaks of each curve indicate the most common ratings for that genre.

✅ Strategic Insights – It helps identify high-performing genres (e.g., those centered around higher IMDB scores) versus those that struggle with lower ratings

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that documentaries, history, and drama tend to have higher IMDb scores, while reality TV and horror have lower scores on average. It also shows that most genres cluster around a 5-7 rating, with some outliers in both directions. This insight can help streaming platforms prioritize high-rated genres for better audience engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create a positive business impact by helping streaming platforms prioritize high-rated genres like documentaries, history, and drama to attract quality-focused audiences and boost subscriptions.

However, insights showing low ratings for reality TV and horror could indicate negative growth potential if these genres dominate the content library. If viewers associate the platform with lower-rated content, it may reduce user retention and brand value. Balancing content strategy by investing in both popular and high-rated genres is key to long-term success.

#### Chart 11 - Age Certification Distribution

In [None]:
# Assuming your dataframe is named 'title_df' and contains an 'age_certification' column
age_certification_counts = title_df['age_certification'].value_counts()
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.barplot(x=age_certification_counts.index, y=age_certification_counts.values, palette='viridis')
plt.title('Age Certification Distribution on Amazon Prime')
plt.xlabel('Age Certification')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

I picked this bar chart because it effectively visualizes the distribution of titles by age certification on Amazon Prime. Here’s why it’s the best choice:

✅ Clear Comparison – The bar chart allows for an easy comparison of content availability across different age categories.

✅ Highlighting Imbalances – The dominance of "Unknown" and "R" rated content becomes immediately noticeable.

✅ Actionable Insights – The stark contrast between mature and family-friendly content suggests areas for improvement, helping in strategic decision-making.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart

High ‘Unknown’ Category 🚨 :
A massive number of titles have an "Unknown" age certification, making it difficult for users to filter content based on age appropriateness.

Dominance of ‘R’ Rated Content 🎬 :
Amazon Prime has a strong presence of R-rated movies, indicating a focus on mature audiences.

Limited Family & Kid-Friendly Content 👨‍👩‍👧‍👦:
Categories like G, TV-Y, TV-G, and TV-PG have significantly fewer titles, suggesting a lack of content for younger audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Impact on Business

✅ Positive Impact – Strong R-rated content can attract a mature audience, boosting engagement for thrillers, dramas, and action movies.

❌ Negative Impact – A lack of age certification for many titles may reduce user trust and make parental control ineffective, leading to churn among family users.

💡 Solution – Improving metadata accuracy and expanding kid-friendly content can enhance inclusivity and increase subscriber retention. 🚀

#### Chart - 12 - IMDB Scores by Age Certifications

In [None]:
plt.figure(figsize=(12, 6))  # Adjust figure size if needed
sns.boxplot(x='age_certification', y='imdb_score', data=title_df, palette='viridis')
plt.title('IMDb Scores by Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('IMDb Score')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot because it provides a comprehensive view of IMDb score distribution across different age certifications.

Why this chart?

✅ Clear Comparison – It effectively highlights which age categories tend to have higher or lower ratings.

✅ Outlier Detection – The box plot reveals extreme values (low-rated movies in ‘Unknown’ and ‘NC-17’ categories), helping identify content that may require quality control.

##### 2. What is/are the insight(s) found from the chart?

📊 Key Insights from the Chart:

TV-MA and TV-PG Perform Well – These age certifications have the highest median IMDb scores, indicating that both mature content and family-friendly content tend to receive better audience ratings.

NC-17 Struggles – Content rated NC-17 has the lowest median IMDb scores, suggesting that extreme or highly restricted content is not well-received.

Wide Score Variations – PG, G, and PG-13 movies show a large spread of IMDb scores, meaning their quality and audience reception vary significantly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

🚀 Business Impact & Risks

✅ Positive Business Impact:

Targeting High-Performing Categories – Focusing on TV-MA and TV-PG content, which have higher median IMDb scores, can drive better audience engagement and retention.
Strategic Content Investment – Prioritizing age certifications that consistently receive better ratings helps in making data-driven production and acquisition decisions.


⚠️ Negative Growth Risks:

NC-17 Underperformance – This category has the lowest median IMDb scores, which may lead to poor audience reception, lower watch times, and negative reviews, impacting the platform’s credibility.

"Unknown" Certification Pitfall – A significant number of low-rated outliers in this category indicate potential issues with unclassified or poorly marketed content, leading to viewer dissatisfaction.


#### Chart - 13 - Runtime vs IMDB Score vs Release Year

In [None]:
# Chart - 13 visualization code

# Assuming your dataframe is named 'title_df'
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Create the scatter plot
ax.scatter(title_df['runtime'], title_df['imdb_score'], title_df['release_year'],
           c=title_df['release_year'], cmap='viridis')  # Color points by release year

# Set labels and title
ax.set_xlabel('Runtime')
ax.set_ylabel('IMDb Score')
ax.set_zlabel('Release Year')
ax.set_title('3D Scatter Plot of Runtime, IMDb Score, and Release Year')

# Show the plot
plt.show()



##### 1. Why did you pick the specific chart?

A 3D Scatter Plot was chosen because it best visualizes the relationship between runtime, IMDb score, and release year in a way that a 2D chart cannot. Here's why:

1️⃣ Multidimensional Storytelling 📊 – It captures how movie length and ratings have evolved over time.

2️⃣ Trend Identification 🔍 – Helps spot patterns, such as whether newer movies tend to have higher scores.

3️⃣ Runtime vs. Popularity ⏳ vs. ⭐ – Shows if longer or shorter movies generally receive better IMDb ratings.

4️⃣ Historical Perspective 🎞️ – Highlights how movie characteristics have changed over decades.

##### 2. What is/are the insight(s) found from the chart?

1️⃣ Modern Movies Have Higher IMDb Scores ⭐ – Recent movies (closer to 2020) tend to have higher IMDb scores (yellow dots at the top).

2️⃣ Majority of Movies Fall Within 50-150 Minutes ⏳ – Most movies are concentrated in this range, indicating a common runtime preference.

3️⃣ Older Movies Have a Broader Runtime Range 🎥 – Movies from earlier decades show a wider spread in runtime, some exceeding 300 minutes.

4️⃣ Shorter Movies Are More Common 🟡 – As runtime increases, the density of movies decreases, suggesting long movies are rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Lights, Camera, Action! 🎬💰

✅ Positive Impact:

The insights help streaming platforms and studios tailor content strategies. If higher IMDb scores correlate with certain runtimes, platforms can recommend or produce content that aligns with audience preferences. Plus, spotting trends in newer vs. older movies helps predict what viewers want next!

⚠️ Potential Negative Growth:

If runtime is too long and affects engagement, audiences may drop off, leading to lower watch time and retention. Similarly, if older movies with lower scores dominate a catalog, it could lead to subscriber churn on streaming platforms.

#### Chart - 14 - Correlation Heatmap

In [None]:
numerical_features = ['release_year', 'runtime', 'imdb_score', 'imdb_votes']
correlation_matrix = title_df[numerical_features].corr()
plt.figure(figsize=(10, 8))  # Adjust figure size if needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

##### 1. Why did you pick the specific chart?

The correlation matrix was chosen because it provides a clear and concise way to understand relationships between different numerical variables (release year, runtime, IMDb score, and IMDb votes).

Here’s why it’s useful:

Identifies Key Influences: It helps detect whether factors like runtime or release year significantly impact IMDb scores or audience engagement.

Highlights Weak or Strong Correlations: Instead of just assuming trends, this chart quantifies the strength of relationships.

Simplifies Complex Data: A heatmap visually distinguishes strong and weak correlations, making it easier to interpret at a glance.

Helps in Decision-Making: Understanding these relationships can help content creators and platforms like Amazon Prime strategize their content better—whether to focus on shorter films, newer releases, or highly rated content.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Correlation Matrix Chart:

No Strong Relationships: The correlation values are all quite low, suggesting

1.   No Strong Relationships: The correlation values are all quite low, suggesting that no single factor (runtime, release year, IMDb score, or votes) heavily influences the others.

2.   IMDb Score & Votes (0.17): There is a slight positive correlation, meaning content with higher IMDb ratings tends to receive more votes, but the relationship is weak.

3.   Runtime & IMDb Score (-0.10): A weak negative correlation suggests that longer content does not necessarily lead to better IMDb ratings.

4.   Release Year & Other Factors: Release year has almost no correlation with IMDb scores or votes, indicating that newer content does not guarantee better engagement or popularity.



#### Chart - 15 - Pair Plot for Movies and TV Shows

In [None]:
movies_df = title_df[title_df['type'] == 'MOVIE']
shows_df = title_df[title_df['type'] == 'SHOW']
sns.pairplot(movies_df[['imdb_score', 'release_year', 'runtime', 'imdb_votes']], kind='scatter', diag_kind='kde', hue='imdb_score')
plt.suptitle('Pair Plot for Movies', y=1.02)
plt.show()
sns.pairplot(shows_df[['imdb_score', 'release_year', 'runtime', 'imdb_votes']], kind='scatter', diag_kind='kde', hue='imdb_score')
plt.suptitle('Pair Plot for TV Shows', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

The specific pair plot was chosen because it effectively visualizes relationships between multiple numerical variables, such as release year, runtime, IMDb votes, and IMDb scores.

Reasons for choosing this chart:


1.   Comprehensive Comparison – It allows for a side-by-side analysis of how different variables interact, helping identify trends and correlations.

2.   Density Visualization – The inclusion of KDE (Kernel Density Estimation) contours highlights data concentration areas, making it easier to spot patterns.

3.   Outlier Detection – The scatter plots reveal anomalies, such as movies with extremely high IMDb votes or unusually long runtimes.

4.   Easy Comparisons – By using different hues for IMDb scores, it visually differentiates high-rated and low-rated content.




##### 2. What is/are the insight(s) found from the chart?

From the charts, the following insights can be observed:

Most Popular Movies & TV Shows – The charts highlight the most popular movies and TV shows based on IMDb votes, with movies like Titanic and shows from recent decades dominating.

Trends Over Time – There is a clear increase in the number of movies and TV shows produced after the 1980s, showing the growth of the entertainment industry.

Runtime vs. Popularity – Most popular movies and TV shows tend to have a runtime within a specific range (typically under 200 minutes for movies and 60 minutes per episode for TV shows).

IMDb Score Distribution – Higher-rated movies and shows generally receive more votes, reinforcing the correlation between critical acclaim and audience engagement.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the Exploratory Data Analysis (EDA) of Amazon Prime, here are strategic recommendations to maximize business growth:

📌 1. Invest in High-Performing Genres

Drama, Action, Comedy, and Thriller are the most popular genres based on IMDB votes.
Strategy: Increase investments in these genres through exclusive Amazon Originals and licensing high-rated content.

📌 2. Leverage Genre-Specific Marketing

Thriller and Romance have high engagement—perfect for targeted marketing campaigns.
Strategy: Use personalized recommendations and AI-driven promotions to boost viewership.

📌 3. Optimize Content for Series vs. Movies

Drama dominates both movies and series, indicating strong audience loyalty to long-format storytelling.
Strategy: Focus on long-running series and anthology-based content to retain subscribers.

📌 4. Strengthen Niche & Underrated Genres

Genres like Documentary, Music, and War have lower engagement but can attract specific user bases.
Strategy: Improve visibility through bundled content strategies and AI-driven curation to push niche content to relevant audiences.

# **Conclusion**

Our exploratory data analysis (EDA) of Amazon Prime content reveals key insights into its library. We observed a dominance of recent releases, indicating a strong focus on fresh content. IMDb scores suggest a mix of highly rated and average content, with certain genres consistently performing well.

Key Takeaways:

*   Movies vs. TV Shows – Movies have a wider range of runtimes, while TV shows tend to have more clustered durations.

*   Viewer Engagement – High IMDb votes correlate with popular titles, reflecting audience preference.

*   Genre & Ratings – Some genres consistently receive higher ratings, guiding content curation decisions.




### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***