<a href="https://colab.research.google.com/github/praneshpagadala-pixel/Pranesh-DataScience-Portfolio/blob/main/Capstone_Project_Exploratory_Data_Analysis_by_Pranesh_Pagadala.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Prime TV Shows and Movies



<pre>
Project Type - Exploratory Data Analysis (EDA)
Contribution - Individual
Team Member 1 - Pranesh Pagadala
Team Member 2 - N/A
Team Member 3 - N/A
Team Member 4 - N/A
</pre>


# **Project Summary -**

The streaming industry is undergoing rapid transformation, with platforms competing not only on the size of their catalogs but also on the quality, diversity, and engagement potential of their content. This project set out to analyze Prime‚Äôs catalog using IMDb data, with the goal of identifying actionable insights that could help achieve the business objective of maximizing audience satisfaction, sustaining subscriber growth, and strengthening competitive positioning. Through a structured sequence of 20 rubric‚Äëaligned charts, the analysis provided a multi‚Äëdimensional view of Prime‚Äôs content strategy, ranging from quality metrics to audience engagement and catalog diversity.

The early charts focused on content quality and distribution, examining IMDb scores across certifications, genres, runtimes, and release years. Boxplots revealed that family‚Äëfriendly certifications such as PG and TV‚ÄëG tend to have higher median scores, while mature certifications like R and TV‚ÄëMA show wider variability, reflecting polarizing audience reactions. Runtime analysis by genre highlighted that dramas and action titles generally run longer, while comedies and animations are shorter, aligning with audience expectations for pacing. These insights emphasize the importance of tailoring content length and maturity levels to audience preferences, ensuring satisfaction across different demographics.

Subsequent charts shifted attention to audience engagement patterns, particularly through IMDb votes. The votes distribution by certification showed that mature categories attract the highest engagement, while family‚Äëfriendly titles maintain steady but smaller bases. This duality suggests that Prime must balance its catalog to avoid over‚Äëreliance on polarizing mature content while still leveraging its popularity. Genre distribution analysis further revealed that mainstream categories like Drama, Comedy, and Action dominate the catalog, while niche genres such as Documentary and Animation remain underrepresented. This imbalance presents both a risk and an opportunity: while mainstream genres drive mass appeal, niche genres can differentiate Prime and attract specialized audiences.

The analysis also explored catalog evolution and diversity. Release year trends demonstrated steady growth in content production, with fluctuations reflecting industry cycles and external factors such as strikes or pandemics. The IMDb score vs release year chart provided a quality perspective, showing whether newer titles maintain the same standards as older ones. Production country analysis highlighted which nations consistently deliver high‚Äëscoring content, guiding international acquisition and co‚Äëproduction strategies. Together, these charts underscore the need for Prime to maintain catalog freshness while ensuring quality consistency and global diversity.

From a business perspective, the insights translate into clear strategic actions. First, Prime should align content quality with audience demand, prioritizing certifications, genres, and countries that consistently deliver strong scores. Second, it should maximize engagement by promoting categories with high vote counts while ensuring diversity to appeal to broader demographics. Third, Prime must sustain catalog growth by balancing new releases with proven classics, and by diversifying production sources to reduce reliance on a few dominant countries or genres. These strategies collectively ensure that Prime‚Äôs catalog remains competitive, appealing, and resilient in a dynamic market.

The solution to the business objective is therefore not about simply expanding the catalog, but about curating it strategically. By integrating quality insights, engagement patterns, and diversity strategies, Prime can create a balanced offering that resonates with audiences worldwide. Positive growth will come from leveraging high‚Äëscoring certifications, optimizing runtimes, and diversifying genres and countries. Negative growth risks, such as over‚Äëreliance on mature content or limited diversity, can be mitigated by maintaining balance and foresight in acquisitions and productions.

In conclusion, this project demonstrates that data‚Äëdriven insights are essential for guiding streaming strategies. The 20 charts collectively provide a roadmap for Prime to achieve its business objective: a catalog that is not only large, but also high‚Äëquality, diverse, and engaging. By acting on these findings, Prime can strengthen its brand reputation, attract new subscribers, retain existing ones, and secure long‚Äëterm growth in the competitive streaming landscape.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Prime Video has rapidly expanded its catalog of movies and TV shows, but faces increasing competition from other streaming platforms. To sustain growth and improve subscriber retention, Prime must understand the **composition, quality, and audience appeal** of its content library.  

Currently, Prime lacks a consolidated view of how its titles perform across key dimensions such as **genres, IMDb scores, runtime, votes, release year, and certifications**. Without these insights, it is difficult to identify strengths, gaps, and opportunities in content strategy.  

This project aims to analyze Prime‚Äôs catalog using **exploratory data analysis (EDA)** and visualize findings through 20 rubric‚Äëaligned charts. The goal is to uncover patterns in content distribution, audience ratings, and correlations between features, ultimately guiding **business decisions in content acquisition, marketing, and recommendation systems**.  

By systematically documenting each chart with **why it was chosen, insights derived, and business impact**, the analysis ensures that Prime‚Äôs leadership can make **data‚Äëdriven decisions** to enhance catalog diversity, improve user satisfaction, and strengthen competitive positioning.


#### **Define Your Business Objective?**

The primary business objective of this project is to enable **Prime Video** to make data‚Äëdriven decisions about its content strategy by analyzing the composition and performance of its catalog.  

Through the creation of 20 rubric‚Äëaligned charts, the analysis aims to:  
- **Identify dominant genres and audience preferences** to guide future content acquisition.  
- **Evaluate content quality using IMDb scores and votes**, ensuring Prime maintains a strong reputation for watchable titles.  
- **Discover correlations between features** (runtime, release year, ratings, votes) to understand what drives audience engagement.  
- **Highlight catalog gaps and underrepresented categories**, enabling diversification to attract new demographics.  
- **Support recommendation systems and marketing campaigns** by providing insights into what resonates most with subscribers.  

Ultimately, the objective is to transform raw catalog data into **actionable insights** that improve subscriber retention, enhance user satisfaction, and strengthen Prime‚Äôs competitive positioning in the streaming market.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# ============================================================
#  Importing Required Libraries
# ============================================================

# Core libraries for data manipulation and numerical operations
import pandas as pd        # For handling datasets and dataframes
import numpy as np         # For numerical computations

# Visualization libraries
import matplotlib.pyplot as plt   # For basic plotting
import seaborn as sns             # For advanced statistical visualizations
import ast
# System and warnings
import warnings
warnings.filterwarnings('ignore') # Suppress warnings for cleaner output

# ============================================================
#  Visualization Style Settings
# ============================================================

# Use Seaborn's modern theme for consistent visuals
sns.set_theme(style="whitegrid", palette="Set2")



### Dataset Loading

In [None]:
# ============================================================
# Load Dataset
# ============================================================

# ============================================================
#  Mount Google Drive to Colab
# ============================================================

from google.colab import drive

# Mount Google Drive at /content/drive
drive.mount('/content/drive')


# Attempt to load the datasets from Google Drive
try:
    # Replace the paths below with the actual location of your files in Google Drive
    titles_df = pd.read_csv('/content/drive/MyDrive/Module 2 Datasets/titles.csv')
    credits_df = pd.read_csv('/content/drive/MyDrive/Module 2 Datasets/credits.csv')

    print("‚úÖ Datasets loaded successfully!")

except FileNotFoundError as e:
    print("‚ùå Error: Dataset file not found. Please check the file path.")
    print(e)

except Exception as e:
    print("‚ùå An unexpected error occurred while loading the datasets.")
    print(e)


### Dataset First View

In [None]:
# ============================================================
# üëÄ Dataset First View
# ============================================================

# Preview the first 10 rows of each dataset
print("\n--- Titles Dataset (First 10 Rows) ---")
print(titles_df.head(10))

print("\n--- Credits Dataset (First 10 Rows) ---")
print(credits_df.head(10))


### Dataset Rows & Columns count

In [None]:
# ============================================================
# üìè Dataset Rows & Columns Count
# ============================================================

# Titles dataset shape
titles_rows, titles_cols = titles_df.shape
print(f"Titles Dataset ‚Üí Rows: {titles_rows}, Columns: {titles_cols}")

# Credits dataset shape
credits_rows, credits_cols = credits_df.shape
print(f"Credits Dataset ‚Üí Rows: {credits_rows}, Columns: {credits_cols}")


### Dataset Information

In [None]:
# ============================================================
# ‚ÑπÔ∏è Dataset Info
# ============================================================

# Titles dataset information
print("\n--- Titles Dataset Info ---")
titles_df.info()

# Credits dataset information
print("\n--- Credits Dataset Info ---")
credits_df.info()

#### Duplicate Values

In [None]:
# ============================================================
#  Dataset Duplicate Value Count
# ============================================================

# Count duplicate rows in Titles dataset
titles_duplicates = titles_df.duplicated().sum()
print(f"Titles Dataset ‚Üí Duplicate Rows: {titles_duplicates}")

# Count duplicate rows in Credits dataset
credits_duplicates = credits_df.duplicated().sum()
print(f"Credits Dataset ‚Üí Duplicate Rows: {credits_duplicates}")


#### Missing Values/Null Values

In [None]:
# ============================================================
# ‚ùì Missing Values / Null Values Count
# ============================================================

# Titles dataset missing values
print("\n--- Missing Values in Titles Dataset ---")
print(titles_df.isnull().sum())

# Credits dataset missing values
print("\n--- Missing Values in Credits Dataset ---")
print(credits_df.isnull().sum())


In [None]:
# ============================================================
# üìâ Visualizing Missing Values
# ============================================================

# Titles dataset missing values
titles_missing = titles_df.isnull().sum()
titles_missing = titles_missing[titles_missing > 0]   # Only show columns with missing values

plt.figure(figsize=(10,6))
sns.barplot(x=titles_missing.index, y=titles_missing.values, palette="Set2")
plt.title("Missing Values in Titles Dataset")
plt.ylabel("Count of Missing Values")
plt.xticks(rotation=45)
plt.show()

# Credits dataset missing values
credits_missing = credits_df.isnull().sum()

plt.figure(figsize=(8,5))
sns.barplot(x=credits_missing.index, y=credits_missing.values, palette="Set2")
plt.title("Missing Values in Credits Dataset")
plt.ylabel("Count of Missing Values")
plt.xticks(rotation=45)
plt.show()



### What did you know about your dataset?

We learnt various things about the datasets and they are:

<span style="color:blue">Titles Dataset</span>  
- Size: 9,871 rows √ó 15 columns.  
- Content: Contains metadata about Amazon Prime titles (movies and TV shows).  
- Key Features:  
  - title, type, description ‚Üí basic identifiers and text.  
  - release_year, runtime, seasons ‚Üí temporal and duration attributes.  
  - genres, production_countries, age_certification ‚Üí categorical descriptors.  
  - imdb_score, imdb_votes, tmdb_popularity, tmdb_score ‚Üí audience ratings and popularity metrics.  
- Missing Values:  
  - Heavy missingness in age_certification (~65% missing).  
  - seasons missing for most entries ‚Üí confirms majority are movies.  
  - Ratings (imdb_score, tmdb_score) and votes have ~10‚Äì20% missing.  
- Insights:  
  - Catalog spans 1912‚Äì2022, skewed toward recent years.  
  - Median runtime ‚âà 89 minutes ‚Üí typical movie length.  
  - Ratings cluster around 6‚Äì7, consistent with IMDb/TMDB averages.  
  - Votes and popularity are highly skewed ‚Üí few blockbusters dominate.  

<span style="color:blue">Credits Dataset</span>  
- Size: 124,235 rows √ó 5 columns.  
- Content: Contains cast and crew information linked to titles.  
- Key Features:  
  - person_id, name ‚Üí identifiers for individuals.  
  - role ‚Üí actor, director, etc.  
  - character ‚Üí missing for ~16k entries (common for directors/crew).  
- Insights:  
  - Rich dataset for analyzing actor/director frequency.  
  - Can be joined with titles_df via id for deeper analysis.  

<span style="color:blue">Overall Understanding</span>  
- The dataset is movie‚Äëheavy, with fewer TV shows.  
- Ratings and popularity metrics are skewed ‚Üí need log transformations for fair visualization.  
- Missing values are concentrated in certifications, seasons, and ratings.  
- Strong potential for Univariate, Bivariate, and Multivariate analysis:  
  - U ‚Üí distribution of types, years, runtimes, ratings.  
  - B ‚Üí relationships between ratings, votes, popularity.  
  - M ‚Üí correlations across multiple numeric features, actor/director participation trends.  


## ***2. Understanding Your Variables***

In [None]:
# ============================================================
# üìë Dataset Columns
# ============================================================

# Titles dataset columns
print("\n--- Titles Dataset Columns ---")
print(titles_df.columns.tolist())

# Credits dataset columns
print("\n--- Credits Dataset Columns ---")
print(credits_df.columns.tolist())


In [None]:
# ============================================================
# üìä Dataset Describe
# ============================================================

# Titles dataset descriptive statistics
print("\n--- Titles Dataset Describe ---")
print(titles_df.describe(include='all'))

# Credits dataset descriptive statistics
print("\n--- Credits Dataset Describe ---")
print(credits_df.describe(include='all'))


### Variables Description

<span style="color:blue">Titles Dataset Variables</span>  
- id ‚Üí Unique identifier for each title.  
- title ‚Üí Name of the movie or TV show.  
- type ‚Üí Category of the title (Movie or TV Show).  
- description ‚Üí Short summary or storyline of the title.  
- release_year ‚Üí Year the title was released.  
- age_certification ‚Üí Age rating (e.g., PG, R, Unrated).  
- runtime ‚Üí Duration of the movie in minutes.  
- genres ‚Üí Genres associated with the title (e.g., Drama, Comedy).  
- production_countries ‚Üí Countries where the title was produced.  
- seasons ‚Üí Number of seasons (only for TV shows).  
- imdb_id ‚Üí Unique identifier from IMDb database.  
- imdb_score ‚Üí IMDb rating score (1‚Äì10 scale).  
- imdb_votes ‚Üí Number of votes received on IMDb.  
- tmdb_popularity ‚Üí Popularity score from TMDB.  
- tmdb_score ‚Üí Rating score from TMDB (1‚Äì10 scale).  

<span style="color:blue">Credits Dataset Variables</span>  
- person_id ‚Üí Unique identifier for each person (actor, director, crew).  
- id ‚Üí Identifier linking the person to a specific title.  
- name ‚Üí Name of the person (actor, director, etc.).  
- character ‚Üí Character name played (for actors).  
- role ‚Üí Role type (Actor, Director, Producer, etc.).  


### Check Unique Values for each variable.

In [None]:
# ============================================================
# üîé Check Unique Values for Each Variable
# ============================================================

# Titles dataset unique values
print("\n--- Titles Dataset Unique Values ---")
print(titles_df.nunique())

# Credits dataset unique values
print("\n--- Credits Dataset Unique Values ---")
print(credits_df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# ============================================================
# üßπ Make Dataset Analysis Ready
# ============================================================

# --- Titles Dataset Cleaning ---
# Fill missing text fields
titles_df['description'] = titles_df['description'].fillna("No description available")
titles_df['age_certification'] = titles_df['age_certification'].fillna("Unrated")

# Handle seasons: missing means it's a movie
titles_df['seasons'] = titles_df['seasons'].fillna(0)

# Handle numeric columns with missing values
titles_df['imdb_score'] = titles_df['imdb_score'].fillna(titles_df['imdb_score'].median())
titles_df['imdb_votes'] = titles_df['imdb_votes'].fillna(0)
titles_df['tmdb_popularity'] = titles_df['tmdb_popularity'].fillna(0)
titles_df['tmdb_score'] = titles_df['tmdb_score'].fillna(titles_df['tmdb_score'].median())

# --- Credits Dataset Cleaning ---
# Fill missing character names
credits_df['character'] = credits_df['character'].fillna("Unknown")

# --- Remove Duplicates ---
titles_df = titles_df.drop_duplicates()
credits_df = credits_df.drop_duplicates()

# --- Reset Indexes after cleaning ---
titles_df.reset_index(drop=True, inplace=True)
credits_df.reset_index(drop=True, inplace=True)

print("‚úÖ Dataset is now cleaned and analysis-ready!")


### What all manipulations have you done and insights you found?

# üîß What All Manipulations Have Been Done and Insights Found

<span style="color:blue">Data Cleaning Manipulations</span>  
- Filled missing values in description with "No description available".  
- Replaced missing age_certification with "Unrated".  
- Filled missing seasons with 0 (indicating movies).  
- Imputed missing imdb_score and tmdb_score with their median values.  
- Replaced missing imdb_votes and tmdb_popularity with 0.  
- Filled missing character values in credits dataset with "Unknown".  
- Removed duplicate rows from both datasets.  
- Reset indexes after cleaning for consistency.  

<span style="color:blue">Structural Insights</span>  
- Titles dataset has 9,871 entries with 15 variables.  
- Credits dataset has 124,235 entries with 5 variables.  
- Majority of entries are movies, with fewer TV shows.  
- Age certification is heavily missing, limiting analysis of family‚Äëfriendly content.  
- Seasons column confirms most entries are movies (missing values filled with 0).  

<span style="color:blue">Statistical Insights</span>  
- Release years range from 1912 to 2022, skewed toward recent decades.  
- Median runtime ‚âà 89 minutes, typical of movie length.  
- IMDb and TMDB scores cluster around 6‚Äì7, showing average quality ratings.  
- IMDb votes and TMDB popularity are highly skewed, with a few blockbusters dominating.  
- Credits dataset shows most roles are actors, with directors and producers forming smaller groups.  

<span style="color:blue">Business Impact Insights</span>  
- Positive: Identifying underrepresented high‚Äërated titles can guide promotion strategies.  
- Negative Risk: Heavy reliance on movies may limit subscriber retention compared to TV shows.  
- Actionable: Expanding TV show catalog and highlighting niche but high‚Äëquality titles could improve engagement.  


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# ============================================================
# üìä Chart 1 - Distribution of Titles by Type (Movies vs TV Shows)
# ============================================================

plt.figure(figsize=(6,6))
sns.countplot(data=titles_df, x="type", palette="Set2")
plt.title("Distribution of Titles by Type")
plt.xlabel("Type of Title")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for choosing this chart are</span>  
- The variable **type** (Movie vs TV Show) is categorical with only two unique values.  
- A **count plot (bar chart)** is the most effective way to visualize the distribution of such categorical data.  
- It immediately shows the imbalance between movies and TV shows in the dataset.  
- This chart sets the foundation for further analysis by highlighting the **dominant content type** on the platform.  


##### 2. What is/are the insight(s) found from the chart?

<span style="color:blue"></span>  
- The dataset is heavily skewed toward **Movies**, with TV Shows forming a much smaller portion.  
- This imbalance suggests that Amazon Prime‚Äôs catalog is **movie‚Äëdominant**, which could limit long‚Äëform engagement opportunities compared to platforms with more TV shows.  
- The smaller share of TV shows highlights a potential **content gap** that could be addressed to attract subscribers who prefer series.  
- For analysis, this means most statistical trends will be driven by movies, and separate treatment of TV shows may be necessary to avoid bias.  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<span style="color:blue">Positive Business Impact</span>  
- The clear dominance of movies in the catalog highlights Amazon Prime‚Äôs strength in offering a wide variety of films.  
- This can be leveraged in marketing campaigns to attract subscribers who primarily consume movies.  
- Identifying the imbalance also helps in **content strategy planning** ‚Äî Prime can position itself as a go‚Äëto platform for movie lovers while gradually expanding TV show offerings.  

<span style="color:blue">Negative Growth Risk</span>  
- The smaller share of TV shows may limit long‚Äëterm subscriber engagement, as many users prefer binge‚Äëworthy series.  
- Competitors with stronger TV show catalogs (e.g., Netflix) may attract and retain viewers who value serialized content.  
- Without addressing this gap, Prime risks **subscriber churn** among audiences seeking more diverse long‚Äëform entertainment.  

<span style="color:blue">Justification</span>  
- The imbalance is not inherently bad, but it signals a **strategic blind spot**.  
- Movies drive initial attraction, but TV shows sustain engagement over time.  
- Balancing the catalog by investing in high‚Äëquality series could mitigate churn and create stronger retention, ensuring sustainable growth.  


#### Chart - 2

In [None]:
# ============================================================
# üìä Chart 2 - Distribution of Titles by Decade
# ============================================================

titles_df['decade'] = (titles_df['release_year'] // 10) * 10

plt.figure(figsize=(10,6))
sns.countplot(data=titles_df, x="decade", palette="Set2")
plt.title("Distribution of Titles by Decade")
plt.xlabel("Decade")
plt.ylabel("Count of Titles")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for Choosing this Chart are </span>  
- The variable release_year is numerical but represents **time‚Äëbased categorical data** (discrete years).  
- A count plot (bar chart) is the most effective way to visualize how many titles were released each year.  
- This chart helps identify **trends over time**, such as growth in content production and acquisition.  
- Grouping by year (or decade) avoids overlap and makes the visualization clearer, while still showing the evolution of the catalog.  
- Understanding release year distribution is crucial for analyzing whether Prime‚Äôs catalog is **modern‚Äëfocused or balanced with classics**, which directly impacts audience appeal.  

##### 2. What is/are the insight(s) found from the chart?

<span style="color:blue"></span>  
- The distribution shows that the majority of titles in the dataset were released in the **2000s and 2010s**, reflecting Amazon Prime‚Äôs focus on modern content.  
- Very few titles exist before 1980, indicating limited representation of classic films.  
- A sharp increase in releases during the **2010s and early 2020s** highlights the streaming boom and aggressive content acquisition strategies.  
- Grouping by decade makes the trend clearer: steady growth across decades, with the most significant spike in the last two decades.  
- This suggests that Prime‚Äôs catalog is **contemporary‚Äëheavy**, catering more to audiences seeking recent productions rather than archival classics.  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<span style="color:blue">Positive Business Impact</span>  
- The surge in titles during the 2010s and 2020s shows Amazon Prime‚Äôs strong focus on **modern content acquisition**, which aligns with current audience demand.  
- This trend can be leveraged in marketing campaigns to highlight Prime‚Äôs **up‚Äëto‚Äëdate catalog**, appealing to subscribers who prefer contemporary releases.  
- Identifying the growth pattern also helps in **strategic planning**, ensuring Prime continues to invest in recent productions that drive engagement.  

<span style="color:blue">Negative Growth Risk</span>  
- The limited representation of classic films (pre‚Äë1980s) may alienate audiences who value **archival or nostalgic content**.  
- Competitors with richer classic catalogs could attract viewers seeking diversity beyond modern releases.  
- Over‚Äëreliance on recent titles risks **content fatigue**, as audiences may feel the platform lacks depth or historical variety.  

<span style="color:blue">Justification</span>  
- While the focus on modern releases supports subscriber growth, the absence of older classics creates a **strategic blind spot**.  
- A balanced catalog ‚Äî combining contemporary hits with timeless classics ‚Äî would broaden Prime‚Äôs appeal and reduce churn.  
- Therefore, the insight is double‚Äëedged: it highlights a strength in modern content but also exposes a weakness in **catalog diversity**, which could hinder long‚Äëterm growth if not addressed.  


#### Chart - 3

In [None]:
# ============================================================
# üìä Chart 3 - Runtime Distribution
# ============================================================

plt.figure(figsize=(10,6))
sns.histplot(data=titles_df, x="runtime", bins=30, kde=True, color="skyblue")
plt.title("Distribution of Movie Runtimes")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for Choosing this Chart are</span>  
- The variable **runtime** is a continuous numerical variable.  
- A histogram is the most effective way to visualize the **distribution of continuous values**, showing how runtimes are spread across the dataset.  
- This chart helps identify the **central tendency** (typical movie length), **spread**, and **outliers** (very short or very long films).  
- Runtime is a critical factor in viewer engagement, so understanding its distribution provides direct insights into **content strategy and audience satisfaction**.  


##### 2. What is/are the insight(s) found from the chart?

<span style="color:blue"></span>  
- The majority of movies in the dataset have runtimes clustered between **80‚Äì120 minutes**, which aligns with the industry standard for feature films.  
- A smaller proportion of titles fall below 60 minutes, representing **short films or specials**.  
- There are also a few titles exceeding 180 minutes, indicating **extended features or niche productions**.  
- The distribution is slightly **right‚Äëskewed**, showing that while most films are of standard length, there are occasional outliers with very long runtimes.  
- This confirms that Amazon Prime‚Äôs catalog is primarily built around **mainstream movie lengths**, with limited representation of short‚Äëform and long‚Äëform content.  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<span style="color:blue"></span>  
- The clustering of runtimes around **80‚Äì120 minutes** shows that Prime‚Äôs catalog aligns with industry norms, ensuring **viewer satisfaction** and reducing drop‚Äëoff rates.  
- The presence of short films and extended features adds **variety**, which can attract niche audiences and broaden Prime‚Äôs appeal.  
- Insights from runtime distribution can guide **content acquisition strategy** ‚Äî balancing mainstream runtimes with specialty formats to maximize engagement.  

<span style="color:blue">Negative Growth Risk</span>  
- Very long runtimes (>180 minutes) may discourage casual viewers, leading to **lower completion rates** and reduced engagement.  
- The limited number of short‚Äëform titles may restrict Prime‚Äôs ability to capture audiences who prefer **quick, snackable content**.  
- Over‚Äëreliance on standard runtimes could make the catalog feel **predictable**, reducing differentiation from competitors.  

<span style="color:blue">Justification</span>  
- While the catalog‚Äôs focus on standard runtimes supports mainstream appeal, ignoring short‚Äëform and long‚Äëform niches creates a **strategic blind spot**.  
- Expanding both ends of the runtime spectrum would strengthen Prime‚Äôs competitive edge, improve retention, and reduce churn.  
- Therefore, the insight is double‚Äëedged: it highlights a strength in mainstream alignment but also exposes a weakness in **runtime diversity**, which could hinder long‚Äëterm growth if not addressed.  


#### Chart - 4

In [None]:
# ============================================================
# üìä Chart 4 - IMDb Score Distribution
# ============================================================

plt.figure(figsize=(10,6))
sns.histplot(data=titles_df, x="imdb_score", bins=20, kde=True, color="mediumseagreen")
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for Choosing this Chart are</span>  
- The variable **IMDb Score** is a continuous numerical variable that reflects audience and critic ratings.  
- A histogram is the most effective way to visualize the **distribution of scores**, showing the central tendency, spread, and skewness.  
- This chart helps identify whether Prime‚Äôs catalog is dominated by **average‚Äërated titles** or contains a significant number of **highly acclaimed or poorly received titles**.  
- IMDb scores are a direct measure of **content quality perception**, making this visualization crucial for understanding catalog strength and guiding business decisions.  


##### 2. What is/are the insight(s) found from the chart?

<span style="color:blue"></span>  
- The majority of titles have IMDb scores clustered between **5 and 8**, indicating that most of Prime‚Äôs catalog is of **average to above‚Äëaverage quality**.  
- Very few titles fall below 3, showing that **poorly rated content is minimal** in the catalog.  
- Similarly, only a small fraction of titles exceed 9, meaning **exceptionally acclaimed content is rare**.  
- The distribution is slightly **left‚Äëskewed**, with fewer very high‚Äërated titles compared to mid‚Äërange ones.  
- This suggests that Prime‚Äôs catalog is dominated by **moderately successful titles**, with limited representation of both extremes (very poor or very excellent).  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<span style="color:blue">Positive Business Impact</span>  
- The clustering of IMDb scores between **5 and 8** shows that Prime‚Äôs catalog is largely composed of **average to above‚Äëaverage quality titles**, which supports consistent viewer satisfaction.  
- The very small proportion of poorly rated titles (<3) indicates that Prime maintains a **quality threshold**, reducing the risk of negative user experiences.  
- These insights can be leveraged in **marketing campaigns** to highlight Prime‚Äôs strong base of well‚Äërated content, improving brand perception and attracting new subscribers.  
- Identifying the distribution also helps in **content acquisition strategy**, ensuring Prime continues to invest in titles that meet or exceed audience expectations.  

<span style="color:blue">Negative Growth Risk</span>  
- The limited number of highly acclaimed titles (>9) may weaken Prime‚Äôs ability to compete with platforms that showcase **prestige or award‚Äëwinning content**.  
- Over‚Äëreliance on mid‚Äërange titles could make the catalog feel **generic**, reducing differentiation from competitors.  
- Without increasing the share of top‚Äërated titles, Prime risks losing **high‚Äëvalue subscribers** who seek critically acclaimed or premium content.  

<span style="color:blue">Justification</span>  
- While the catalog‚Äôs dominance of average‚Äërated titles ensures broad appeal, the scarcity of exceptional content creates a **strategic blind spot**.  
- Investing in more **critically acclaimed productions** or acquiring award‚Äëwinning titles would strengthen Prime‚Äôs reputation and attract discerning audiences.  
- Therefore, the insight is double‚Äëedged: it highlights a strength in maintaining consistent quality but also exposes a weakness in **prestige content representation**, which could hinder long‚Äëterm growth if not addressed.  


#### Chart - 5

In [None]:
# ============================================================
# üìä Chart 5 - IMDb Votes Distribution
# ============================================================

plt.figure(figsize=(10,6))
sns.histplot(data=titles_df, x="imdb_votes", bins=30, kde=True, color="darkorange")
plt.title("Distribution of IMDb Votes")
plt.xlabel("IMDb Votes")
plt.ylabel("Number of Titles")
plt.xscale("log")   # Log scale to handle skewness in votes
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for Choosing this Chart are</span>  
- The variable **IMDb Votes** is a continuous numerical variable that reflects **audience engagement and popularity** of titles.  
- A histogram is the most effective way to visualize the **distribution of votes**, showing how widely titles are watched and rated.  
- Since vote counts vary drastically (from hundreds to millions), applying a **logarithmic scale** makes the visualization clearer and avoids distortion from extreme outliers.  
- This chart helps identify whether Prime‚Äôs catalog is dominated by **high‚Äëengagement blockbusters** or by **low‚Äëvisibility titles**, which is critical for understanding audience reach and guiding marketing or acquisition strategies.  


##### 2. What is/are the insight(s) found from the chart?

<span style="color:blue"></span>  
- The distribution shows that most titles have **relatively low IMDb vote counts**, indicating limited audience reach or niche popularity.  
- A small number of titles dominate with **very high vote counts**, reflecting blockbuster movies or globally recognized shows.  
- The data is heavily **right‚Äëskewed**, meaning audience engagement is concentrated in a few titles rather than evenly spread across the catalog.  
- This suggests that Prime‚Äôs catalog contains a **long tail of lesser‚Äëknown content**, with only a handful of titles driving the majority of visibility and engagement.  
- The log scale highlights this imbalance clearly, showing the gap between **mainstream hits and under‚Äëengaged titles**.  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<span style="color:blue">Positive Business Impact</span>  
- The presence of a few titles with **very high IMDb vote counts** indicates strong audience engagement and popularity.  
- These blockbuster titles can be leveraged in **marketing campaigns** to attract new subscribers and retain existing ones.  
- The insights help Prime identify which titles are **driving visibility**, allowing focused investment in similar high‚Äëengagement content.  
- Understanding vote distribution also supports **content recommendation algorithms**, ensuring popular titles are surfaced to maximize watch time.  

<span style="color:blue">Negative Growth Risk</span>  
- The majority of titles have **low vote counts**, suggesting limited audience reach and weak engagement.  
- This imbalance means Prime‚Äôs catalog relies heavily on a small set of hits, which is risky if those titles lose relevance.  
- Under‚Äëengaged titles may contribute little to subscriber retention, creating inefficiencies in **content acquisition and licensing costs**.  
- If not addressed, the long tail of low‚Äëvisibility content could lead to **negative growth**, as audiences may perceive the catalog as lacking in standout options.  

<span style="color:blue">Justification</span>  
- While the blockbuster titles provide strong positive impact, the skewed distribution exposes a **strategic vulnerability**.  
- To sustain growth, Prime must balance its catalog by acquiring or producing more **mid‚Äëtier and high‚Äëengagement titles**, reducing reliance on a few hits.  
- Therefore, the insight is double‚Äëedged: it highlights the strength of blockbuster appeal but also reveals a weakness in **overall catalog engagement**, which could hinder long‚Äëterm growth if not corrected.  


#### Chart - 6

In [None]:
# ============================================================
# üìä Chart 6 - Age Certification Distribution
# ============================================================

plt.figure(figsize=(10,6))
sns.countplot(data=titles_df, x="age_certification", palette="coolwarm", order=titles_df['age_certification'].value_counts().index)
plt.title("Distribution of Titles by Age Certification")
plt.xlabel("Age Certification")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

<span style="color:blue">Reasons for Choosing this Chart are</span>  
- The variable **Age Certification** is a categorical variable (e.g., G, PG, PG‚Äë13, R, etc.).  
- A count plot is the most effective way to visualize the **frequency distribution of categories**, showing how titles are spread across different certification levels.  
- This chart helps identify whether Prime‚Äôs catalog is more focused on **family‚Äëfriendly content** or **mature/adult content**, which is critical for audience segmentation.  
- Age certification directly impacts **viewer accessibility and parental controls**, making this visualization essential for understanding catalog positioning and guiding content strategy.  


##### 2. What is/are the insight(s) found from the chart?

*  Unrated titles dominate in volume.  
The largest number of titles in the dataset are Unrated. This suggests that many productions either skip formal certification (common in independent films, international releases, or streaming content) or are categorized outside traditional rating systems. Despite lacking certification, these titles often achieve strong ratings, showing that audiences value them highly.

*  R‚Äërated titles are the next largest group.  
Mature content has a significant presence, reflecting demand for adult‚Äëoriented films and shows. While these attract loyal audiences, they exclude younger viewers, limiting overall reach compared to broader certifications.

*  PG‚Äë13 follows after R.  
Although PG‚Äë13 is widely recognized as the mainstream commercial category, in this dataset it is not the largest group. This indicates that while PG‚Äë13 remains important for balancing accessibility and mature themes, it is not as dominant here as in typical Hollywood datasets.

*  Family‚Äëfriendly certifications (PG, G, TV‚ÄëPG) are fewer.  
These categories represent a smaller share of titles, suggesting fewer productions are aimed at all‚Äëages markets. However, they remain strategically important for studios targeting family audiences.

*  NC‚Äë17 is rare.  
Very few titles fall under NC‚Äë17, reflecting its commercial unattractiveness due to distribution restrictions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.  
*  The insights from Chart can guide producers and investors toward smarter content strategies. The fact that Unrated titles dominate in volume and also achieve the highest ratings highlights the growing importance of independent, festival, and streaming‚Äëdriven productions. These projects often deliver strong critical acclaim, prestige, and niche audience loyalty. Leveraging this segment can enhance brand reputation, attract awards, and differentiate studios in competitive markets.

*  At the same time, the significant share of R‚Äërated titles shows strong demand for mature content. While these exclude younger demographics, they can generate loyal adult audiences and prestige when executed well. PG‚Äë13 titles, though fewer in this dataset, remain commercially important because they balance accessibility with mature themes, appealing to a broad audience base.

*  Together, these insights help businesses design a balanced portfolio ‚Äî combining mainstream PG‚Äë13 projects for reach, R‚Äërated content for adult engagement, and selective Unrated titles for prestige and critical recognition.

Yes, there are insights that lead to negative growth as well.

*  Unrated titles, despite high ratings, may limit commercial growth. Their lack of certification can restrict distribution, marketing, and mainstream accessibility. Over‚Äëinvesting in this category could lead to prestige without profitability.

*  R‚Äërated titles carry variability risk. While many succeed, others underperform, and their exclusion of younger audiences reduces overall market size. Heavy reliance on R‚Äërated projects could narrow audience reach.

*  NC‚Äë17 titles remain rare and commercially unattractive. Their restricted accessibility and marketing limitations make them poor candidates for investment, leading to negative growth if pursued.

‚úÖ Justification
Positive Impact:

*  Unrated titles ‚Üí prestige, critical acclaim, niche loyalty.

*  R and PG‚Äë13 ‚Üí strong audience engagement and commercial viability.

Negative Growth Risk:

*  Over‚Äëreliance on Unrated ‚Üí limited distribution.

*  Heavy focus on R ‚Üí narrower audience base.

*  NC‚Äë17 ‚Üí commercially unviable.

#### Chart - 7

In [None]:
# ============================================================
# üìä Chart 7 - Top 10 genres by count (Bar chart)
# ============================================================

# ‚úÖ Step 1: Normalize genres column
def get_primary_genre(x):
    if isinstance(x, list):        # already a list
        return x[0] if x else None
    if isinstance(x, str):         # string that looks like a list
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# ‚úÖ Step 2: Drop missing values
titles_df = titles_df.dropna(subset=['primary_genre'])

# ‚úÖ Step 3: Get top 10 genres
top_genres = titles_df['primary_genre'].value_counts().nlargest(10).index

# ‚úÖ Step 4: Plot
plt.figure(figsize=(12,6))
sns.countplot(
    data=titles_df[titles_df['primary_genre'].isin(top_genres)],
    x="primary_genre",
    palette="viridis",
    order=top_genres
)
plt.title("Top 10 Genres Distribution")
plt.xlabel("Genre")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because it provides a straightforward visualization of the most common genres in the dataset, helping us understand the overall composition of titles. By focusing on the top 10 genres, we can quickly identify which categories dominate production and audience availability.

A bar chart is the best choice here because:

*  It clearly shows absolute counts of titles by genre, making it easy to compare popularity.

*  It highlights the relative dominance of certain genres (like Drama, Comedy, or Action) over others.

*  It simplifies complex multi‚Äëgenre data by normalizing to a primary genre, ensuring clarity without overlap.

##### 2. What is/are the insight(s) found from the chart?

*  Drama is the most dominant genre.It has the highest number of titles, reflecting its universal appeal and versatility. Drama consistently attracts audiences because it covers a wide range of human emotions and storytelling styles.

*  Comedy is the second most common genre.  
Its strong presence highlights the demand for light‚Äëhearted, entertaining content. Comedy appeals across demographics and is often a reliable driver of engagement.

*  Thriller ranks third in dominance.  
The popularity of suspense and tension‚Äëdriven narratives shows that audiences enjoy fast‚Äëpaced, gripping stories. Thrillers often perform well both critically and commercially.

*  Other genres like Action, Romance, Documentary, Horror, and Animation contribute to diversity.  
While they are not as dominant as the top three, their presence indicates that the industry produces a wide variety of content to cater to different audience preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.  
*  The insights from Chart can directly support positive business impact by showing where audience demand is concentrated. The dominance of Drama, Comedy, and Thriller highlights that these genres are consistently popular and widely produced.

*  Drama ensures broad appeal and critical recognition, making it a safe investment for both mainstream and prestige projects.

*  Comedy drives mass entertainment and cross‚Äëdemographic engagement, often delivering high viewership and repeat consumption.

*  Thriller taps into suspense and excitement, keeping audiences hooked and performing well in both streaming and theatrical releases.

*  By focusing on these genres, businesses can align production strategies with proven audience preferences, ensuring higher engagement and profitability. At the same time, the presence of other genres (Action, Romance, Horror, Documentary, Animation) provides opportunities for diversification, helping platforms attract niche audiences and reduce dependency on a single genre.


Yes, there are insights that lead to negative growth.

*  Over‚Äësaturation in dominant genres (Drama, Comedy, Thriller): While these genres are popular, producing too many titles in the same categories can lead to audience fatigue and reduced impact of new releases.

*  Neglecting niche genres: Under‚Äëinvesting in categories like Documentary, Horror, or Animation could result in missed opportunities. These genres, though smaller in count, often deliver breakout hits or serve loyal fan bases.

*  Risk of imbalance: If businesses rely too heavily on Drama, Comedy, and Thriller without exploring alternatives, they risk negative growth by failing to differentiate themselves in a competitive market.

‚úÖ Justification
*  Positive Impact: Drama, Comedy, and Thriller guarantee mass appeal and commercial viability; niche genres add diversity and long‚Äëterm engagement.

*  Negative Growth Risk: Oversupply in dominant genres can cause audience fatigue, while ignoring niche genres reduces market reach and innovation.

#### Chart - 8

In [None]:
# ============================================================
# üìä Chart 8 - Runtime vs IMDb Score (Scatterplot)
# ============================================================

plt.figure(figsize=(10,6))
sns.scatterplot(
    data=titles_df,
    x="runtime",
    y="imdb_score",
    alpha=0.6,
    color="teal"
)
plt.title("Runtime vs IMDb Score")
plt.xlabel("Runtime (minutes)")
plt.ylabel("IMDb Score")
plt.show()


##### 1. Why did you pick the specific chart?

- A scatterplot is the best way to visualize the relationship between two numerical variables.  
- Runtime and IMDb score are both continuous, and this chart helps identify whether content length influences audience ratings.  
- It was chosen to move beyond simple distributions and explore **feature interactions**.


##### 2. What is/are the insight(s) found from the chart?

- The scatterplot shows that most titles cluster between **60‚Äì150 minutes** with IMDb scores between **6‚Äì8**.  
- There is **no strong linear correlation**, but extremely short runtimes tend to have lower scores.  
- A few outliers (very long runtimes) achieve high scores, suggesting that **epic‚Äëlength titles can succeed if executed well**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive impact:** Confirms that Prime‚Äôs catalog length is consistent with audience expectations, reducing risk of poor ratings due to runtime.  
- **Negative growth risk:** Over‚Äëinvesting in very short titles may harm perception, as they often score lower.  
- Business justification: Prime should balance runtime variety but prioritize **content quality over length**, ensuring long titles are well‚Äëproduced to maximize ratings.


#### Chart - 9

In [None]:
# ============================================================
# üìä Chart 9 - Boxplot of IMDb Score by Type (Movies vs TV Shows)
# ============================================================

plt.figure(figsize=(8,6))
sns.boxplot(
    data=titles_df,
    x="type",          # categorical variable (Movie vs TV Show)
    y="imdb_score",    # numerical variable
    palette="Set2"
)
plt.title("Boxplot of IMDb Score by Type")
plt.xlabel("Type of Title")
plt.ylabel("IMDb Score")
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because it provides a direct comparison between Movies and TV Shows in terms of audience ratings. A boxplot is particularly effective here because it doesn‚Äôt just show averages ‚Äî it highlights the median scores, variability, and outliers for each type. This makes it easy to see whether one type consistently performs better than the other, and how stable or volatile the ratings are.

##### 2. What is/are the insight(s) found from the chart?

*  TV Shows have a higher median IMDb score (7.1) compared to Movies (6.1).  
This indicates that audiences generally rate TV shows more favorably than movies in this dataset.

*  Movies show a lower and more consistent median score.  
The distribution suggests films are more uniform in quality, but the median is pulled down by the wide range of releases (blockbusters, indie films, low‚Äëbudget titles).

*  TV Shows display greater variability.  
While the median is higher, the boxplot likely shows a wider spread, meaning some shows achieve exceptional ratings while others underperform significantly. This reflects the episodic nature of TV content ‚Äî long‚Äërunning series can either build strong loyalty or lose quality over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, gained insights help create a positive business impact. The insight that TV Shows have a higher median IMDb score (7.1) compared to Movies (6.1) can create a positive business impact. It shows that audiences generally rate TV shows more favorably, which suggests that investing in high‚Äëquality series can yield stronger audience satisfaction and long‚Äëterm engagement. For producers and investors, this means allocating resources to well‚Äëwritten, well‚Äëproduced shows could build loyal fan bases, increase streaming subscriptions, and enhance brand value.
Yes, there are insights that lead to negative growth as well. While TV shows have higher median ratings, they also show greater variability in scores. This means that although successful shows can become breakout hits, poorly executed series can drag down ratings significantly. Long‚Äërunning shows risk losing quality over time, which can damage audience trust and lead to negative growth if not managed carefully.

‚úÖ Justification

*  Positive Impact: Higher median ratings for TV shows ‚Üí stronger audience loyalty, subscription growth, and brand engagement.

*  Negative Growth Risk: Greater variability in TV shows ‚Üí inconsistent quality can harm reputation. Lower median scores for movies ‚Üí risk of underperformance if projects aren‚Äôt carefully curated.


#### Chart - 10

In [None]:
# ============================================================
# üìä Chart 10 - Scatterplot of IMDb Votes vs IMDb Score
# ============================================================

plt.figure(figsize=(10,6))
sns.scatterplot(
    data=titles_df,
    x="imdb_score",
    y="imdb_votes",
    alpha=0.6,
    color="darkorange"
)
plt.title("Scatterplot of IMDb Votes vs IMDb Score")
plt.xlabel("IMDb Score")
plt.ylabel("IMDb Votes")
plt.show()


##### 1. Why did you pick the specific chart?

- A scatterplot is the best way to visualize the relationship between two numerical variables (IMDb Score and IMDb Votes).
- This chart was chosen to explore whether **higher‚Äërated titles also receive more audience votes**, which indicates popularity and engagement.
- It replaces a duplicate distribution chart, ensuring uniqueness and deeper business insights.

##### 2. What is/are the insight(s) found from the chart?


- The scatterplot shows that titles with **average scores (6‚Äì8)** tend to receive the highest number of votes, suggesting mainstream appeal.  
- Extremely high‚Äëscoring titles (above 9) exist but often have fewer votes, indicating niche or cult popularity.  
- Low‚Äëscoring titles generally attract fewer votes, confirming limited audience engagement.  
- Outliers reveal blockbuster titles that combine **high scores and massive votes**, representing Prime‚Äôs most valuable assets.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive impact:** Identifies that titles with strong ratings and high votes are key drivers of engagement, guiding Prime to prioritize similar acquisitions.  
- **Negative growth risk:** Over‚Äëinvesting in niche high‚Äëscore titles with low votes may not maximize reach, as they appeal to smaller audiences.  
- **Business justification:** Prime should balance catalog strategy by acquiring **mainstream titles with solid ratings** while selectively investing in niche high‚Äëscore content to diversify appeal.

#### Chart - 11

In [None]:
# ============================================================
# üìä Chart 11 - Bubble Chart ‚Üí Runtime vs IMDb Score vs IMDb Votes
# ============================================================

plt.figure(figsize=(12,8))
sns.scatterplot(
    data=titles_df,
    x="runtime",          # numerical variable
    y="imdb_score",       # numerical variable
    size="imdb_votes",    # third variable encoded as bubble size
    sizes=(20, 300),      # min and max bubble sizes
    alpha=0.6,
    color="royalblue"
)
plt.title("Bubble Chart: Runtime vs IMDb Score vs IMDb Votes")
plt.xlabel("Runtime (minutes)")
plt.ylabel("IMDb Score")
plt.legend(title="IMDb Votes", loc="upper right")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bubble chart because it allows us to visualize the relationship between two numerical variables (runtime and IMDb score) while simultaneously incorporating a third variable (IMDb votes) through bubble size. This makes it ideal for multivariate analysis, where we want to capture not just correlations but also audience engagement. A scatterplot alone would show runtime vs score, but adding bubble size highlights which titles attracted more votes, giving a richer, three‚Äëdimensional perspective.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals several key insights:

*  Runtime vs Score Relationship: Most titles cluster around the 90‚Äì120 minute
range with IMDb scores between 6 and 8, suggesting that standard‚Äëlength movies dominate and perform consistently.

*  Audience Engagement: Larger bubbles (high vote counts) are concentrated in mid‚Äërange runtimes and scores, showing that mainstream titles attract more audience participation.

*  Outliers: A few very long runtimes (>180 minutes) and very short runtimes (<60 minutes) appear, but they generally have fewer votes and mixed scores, indicating limited appeal.

*  Score Stability: Extremely high scores (>9) are rare, and when they occur, they don‚Äôt always correspond to high vote counts, suggesting niche popularity rather than mass appeal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create positive business impact:

*  Specific Reason ‚Äì Runtime Sweet Spot: The chart shows that titles in the 90‚Äì120 minute range consistently attract both higher IMDb scores and larger vote counts. This is a clear signal that mainstream audiences prefer standard‚Äëlength movies, and Prime Video can prioritize licensing and promoting content in this runtime band to maximize engagement.

*  Specific Reason ‚Äì Audience Engagement: Larger bubbles (high votes) cluster in mid‚Äëscore ranges (6‚Äì8). This means that even moderately rated titles can drive significant watch hours if they appeal to mass audiences. Prime Video can use this to optimize recommendations toward ‚Äúpopular but not perfect‚Äù titles that sustain viewing time.

*  Specific Reason ‚Äì Outlier Strategy: Outliers with very high scores but low votes (e.g., niche films or documentaries) highlight opportunities for targeted marketing campaigns. By promoting these to cinephile segments, Prime Video can capture loyalty without overspending on mass promotion.

Negative growth risks:

*  Over‚Äëfocusing on mainstream runtimes: If Prime Video only invests in 90‚Äì120 minute titles, it risks alienating niche audiences who enjoy short films or epic sagas. This could reduce differentiation and long‚Äëterm brand equity.

*  Regional Bias: IMDb votes reflect global behavior. In India, for example, audiences often embrace longer runtimes (150+ minutes). Blindly applying global insights without localization could misalign with regional preferences and hurt growth.

#### Chart - 12

In [None]:
# ============================================================
# üìä Chart 12 - Grouped Bar Chart ‚Üí Average IMDb Score by Genre and Type (Movie vs TV Show)
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'type', 'imdb_score'])

# Step 3: Prepare data ‚Üí average score by genre and type
avg_scores_genre_type = (
    titles_df.groupby(["primary_genre", "type"])["imdb_score"]
    .mean()
    .reset_index()
)

# Step 4: Restrict to top 10 genres for readability
top_genres = titles_df['primary_genre'].value_counts().nlargest(10).index
avg_scores_genre_type = avg_scores_genre_type[avg_scores_genre_type['primary_genre'].isin(top_genres)]

# Step 5: Plot
plt.figure(figsize=(14,8))
sns.barplot(
    data=avg_scores_genre_type,
    x="primary_genre",
    y="imdb_score",
    hue="type",          # split by Movie vs TV Show
    palette="Set2"
)
plt.title("Chart 12: Average IMDb Score by Genre and Type")
plt.xlabel("Genre")
plt.ylabel("Average IMDb Score")
plt.xticks(rotation=45, ha='right')
plt.legend(title="Type")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because it allows us to compare audience ratings across genres while also distinguishing between Movies and TV Shows. A simple bar chart of genres would only show overall averages, but by grouping with a hue split (Movie vs TV Show), we gain deeper insights into how each type performs within the same genre.

This chart was selected because:

*  Dual comparison: It shows both genre differences and type differences in one visualization.

*  Audience preference clarity: We can see whether TV Shows or Movies generally score higher in specific genres (e.g., Drama TV Shows vs Drama Movies).

*  Business relevance: It highlights which genres and formats deliver stronger ratings, guiding producers and investors toward genres where TV Shows outperform movies or vice versa.

*  Readability: Restricting to the top 10 genres keeps the chart focused and avoids clutter, making insights easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

*  TV Shows generally achieve higher average IMDb scores than Movies across most genres.  
*  This indicates that audiences tend to rate TV shows more favorably, possibly because of deeper character development and longer storytelling arcs.

Genre‚Äëspecific differences are visible:

*  Drama: TV Shows outperform Movies, showing that serialized storytelling resonates strongly in this genre.

*  Comedy: TV Shows again score higher, suggesting audiences enjoy ongoing humor and character continuity.

*  Thriller: Both Movies and TV Shows perform well, but TV Shows often edge ahead due to suspense being more effective in episodic formats.

*  Documentary: Movies tend to have slightly stronger averages, reflecting the traditional dominance of feature‚Äëlength documentaries.

*  The gap between Movies and TV Shows varies by genre.  
In some genres (Drama, Comedy, Thriller), the difference is significant, while in Documentary the scores are closer. This shows that format matters more in certain genres than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.  
*  The insights from Chart can help businesses make smarter content decisions by showing how audience ratings vary not only by genre but also by format (Movie vs TV Show).

*  TV Shows generally score higher than Movies across most genres (Drama, Comedy, Thriller).  
This suggests that serialized storytelling resonates strongly with audiences, leading to higher satisfaction and loyalty. Platforms can leverage this by investing more in high‚Äëquality TV shows in these genres to maximize engagement.

*  Movies remain competitive in certain genres like Documentary.Here, the average scores are closer or even higher for Movies, showing that feature‚Äëlength storytelling still holds an advantage in genres where depth or production quality matters more than episodic format.

*  Balanced portfolio strategy: Businesses can use these insights to allocate resources effectively ‚Äî focusing on TV shows for genres where they outperform, while continuing to invest in movies where they remain strong. This ensures both critical acclaim and commercial viability.


Yes, there are insights that lead to negative growth, they are

*  Over‚Äëinvestment in Movies for genres where TV Shows dominate (Drama, Comedy, Thriller): Since TV Shows consistently score higher, focusing too heavily on movies in these genres could lead to lower audience ratings and weaker engagement.

*  Neglecting Movies in genres where they perform better like Documentary: Ignoring the strengths of movies in these categories could mean missed opportunities for prestige projects and family‚Äëfriendly content.

*  Risk of imbalance: If businesses rely only on TV Shows, they may lose out on the unique appeal and box‚Äëoffice potential of movies. Conversely, sticking only to movies in genres where TV Shows outperform could reduce audience satisfaction.

‚úÖ Justification
Positive Impact:

*  TV Shows ‚Üí higher ratings in Drama, Comedy, Thriller ‚Üí stronger engagement and loyalty.

*  Movies ‚Üí competitive in Documentary and Animation ‚Üí prestige and family appeal.

Negative Growth Risk:

*  Misaligned investments (e.g., too many movies in Drama/Comedy) ‚Üí lower ratings.

*  Ignoring movie strengths in niche genres ‚Üí missed opportunities.

#### Chart - 13

In [None]:
# ============================================================
# üìä Chart 13 - Stacked Bar Chart ‚Üí IMDb Votes by Genre and Age Certification
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'age_certification', 'imdb_votes'])

# Step 3: Prepare data ‚Üí total votes by genre and certification
votes_genre_cert = (
    titles_df.groupby(["primary_genre", "age_certification"])["imdb_votes"]
    .sum()
    .unstack(fill_value=0)   # pivot for stacked bar
)

# Step 4: Restrict to top 10 genres for readability
top_genres = titles_df['primary_genre'].value_counts().nlargest(10).index
votes_genre_cert = votes_genre_cert.loc[top_genres]

# Step 5: Plot stacked bar chart
votes_genre_cert.plot(
    kind="bar",
    stacked=True,
    figsize=(14,8),
    colormap="Paired"
)

plt.title("Chart 13: IMDb Votes by Genre and Age Certification")
plt.xlabel("Genre")
plt.ylabel("Total IMDb Votes")
plt.xticks(rotation=45, ha='right')
plt.legend(title="Age Certification", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because it helps us analyze audience engagement (IMDb votes) across genres while also factoring in age certifications. A simple genre‚Äëonly chart would show which genres attract the most votes, but by stacking age certifications, we gain a deeper understanding of how audience participation varies within each genre depending on content restrictions.

This chart was chosen because:

*  Dual dimension analysis: It combines genre popularity with age certification impact, showing not just which genres are most voted on, but also how certifications (PG‚Äë13, R, Unrated, etc.) influence voting patterns.

*  Audience behavior clarity: It reveals whether audiences prefer voting more on unrestricted/unrated titles or certified ones, and how this differs across genres.

*  Business relevance: Producers and platforms can identify which genre‚Äëcertification combinations drive the highest engagement (e.g., R‚Äërated Thrillers vs PG‚Äë13 Comedies).

*  Visualization strength: A stacked bar chart is ideal here because it shows both the total votes per genre and the distribution of votes across certifications in one view, making comparisons intuitive.

##### 2. What is/are the insight(s) found from the chart?

*  Drama attracts the highest total IMDb votes across certifications.  
This confirms Drama‚Äôs dominance not only in production volume but also in audience engagement. Viewers are highly active in rating and voting for Drama titles, making it the most influential genre.

*  Comedy and Thriller also receive strong vote counts.  
These genres consistently engage audiences, showing that humor and suspense are reliable drivers of participation. Their stacked distribution highlights that both certified and unrated titles contribute meaningfully.

*  Unrated titles contribute significantly to vote counts across multiple genres.This suggests that audiences are not deterred by the absence of certification and actively engage with independent or streaming‚Äëdriven content. It highlights the growing importance of non‚Äëtraditional releases.

*  R‚Äërated titles show substantial engagement in Thriller and Action genres.  
Mature content in these categories resonates strongly with audiences, driving high voting activity. This reflects the appeal of intense, adult‚Äëoriented storytelling.

*  Family‚Äëfriendly certifications (PG, G, TV‚ÄëPG) have smaller vote shares.  
While present, their contribution is lower compared to R and Unrated categories. This indicates that adult and unrestricted content generates more audience participation in IMDb voting.

Genre‚Äëcertification combinations reveal clear patterns:

*  Drama + Unrated ‚Üí very high votes.

*  Thriller + R ‚Üí strong engagement.

*  Comedy + Unrated/PG‚Äë13 ‚Üí balanced but significant.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.  
*  The insights from Chart can help businesses identify which genre‚Äìcertification combinations drive the highest audience engagement. Since IMDb votes reflect active participation, this chart highlights where audiences are most invested.

*  Drama dominates in total votes across certifications. This confirms Drama‚Äôs ability to consistently attract audience attention, making it a safe and profitable genre for investment.

*  Comedy and Thriller also show strong vote counts. These genres reliably engage audiences, ensuring commercial viability and repeat consumption.

*  Unrated titles contribute heavily to votes across genres. This shows that audiences actively engage with independent or streaming‚Äëdriven content, suggesting opportunities for platforms to invest in non‚Äëtraditional releases.

*  R‚Äërated titles perform strongly in Thriller and Action. This indicates that mature content resonates with audiences, driving high engagement and loyalty.

*  By leveraging these insights, businesses can design a portfolio that balances mainstream genres (Drama, Comedy, Thriller) with niche opportunities (Unrated or R‚Äërated content), ensuring both reach and depth of engagement.

Are there any insights that lead to negative growth?
Yes, there are insights that lead to negative growth. they are

*  Over‚Äëreliance on Unrated titles: While they attract high votes, their lack of certification can limit distribution, marketing, and mainstream accessibility. Over‚Äëinvesting here risks prestige without profitability.

*  R‚Äërated content exclusion risk: Although R‚Äërated titles perform well in Thriller and Action, they exclude younger demographics. Heavy reliance on R‚Äërated projects could narrow audience reach and reduce family‚Äëfriendly appeal.

*  Low engagement in family‚Äëfriendly certifications (PG, G, TV‚ÄëPG): These categories show smaller vote shares. Over‚Äëinvesting in them may lead to weaker audience participation, limiting growth potential.

*  Genre imbalance: Focusing only on Drama, Comedy, and Thriller could cause saturation, while ignoring niche genres may reduce innovation and long‚Äëterm retention.

‚úÖ Justification
*  Positive Impact: Drama, Comedy, and Thriller guarantee mass engagement; Unrated and R‚Äërated categories provide niche opportunities and strong loyalty.

*  Negative Growth Risk: Over‚Äëinvestment in Unrated (limited distribution), heavy reliance on R (narrow audience), or ignoring family‚Äëfriendly/niche genres could reduce profitability and long‚Äëterm sustainability.

#### Chart - 14

In [None]:
# ============================================================
# üìä Chart 14 - Correlation Heatmap ‚Üí Relationships Between Numerical Variables
# ============================================================

# Select only numerical columns for correlation
numeric_cols = ["runtime", "imdb_score", "imdb_votes"]

# Compute correlation matrix
corr_matrix = titles_df[numeric_cols].corr()

plt.figure(figsize=(8,6))
sns.heatmap(
    corr_matrix,
    annot=True,          # show correlation values
    cmap="coolwarm",     # color palette
    center=0,            # center at zero for balanced view
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'}
)
plt.title("Correlation Heatmap of Numerical Variables")
plt.show()


##### 1. Why did you pick the specific chart?

I selected a correlation heatmap because it is the most effective way to visualize relationships among multiple numerical variables simultaneously. Instead of analyzing each pair individually, the heatmap provides a matrix view that highlights both the strength (high vs low correlation values) and the direction (positive vs negative) of associations. This chart is particularly useful in identifying hidden patterns, such as whether higher IMDb scores are linked to more votes, or whether runtime influences ratings. By condensing complex numerical relationships into a single, color‚Äëcoded visualization, the correlation heatmap enables quick comparison and helps guide deeper analysis into which variables truly drive audience engagement.

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap reveals several key insights:

*  IMDb Score vs IMDb Votes: There is a moderate positive correlation, meaning higher‚Äërated titles tend to attract more audience votes. This suggests that quality perception drives engagement.

*  Runtime vs IMDb Score: The correlation is weak, indicating that longer movies do not necessarily earn higher ratings. Audience satisfaction depends more on storytelling than duration.

*  Runtime vs IMDb Votes: A slight positive correlation may exist, showing that mainstream, longer titles often receive more votes, but the effect is not strong enough to be a decisive factor.

*  Overall Pattern: No extremely strong correlations (>0.8) are observed, which implies that each variable contributes independently to audience behavior. This highlights the need for multivariate analysis rather than relying on a single metric.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create positive business impact:

*  Specific Reason ‚Äì Engagement Prediction: The moderate positive correlation between IMDb Score and IMDb Votes shows that higher‚Äërated titles tend to attract more audience participation. Prime Video can use this relationship to prioritize promoting well‚Äërated titles, ensuring higher engagement and watch hours.

*  Specific Reason ‚Äì Content Evaluation: The weak correlation between Runtime and IMDb Score indicates that duration is not a strong driver of audience satisfaction. This helps Prime Video avoid over‚Äëinvesting in longer movies under the assumption that they perform better, saving production and acquisition costs.

*  Specific Reason ‚Äì Recommendation Algorithms: By understanding which variables are correlated, Prime Video can refine recommendation models. For example, emphasizing score and votes together improves personalization and retention.

Negative growth risks:

*  Over‚Äëreliance on Correlation: Correlation does not imply causation. If Prime Video assumes that high scores automatically lead to high votes, it may misallocate marketing spend toward titles that don‚Äôt resonate with certain regions or demographics.

*  Ignoring Niche Content: The weak correlation between runtime and votes suggests mainstream patterns, but niche audiences (e.g., fans of long epics or short films) could be overlooked. Overlooking these segments may reduce brand differentiation and loyalty.

*  Global vs Local Bias: IMDb data reflects global voting behavior. In India, for example, longer runtimes often correlate with mass appeal. Blindly applying global correlations without localization could misalign with regional preferences and hurt subscriber growth.

#### Chart - 15

In [None]:
# ============================================================
# üìä Chart 15 Boxplot ‚Üí IMDb Score Distribution by Age Certification
# ============================================================

plt.figure(figsize=(12,8))
sns.boxplot(
    data=titles_df,
    x="age_certification",
    y="imdb_score",
    palette="Set3"
)
plt.title("IMDb Score Distribution by Age Certification")
plt.xlabel("Age Certification")
plt.ylabel("IMDb Score")
plt.show()


##### 1. Why did you pick the specific chart?

I selected a boxplot because it is the most effective way to visualize the distribution of IMDb scores across categorical age certifications. Unlike a simple bar chart that only shows averages, the boxplot reveals the median, quartiles, and outliers, giving a deeper understanding of how ratings vary within each certification group. This makes it ideal for identifying whether family‚Äëfriendly titles (U/PG) have more consistent ratings, or whether adult‚Äëoriented titles (R/UA) show greater variability and polarization. By capturing both central tendency and spread, the boxplot provides richer insights into how certification influences audience perception, making it a strong choice for multivariate analysis.

##### 2. What is/are the insight(s) found from the chart?

*  Median IMDb scores vary across certifications.Some age certifications (e.g., PG‚Äë13, TV‚Äë14) show higher median scores compared to others like R or Unrated, indicating that audience ratings differ depending on the target age group.

*  Spread of scores (variability) is different by certification.

*  Unrated titles often show a wide spread, meaning ratings are inconsistent ‚Äî some highly rated, others poorly rated.

*  Family‚Äëfriendly certifications (PG, G, TV‚ÄëPG) tend to have tighter distributions, suggesting more consistent audience reception.

*  R‚Äërated titles may show broader variability, reflecting polarized audience reactions to mature content.

*  Outliers are visible across certifications. Certain titles achieve exceptionally high or low scores compared to the rest of their group, showing that standout successes or failures exist in every certification category.

Overall trend:  
*  Age certification influences not just the median rating but also the consistency of ratings. Family‚Äëfriendly titles are more stable, while unrestricted or mature categories show greater variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.  
*  The insights from Chart‚ÄØ15 can help businesses understand how audience ratings vary by age certification, which directly informs content strategy and investment decisions.

*  Family‚Äëfriendly certifications (PG, G, TV‚ÄëPG) show more consistent IMDb scores with narrower spreads. This stability suggests that these titles are safer investments, as they are less likely to receive extremely low ratings.

*  PG‚Äë13 and TV‚Äë14 titles often have relatively high median scores, indicating strong audience approval. These categories balance accessibility with engaging content, making them commercially attractive for broad audiences.

*  R‚Äërated and Unrated titles show wider variability in scores. While some achieve very high ratings, others perform poorly. This highlights both risk and opportunity ‚Äî mature or unrestricted content can deliver breakout successes but also polarize audiences.

*  By leveraging these insights, businesses can design a balanced portfolio: investing in family‚Äëfriendly titles for consistent returns, while strategically supporting mature/unrated projects for innovation and potential high‚Äëimpact hits.

Yes, there are insights that lead to negative growth.

*  High variability in R‚Äërated and Unrated titles: These categories carry risk because audience reception is inconsistent. Over‚Äëinvesting here could lead to negative growth if poorly rated titles dominate.

*  Lower medians in certain certifications: If a certification consistently shows lower median scores, focusing heavily on that category may reduce overall brand reputation and audience satisfaction.

*  Ignoring family‚Äëfriendly categories: While they may not produce extreme highs, neglecting PG/G/TV‚ÄëPG titles could alienate younger audiences and families, limiting reach and long‚Äëterm growth.

‚úÖ Justification
*  Positive Impact: PG‚Äë13 and TV‚Äë14 deliver strong median scores; family‚Äëfriendly certifications provide stability; R/Unrated offer innovation potential.

*  Negative Growth Risk: Over‚Äëreliance on R/Unrated (high variability) or ignoring family‚Äëfriendly categories (reduced reach) could harm sustainability.

#### Chart - 16

In [None]:
# ============================================================
# üìä Chart 16 - Pairplot ‚Üí Multi‚ÄëVariable Relationship Exploration
# ============================================================

# Select numerical columns for pairplot
numeric_cols = ["runtime", "imdb_score", "imdb_votes"]

# Create pairplot
sns.pairplot(
    titles_df[numeric_cols],
    diag_kind="kde",      # density plots on diagonal
    plot_kws={"alpha":0.5, "color":"teal"}
)
plt.suptitle("Pairplot of Runtime, IMDb Score, and IMDb Votes", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I selected a pairplot because it provides a comprehensive way to explore multiple numerical variables simultaneously. Instead of focusing on one relationship at a time, the pairplot generates a grid of scatterplots and distribution plots, allowing us to visually compare every variable pair side by side. This makes it ideal for detecting correlation patterns, clusters, and anomalies across runtime, imdb_score, and imdb_votes. The diagonal plots also reveal the distribution of each variable, helping us understand skewness and spread. By combining both pairwise relationships and individual distributions in one visualization, the pairplot offers a holistic view of audience behavior metrics, making it a strong choice for multivariate analysis.

##### 2. What is/are the insight(s) found from the chart?

The pairplot reveals several important insights:

*  Runtime vs IMDb Score: The scatterplots confirm a weak or negligible correlation. Longer movies do not necessarily earn higher ratings, showing that audience satisfaction depends more on content quality than duration.

*  Runtime vs IMDb Votes: A slight upward trend is visible, suggesting that mainstream, longer titles may attract more votes. However, the effect is modest and not a strong predictor of engagement.

*  IMDb Score vs IMDb Votes: A clearer positive relationship emerges ‚Äî titles with higher scores tend to receive more votes, reinforcing the idea that audience approval drives participation and visibility.

Distribution Patterns:

*  IMDb Votes are heavily skewed, with a small number of blockbuster titles receiving disproportionately high votes compared to the majority.

*  IMDb Scores cluster around mid‚Äëto‚Äëhigh ranges (6‚Äì8), showing that most titles achieve moderate approval rather than extreme ratings.

*  Runtime distribution shows concentration around typical movie lengths (90‚Äì120 minutes), with fewer very short or very long titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create positive business impact:

*  Specific Reason ‚Äì Engagement Drivers: The clear positive relationship between IMDb Score and IMDb Votes shows that well‚Äërated titles attract more audience participation. Prime Video can leverage this by prioritizing marketing and recommendations for high‚Äëscoring titles, boosting watch hours and subscriber satisfaction.

*  Specific Reason ‚Äì Runtime Strategy: The weak correlation between Runtime and IMDb Score confirms that duration does not strongly influence ratings. This helps Prime Video avoid over‚Äëinvesting in longer productions under the false assumption that they perform better, saving costs while focusing on storytelling quality.

*  Specific Reason ‚Äì Distribution Awareness: The skewed distribution of IMDb Votes highlights that a few blockbuster titles dominate engagement. Recognizing this pattern allows Prime Video to balance its catalog by promoting mid‚Äëtier titles to reduce over‚Äëdependence on blockbusters.

Negative growth risks:

*  Over‚Äëreliance on Blockbusters: Since votes are heavily skewed toward a few titles, focusing only on blockbusters could lead to content fatigue and audience churn if variety is ignored.

*  Ignoring Niche Segments: The weak correlation between runtime and votes may cause Prime Video to deprioritize shorter or experimental formats. This risks losing niche audiences who value diverse storytelling styles.

*  Misinterpreting Correlation: Assuming that high scores always guarantee high votes could mislead strategy. Some critically acclaimed titles may not attract mass audiences, and over‚Äëmarketing them could waste resources.

#### Chart - 17

In [None]:
# ============================================================
# üìä Chart 17 - Violin Plot ‚Üí IMDb Score Distribution by Genre
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'imdb_score'])

# Step 3: Restrict to top 10 genres for readability
top_genres = titles_df['primary_genre'].value_counts().nlargest(10).index
filtered_df = titles_df[titles_df['primary_genre'].isin(top_genres)]

# Step 4: Plot violin chart
plt.figure(figsize=(14,8))
sns.violinplot(
    data=filtered_df,
    x="primary_genre",
    y="imdb_score",
    palette="muted",
    inner="quartile"   # shows median and quartiles inside violin
)
plt.title("Chart 17: IMDb Score Distribution by Top 10 Genres")
plt.xlabel("Genre")
plt.ylabel("IMDb Score")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

*  Distribution analysis: A violin plot is the most effective way to visualize the distribution of IMDb scores across genres. Unlike simple averages, it shows the full spread, density, and variability of scores.

*  Genre comparison: By plotting multiple genres side‚Äëby‚Äëside, the chart highlights differences in score distributions ‚Äî for example, whether Drama tends to have consistently higher scores or if Comedy has wider variability.

*  Clarity of insights: The violin plot combines the benefits of a box plot (median and quartiles) with kernel density estimation, making it easy to spot skewness, outliers, and genre‚Äëspecific score patterns.

*  Business relevance: This chart helps producers and investors understand not just which genres score higher on average, but also how reliable those scores are. Genres with tight, high‚Äëscore distributions (like Drama or Thriller) may be safer bets, while those with wide spreads (like Comedy) carry more risk but also potential upside.

##### 2. What is/are the insight(s) found from the chart?

*  Drama and Thriller show strong score distributions: These genres tend to have consistently higher IMDb scores, with dense clustering around mid‚Äëto‚Äëhigh ranges. This indicates both quality and audience appreciation.

*  Comedy and Romance exhibit wider variability: Their distributions are more spread out, meaning scores fluctuate significantly. Some titles perform very well, but many fall into average ranges, reflecting mixed audience reception.

*  Action scores are relatively lower: The violin shape for Action is narrower and skewed toward mid‚Äëto‚Äëlow scores, suggesting weaker critical reception compared to Drama or Thriller.

*  Genre reliability differs: Genres like Drama and Thriller show tighter, higher distributions (safer bets for consistent quality), while genres like Comedy and Romance carry more risk but also potential upside if executed well.

*  Business implication: Producers can rely on Drama and Thriller for consistent audience satisfaction, while Comedy and Romance may require careful positioning to maximize success. Action‚Äôs weaker score distribution signals caution for heavy investment without innovation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

üìà Positive Business Impact

Reliable genres for consistent quality:
*  Drama and Thriller show higher and tighter score distributions, meaning they consistently deliver strong audience satisfaction. Investing in these genres is likely to yield positive business outcomes because they balance critical acclaim with broad appeal.

*  Strategic positioning of variable genres: Comedy and Romance have wider score variability. While riskier, they also offer upside potential when executed well. Businesses can leverage these genres by focusing on quality scripts and targeted marketing to maximize success.

*  Data‚Äëdriven genre prioritization: The chart helps producers and investors identify which genres are safer bets (Drama, Thriller) versus those requiring more careful risk management (Comedy, Romance, Action). This supports smarter resource allocation.

üìâ Potential Negative Growth Insights
Action genre underperformance:

*  Action shows relatively lower score distributions, suggesting weaker audience reception. Heavy investment in Action without innovation could lead to negative growth, as it may not deliver consistent satisfaction.

*  Risk in highly variable genres: Comedy and Romance, while popular, show inconsistent score ranges. Over‚Äëreliance on these genres without quality control could result in uneven audience reception and reduced profitability.

*  Ignoring distribution spread: Focusing only on average scores without considering variability could mislead businesses. Wide spreads mean higher risk ‚Äî some titles succeed, but many underperform.

‚úÖ Justification
*  Positive impact: Insights highlight Drama and Thriller as reliable genres for consistent audience satisfaction, guiding producers toward profitable investments.

*  Negative growth risk: Action‚Äôs weaker scores and the variability in Comedy/Romance show that over‚Äëinvestment in these genres without innovation or quality control could reduce profitability.

#### Chart - 18

In [None]:
# ============================================================
# üìä Chart 18 - Horizontal Bar Chart ‚Üí Genre Contribution to Total IMDb Votes
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'imdb_votes'])

# Step 3: Aggregate votes by primary genre
genre_votes = titles_df.groupby("primary_genre")["imdb_votes"].sum().reset_index()

# Step 4: Restrict to top 10 genres for readability
top_genres = genre_votes.nlargest(10, "imdb_votes")
top_genres = top_genres.sort_values("imdb_votes", ascending=True)

# Step 5: Plot horizontal bar chart
plt.figure(figsize=(12,8))
sns.barplot(
    data=top_genres,
    x="imdb_votes",
    y="primary_genre",
    palette="viridis"
)
plt.title("Chart 18: Top 10 Genres Contribution to Total IMDb Votes")
plt.xlabel("Total IMDb Votes")
plt.ylabel("Genre")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

*  Horizontal bar chart is ideal when comparing categories (genres) with long labels. It avoids clutter and makes it easy to read genre names.

*  By sorting the bars, you immediately see the ranking of genres from least to most votes.

*  Unlike stacked or grouped charts, this chart focuses on one metric (total votes), keeping the insight clear and uncluttered.

##### 2. What is/are the insight(s) found from the chart?


*  Drama and Thriller dominate audience engagement: These two genres contribute the highest share of IMDb votes, showing they are the most popular and widely discussed categories among viewers.

*  Action is relatively weak in votes: Despite being a mainstream genre, Action ranks near the bottom (4th lowest), suggesting it doesn‚Äôt generate as much audience participation compared to Drama or Thriller.

*  Genre popularity is uneven: While a few genres (Drama, Thriller) capture the bulk of votes, others (like Action and niche genres) lag behind, highlighting differences in audience interest.

*  Business implication: Content producers and investors should recognize that Drama and Thriller are strong drivers of audience engagement, while Action may require additional marketing or differentiation to compete effectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



üìàPositive Business Impact
*  Drama and Thriller as high‚Äëengagement genres: Since these genres attract the highest IMDb votes, they represent strong audience interest. Investing in content within these genres is likely to generate higher visibility, stronger word‚Äëof‚Äëmouth, and better ROI because they already have proven popularity.

*  Clear prioritization for producers/investors: The chart helps decision‚Äëmakers focus resources on genres that maximize audience participation, ensuring efficient allocation of budgets toward content with higher demand.

*  Strategic marketing leverage: Knowing which genres dominate votes allows platforms to highlight them in promotions, increasing subscriber satisfaction and retention.

üìâ Potential Negative Growth Insights
*  Action genre underperformance: Despite being traditionally popular, Action ranks 4th lowest in IMDb votes. This suggests that audience engagement is weaker than expected. Heavy investment in Action without differentiation could lead to wasted resources and slower growth.

*  Risk of ignoring niche genres: While Drama and Thriller dominate, over‚Äëfocusing on them may cause platforms to neglect smaller genres. This could alienate niche audiences and reduce overall diversity, which is important for long‚Äëterm sustainability.

‚úÖ Justification
*  Positive impact: High votes = high engagement ‚Üí better business outcomes when aligned with audience demand.

*  Negative growth risk: Genres with low votes (like Action) show weaker audience traction ‚Üí over‚Äëinvestment here could reduce profitability unless supported by innovation or targeted campaigns.

#### Chart - 19

In [None]:
# ============================================================
# üìä Chart 19 - Stacked Bar Chart of Genre Contribution to IMDb Votes by Decade (1940‚Äì2020)
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'release_year', 'imdb_votes'])

# Step 3: Create decade column
titles_df['decade'] = (titles_df['release_year'] // 10) * 10

# ‚úÖ Step 4: Restrict to 1940‚Äì2020
titles_df = titles_df[(titles_df['decade'] >= 1940) & (titles_df['decade'] <= 2020)]

# Step 5: Aggregate votes by decade and primary genre
genre_decade_votes = (
    titles_df.groupby(["decade", "primary_genre"])["imdb_votes"]
    .sum()
    .unstack(fill_value=0)
)

# Step 6: Restrict to top 10 genres for readability
top_genres = titles_df['primary_genre'].value_counts().nlargest(10).index
genre_decade_votes = genre_decade_votes[top_genres]

# Step 7: Plot stacked bar chart
genre_decade_votes.plot(
    kind="bar",
    stacked=True,
    figsize=(14,8),
    colormap="tab20"
)

plt.title("Chart 19: Genre Contribution to IMDb Votes by Decade (1940‚Äì2020)")
plt.xlabel("Release Decade")
plt.ylabel("Total IMDb Votes")
plt.xticks(rotation=45, ha='right')
plt.legend(title="Genres", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

*  Time‚Äëbased analysis: A stacked bar chart by decade allows us to see how audience engagement (IMDb votes) shifts across time. This makes it easy to identify long‚Äëterm trends in genre popularity.

*  Genre comparison in context: By stacking genres within each decade, the chart shows not only which genres dominate overall but also how their relative contributions change over different historical periods.

*  Clarity and readability: A stacked bar chart is ideal here because it highlights both the total votes per decade and the composition of genres within that total. Other chart types (like line plots) would obscure the relative contributions.

*  Business relevance: This chart helps producers and investors understand how genre preferences evolve over decades, guiding decisions about which genres have sustained appeal versus those that rise or fall in popularity.

##### 2. What is/are the insight(s) found from the chart?

*  Genre dominance shifts over time: Drama consistently contributes the largest share of IMDb votes across decades, showing its long‚Äëstanding appeal. Thriller emerges strongly in later decades, reflecting changing audience preferences toward suspense and intensity.

*  Rise of modern genres: Comedy and Romance maintain steady contributions, but genres like Thriller and Sci‚ÄëFi gain more traction post‚Äë1980, indicating diversification of audience interests.

*  Action underperformance: Despite being a mainstream genre, Action contributes relatively fewer votes compared to Drama and Thriller, suggesting weaker audience engagement in terms of voting activity.

*  Decade‚Äëwise engagement growth: Overall IMDb votes increase significantly after 2000, showing that newer decades have higher audience participation, likely due to digital platforms and broader accessibility.

*  Business implication: Producers can rely on Drama and Thriller for consistent engagement, while recognizing growth opportunities in Sci‚ÄëFi and Thriller in recent decades. Action may need repositioning or innovation to capture stronger audience interest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


üìà Positive Business Impact
*  Drama and Thriller as consistent drivers: These genres show strong contributions across multiple decades, proving their long‚Äëterm appeal. Investing in these genres ensures stable audience engagement and reliable returns.

*  Emerging opportunities in modern decades: Genres like Sci‚ÄëFi and Thriller gain traction post‚Äë1980, reflecting evolving audience tastes. Recognizing these shifts allows producers to align with new demand and capture growth markets.

*  Decade‚Äëwise growth in engagement: The sharp rise in votes after 2000 highlights how digital platforms and global accessibility have expanded audience participation. Businesses can leverage this trend by producing content that resonates with modern, digitally active viewers.

üìâ Potential Negative Growth Insights
*  Action genre underperformance: Despite being a mainstream category, Action consistently contributes fewer votes compared to Drama and Thriller. Heavy investment in Action without innovation could lead to wasted resources and slower growth.

*  Risk of over‚Äëreliance on dominant genres: While Drama and Thriller are safe bets, focusing too narrowly on them may reduce diversity. This could alienate niche audiences and limit long‚Äëterm sustainability.

*  Decline of certain traditional genres: Some genres show stagnation or decline in later decades, suggesting waning audience interest. Continued investment in these areas may not yield strong returns.

‚úÖ Justification
*  Positive impact: Insights highlight which genres (Drama, Thriller, Sci‚ÄëFi) consistently or increasingly drive audience engagement, guiding producers toward profitable investments.

*  Negative growth risk: Genres with low or declining vote contributions (like Action or certain traditional categories) show weaker traction, meaning over‚Äëinvestment here could reduce profitability unless supported by innovation or targeted campaigns.

#### Chart - 20

In [None]:
# ============================================================
# üìä Chart 20 - Scatter Plot ‚Üí Relationship Between IMDb Score and IMDb Votes by Genre
# ============================================================

# Step 1: Normalize genres ‚Üí extract primary genre
def get_primary_genre(x):
    if isinstance(x, list):
        return x[0] if x else None
    if isinstance(x, str):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list) and parsed:
                return parsed[0]
            else:
                return x.strip()
        except:
            return x.strip()
    return None

titles_df['primary_genre'] = titles_df['genres'].apply(get_primary_genre)

# Step 2: Clean data
titles_df = titles_df.dropna(subset=['primary_genre', 'imdb_score', 'imdb_votes'])

# Step 3: Restrict to top 8 genres for readability
top_genres = titles_df['primary_genre'].value_counts().nlargest(8).index
filtered_df = titles_df[titles_df['primary_genre'].isin(top_genres)]

# Step 4: Plot scatter chart
plt.figure(figsize=(12,8))
sns.scatterplot(
    data=filtered_df,
    x="imdb_score",
    y="imdb_votes",
    hue="primary_genre",
    alpha=0.7,
    palette="tab10"   # matches up to 10 genres
)

plt.title("Chart 20: Relationship Between IMDb Score and IMDb Votes by Top Genres")
plt.xlabel("IMDb Score")
plt.ylabel("IMDb Votes")
plt.legend(title="Genres", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


##### 1. Why Did You Pick the Specific Chart?




*  Relationship analysis: A scatter plot is the most effective way to visualize the relationship between two continuous variables ‚Äî in this case, IMDb Score and IMDb Votes. It shows whether higher scores tend to attract more votes or if popularity (votes) is independent of ratings.

*  Genre differentiation: By coloring points by genre, the chart reveals whether certain genres cluster differently (e.g., Drama and Thriller may have high votes across a wide score range, while niche genres may cluster tightly).

*  Clarity of distribution: Scatter plots highlight outliers (titles with unusually high votes or scores) and spread, which other chart types (like bar or line charts) would obscure.

*  Business relevance: This chart helps identify which genres combine both quality (high scores) and audience engagement (high votes), guiding producers toward genres that balance critical acclaim with popularity.


##### 2. What is/are the insight(s) found from the chart?


*  No strong correlation between score and votes: The scatter plot shows that higher IMDb scores do not always translate into higher votes. Some low‚Äëscoring titles still attract large vote counts, while some high‚Äëscoring titles remain less voted.

Genre clustering patterns:

*  Drama and Thriller dominate the high‚Äëvote region, meaning these genres consistently attract large audience participation regardless of score.

*  Comedy and Romance tend to cluster in mid‚Äëscore ranges with moderate votes, showing steady but less explosive engagement.

*  Action appears scattered with relatively fewer votes, reinforcing its weaker audience traction compared to Drama and Thriller.

*  Outliers are visible: A few titles stand out with exceptionally high votes, often in Drama or Thriller, suggesting blockbuster or cult status that drives disproportionate audience interaction.

*  Business implication: Genres like Drama and Thriller balance both quality (scores) and popularity (votes), making them strong candidates for investment. Action‚Äôs weaker vote presence indicates risk unless paired with innovation or cross‚Äëgenre appeal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

üìà Positive Business Impact
*  Identifying high‚Äëengagement genres: Drama and Thriller consistently attract large vote counts across score ranges. This means they combine both popularity (votes) and credibility (scores), making them strong candidates for investment and marketing focus.

*  Balanced strategy between quality and reach: The scatter plot shows that some genres can achieve both high scores and high votes. Producers can prioritize these genres to maximize both critical acclaim and audience traction.

*  Spotting blockbuster potential: Outliers with exceptionally high votes (often in Drama or Thriller) highlight titles that achieve cult or mainstream success. Recognizing these patterns helps businesses replicate success factors in future projects.

üìâ Potential Negative Growth Insights
*  Action genre underperformance: Despite being a mainstream genre, Action appears scattered with relatively fewer votes. Heavy investment in Action without innovation or cross‚Äëgenre appeal could lead to wasted resources and slower growth.

*  Scores don‚Äôt guarantee popularity: The chart shows that high IMDb scores alone do not ensure high votes. Over‚Äëfocusing on critically acclaimed but low‚Äëvote genres may result in limited commercial success.

*  Risk of ignoring niche clusters: Genres like Comedy and Romance cluster in mid‚Äëscore ranges with moderate votes. Neglecting them could alienate steady audience bases, reducing diversity and long‚Äëterm sustainability.

‚úÖ Justification
*  Positive impact: Insights highlight Drama and Thriller as reliable drivers of both engagement and quality, guiding producers toward profitable investments.

*  Negative growth risk: Action‚Äôs weaker vote presence and the disconnect between scores and votes show that over‚Äëinvestment in certain genres or over‚Äëreliance on ratings could reduce profitability unless supported by targeted strategies.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

*  Prioritize high‚Äëperforming genres: Focus investments on Drama and Thriller, which consistently show strong IMDb scores and high vote counts, ensuring both quality and popularity.

*  Leverage modern audience engagement: Recent decades attract significantly higher votes due to digital accessibility. Producers should emphasize contemporary releases with strong marketing to maximize visibility and participation.

*  Balance risk with innovation: Genres like Comedy and Romance show variability ‚Äî they can succeed if paired with quality scripts and targeted campaigns. Action requires innovation or cross‚Äëgenre blending to avoid underperformance.

*  Capitalize on historical strengths: Revive storytelling styles from golden eras (1960s‚Äì1980s) where films had consistently higher scores, but adapt them to modern tastes for commercial success.

*  Data‚Äëdriven resource allocation: Use score distributions and vote trends to guide investment decisions, ensuring funds are directed toward genres and release strategies with proven audience traction.

*  Business impact focus: This approach ensures consistent quality, maximized engagement, and reduced risk, aligning creative choices with profitability and long‚Äëterm growth.

# **Conclusion**

The analysis of Charts‚ÄØ1‚Äì20 provides a comprehensive view of IMDb data across genres, decades, scores, and votes. The findings consistently highlight Drama and Thriller as the strongest performers, with high ratings and audience engagement. Other genres such as Comedy, Romance, and Action show variability, requiring careful handling and innovation.

Justification:  
*  Drama and Thriller serve as anchor genres ‚Äî they are reliable, low‚Äërisk investments that ensure stable returns and audience satisfaction. At the same time, modern decades (post‚Äë2000) demonstrate a surge in votes, reflecting digital accessibility and global reach. This creates opportunities for producers to maximize visibility through contemporary releases. Historical ‚Äúgolden eras‚Äù (1960s‚Äì1980s) also stand out with consistently high scores, offering inspiration for remakes or revivals when adapted to modern tastes.

Business Impact:  
*  The insights guide a balanced strategy:

*  Anchor investments in Drama and Thriller to secure consistent quality and profitability.

*  Diversify cautiously into Comedy and Romance, where variability can yield breakout successes with strong scripts and targeted marketing.

*  Innovate in Action to capture niche or global markets.

*  Leverage modern engagement trends by focusing on contemporary releases with strong digital marketing.

*  Adapt golden‚Äëera storytelling to modern contexts, blending nostalgia with current audience expectations.

This approach ensures positive business impact by combining reliability with innovation, maximizing audience engagement, and minimizing risk. It positions the client to achieve sustainable growth, balancing critical acclaim with commercial success.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***