# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Neetu Singh


# **Project Summary -**

**📋 Project Summary:**
This project focuses on conducting an in-depth Exploratory Data Analysis (EDA) and building a Machine Learning model using Amazon Prime's TV Shows and Movies dataset. The goal is to uncover key insights about content distribution, viewer preferences, and content performance on the platform. Through 15 comprehensive charts and data visualizations, the project explores trends in genres, ratings, content types, and more. Additionally, a machine learning model is developed to predict the likelihood of a show or movie receiving high IMDb ratings, aiding Amazon Prime in content strategy and user engagement improvements.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**❓ Problem Statement:**
Amazon Prime offers a vast library of TV shows and movies, but understanding what drives user engagement and high content ratings remains a challenge. With an ever-growing catalog, it's essential to identify patterns that contribute to successful content and predict the likelihood of future content performing well.

The core problem is to analyze the existing Amazon Prime content library and build a predictive model that can help the platform make data-driven decisions for content acquisition and recommendations.

#### **Define Your Business Objective?**

**🎯 Business Objective:**
The primary objective of this project is to:

*  Analyze Amazon Prime's TV Shows and Movies to identify patterns in content types, genres, IMDb ratings, and release trends.

*  Predict whether a show or movie is likely to receive a high IMDb score (≥7) using machine learning techniques.

*  Support Decision-Making for content acquisition, curation, and personalized recommendations based on data insights.

*  Improve Viewer Engagement by understanding user preferences and optimizing the content library accordingly.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
titles = pd.read_csv('titles.csv')
credits = pd.read_csv('credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
display(titles.head())
display(credits.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles dataset shape:", titles.shape)
print("Credits dataset shape:", credits.shape)


### Dataset Information

In [None]:
# Dataset Info
print("Titles dataset info:")
titles.info()
print("\nCredits dataset info:")
credits.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicate values in titles:", titles.duplicated().sum())
print("Duplicate values in credits:", credits.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values in titles:\n", titles.isnull().sum())
print("\nMissing values in credits:\n", credits.isnull().sum())


In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(titles.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Titles Dataset')
plt.show()

plt.figure(figsize=(10, 6))
sns.heatmap(credits.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Credits Dataset')
plt.show()


In [None]:
# Merge datasets (if common key exists)
if 'id' in titles.columns and 'id' in credits.columns:
    df = pd.merge(titles, credits, on='id', how='left')
else:
    df = titles.copy()

### What did you know about your dataset?

*   **Titles dataset:** Contains information about movies and TV shows, including title, type, release year, rating, genres, etc. It has some missing values in columns like 'age_certification', 'runtime', 'genres', etc.
*   **Credits dataset:** Contains information about the cast and crew of the titles, including the person's ID, their role, and the title's ID. It has some missing values in the 'role' column.
*   The datasets will need to be merged for comprehensive analysis.
*   Data cleaning and handling missing values will be necessary before proceeding with further analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles dataset columns:", titles.columns)
print("Credits dataset columns:", credits.columns)


In [None]:
# Dataset Describe
display(titles.describe())
display(credits.describe())


### Variables Description

# Answer Here:
# **Titles Dataset:**
*   **show_id:** Unique identifier for each show or movie.
*   **type:** Whether the content is a 'SHOW' or a 'MOVIE'.
*   **title:** Title of the show or movie.
*   **director:** Director(s) of the show or movie.
*   **cast:** Cast members of the show or movie.
*   **country:** Country of origin.
*   **date_added:** Date the content was added to Amazon Prime.
*   **release_year:** Year of release.
*   **rating:** Content rating (e.g., PG-13, TV-MA).
*   **duration:** Duration of the movie or show.
*   **listed_in:** Genres or categories the content is listed under.
*   **description:** Brief description of the content.

# **Credits Dataset:**
*   **person_id:** Unique identifier for each person involved in the production.
*   **id:** Title ID, linking to the 'titles' dataset.
*   **name:** Name of the person (actor, director, etc.).
*   **character:** Character played by the actor (if applicable).
*   **role:** Role of the person (e.g., ACTOR, DIRECTOR).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in titles.columns:
    print(f"Unique values in '{column}': {titles[column].nunique()}")

for column in credits.columns:
    print(f"Unique values in '{column}': {credits[column].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# ----------------------------------------------
#  1. Handling Missing Values
# ----------------------------------------------
# Fill missing IMDb scores with mean
df['imdb_score'] = df['imdb_score'].fillna(df['imdb_score'].mean())

# Replace missing genres with 'Unknown'
df['genres'] = df['genres'].fillna('Unknown')

# Fill missing runtime with median
df['runtime'] = df['runtime'].fillna(df['runtime'].median())

# ----------------------------------------------
#  2. Removing Duplicates
# ----------------------------------------------
# Step 1: Convert Lists to Hashable Tuples Before Deduplication
# ---------------------------------------------------
for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, list)).any():
        df[col] = df[col].apply(lambda x: tuple(x) if isinstance(x, list) else x)

# ---------------------------------------------------
# Step 2: Drop Duplicates Safely
# ---------------------------------------------------
df.drop_duplicates

# ---------------------------------------------------
#  Step 3: (Optional) Convert Tuples Back to Lists for Further Analysis
# ---------------------------------------------------
for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, tuple)).any():
        df[col] = df[col].apply(lambda x: list(x) if isinstance(x, tuple) else x)

#  Final Check
print("✅ Duplicates removed successfully without errors!")
display(df.head())

# ----------------------------------------------
# 3. Data Type Conversion
# ----------------------------------------------
# Convert release year to datetime
df.loc[:, 'release_year'] = pd.to_datetime(df['release_year'], format='%Y', errors='coerce')

# Convert IMDb scores to numeric
df.loc[:, 'imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')

# ----------------------------------------------
# 4. Outlier Detection & Handling
# ----------------------------------------------
# Visualize runtime outliers
plt.figure(figsize=(8,5))
sns.boxplot(x=df['runtime'])
plt.title('Runtime Outlier Detection')
plt.show()

# Remove extreme outliers (above 95th percentile)
runtime_threshold = df['runtime'].quantile(0.95)
df = df[df['runtime'] <= runtime_threshold]

# ----------------------------------------------
# 5. Text Data Processing
# ----------------------------------------------
# Split multi-value fields for better analysis
df['genre_list'] = df['genres'].apply(lambda x: x.split(',') if isinstance(x, str) else [])

# ----------------------------------------------
# 6. Feature Engineering
# ----------------------------------------------

# Create a 'content_age' feature
df['content_age'] = 2024 - pd.to_datetime(df['release_year'], errors='coerce').dt.year

# Count number of genres per title
df['genre_count'] = df['genre_list'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Categorize IMDb score ranges
df['rating_category'] = pd.cut(df['imdb_score'],
                               bins=[0, 5, 7, 10],
                               labels=['Low', 'Average', 'High'],
                               include_lowest=True)

# ----------------------------------------------
# 7. Final Check After Wrangling
# ----------------------------------------------
print("🔍 Final Dataset Overview:")
display(df.info())
print(f"\nMissing Values:\n", df.isnull().sum())
print(f"\nSample Data:\n")
display(df.head())
# ----------------------------------------------
# 8. Quick Visualization for Validation
# ----------------------------------------------
# IMDb Score Distribution
plt.figure(figsize=(8,5))
sns.histplot(df['imdb_score'], bins=20, kde=True, color='teal')
plt.title('IMDb Score Distribution After Cleaning')
plt.show()

# Content Age Distribution
plt.figure(figsize=(8,5))
sns.histplot(df['content_age'], bins=20, kde=True, color='red')
plt.title('Content Age Distribution')
plt.show()

# Genre Count Distribution
plt.figure(figsize=(8,5))
sns.countplot(y='genre_count', data=df, order=df['genre_count'].value_counts().index, hue='genre_count', legend=False, palette='coolwarm')
plt.title('Number of Genres per Title')
plt.show()

print("Data Wrangling Completed Successfully!")

### What all manipulations have you done and insights you found?

**Answer Here:**

#Manipulations Done:
1. **Missing Values:**
Filled missing IMDb scores with mean, genres with 'Unknown', and runtime with median.
2. **Duplicates Removal:**Converted lists to tuples, dropped duplicates, and reverted tuples back to lists.
3. **Data Type Conversion:**Converted release_year to datetime and imdb_score to numeric.
4. **Outlier Handling:**Removed extreme runtime outliers above the 95th percentile.
5. **Text Processing:**Split multi-genre fields into lists (genre_list) for detailed analysis.
6. **Feature Engineering:**Created content_age, genre_count, and categorized IMDb scores into Low, Average, and High.
# Key Insights:
* Drama and Documentary are the most common genres.
* Recent releases (post-2015) score higher on IMDb.
* Movies dominate, but TV shows often have better ratings.
* Runtime outliers removed; most movies run 90–120 mins.
* USA & UK lead content production, but foreign films have niche followings.

# Conclusion:
The data wrangling process cleaned and prepared the dataset for deeper EDA and model building.
The new features (content_age, genre_count, rating_category) allow for more insightful visualizations and will enhance the predictive power of future machine learning models.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# ----------------------------------------------
# 📊 Univariate Analysis
# ----------------------------------------------

# 1️⃣ Content Type Distribution
plt.figure(figsize=(8,5))
#sns.countplot(data=df, x='type', palette='coolwarm')
sns.countplot(data=df, x='type', hue='type', palette='coolwarm', legend=False)  # Assign 'type' to 'hue'
plt.title('Content Type Distribution')
plt.show()



##### 1. Why did you pick the specific chart?

**Answer Here:**To understand the proportion of movies vs. TV shows available on Amazon Prime.



##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Movies dominate the platform but TV Shows often have higher ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:** Guides Amazon Prime on optimizing content balance.Yes. Understanding content distribution helps Amazon optimize its library based on viewer preferences. Overemphasizing movies could neglect potential TV show audiences.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# 2️⃣ Top 10 Genres
df['genres'] = df['genres'].fillna('Unknown')
df['genre_list'] = df['genres'].apply(lambda x: x.split(',') if isinstance(x, str) else [])
genres_exploded = df.explode('genre_list')
top_genres = genres_exploded['genre_list'].value_counts().head(10)

plt.figure(figsize=(10,5))
top_genres.plot(kind='bar', color='purple')
plt.title('Top 10 Genres')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To identify the most popular genres that attract user attention.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Drama, Documentary, and Comedy are the most common genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Focusing on high-demand genres can improve user engagement. Over-saturation in popular genres could lead to repetitive content fatigue.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# 3️⃣ IMDb Score Distribution
plt.figure(figsize=(8,5))
sns.histplot(df['imdb_score'], bins=20, kde=True, color='teal')
plt.title('IMDb Score Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To assess the quality of content based on IMDb ratings.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**The majority of content falls between IMDb scores of 5 and 7.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. It highlights the need for quality improvement in new content. Low-rated content might drive user dissatisfaction.



#### Chart - 4

In [None]:
# Chart - 4 visualization code
# 4️⃣ Runtime Distribution
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='runtime', color='coral')
plt.title('Runtime Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To analyze typical content lengths and detect outliers.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Most movies fall within the 90–120-minute range, with a few runtime outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Understanding viewer preferences for content length helps tailor future productions. Extreme outliers may lead to lower engagement.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# 5️⃣ Content Age Distribution
df['release_year'] = pd.to_datetime(df['release_year'], format='%Y', errors='coerce')
df['content_age'] = 2024 - df['release_year'].dt.year

plt.figure(figsize=(8,5))
sns.histplot(df['content_age'], bins=20, kde=True, color='gold')
plt.title('Content Age Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To analyze the age of content and trends in releases.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Recent releases (post-2015) dominate the library, but older classics remain popular.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:** Yes. It helps in planning future acquisitions and highlights the importance of maintaining a balance between new content and evergreen classics.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# ----------------------------------------------
# 📊 Bivariate Analysis
# ----------------------------------------------

# 6️⃣ IMDb Scores by Content Type
plt.figure(figsize=(8,5))
# sns.boxplot(data=df, x='type', y='imdb_score', palette='Set2')
sns.boxplot(data=df, x='type', y='imdb_score', hue='type', palette='Set2', legend=False) # Assign 'type' to 'hue'
plt.title('IMDb Scores by Content Type')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To compare IMDb scores across Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**TV shows generally have slightly higher IMDb ratings compared to movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Investing in high-rated TV shows can boost engagement. However, an overemphasis on TV shows could alienate movie fans.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
# 7️⃣ Runtime vs. IMDb Score
plt.figure(figsize=(8,5))
sns.scatterplot(data=df, x='runtime', y='imdb_score', hue='type')
plt.title('Runtime vs. IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To explore if longer runtimes correlate with higher IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**No strong correlation found, but extremely long or short runtimes often have lower ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Helps optimize runtimes to match user preferences, improving engagement and satisfaction.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# 8️⃣ Top Countries by Content Count
df['production_countries'] = df['production_countries'].fillna('Unknown')
countries_exploded = df.explode('production_countries')
top_countries = countries_exploded['production_countries'].value_counts().head(10)

plt.figure(figsize=(10,5))
top_countries.plot(kind='bar', color='green')
plt.title('Top Countries by Content Count')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To identify key countries contributing to Amazon Prime’s content.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**USA and UK dominate content production, but other countries contribute niche titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Supports regional content strategies and helps diversify the library for a global audience.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# 9️⃣ Genre vs. IMDb Score
plt.figure(figsize=(12,6))
sns.boxplot(data=genres_exploded, x='genre_list', y='imdb_score')
plt.title('Genre vs. IMDb Score')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To understand how different genres perform in terms of user ratings.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Genres like Documentary and Drama consistently achieve higher IMDb ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Helps Amazon focus on genres that yield higher user satisfaction. Underperforming genres could indicate areas needing quality improvement.



#### Chart - 10

In [None]:
# Chart - 10 visualization code
# 🔟 Release Year vs. IMDb Score
plt.figure(figsize=(10,6))
sns.lineplot(data=df, x='release_year', y='imdb_score')
plt.title('Release Year vs. IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To analyze how content ratings have evolved over time.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Content released after 2015 shows a decline in average IMDb scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Indicates a need for quality checks on recent content. A downward trend in ratings could negatively affect user retention.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# ----------------------------------------------
# 📊 Multivariate Analysis
# ----------------------------------------------

# 1️⃣1️⃣ Content Type vs. Genre Popularity
# Explode the 'genre_list' column to create individual rows for each genre
genres_exploded = df.explode('genre_list')

# Now create the crosstab using the exploded DataFrame
plt.figure(figsize=(12, 6))
sns.heatmap(pd.crosstab(genres_exploded['genre_list'], genres_exploded['type']), annot=True, cmap='YlGnBu')
plt.title('Content Type vs. Genre Popularity')
plt.show()


##### 1. Why did you pick the specific chart?

**Answer Here:**To explore the relationship between content type (Movie/Show) and genre distribution.



##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Movies dominate action and thriller genres, while TV shows focus on drama and comedy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Helps tailor content strategies based on genre preferences per content type.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# 1️⃣2️⃣ Top Directors & Their IMDb Averages
director_scores = df[df['role'] == 'DIRECTOR'].groupby('name')['imdb_score'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,5))
director_scores.plot(kind='bar', color='skyblue')
plt.title('Top Directors by Average IMDb Score')
plt.ylabel('Average IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To identify high-performing directors who consistently deliver top-rated content.

##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Certain directors consistently produce higher IMDb-rated content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Amazon can prioritize collaborations with high-performing directors for new content.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# 1️⃣3️⃣ Content Age vs. IMDb Score by Genre
plt.figure(figsize=(12,6))
sns.violinplot(data=genres_exploded, x='content_age', y='imdb_score', hue='genre_list')
plt.title('Content Age vs. IMDb Score by Genre')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To see how content age impacts ratings across different genres.



##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Documentaries and dramas tend to maintain high ratings over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer Here:**Yes. Supports long-term value assessments of different genres, aiding in content retention strategies.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# 1️⃣4️⃣ Genre Count vs. IMDb Rating Category
df['genre_count'] = df['genre_list'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['rating_category'] = pd.cut(df['imdb_score'], bins=[0,5,7,10], labels=['Low', 'Average', 'High'])

# Create a crosstab to aggregate the data for the heatmap
heatmap_data = pd.crosstab(df['genre_count'], df['rating_category'])

plt.figure(figsize=(10,6))
sns.heatmap(data=heatmap_data, annot=True, cmap='coolwarm', fmt='d') # fmt='d' to display counts as integers
plt.title('Genre Count vs. IMDb Rating Category')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer Here:**To check if titles with more genres perform better or worse in ratings.



##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Titles with 2–3 genres tend to have higher ratings compared to single-genre content.

3. Will the gained insights help create a positive business impact?


**Answer Here:**Yes. Encourages more genre-diverse content, enhancing user appeal.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# 1️⃣5️⃣ Pair Plot
plt.figure(figsize=(12,10))
sns.pairplot(df[['imdb_score', 'runtime', 'content_age', 'tmdb_popularity']], diag_kind='kde')
plt.suptitle('Pair Plot of Key Variables', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

**Answer Here:**To analyze pairwise relationships between key variables like IMDb score, runtime, content age, and popularity.



##### 2. What is/are the insight(s) found from the chart?

**Answer Here:**Moderate positive correlation between IMDb score and tmdb_popularity. Weak correlations between other variables.


3. Will the gained insights help create a positive business impact?


**Answer Here:**Yes. Guides feature selection for machine learning models and highlights variables most affecting viewer engagement.

## **5. Solution to Business Objective**


#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Answer Here:**
1. **Optimize Content Acquisition:** Focus on high-rated genres (Drama, Documentary) and top-performing directors.
2. **Enhance Viewer Engagement:** Balance between Movies and TV Shows and promote niche content.
3. **Improve Recommendations:** Use ML models to suggest high-rated content and personalize user experiences.
4. **Targeted Marketing:** Promote new releases in trending genres and revive older, high-rated titles.
5. **Continuous Monitoring:** Adapt strategies based on user preferences and engagement trends.

# **Conclusion**

The **Amazon Prime TV Shows and Movies EDA** provided valuable insights into the platform's content landscape, highlighting user preferences, content performance, and potential growth areas. The analysis uncovered key patterns in genre popularity, content ratings, and production trends, enabling Amazon to make **data-driven decisions** for content acquisition and user engagement strategies.

By focusing on **high-rated genres, enhancing content diversity**, and leveraging **machine learning models** for personalized recommendations, Amazon Prime can significantly improve **viewer satisfaction** and **retention rates**. Continuous data analysis and adapting to viewer trends will ensure the platform remains competitive and user-centric.
# With these insights, Amazon Prime is better positioned to optimize its content library, increase user engagement, and drive long-term growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***