<a href="https://colab.research.google.com/github/krishna280104/AMAZON-PRIME-EDA/blob/main/Amazon_Prime_EDA_Krishnapriya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AMAZON PRIME TV & MOVIE TRENDS: AN EDA APPROACH



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The project titled “Amazon Prime TV and Movie Trends – An EDA Approach” focuses on uncovering hidden patterns, insights, and trends within Amazon Prime’s publicly available TV show and movie dataset using Exploratory Data Analysis (EDA). As a growing OTT platform, Amazon Prime continuously adds content in various genres, formats, and regions. To maintain user satisfaction and optimize content strategy, it’s crucial to understand what types of content perform well, which genres dominate, and how user ratings and other metadata relate to viewership trends.

This analysis was conducted individually using Python and libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Plotly for data cleaning, manipulation, and visualization. The main objective was to derive meaningful business insights that can guide Amazon Prime in its content creation, recommendation strategies, and user engagement improvements.

The project began with data loading and initial exploration, identifying missing values, incorrect data types, and anomalies. A thorough data cleaning process was performed, which included handling null values, converting data types (like dates), and standardizing column values (for genres, seasons, etc.). This helped prepare the dataset for deeper analysis.

Next, univariate and bivariate visualizations were used to explore individual columns and relationships between them. For instance, charts were created to analyze:
	•	The number of movies vs. TV shows
	•	Top contributing countries to Amazon Prime’s content
	•	Most frequent genres and their IMDB scores
	•	Relationship between runtime and ratings
	•	Seasonal trends in content release
	•	Distribution of IMDb scores across content types and genres

These visualizations helped highlight some important insights. For example, movies dominate over TV shows in quantity, but TV shows tend to have slightly higher IMDb ratings on average. It was also observed that genres like Drama, Comedy, and Documentary are the most frequently available, suggesting a content strategy aligned with mass appeal. However, niche genres still maintain consistent viewer ratings.

A correlation heatmap was used to identify numerical relationships between variables such as IMDb score, number of votes, seasons, and runtime. A pair plot was used to further explore multi-variable relationships, which added depth to the interpretation of these numerical fields.

The final section involved connecting the findings to real-world business objectives. Based on the data, it was clear that Amazon Prime could enhance its recommendation engine by combining genre, runtime, release year, and user rating data. Personalized recommendations based on such features could improve engagement and retention.

Other suggested solutions include better content scheduling based on seasonal trends, investing in genres that consistently receive high ratings, and focusing on regional content that shows strong engagement. The project ultimately offers actionable recommendations that, if applied, can help Amazon Prime strengthen its market position by making informed, data-driven content decisions.

# **GitHub Link -**

https://github.com/krishna280104/AMAZON-PRIME-EDA

# **Problem Statement**




With the explosion of digital content and growing competition in the OTT (Over-the-Top) industry, platforms like Amazon Prime are constantly striving to improve user engagement, retention, and satisfaction. However, understanding which factors influence viewer preferences—such as content type, genre, country of origin, release year, and user ratings—is a complex task due to the sheer volume and diversity of data.

Amazon Prime hosts thousands of movies and TV shows globally, but lacks readily interpretable insights from this data that could guide decision-making. Without a thorough analysis of content trends and user engagement patterns, the platform may miss opportunities to:
- Optimize its content strategy,
-	Recommend the right content to the right audience,
-	Identify underperforming areas for improvement, and
- Leverage high-performing categories to drive growth.





This project aims to address these challenges by conducting Exploratory Data Analysis (EDA) on Amazon Prime’s content dataset to uncover patterns, trends, and relationships that can help improve content quality, personalization, and strategic decisions.

#### **Define Your Business Objective?**

The main objective of this Exploratory Data Analysis (EDA) project is to extract meaningful insights from Amazon Prime’s TV shows and movies dataset to support better decision-making and enhance business outcomes. By analyzing key attributes like content type, genre, release year, runtime, IMDb ratings, and popularity, the project seeks to uncover trends and patterns that reflect viewer behavior and content performance.

The specific goals are to:
1.	Understand Content Distribution: Analyze the proportion of movies vs. TV shows and examine how content is distributed across genres, countries, and age certifications.
2.	Identify Viewer Preferences: Investigate which genres, runtimes, and release years receive higher IMDb ratings and engagement.
3.	Track Popularity Trends: Explore how content popularity varies with time, genre, and other metadata features.
4.	Support Data-Driven Personalization: Provide insights that can help recommend relevant content to users based on trends and ratings.
5.	Assist Strategic Planning: Help Amazon Prime identify what type of content to acquire or produce more of, based on data-backed insights, in order to improve viewer satisfaction and maximize engagement.

Ultimately, this analysis enables Amazon Prime to make informed, data-driven decisions that improve user experience, drive content success, and strengthen its position in the competitive OTT market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

### Dataset Loading

In [None]:
import pandas as pd
titles = pd.read_csv('titles.csv')
credits = pd.read_csv('credits.csv')

### Dataset First View

In [None]:
titles.head()
credits.head()

### Dataset Rows & Columns count

In [None]:
print("Titles Shape:", titles.shape)
print("Credits Shape:", credits.shape)
titles.columns
credits.columns

### Dataset Information

In [None]:
titles.info()
credits.info()

#### Duplicate Values

In [None]:
print("Duplicate rows in titles:", titles.duplicated().sum())
print("Duplicate rows in credits:", credits.duplicated().sum())

titles.drop_duplicates(inplace=True)
credits.drop_duplicates(inplace=True) #removed duplicate rows

#### Missing Values/Null Values

In [None]:
titles.isnull().sum()
credits.isnull().sum()

In [None]:
!pip install missingno
import missingno as msno
import matplotlib.pyplot as plt
#visulize missing values in titles dataset
msno.matrix(titles)
plt.show()
#visualize missing values in credits
msno.matrix(credits)
plt.show()

### What did you know about your dataset?

The titles

## ***2. Understanding Your Variables***

In [None]:
print(titles.columns)
print(credits.columns)

In [None]:
titles.describe()
credits.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
titles.nunique()
credits.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#make a copy so original stays safe
titles_clean = titles.copy()
credits_clean = credits.copy()

#drop rows with missing titles (cant analyze without names)
titles_clean.dropna(subset=['title'], inplace=True)

#fill missing IMDb scores and runtimes with the mean
titles_clean['imdb_score'].fillna(titles_clean['imdb_score'].mean(), inplace=True)
titles_clean['runtime'].fillna(titles_clean['runtime'].mean(), inplace=True)

#replace missing age certifications with 'Unknown'
titles_clean['age_certification'].fillna('Unknown', inplace=True)

#replace missing genres with 'Unknown'
titles_clean['genres'].fillna('Unknown', inplace=True)

#drop any unnecessary columns if you find any (example below)
titles_clean.drop(columns=['description'], inplace=True)

#reset index after cleaning
titles_clean.reset_index(drop=True, inplace=True)

#confirm its clean
print("Null values after wrangling:\n", titles_clean.isnull().sum())

### What all manipulations have you done and insights you found?

1.Removed rows with missing title values since they are essential for analysis

2.Filled missing values in imdb_score and runtime with their respective column averages

3.replaced missing values in age_certifications with 'Unknown'

4.removed duplicate rows to ensure clean and consistent data

5.reset the index after cleaning to maintain proper raw order

INITIAL INSIGHTS
1.The data contains more movies than TV shows

2.Most content was released after 2010, showing a modern catalog

3.'Unknown' or missing age ratings are quite frequent which could affect age based analysis

4.IMDb scores are mostly between 5 and 7 with afew highly rated titles

5.the most common roles in credits are actor and director which can be useful for cast analysis later

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a figure for the plot with a specified size
plt.figure(figsize=(6,4))

# Generate a count plot of content types (Movie vs Show)
sns.countplot(data=titles_clean, x='type')

# Set the title of the plot
plt.title('Count of movies vs tv shows on amazon prime')

# Set the label for the x-axis
plt.xlabel('Content Type')

# Set the label for the y-axis
plt.ylabel('Count')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the best to compare the count of categorical variables like type (i.e.,Movies vs TV shows).
A bar chart makes the comparison clear and visually easy to understand.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Amazon Prime has a higher number of movies than TV shows. This indicates that movies form larger portion of their content catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**POSITIVE INSIGHT**

The platform appears to be movie dominated which can attract users looking for a quick, standalone content. This aligns well with binge-watching behaviour and casual viewers.

**NEGATIVE INSIGHTS**

The lower number of TV shows could be a missed opportunity, especially since long format content increases viewer retention. Amazon may consider investing in more series based content to balance the offering.

#### Chart - 2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a figure for the plot with a specified size
plt.figure(figsize=(6,4))

# Generate a count plot of age certifications, ordered by frequency
sns.countplot(
    data=titles_clean,
    x='age_certification',
    order=titles_clean['age_certification'].value_counts().index,
    palette='Set2' # Set the color palette for the bars
)

# Set the title of the plot
plt.title('Distribution of content by age certification')

# Set the label for the x-axis
plt.xlabel('Age Certification')

# Set the label for the y-axis
plt.ylabel('Count')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I used a bar chart again because we are analyzing a categorical variable-age_certification. A bar chart helps compare the volume of content across different age ratings easily and visually.

##### 2. What is/are the insight(s) found from the chart?



*   A significant portion of the content has an "unknown" age certification.
*   Among the labeled ones, "16+" and "18+" have the highest content volume.


*   Content for children and family is relatively low






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**POSITIVE INSIGHTS**

the platform caters more towards teen and adult audiences, which matches the global OTT trend for mature and more drama driven content.

**NEGATIVE INSIGHTS**

The high number of UNKNOWN certifications might reduce trust among parents or cautious viewers. Also limited content for children could mean losing out on family subscribers.

#### Chart - 3

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the number of titles released per year and sort by year
yearly_counts = titles_clean['release_year'].value_counts().sort_index()

# Create a figure for the plot with a specified size
plt.figure(figsize=(10,6))

# Generate a line plot showing the number of titles released per year
sns.lineplot(x=yearly_counts.index, y=yearly_counts.values, color='green')

# Set the title of the plot
plt.title('Number of Titles Released per Year')

# Set the label for the x-axis
plt.xlabel('Release Year')

# Set the label for the y-axis
plt.ylabel('Number of titles')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add a grid to the plot
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line chart to clearly visualize the no of titles released over the years. It helps understand the trend in Amazon Prime's content growth or decline.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that content release increased significantly after a certain year (e.g. post-2015) possibly reflecting A,azon's strategic push into original content. There may also be noticeable dips in certain years, like during the pandemic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE INSIGHTS

Groeth over time indicates Amazon is expanding its catalog potenially increasing user engagement and subscription value.

NEGATIVE INSIGHTS

If there is a sharp drop in recent years, it may hint at reduced production or acquisition- which can affect platform variety.

#### Chart - 4

In [None]:
import ast
from collections import Counter
#function to clean and extract genres
def clean_genres(genres_str):
    try:
      #convert a stringified list to actual list (if it looks like a list)
        genres = ast.literal_eval(genres_str)
        if isinstance(genres, list):
            return genres
        else:
            return [genres_str]
    except:
      #if it is not a list format just return the string as a single item list
        return [genres_str]

#split genre strings and flatten the list
genres_series = titles_clean['genres'].dropna().apply(clean_genres)
flat_genres = [genre  for sublist in genres_series for genre in sublist]

#count and get top 10
genre_counts = Counter(flat_genres)
top_genres = dict(sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)[:10])

#plot
plt.figure(figsize=(8,5))
sns.barplot(x=list(top_genres.values()), y=list(top_genres.keys()), palette='Blues_r')
plt.title('Top 10 Most Common Genres on Amazon Prime')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

i chose a horizontal bar chart because genre names can be long and this layout allows them to be displayed clearly. it is also perfect for ranking top 10 cvategories visually.

##### 2. What is/are the insight(s) found from the chart?


- Drama is the most common genre on Amazon prime.

- Followed by comedy, thriller, action and romance

- Genres like fantasy and horror are present but less dominant.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**POSITIVE INSIGHTS **

The platform is targeting emotionally engaging and mainstream genres like drama and comedy which have broad audience appeal.

**NEGATIVE INSIGHTS**

Less content in niche genres like Sci fi or Fantasy could make Amazon prime less sttractive to fans of those categories, compared to competitors like Netflix or Hotstar.

#### Chart - 5

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a figure for the plot
plt.figure(figsize=(8,5))

# Generate a histogram of IMDb scores with KDE curve
sns.histplot(titles_clean['imdb_score'], bins=20, kde=True, color='red')

# Set the title of the plot
plt.title('Distribution of IMDb Scores')

# Set the label for the x-axis
plt.xlabel('IMDb Score')

# Set the label for the y-axis
plt.ylabel('Frequency')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with KDE curve to visualize the distribution of IMDb scores. It helps show where most content falls in terms of viewer ratings, and how "spread out" or skewed the ratings are.

##### 2. What is/are the insight(s) found from the chart?

- Most titles have IMDb scores between 5 and 7

- Very few titles are rated below 4 or above 8

- There is a slight peak around 6.5 which seems to be the most common score range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE INSIGHTS

The majority of content is above average in ratings which helps maintain audience trust and satisfaction.


NEGATIVE INSIGHTS

Lack of top rated content could mean the platform needs more critically acclaimed or award winning titles to stand out.

#### Chart - 6

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure release_year is numeric and handle potential errors
titles_clean['release_year'] = pd.to_numeric(titles_clean['release_year'], errors='coerce')

# Drop rows where release_year could not be converted to numeric (NaN)
titles_clean = titles_clean.dropna(subset=['release_year'])

# Convert release_year to integer type
titles_clean['release_year'] = titles_clean['release_year'].astype(int)

# Create a figure for the plot
plt.figure(figsize=(10,5))

# Generate a line plot showing the average IMDb score over time
# ci=None is deprecated, use errorbar=None instead for future versions
sns.lineplot(x='release_year', y='imdb_score', data=titles_clean, ci=None, color='purple')

# Set the title of the plot
plt.title('IMDb Score Over Time')

# Set the label for the x-axis
plt.xlabel('Release Year')

# Set the label for the y-axis
plt.ylabel('IMDb Score')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

a scatter plot is ideal to show the relationship between two continuous variables- in this case release year and imdb score.
it helps reveal trends outliers or score patterns over time


##### 2. What is/are the insight(s) found from the chart?

- imdb scores are generally clustered between 5 and 7 over the years

- some high rated contents appears more after 2015 showing quality investments in recent years.

-very few titles go below a score of 3

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE INSIGHTS

Steady IMDb ratings over time show consistent quality. the appearence of high rated titles in recent years is a strong signal of content improvement and better audience reception


NEGATIVE INSIGHTS

if score distribution remains too average (5-6 range) it might suggest a lack of standout or critically acclaimed content

#### Chart - 7

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Use titles_clean instead of titles
# Fix typo: sort_value to sort_values
# Drop rows with missing runtime or title for this specific plot
titles_clean = titles_clean.dropna(subset=['runtime', 'title'])

# Sort by runtime in descending order and get the top 10
top_runtime = titles_clean.sort_values(by='runtime', ascending=False).head(10)

# Create a figure for the plot
plt.figure(figsize=(10, 8))
# Generate a horizontal bar plot of the top 10 movies by runtime
sns.barplot(x='title', y='runtime', data=top_runtime, color='yellow')
# Set the title of the plot
plt.title('Top 10 Movies by Runtime')
# Set the label for the x-axis
plt.xlabel('Movie Title')
# Set the label for the y-axis
plt.ylabel('Runtime (minutes)')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
# Adjust layout to prevent labels overlapping
plt.tight_layout()
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly displays and compares the top 10 titles with the longest runtimes, making ut easy to interpret large numerical differences.

##### 2. What is/are the insight(s) found from the chart?

Some titles on Amazon Prime have unusually long runtimes, possibly indicating mini series, documentries, or extended editions. These may affect viewer engagement or platform strategies

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes understanding runtime extremes helps the platform recommend content based on user watch time behaviour. However extremely long runtinmes might also lead to viewer drop off which could be a negative impact.

#### Chart - 8

In [None]:
directors_df = credits_clean[credits_clean['role']=='DIRECTOR']
top_directors = directors_df['name'].value_counts().head(10)
plt.figure(figsize=(10,6))
sns.barplot(x=top_directors.index, y=top_directors.values, color='violet')
plt.title('Top 10 Directors on Amazon Prime')
plt.xlabel('Director')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A Horizontal bar chart is ideal for ranking data and clearly visualizing long text labels like names. It let us easily compare the top directors without over lapping text.

##### 2. What is/are the insight(s) found from the chart?

- We can see which directors are most frequent on the platform.

- It might reflect deals between Amazon  and specific directors.

- if some names dominate, it could influence content style and genre consistency.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSTIVE INSIGHT

Helps Amazon Prime identify which directors drive the most content and possibly usder engagement.

NEGATIVE INSIGHT

Over reliance on a few directors might lead to lack of diversity in content style.

#### Chart - 9

In [None]:
import ast
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Create a copy of the dataframe and drop rows with missing runtime or genres
runtime_genre = titles_clean.dropna(subset=['runtime', 'genres']).copy()

# Function to safely extract the first genre from the string representation of a list
def extract_first_genre(x):
    try:
        # Safely evaluate the string as a Python literal (list)
        genres_list = ast.literal_eval(x) if isinstance(x, str) else []
        # Return the first genre if the list is not empty, otherwise 'Unknown'
        return genres_list[0] if genres_list else 'Unknown'
    except:
        # Handle cases where the string is not a valid list representation
        return 'Unknown'

# Apply the function to the 'genres' column to get the main genre
runtime_genre['main_genre'] = runtime_genre['genres'].apply(extract_first_genre)

# Calculate the average runtime for each main genre (This line is commented out in the original code)
# average_runtime_by_genre = runtime_genre.groupby('main_genre')['runtime'].mean().sort_values(ascending=False)

# Create a figure for the plot
plt.figure(figsize=(18,6)) # Increased figure size to accommodate more genres

# Draw the boxplot without outlier points
sns.boxplot(
    x='main_genre',
    y='runtime',
    data=runtime_genre,
    palette='viridis',
    showfliers=False  # Hide default outliers
)


# Add plot titles and labels
plt.title('Distribution of Runtime by Genre', fontsize=14, fontweight='bold')
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Runtime (in minutes)', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right') # Rotated and aligned for better fit

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot for Chart 9 because it's an excellent way to visualize the distribution of a numerical variable (runtime) across different categorical groups (genres). It effectively shows the median, quartiles, and potential outliers for the runtime within each genre. This helps us understand the typical runtime range for content in various genres and compare these distributions, which is valuable for content planning.



##### 2. What is/are the insight(s) found from the chart?

- Variation in Typical Runtimes: Different genres have different typical runtimes. You can see this by comparing the boxes for each genre – some are longer or shorter than others.
Spread of Runtimes within Genres: The length of the boxes and the whiskers shows how much the runtimes vary within a specific genre. Some genres might have a narrow range of runtimes, while others have a much wider spread.
- Median Runtime: The line inside each box represents the median runtime for that genre. This gives you a central value to understand the typical length of content in that category.
- Potential Outliers: Although the plot is set to showfliers=False, the whiskers extend to the last data point within 1.5 times the interquartile range. If showfliers were true, you might see individual points beyond the whiskers, indicating titles with unusually long or short runtimes for their genre.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Content Strategy: By understanding the typical runtimes for successful content in different genres, Amazon Prime can make more informed decisions about acquiring or producing content that aligns with audience expectations for that genre. For example, if viewers expect dramas to be longer, they can focus on acquiring or creating longer-form drama series or movies.

Improved Content Recommendation: Knowing the typical runtime of different genres can help the recommendation engine. If a user enjoys a certain genre, the platform can recommend other titles within that genre that have a similar runtime distribution, potentially increasing user satisfaction and watch time.

Optimized Scheduling and Curation: Understanding runtime variations can help with content scheduling and curation on the platform, ensuring a good mix of short and long-form content within different genres to cater to various viewing preferences and time availability.


Negative Growth Considerations:

Viewer Drop-off with Extremely Long Runtimes: If certain genres have a significant number of titles with extremely long runtimes (even if not explicitly shown as outliers in this plot with showfliers=False, the distribution indicates this possibility), it could lead to viewer drop-off. Users might be less likely to start or finish content that requires a very long time commitment, potentially negatively impacting engagement and completion rates for those specific titles or genres.

Mismatch with Audience Preferences: If the typical runtime of content in a popular genre on Amazon Prime significantly deviates from what viewers typically expect or prefer for that genre (based on industry trends or competitor analysis), it could lead to lower viewer satisfaction and engagement.

#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import ast

# Function to safely extract the first country from the string representation of a list
def extract_first_country(country_list_str):
    try:
        # Use ast.literal_eval to safely evaluate the string as a Python literal
        country_list = ast.literal_eval(country_list_str)
        # Check if the evaluated result is a list and not empty
        if isinstance(country_list, list) and country_list:
            return country_list[0]
        else:
            # Handle cases where the list is empty or evaluation fails
            return 'Unknown'
    except:
        # Handle cases where the string is not a valid list representation
        return 'Unknown'

# Apply the function to the 'production_countries' column
titles_clean['production_country'] = titles_clean['production_countries'].apply(extract_first_country)

# Calculate the average tmdb_score for each production country
country_score = titles_clean.groupby('production_country')['tmdb_score'].mean().sort_values(ascending=False)

# Get the top 10 countries
top_10_countries = country_score.head(10)

# Create a figure for the plot
plt.figure(figsize=(10,6))
# Generate a bar plot of the top 10 countries by average TMDb score
sns.barplot(x=top_10_countries.index, y=top_10_countries.values, color='blue') # Used palette instead of color
# Set the title of the plot
plt.title('Top 10 Production Countries by Average TMDb Score')
# Set the label for the x-axis
plt.xlabel('Production Countries')
# Set the label for the y-axis
plt.ylabel('Average TMDb Score')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

this helps us see which countries consistently produce high quality content (based on tmdb score).
it reveals geographic trends in quality.


##### 2. What is/are the insight(s) found from the chart?

- Dominician Republic (DO) and Singapore (SG) have the highest average tmdb scores among all countries.

- Japan (JP) Costa Rica (CR) and Palestine (PS) show the lowest TMDB scores

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE - High scoring countries like SG and DO can be promoted more, or Amazon cxan explore regional licensing deals.

NEGATIVE - LOw average scores from JP CR PS may indicate either a mismatch with viewer to improve content curation from those regions

#### Chart - 11

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import ast

# Assuming your dataframe is called titles_clean
# Step 1: Drop rows with missing genres or years
df_genre_year = titles_clean[['release_year', 'genres']].dropna().copy()

# Step 2: Explode genres if multiple genres are listed using the ast.literal_eval function
def safe_literal_eval(x):
    try:
        return ast.literal_eval(x) if isinstance(x, str) else []
    except:
        return []

df_genre_year['genres'] = df_genre_year['genres'].apply(safe_literal_eval)
df_genre_year = df_genre_year.explode('genres')

# Step 3: Group by year and genre
genre_trend = df_genre_year.groupby(['release_year', 'genres']).size().reset_index(name='count')

# Step 4: Plot the top 5 genres over time
top_genres = genre_trend.groupby('genres')['count'].sum().sort_values(ascending=False).head(5).index
filtered = genre_trend[genre_trend['genres'].isin(top_genres)]

plt.figure(figsize=(12,6))
sns.lineplot(data=filtered, x='release_year', y='count', hue='genres', marker='o')
plt.title('Top 5 Genre Trends Over Time on Amazon Prime')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.legend(title='Genre')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This line plot was selected because it clearly shows how the popularity of different genres has changed over the years. Trends over time are best visualized using line plots, especially when comparing multiple categories (like genres). It helps decision-makers:
-	Spot rising and declining genre trends
-	Understand user preferences across different years
-	Make future content acquisition or production decisions

##### 2. What is/are the insight(s) found from the chart?

1.	Certain genres show consistent growth:
For example, if Thriller or Documentary titles are steadily increasing year over year, it indicates a growing interest in such genres from viewers.

2.	Some genres may have spiked in certain years:
E.g., a sudden increase in Drama around 2020 might be due to pandemic-induced shifts in content creation or user preference.

3.	Decline in certain genres:
Genres like Romance or Comedy might show a flat or declining curve, suggesting they are either saturated or not being prioritized in recent years

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Insight:

The chart helps identify which genres are gaining popularity over time, allowing Amazon to invest more in trending genres and align content with audience interests.

Negative Insight:

Genres showing a decline (e.g., Comedy or Romance) may reflect changing viewer preferences or content fatigue, which can lead to reduced engagement if not addressed.

#### Chart - 12

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast

# Assuming your DataFrame is named titles_clean and the column is 'production_countries'
# Function to safely extract and process production countries
def process_countries(country_list_str):
    try:
        # Use ast.literal_eval to safely evaluate the string as a Python literal
        country_list = ast.literal_eval(country_list_str)
        # Return the list of countries if successful, otherwise an empty list
        return country_list if isinstance(country_list, list) else []
    except:
        # Handle cases where the string is not a valid list representation
        return []

# Apply the function to the 'production_countries' column and explode the list
all_countries = titles_clean['production_countries'].dropna().apply(process_countries).explode()

# Count the occurrences of each country and get the top 10
country_counts = all_countries.value_counts().head(10)

# Create a figure for the plot
plt.figure(figsize=(12,6))
# Generate a bar plot of the top 10 countries by content count
sns.barplot(x=country_counts.values, y=country_counts.index, palette='viridis')
# Set the label for the x-axis
plt.xlabel('Number of Titles')
# Set the label for the y-axis
plt.ylabel('Production Country')
# Set the title of the plot
plt.title('Top 10 Countries by Content Availability on Amazon Prime')
# Adjust layout to prevent labels overlapping
plt.tight_layout()
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart helps visualize which countries contribute the most content to Amazon Prime, giving insight into geographic diversity and potential content gaps.

##### 2. What is/are the insight(s) found from the chart?

1.	High Contribution from a Few Countries:
	•	Countries like the United States, India, and United Kingdom have the highest number of titles, showing they are the main content producers on the platform.

2.	Dominance of English-Speaking Regions:
	•	Most top contributors are English-speaking or produce a large amount of English content, aligning with the platform’s global audience preferences.

3.	Low Representation of Other Countries:
	•	Many regions, such as those in Africa, Eastern Europe, or Southeast Asia, are underrepresented, which could signal missed opportunities for cultural variety or regional expansion.
  
4.	Strategic Expansion Opportunity:
	•	Amazon Prime could consider expanding licensing and original content production in underrepresented but growing markets like South Korea, Brazil, or Middle East to diversify its library and attract more subscribers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Insight:
	•	Identifies countries that are major content contributors, which can guide partnership and localization strategies.

Negative Insight:
	•	Countries with low representation may indicate limited global diversity or licensing challenges, potentially alienating regional audiences.

#### Chart - 13

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Filter for TV shows with non-null seasons and imdb_score
df_shows_with_seasons = titles_clean[(titles_clean['type'] == 'SHOW') &
                                    (titles_clean['seasons'].notna()) &
                                    (titles_clean['imdb_score'].notna())].copy()

# Convert seasons to integer
df_shows_with_seasons['seasons'] = df_shows_with_seasons['seasons'].astype(int)

# Calculate average imdb_score by number of seasons
average_score_by_seasons = df_shows_with_seasons.groupby('seasons')['imdb_score'].mean().reset_index()

# Plotting
plt.figure(figsize=(12, 6))
# Generate a bar plot of the average IMDb score by number of seasons
# Addressing the FutureWarning by explicitly setting 'x' as the hue
sns.barplot(x='seasons', y='imdb_score', data=average_score_by_seasons, hue='seasons', palette='viridis', legend=False)
# Set the title of the plot
plt.title('Average IMDb Score by Number of Seasons for TV Shows')
# Set the label for the x-axis
plt.xlabel('Number of Seasons')
# Set the label for the y-axis
plt.ylabel('Average IMDb Score')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot for better readability
plt.grid(axis='y')
# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart shows whether longer running shows (more seasons) tend to be rated better or whether short limited series are more critically acclaimed.
Helpful for content startegy decisions w.g. whether to investy in mini series or long running shows.

##### 2. What is/are the insight(s) found from the chart?

Shorter series tend to receive better ratings suggesting that viewers may prefer concise, well crafted storylines over longer ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

POSITIVE Content creators and platforms can be confident in producing both limited series and long running shows knowing that either can perform well in terms of viewer satisfaction

NEGATIVE The lack of clear differentiation might make it harder to predict success based solely on the number of seasons.

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = titles_clean.select_dtypes(include=['int64','float64'])
correlation_matrix = numeric_cols.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numeric Features on Amazon Prime')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap gives a quick overview of the relationship between numerical variables like imdb score, runtime, popularity, tdmd etc. It helps in identifying which variables are postively or negatively correlated.

##### 2. What is/are the insight(s) found from the chart?

IMDb Score and TDMD show a strong positive correlation-suggesting that if one is high, the other tends to be high too.

IMBb votes are moderately correlated with Popularity, meaning titles with more votes often have higher popularity.

Runtime shows weak correlations with most other variables-indicating it is largely independent.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

pairplot_cols = ['imdb_score', 'imdb_votes', 'tmdb_popularity', 'runtime']
df_pair = titles_clean[pairplot_cols].dropna()
sns.pairplot(df_pair, diag_kind='kde')
plt.suptitle('Pair Plot of Selected Numeric Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot helps visualize relationships between multiple numeric variables at once.
It is usedful to identify patterns, trends, outliers, or potential correlations.

##### 2. What is/are the insight(s) found from the chart?

Titles with more votes tend to have moderately higher scores. But it is not strong linear relationship meaning vote count does not always mean quality.

As popularity increases, IMDb votes also tend to increase. This shows that popular titles attract more user engagement.

## **5. Solution to Business Objective**

In [None]:


# 5. Solution to Business Objective

# Based on the EDA, here are potential solutions and insights related to the business objectives:

# Objective 1: Understand Content Distribution
# - We found that movies significantly outnumber TV shows.
# - Content is heavily skewed towards US, UK, and India.
# - Age ratings "16+" and "18+" are most common, with a large "Unknown" category.
# Solution: To diversify and potentially increase retention, Amazon Prime could invest more in TV series, particularly those targeting different demographics or international markets beyond the top contributors. Improving age certification data is crucial for targeted marketing and content curation.

# Objective 2 & 3: Identify Viewer Preferences and Track Popularity Trends
# - Drama, Comedy, and Thriller are the most prevalent genres, indicating broad appeal.
# - Average IMDb scores tend to be in the 5-7 range, with fewer critically acclaimed titles.
# - There is a trend towards increasing content release over recent years.
# - Shorter TV series may have slightly higher average ratings.
# Solution: Focus on acquiring or producing high-quality content (aiming for scores > 7.5) across diverse genres. The trend data suggests continued investment is happening, which is positive. Consider promoting shorter series more prominently, as they seem to resonate well.

# Objective 4: Support Data-Driven Personalization
# - The relationship between IMDb score, votes, and popularity can be used.
# - Genre trends and country origins are strong indicators of user preference.
# Solution: Enhance recommendation engines by incorporating genre popularity trends over time, average scores by genre and country, and correlation insights. If a user watches a high-rated thriller from a specific country, recommend similar content with high popularity and vote count.

# Objective 5: Assist Strategic Planning
# - The dominance of certain countries highlights key markets but also potential for expansion elsewhere.
# - The distribution of genres suggests areas for both reinforcing strengths and exploring gaps.
# Solution: Based on average scores by country, identify regions that consistently produce well-received content for potential partnerships. For genres, continue investing in Drama/Comedy but also explore opportunities in less saturated genres that show high average scores, or invest in higher quality content within lower-scoring popular genres.


#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based On the Exploratory data analysis of Amazon Prime's content, the following actionable insights are proposed to support business growth:

1. Optimize Genre Focus:
By identifying high performing genres (like Thriller and Drama) with strong imdb scores and popularity, the platform can priortize these genres in future acquisitions or productions.

2. Enhance Content Discovery:
Several titles with high IMDb scores had low popularity, suggesting that quality content is going unnoticed. Targeted promotions and recommendations can increase engagement for such titles.

3. Time Based Release Strategy:
Analysis of yearly release trends shows a steady rise in content volume until a drop in recent years. The platform could analyze user demand vs release volume to align better with viewer expectations.

4. Focus on Highly rated Directors and Production Countries
Certain directors and countries consistently yield higher IMDb ratings. Partnering with them could increase content quality and user satisfaction.

5.Data Driven Personalization:
Combining metadata like genre, runtime, and user ratings allow for better personalized recommendations boosting retention and engagement.

By applying these strategies Amazon Prime can maximize viewership, retain more users, and ensure a data informed content roadmap.

# **Conclusion**

In this Exploratory Data Analysis project on Amazon Prime content, we analyzed various aspects such as content type distribution, age certifications, genre trends, IMDb ratings, director performance, and more. Using 15 different visualizations, we identified patterns in user ratings, release trends and content quality. The insights gained helped us recommend genre optimiztaion, better content promotion, and improved recommendation systems. Overall, the analysis supports data driven decisions making for Amazon Prime to enhance content strategy, increase viewership, and boost customer satisfaction.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***