<a href="https://colab.research.google.com/github/iamanantalok/Netflix-Content-Clustering/blob/main/Capstone_Project_4_Netflix_Movies_And_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Netflix Movies and TV Shows Clustering**  



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**            - Anant Alok


# **Project Summary -**

With over 83 million subscribers and a global presence spanning 190 countries, Netflix reigns as the world's foremost Internet television network. Daily, users consume more than 125 million hours of TV and film content, including original series, documentaries, and feature films. Netflix grants members the freedom to watch as much as they desire, at their convenience, on virtually any internet-connected screen. Members can pause, play, and resume without interruption or commitment.

This dataset, compiled in 2019 from Flixable, a third-party Netflix search engine, encompasses Netflix's collection of TV shows and movies. Intriguingly, a 2018 report revealed that Netflix's TV show catalog has nearly tripled since 2010, while the number of movies has declined by over 2,000 titles during the same period. This dataset provides ample opportunity to uncover additional insights.

In this project, our focus revolved around solving a text clustering challenge. Our goal was to categorize Netflix movies and shows into clusters based on similarity, ensuring that shows within a cluster shared common characteristics, while those in different clusters diverged.

Our project encompassed the following tasks:

1. **Exploratory Data Analysis (EDA):** We began by addressing missing data and conducting EDA.

2. **Content Analysis by Country:** We sought to understand the types of content available in different countries.

3. **Shift towards TV:** An examination of Netflix's focus on TV content versus movies in recent years.

4. **Clustering by Text-Based Features:** We employed attributes like cast, country, genre, director, rating, and description for clustering. Utilizing TF-IDF vectorization, we tokenized, preprocessed, and vectorized these attributes.

5. **Dimensionality Reduction:** To address dimensionality issues, we applied Principal Component Analysis (PCA).

6. **Cluster Creation:** Using K-Means Clustering and Agglomerative Hierarchical Clustering, we constructed two distinct types of clusters. We determined the optimal number of clusters using methods such as the elbow method, silhouette score, and dendrogram analysis.

7. **Content-Based Recommender System:** Leveraging cosine similarity on the similarity matrix, we developed a content-based recommender system. Users receive ten recommendations based on their viewing preferences.

This comprehensive analysis and recommendation system aim to enhance user satisfaction and subsequently improve Netflix's retention rates.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Netflix, the world's largest online streaming service provider, boasts a staggering 220 million subscribers as of Q2 2022. In order to retain subscribers and provide an improved user experience, it's crucial for Netflix to efficiently categorize the shows available on its platform. This categorization allows for a deeper understanding of show similarities and differences, which, in turn, can be used to offer personalized show recommendations tailored to individual preferences.

The primary objective of this project is to group Netflix shows into clusters, ensuring that shows within the same cluster exhibit similarity while those in different clusters diverge significantly. This clustering endeavor aims to enhance the overall user experience and reduce subscriber churn.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Standard Libraries
import string
import re
import warnings

# Third-party Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import missingno as msno
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download NLTK data
nltk.download('punkt')

# Disable warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Mount Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix_project = pd.read_csv('/content/drive/MyDrive/Capstone_Project_4-Netflix-Movies-and-shows-clustering/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')
netflix_df = netflix_project.copy()

### Dataset First View

In [None]:
# Concatenate the first five and last five rows of the Netflix dataset
concatenated_df = pd.concat([netflix_df.head(), netflix_df.tail()])

# Display the concatenated DataFrame
print(concatenated_df)


### Dataset Rows & Columns count

In [None]:
# Print the number of rows and columns in the Netflix dataset
print(f'Number of rows: {netflix_df.shape[0]}\nNumber of columns: {netflix_df.shape[1]}')

### Dataset Information

In [None]:
# Display information about the Netflix dataset
netflix_df.info()

#### Duplicate Values

In [None]:
# Calculate and print the number of duplicate values in the Netflix dataset
duplicate_value = len(netflix_df[netflix_df.duplicated()])
print("The number of duplicate values in the dataset is =", duplicate_value)

#### Missing Values/Null Values

In [None]:
# Display the count of null values in each column of the Netflix dataset
print("-" * 32)
print("Null value count in each column:")
print("-" * 32)
null_count_by_column = netflix_df.isna().sum()
print(null_count_by_column)
print("-" * 42)

# Calculate and display the percentage of null values in each column
print("Percentage of null values in each column:")
print("-" * 42)
percentage_null_by_column = (null_count_by_column / len(netflix_df)) * 100
print(percentage_null_by_column)
print("-" * 42)

In [None]:
# Visualize missing values using the missingno library
msno.bar(netflix_df, color='green', sort='ascending', figsize=(10, 3), fontsize=15)

# Create a bar plot to show the count of missing values in each column
plt.figure(figsize=(15, 8))
plots = sns.barplot(x=netflix_df.columns, y=netflix_df.isna().sum())
plt.grid(linestyle='--', linewidth=0.3)

# Annotate the bar plot with the count of missing values
for bar in plots.patches:
    plots.annotate(bar.get_height(),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=12, xytext=(0, 8),
                   textcoords='offset points')

# Display the plot
plt.show()


### What did you know about your dataset?

The "Netflix Movies and TV Shows Clustering" dataset consists of 12 columns, with just one column containing integer data. Notably, there are no duplicate values present, but there are null values in five columns: director, cast, country, date_added, and rating.

This dataset serves as a valuable resource for investigating trends in Netflix's extensive collection of movies and TV shows. Furthermore, it offers an opportunity to construct clustering models that group similar titles together based on common attributes like genre, country of origin, and rating.

## ***2. Understanding Your Variables***

In [None]:
# Display the list of available columns in the Netflix dataset
print(f"Available columns:\n{netflix_df.columns.to_list()}")

In [None]:
# Generate a summary of statistics for the Netflix dataset (including all columns)
summary_stats = netflix_df.describe(include='all').T
print(summary_stats)

### Variables Description

The "**Netflix Movies and TV Shows Clustering**" dataset contains the following variables:

1. **show_id:** A unique identifier for each movie or TV show.
2. **type:** Indicates whether the entry is a movie or a TV show.
3. **title:** The name of the movie or TV show.
4. **director:** The name(s) of the director(s) of the movie or TV show.
5. **cast:** The names of the actors and actresses featured in the movie or TV show.
6. **country:** The country or countries where the movie or TV show was produced.
7. **date_added:** The date when the movie or TV show was added to Netflix.
8. **release_year:** The year when the movie or TV show was originally released.
9. **rating:** The TV rating or movie rating of the movie or TV show.
10. **duration:** The length of the movie or TV show, either in minutes or seasons.
11. **listed_in:** The categories or genres of the movie or TV show.
12. **description:** A brief synopsis or summary of the movie or TV show.

### Check Unique Values for each variable.

In [None]:
# Check and print the number of unique values for each variable (column) in the Netflix dataset
for column in netflix_df.columns.tolist():
    unique_count = netflix_df[column].nunique()
    print(f"No. of unique values in {column} is {unique_count}")

I've narrowed down our focus to key columns: 'type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', and 'description'.

Here's what's on the horizon:

1. **Clustering:** I'll create a 'cluster' column using K-means and Hierarchical clustering methods to group similar data points.

2. **Recommendations:** We're developing a personalized content-based recommendation system that considers user preferences and viewing history.

This strategy is all about insights and user satisfaction.

## 3. ***Data Wrangling***

### Data Wrangling Code

Handling Missing Data in Each Feature

In [None]:
# Display the count of null values in each column of the Netflix dataset
print("-" * 32)
print("Null value count in each column:")
print("-" * 32)
null_count_by_column = netflix_df.isna().sum()
print(null_count_by_column)
print("-" * 42)

# Calculate and display the percentage of null values in each column
print("Percentage of null values in each column:")
print("-" * 42)
percentage_null_by_column = (null_count_by_column / len(netflix_df)) * 100
print(percentage_null_by_column)
print("-" * 42)

In [None]:
# Count the occurrences of each unique value in the "date_added" column
date_added_counts = netflix_df["date_added"].value_counts()
print(date_added_counts)

In [None]:
# Count the occurrences of each unique value in the "rating" column
rating_counts = netflix_df['rating'].value_counts()
print(rating_counts)

In [None]:
# Count the occurrences of each unique value in the "country" column
country_counts = netflix_df['country'].value_counts()
print(country_counts)

-  Considering the low percentage of null values in 'date_added' and 'rating,' it's advisable to exclude those data points to maintain the impartiality of our clustering model.

-  For 'director' and 'cast,' where null values are relatively high and we lack information about the actual movies or TV shows, it's prudent to replace these entries with 'unknown.'

-  Regarding 'country,' since only 6% of the values are missing, and the majority of movies/shows originate from the US, we can fill the null values with the mode.

In [None]:
# Impute 'director' and 'cast' columns with "Unknown" for missing values
netflix_df[['director', 'cast']] = netflix_df[['director', 'cast']].fillna("Unknown")

# Impute missing values in the 'country' column with the mode (most common country)
netflix_df['country'] = netflix_df['country'].fillna(netflix_df['country'].mode()[0])

# Drop rows with missing values in 'date_added' and 'rating' columns
netflix_df.dropna(subset=['date_added', 'rating'], inplace=True)


In [None]:
# Display the count of null values in each column after imputation
print("-" * 50)
print("Null value count in each column after imputation:")
print("-" * 50)
null_count_by_column = netflix_df.isna().sum()
print(null_count_by_column)
print("-" * 59)

# Calculate and display the percentage of null values in each column after imputation
print("Percentage of null values in each column after imputation:")
print("-" * 59)
percentage_null_by_column = (null_count_by_column / len(netflix_df)) * 100
print(percentage_null_by_column)
print("-" * 59)


*Country and Listed_in :*

In [None]:
# Count and display the number of movies/TV shows per country
print("Top countries by the number of movies/TV shows:")
print('-'*47)
country_counts = netflix_df['country'].value_counts()
print(country_counts)

# Count and display the number of movies/TV shows per genre
print("\nGenres of shows:")
print('-'*16)
genre_counts = netflix_df['listed_in'].value_counts()
print(genre_counts)

# Find entries with multiple countries listed
multiple_countries = netflix_df[netflix_df['country'].str.contains(',', na=False)]

# Find entries with multiple genres listed
multiple_genres = netflix_df[netflix_df['listed_in'].str.contains(',', na=False)]

# Print movies/TV shows with multiple countries listed
print("\nMovies/TV Shows Filmed in Multiple Countries:")
print('-'*45)
print(multiple_countries[['title', 'country']])

# Print movies/TV shows with multiple genres listed
print("\nMovies/TV Shows with Multiple Genres:")
print('-'*36)
print(multiple_genres[['title', 'listed_in']])


To streamline the analysis, let's focus solely on the primary filming location of each movie or TV show and the primary genre.

In [None]:
# Function to extract the primary value (first value in a comma-separated list)
def extract_primary(value):
    if isinstance(value, str):
        return value.split(',')[0]
    return value

# Apply the function to 'country' and 'listed_in' columns to consider only the primary values
netflix_df['country'] = netflix_df['country'].apply(extract_primary)
netflix_df['listed_in'] = netflix_df['listed_in'].apply(extract_primary)

# Print the DataFrame with simplified values
netflix_df

*Data Handling for date_added Column :*

In [None]:
# Typecast 'date_added' from string to datetime
netflix_df["date_added"] = pd.to_datetime(netflix_df['date_added'])

# Find the first and last date on which a show was added on Netflix
min_date_added = netflix_df.date_added.min().strftime('%Y-%m-%d')
max_date_added = netflix_df.date_added.max().strftime('%Y-%m-%d')

# Print the range of dates when shows were added on Netflix
print(f"The shows were added on Netflix between {min_date_added} and {max_date_added}.")

# Adding new attributes for day, month, and year of date added
netflix_df['day_added'] = netflix_df['date_added'].dt.day
netflix_df['month_added'] = netflix_df['date_added'].dt.month
netflix_df['year_added'] = netflix_df['date_added'].dt.year

# Dropping the original 'date_added' column
netflix_df.drop('date_added', axis=1, inplace=True)

*Transforming Ratings into Age-Based Content Restrictions :*

In [None]:
# Create a countplot to visualize the age ratings for shows on Netflix
plt.figure(figsize=(10, 5))
sns.countplot(x='rating', data=netflix_df)

# Provide an observation as text on the graph
plt.title("Age Ratings for Shows on Netflix")
plt.xlabel("Age Rating")
plt.ylabel("Number of Shows")

# Calculate the count for the most frequent rating
most_common_rating = netflix_df['rating'].mode()[0]
count_most_common_rating = (netflix_df['rating'] == most_common_rating).sum()

# Add the observation as text on the graph
plt.text(0.5, count_most_common_rating + 10, f"Most Common Rating: {most_common_rating}", ha='center', fontsize=8)

plt.show()



In [None]:
# Display the unique age ratings in the 'rating' column before mapping
unique_ratings_before = netflix_df.rating.unique()
print("Unique ratings before mapping:", unique_ratings_before)

# Define a mapping to change the values in the 'rating' column
rating_map = {
    'TV-MA': 'Adults',
    'R': 'Adults',
    'PG-13': 'Teens',
    'TV-14': 'Young Adults',
    'TV-PG': 'Older Kids',
    'NR': 'Adults',
    'TV-G': 'Kids',
    'TV-Y': 'Kids',
    'TV-Y7': 'Older Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'NC-17': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'UR': 'Adults'
}

# Replace the values in the 'rating' column using the mapping
netflix_df['rating'].replace(rating_map, inplace=True)

# Display the unique age ratings in the 'rating' column after mapping
unique_ratings_after = netflix_df['rating'].unique()
print("Unique ratings after mapping:", unique_ratings_after)

# Create a countplot to visualize the new age ratings for shows on Netflix
plt.figure(figsize=(10, 5))
sns.countplot(x='rating', data=netflix_df)

plt.show()


**Around 50% of shows on Netflix are produced for adult audience, followed by young adults, older kids, and kids. Netflix has the least number of shows specifically produced for teenagers compared to other age groups.**

*Preparing Duration Data :*

In [None]:
# Split the 'duration' column and change its datatype to integer
netflix_df['duration'] = netflix_df['duration'].apply(lambda x: int(x.split()[0]))

# Print the number of seasons for TV shows
tv_show_season_counts = netflix_df[netflix_df['type'] == 'TV Show']['duration'].value_counts()
print("Number of seasons for TV shows:")
print(tv_show_season_counts)

# Print the unique movie lengths in minutes
unique_movie_lengths = netflix_df[netflix_df['type'] == 'Movie']['duration'].unique()
print("Unique movie lengths in minutes:")
print(unique_movie_lengths)

# Check the datatype of the 'duration' column
duration_datatype = netflix_df['duration'].dtype
print("Datatype of duration:", duration_datatype)


### What all manipulations have you done and insights you found?

We have 12 attributes, some with improper data types like 'date_added' and 'duration,' which we convert to the desired data types. After this conversion, we discover that there are more movies than TV shows in the 'type' feature. We create a word cloud image to identify common words in titles.

For movies and TV shows filmed in multiple countries and with multiple genres, we focus only on the primary country and genre. The majority of shows on Netflix are rated TV-MA, followed by TV-14 and TV-PG. We adjust ratings to cater to different viewer preferences, such as adults, teens, and older kids.

Approximately 50% of Netflix shows target adult audiences, followed by young adults, older kids, and kids. Netflix has fewer shows specifically produced for teenagers compared to other age groups.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Examining the Content Variety on Netflix**

In [None]:
# Create a 1x2 subplots with a specified figure size
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Countplot to show the count of 'type' values (Movies and TV Shows)
countplot_graph = sns.countplot(x='type', data=netflix_df, ax=ax[0])
countplot_graph.set_title('Count of Values', size=20)

# Pie chart to show the percentage distribution of 'type'
netflix_df['type'].value_counts().plot(kind='pie', autopct='%1.2f%%', ax=ax[1], figsize=(15, 6), startangle=90)
plt.title('Percentage Distribution', size=20)

# Ensure tight layout for better visualization
plt.tight_layout()

# Show the combined visualization
plt.show()

##### 1. Why did you pick the specific chart?

The selection of a countplot for displaying the precise counts of "Movies" and "TV Shows" offers a straightforward comparison of the content types within our dataset. Furthermore, we have opted for a pie chart to illustrate the percentage distribution of these content categories, providing insight into the proportion of movies and TV shows in the entire dataset.

##### 2. What is/are the insight(s) found from the chart?

"Movies" dominate Netflix's content offerings, while "TV Shows" represent a smaller share. The pie chart visually emphasizes this distribution, highlighting the significant presence of "Movies" in the overall content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Strategy: The data suggests a significant prevalence of "Movies" in Netflix's content, which can guide content acquisition and production strategies. Focusing on acquiring diverse and popular movie titles can broaden Netflix's appeal.

User Engagement: Recognizing the dominance of "Movies" allows for tailored marketing and engagement efforts. Targeted promotional campaigns for specific movie genres can attract and retain subscribers effectively.

Retention Strategies: Customized recommendations and curated collections centered around movies can enhance user satisfaction and prolong subscription durations.

Negative Growth Insights:

While the visualizations don't directly indicate negative growth, the limited presence of "TV Shows" may present challenges:

Content Diversity: Potential dissatisfaction among subscribers who prefer TV series due to the scarcity or variety of available content.

Market Competition: Competition from platforms with a broader TV show selection may affect Netflix's market share.

Subscription Tier Adjustment: To align with user preferences, Netflix may need to optimize its subscription tiers, potentially impacting perceived value.

#### **Leading Nations in Content Production**

In [None]:
# Grouping and aggregating the data to get the top 10 countries with the most unique titles
df_country = netflix_df.groupby(['country']).agg({'title': 'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[:10]

# Create a bar chart to show the top 10 countries for content creation
plt.figure(figsize=(15, 6))
barplot = sns.barplot(y="country", x='title', data=df_country)
plt.xticks(rotation=60)
plt.title('Top 10 Countries for Content Creation on Netflix')
plt.grid(linestyle='--', linewidth=0.3)

# Show the chart
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot is a fitting choice for visualizing the distribution of primary filming countries in Netflix content due to its suitability for categorical data. Each country is represented by a bar, enabling easy comparisons of show counts. Bars can be arranged in descending order to highlight top-contributing countries. The y-axis values (show counts) are easily interpreted, and bar lengths reveal countries with a significant presence in Netflix content production. In summary, a bar plot is effective for presenting this categorical data, simplifying comparisons, and extracting insights concisely.


##### 2. What is/are the insight(s) found from the chart?

The list of top countries with the most shows on Netflix provides valuable insights into the content distribution based on primary filming locations:

1. **Content Production Leaders**: The United States leads with the highest number of shows, signifying its significant contribution to Netflix's content library, likely due to its robust entertainment industry.

2. **Global Diversity**: Countries like India, the United Kingdom, Canada, and Japan also have substantial show counts, reflecting a diverse range of content from around the world, catering to various viewer preferences and cultures.

3. **Language and Localization**: The presence of shows from different countries demonstrates Netflix's commitment to offering content in multiple languages and localizing it for global audiences, attracting a broader subscriber base.

4. **Regional Appeal**: South Korea, Spain, and Mexico are notable contributors, indicating the popularity of content from these regions and a growing interest in international shows.

5. **Audience Segmentation**: The country-wise distribution helps Netflix tailor content for specific regions, catering to local tastes and preferences.

6. **Collaborative Productions**: Co-productions between countries contribute to higher numbers, fostering diversity and engaging content through resource and talent sharing.

7. **Market Penetration**: The number of shows from a country may reflect Netflix's market presence, with higher numbers indicating a stronger focus in certain regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. **Global Market Expansion**: Diverse content attracts a global audience, expanding Netflix's subscriber base.

2. **Localized Content**: Tailoring content for top countries boosts engagement and satisfaction.

3. **Cultural Relevance**: Catering to cultural preferences fosters inclusivity and connection.

4. **Strategic Partnerships**: Successful collaborations strengthen Netflix's market position.

5. **Data-Informed Decisions**: Insights optimize content strategy and resource allocation.

Potential Challenges and Negative Impact:

1. **Market Saturation**: Overreliance on content from a few countries can limit global appeal.

2. **Cultural Misalignment**: Inaccurate adaptation risks backlash and attrition.

3. **Competition and Differentiation**: Lack of diversity may hinder differentiation from competitors.

4. **Localization Complexity**: Managing content for multiple countries strains budgets and efficiency.

5. **Content Diversity**: Focusing on high-show-count countries can overlook smaller contributors.

#### **Content Development Across Years**

In [None]:
# Filter data by type (TV Show or Movie)
tv_show = netflix_df[netflix_df["type"] == "TV Show"]
movie = netflix_df[netflix_df["type"] == "Movie"]

col = "year_added"

# Count content added each year for TV Shows and Movies
content_1 = tv_show[col].value_counts().reset_index()
content_1 = content_1.rename(columns={col: "count", "index": col})
content_1 = content_1.sort_values(col)

content_2 = movie[col].value_counts().reset_index()
content_2 = content_2.rename(columns={col: "count", "index": col})
content_2 = content_2.sort_values(col)

# Create traces for TV Shows and Movies
trace1 = go.Scatter(x=content_1[col], y=content_1["count"], name="TV Shows", marker=dict(color="#db0000"))
trace2 = go.Scatter(x=content_2[col], y=content_2["count"], name="Movies", marker=dict(color="#564d4d"))

data = [trace1, trace2]
layout = go.Layout(
    title="Content Added Over the Years",
    xaxis=dict(title="Year"),
    yaxis=dict(title="Count"),
    legend=dict(x=0.4, y=1.1, orientation="h")
)
fig = go.Figure(data, layout=layout)

# Display the figure using Plotly
fig.show()


##### 1. Why did you pick the specific chart?

The selection of a line chart (specifically, a scatter plot with connected lines) in the provided code is appropriate for visualizing the growth of content (TV shows and movies) over the years.

##### 2. What is/are the insight(s) found from the chart?

TV Shows:

- Limited TV show content was added in the early years (2008-2010), possibly signaling the onset of Netflix's original content creation.
- Substantial TV show growth began around 2015, steadily increasing in subsequent years.
- The peak TV show content growth occurred from 2016 to 2020, with 2020 reaching 697 additions.
- A significant decline in TV show additions was noted in 2021 compared to prior years.

Movies:

- Like TV shows, the early years (2008-2010) had relatively few movie additions.
- Movie additions increased notably from 2014, with a significant rise in 2016.
- The growth continued from 2016 to 2019, peaking at 1497 additions in 2019.
- A minor dip occurred in movie additions in 2020, followed by a relatively higher number in 2021.

Overall Insights:

- Both TV shows and movies experienced growth, notably from around 2015-2016.
- The active years for content additions were 2018 and 2019 for both categories.
- The drop in 2020 additions may be attributed to COVID-19-related production delays.
- The lower 2021 additions for both TV shows and movies could signify a shift in strategy or ongoing external factors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. **Strategic Decision-Making**: Insights into growth years (e.g., 2018-2019) inform successful strategies, aiding future decision-making.

2. **Content Investment**: Recognizing growth trends allows resource allocation, focusing on popular content types like TV shows.

3. **Subscriber Retention and Attraction**: Consistent growth attracts and retains subscribers, reducing churn rates.

4. **Global Events Impact**: Understanding the pandemic's effect on 2020 content additions helps manage expectations during external disruptions.

Potential Negative Impact:

1. **Decline in Content Additions**: A drop in 2021 could lead to reduced subscriber engagement and satisfaction.

2. **Competition and Variety**: Overgrowth without maintaining content quality and variety might overwhelm users, impacting engagement.

3. **Production Delays**: Delays due to unforeseen events (e.g., pandemic) can reduce content availability, affecting user satisfaction.

4. **Content Quality**: Maintaining high-quality content alongside growth is crucial to avoid subscriber dissatisfaction. Quantity should not compromise quality.

#### **Busiest Month for Netflix Content Releases**

In [None]:
# Create a DataFrame to store month values and counts of content added
months_df = pd.DataFrame(netflix_df.month_added.value_counts())
months_df.reset_index(inplace=True)
months_df.rename(columns={'index': 'month', 'month_added': 'count'}, inplace=True)

# Create a bar chart using Plotly
fig = px.bar(months_df, x="month", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Month-wise Addition of Movies and Shows to the Platform',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
    width=1000,
    height=500)

# Show the figure using Plotly
fig.show()


##### 1. Why did you pick the specific chart?

A bar chart is the ideal choice for visualizing this data because it efficiently presents the distribution of content additions across months, enabling straightforward comparison and interpretation of the information.

##### 2. What is/are the insight(s) found from the chart?

High-Volume Months: December (833), October (785), and January (757) are the busiest months for content additions on Netflix, indicating a spike in new content during these periods.

End-of-Year Peaks: December's prominence as the top month aligns with the holiday season, catering to increased streaming activity and diverse content preferences during this festive period.

Release Patterns: The top months for content additions may correspond to strategic release timings, taking advantage of holidays, school breaks, or cultural events to attract viewers.

Mid-Year Dips: May (543) and June (542) exhibit lower content additions, likely due to factors such as production schedules, vacation seasons, or a focus on promoting existing content.

Consistent Activity: Months from March to August (542 to 669) maintain a steady pace of content additions, ensuring a continuous flow of new material throughout the year.

Potential Seasonal Patterns: The clustering of December, October, and January suggests potential seasonal patterns influenced by holidays, changing weather, or viewer behavior.

Varied Peaks: Besides the top months, November (738) and February (472) also experience significant content additions, reflecting a diverse release strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. **Strategic Content Releases**: Insights into peak months (e.g., December, October, January) inform strategic content releases, maximizing viewer engagement and subscription growth.

2. **Subscriber Engagement**: Releasing content during high-engagement months enhances user satisfaction and prolongs subscriptions, improving business outcomes.

3. **Marketing Campaigns**: High-content addition months can be targeted for effective marketing campaigns, attracting new and retaining existing subscribers.

4. **Revenue Generation**: Optimizing content release schedules leads to increased user engagement and revenue through subscription growth and retention.

Potential Negative Impact:

1. **Content Oversaturation**: Overemphasizing high-volume months may lead to content oversaturation, potentially overwhelming users and hindering content engagement.

2. **Neglecting Low-Volume Months**: Focusing too much on peak months might neglect low-volume months, risking dissatisfaction among users with fewer new content options.

3. **Unpredictable Viewer Behavior**: While trends are evident, viewer behavior can be unpredictable, requiring flexibility in content release strategies.

4. **Competition**: Competition for viewer attention during peak months may fragment the audience, potentially affecting subscription numbers.

5. **Quality Over Quantity**: Prioritizing content quantity during high-volume months should not compromise quality, as maintaining viewer satisfaction and loyalty is paramount.

#### **Dominant Days for Netflix Content Releases**

In [None]:
# Create a DataFrame to store day values and counts of content added
days_df = pd.DataFrame(netflix_df.day_added.value_counts())
days_df.reset_index(inplace=True)
days_df.rename(columns={'index': 'day', 'day_added': 'count'}, inplace=True)

# Create a bar chart using Plotly
fig = px.bar(days_df, x="day", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Prominent Days for Content Additions',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
    width=1200,
    height=600)

# Show the figure using Plotly
fig.show()


##### 1. Why did you pick the specific chart?

A bar chart is the ideal choice for visualizing this data because it efficiently presents the distribution of content additions across days of the week, enabling straightforward comparison and interpretation of the information.

##### 2. What is/are the insight(s) found from the chart?

Weekdays Dominate: Weekdays (days 1 to 5) exhibit significantly higher content additions compared to weekends (days 6 and 7), indicating a preference for adding content during weekdays.

Day 1 Peak: The first day of the month (day 1) boasts the highest content additions (2069), potentially indicating a trend of starting the month with new content.

Mid-Month Peaks: Days around the 15th of the month (days 15 and 16) witness relatively high content additions (644 and 240, respectively), hinting at a mid-month content addition trend.

End-of-Month Surges: Content additions spike toward the end of the month (days 31, 30, and 31), likely associated with content releases before month-end.

Variation on Weekends: Days 6 and 7 (Saturday and Sunday) record lower content additions (165 and 162, respectively), suggesting a strategy of reduced focus on weekends.

Consistency in Numbers: Days in the mid-range (days 18 to 28) maintain consistent content additions, indicating a steady flow of new content throughout the month.

Influence of Viewer Behavior: Higher content additions at the beginning and middle of the month might reflect viewer behavior patterns, such as increased engagement after weekends and around mid-month paydays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. **Strategic Content Releases**: Insights into high-content addition days (e.g., Mondays, mid-month, end-of-month) inform strategic content releases, maximizing viewer engagement and subscriptions.

2. **Optimized User Engagement**: Aligning content releases with peak engagement days (e.g., weekdays) enhances user engagement and extends subscription durations.

3. **Viewer Satisfaction**: Consistent content additions throughout the week boost viewer satisfaction, preventing content gaps and retaining user interest.

4. **Content Variety**: Lower addition days (e.g., weekends) offer opportunities to diversify content and cater to different viewer preferences during those times.

Potential Negative Impact:

1. **Neglecting Weekends**: Focusing heavily on weekdays might neglect weekends, potentially leading to dissatisfaction among users who seek content during leisure days.

2. **Viewer Fatigue**: Concentrating content additions on specific days might overwhelm viewers and result in oversaturation, reducing engagement.

3. **Content Quality Over Quantity**: Prioritizing specific days for content additions should not compromise content quality, as maintaining viewer satisfaction is crucial.

4. **Neglecting Viewer Diversity**: Viewer behavior varies, and exclusive reliance on insights from specific days might overlook users with different preferences and schedules.

5. **Competition**: Similar patterns of concentrated content releases by other platforms could increase competition for viewer attention, possibly fragmenting the audience.

#### **Genre Rankings: Top and Bottom 10**

In [None]:
# Split the 'listed_in' column to extract genres for analysis
genres = netflix_df['listed_in'].str.split(', ', expand=True).stack()

# Count the occurrences of each genre
genres = genres.value_counts().reset_index().rename(columns={'index': 'genre', 0: 'count'})

# Create subplots for top 10 and last 10 genres
fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# Top 10 genres
top = sns.barplot(x='genre', y='count', data=genres[:10], ax=ax[0])
top.set_title('Top 10 Genres on Netflix', size=20)
plt.setp(top.get_xticklabels(), rotation=90)

# Last 10 genres
bottom = sns.barplot(x='genre', y='count', data=genres[-10:], ax=ax[1])
bottom.set_title('Last 10 Genres on Netflix', size=20)
plt.xticks(rotation=90)

# Ensure tight layout for better visualization
plt.tight_layout()

# Show the charts
plt.show()

##### 1. Why did you pick the specific chart?

The bar plot is apt for visualizing genre distribution in the "listed_in" column due to its suitability for categorical data comparison.

##### 2. What is/are the insight(s) found from the chart?

Top 10 Genres:

1. Diverse Genre Offerings: The top genres span dramas, comedies, documentaries, and action & adventure, reflecting Netflix's diverse content efforts.

2. Mainstream Appeal: Popular genres like dramas, comedies, and documentaries indicate broad viewer appeal.

3. Global Audience: "International TV Shows" suggests a focus on global content diversity.

4. Family and Kids' Content: Inclusion of "Children & Family Movies," "Kids' TV," and "Animation" caters to family audiences.

5. Entertainment Variety: "Stand-Up Comedy" and "Music & Musicals" diversify entertainment choices.

Last 10 Genres:

1. Niche and Specialized Content: "Cult Movies," "TV Horror," and "Sci-Fi & Fantasy" cater to specialized audiences.

2. Limited Appeal: Genres like "LGBTQ Movies," "Sports Movies," and "Spanish-Language TV Shows" may have niche appeal.

3. Highly Specific Content: "TV Sci-Fi & Fantasy" and "TV Horror" target specific genre enthusiasts.

4. Limited Availability: Some genres (e.g., "Sports Movies") indicate limited content offerings.

5. Viewer Diversity: "TV Shows" and "Romantic Movies" serve diverse interests despite lower counts.

6. Content Focus: Lower counts in certain genres suggest resource allocation toward more mainstream options.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. **Viewer Engagement**: Offering diverse, popular genres can boost viewer engagement, longer viewing sessions, and subscription renewals.

2. **Global Audience**: Inclusion of "International TV Shows" can expand Netflix's global audience, increasing international subscriptions and positive business impact.

3. **Family-Friendly Content**: Genres like "Children & Family Movies" and "Kids' TV" attract families, driving subscriptions and positive word-of-mouth recommendations.

4. **Entertainment Variety**: A mix of genres, including "Stand-Up Comedy" and "Music & Musicals," appeals to diverse entertainment seekers, extending engagement.

5. **Niche Audience Catering**: Lower-count genres may satisfy niche audiences with passionate fan bases, fostering loyalty and positive reviews.

Potential Negative Impact:

1. **Neglected Genres**: Overemphasizing popular genres may neglect those with lower counts, potentially decreasing engagement from viewers who prefer these genres.

2. **Oversaturation**: Focusing heavily on popular genres might lead to oversaturation, overwhelming viewers with content choices and reducing engagement.

3. **Limited Niche Content**: Prioritizing niche genres exclusively could limit viewership and result in negative growth if niche genres lack a sustainable audience.

4. **Quality Over Quantity**: Prioritizing quantity over quality in specific genres may lead to viewer dissatisfaction, negative reviews, and potential churn.

5. **Missed Opportunities**: Neglecting certain genres (e.g., LGBTQ Movies, Spanish-Language TV Shows) may miss opportunities to capture specific viewer segments, potentially leading to negative growth within those segments.

6. **Competition**: Neglecting or not curating genres well might drive viewers to competing platforms with more diverse and tailored genre options.

#### **Netflix's Yearly Content Influx: Shows and Movies**

In [None]:
# Set the figure size using Seaborn
sns.set(rc={'figure.figsize': (15, 7)})

# Create a countplot to visualize the total shows/movies added each year
sns.countplot(x='year_added', data=netflix_df, palette="Set1")

# Set the title and formatting for the plot
plt.title('Total Shows/Movies Added Each Year on Netflix', size='15', fontweight="bold")

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

The countplot is an appropriate choice for visualizing the distribution of shows and movies added to Netflix each year. It effectively displays the frequency of content additions for each year, facilitating straightforward comparison and data interpretation.

##### 2. What is/are the insight(s) found from the chart?

Rapid Growth: Netflix experienced substantial content growth in recent years, with 2019 and 2020 as peak years, adding 2153 and 2009 shows/movies, respectively.

Consistent Expansion: The trend continued with 1685 additions in 2018, indicating steady growth efforts.

Steady Growth: In 2017 and 2016, Netflix added 1225 and 443 shows/movies, respectively, reflecting consistent content library growth.

Recent Decline: There was a noticeable drop to 117 additions in 2021, but it's important to consider potential data incompleteness and evolving trends.

Early Years: Content additions were lower in 2014 and earlier, reflecting Netflix's smaller content library during its early years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

1. **Content Growth**: Recent years (2019, 2020) saw rapid content growth, attracting and retaining subscribers.

2. **Subscriber Retention**: Consistent additions in 2017 and 2018 enhance subscriber satisfaction and loyalty.

3. **Competitive Edge**: Regular updates make Netflix competitive by offering diverse content.

4. **Market Expansion**: High additions show Netflix's global market expansion efforts.

5. **Original Strategy**: Content growth aligns with Netflix's original content strategy.

**Potential Negative Impact:**

1. **Quality Concerns**: 2021's drop may indicate content quality issues, risking viewer dissatisfaction.

2. **Subscription Risks**: Reduced additions could lead to subscriber attrition and less new sign-ups.

3. **Saturation**: Oversaturation can overwhelm viewers, reducing engagement.

4. **Missed Opportunities**: Lower early-year additions might have missed subscriber base growth.

5. **Competition**: Reduced growth increases competition for viewer attention.

6. **Staleness**: Fewer additions might lead to content staleness and lower engagement.

#### **Rating Trends in Netflix's Catalog**

In [None]:
# Create a figure with a specified size
plt.figure(figsize=(10, 6))

# Create a grouped bar chart to compare ratings and content types
sns.countplot(x="rating", hue="type", data=netflix_df)

# Set the title and labels for the chart
plt.title("Rating vs. Type")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.legend(title="Type")

# Print count values on the bars
ax = plt.gca()
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')

# Show the chart
plt.show()


##### 1. Why did you pick the specific chart?

The countplot is a variant of the bar plot, tailor-made for visualizing the distribution of a categorical variable. It excels at displaying the frequency of various categories and is especially handy for comparing the occurrences of different categories within the dataset.


##### 2. What is/are the insight(s) found from the chart?

"Adults" Rating: There are significantly more movies (2595) than TV shows (1025) with the "Adults" rating, indicating a preference for movies in this category.

"Teens" Rating: Only movies (386) are available with the "Teens" rating, while there are no TV shows, suggesting a focus on movie content for teenagers.

"Young Adults" Rating: The distribution between movies (1272) and TV shows (659) is relatively balanced for the "Young Adults" rating, providing a diverse range of content for this audience.

"Older Kids" and "Kids" Ratings: Both "Older Kids" and "Kids" ratings have more movies than TV shows. "Older Kids" has 852 movies and 478 TV shows, while "Kids" has 267 movies and 246 TV shows, catering to children and older children with a preference for movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

1. Targeted Content Allocation: Knowing which content ratings are dominant in specific content types allows Netflix to allocate resources effectively. Producing more content within popular rating categories can attract and retain subscribers, contributing to growth.

2. Diverse Audience Engagement: Balanced distribution of the "Young Adults" rating between movies and TV shows caters to a diverse audience within that age group, enhancing engagement and satisfaction.

Negative Impact:

1. Missed Opportunity for Teens: The absence of TV shows for the "Teens" rating might lead to missed opportunities to attract younger viewers seeking TV show content. This could result in negative growth among teenagers.

2. Limited Children's TV Shows: The higher count of movies compared to TV shows in the "Older Kids" and "Kids" categories could limit options for younger viewers who prefer TV shows. This might lead to negative growth among families looking for TV show content for children.

#### **TV Shows Breakdown by Season**

In [None]:
# Filter the DataFrame to include only TV Shows
tv_df = netflix_df[netflix_df['type'] == 'TV Show']

# Count the number of TV shows for each duration (number of seasons)
tv_duration_counts = tv_df['duration'].value_counts().reset_index()

# Create a pie chart using Plotly
fig = px.pie(tv_duration_counts, values='duration', names='index', color_discrete_sequence=px.colors.sequential.Greens)

# Update the layout and appearance of the pie chart
fig.update_layout(title="Season-Wise Distribution of TV Shows")
fig.update_traces(
    textposition='inside',
    textinfo='percent+label',
    textfont_size=20,
    marker=dict(line=dict(color='RebeccaPurple', width=2))
)


##### 1. Why did you pick the specific chart?

A pie chart is an effective choice for visualizing the distribution of TV shows on Netflix based on the number of seasons. It provides a clear and concise representation of the proportion of TV shows in each season category, allowing for easy visual comparison of these proportions.

##### 2. What is/are the insight(s) found from the chart?

Positive Business Impact:

1. Diverse Content Strategy: Netflix's diverse content strategy, including both single-season and multi-season shows, caters to a wide range of viewer preferences, potentially attracting and retaining a broader audience.

2. Emphasis on Shorter Formats: The dominance of single-season shows aligns with viewer trends favoring concise storytelling, potentially increasing viewer satisfaction and engagement.

3. Variety in Multi-Season Shows: Offering multi-season shows in various ranges provides viewers with a choice of ongoing series and longer story arcs, enhancing viewer loyalty and engagement.

4. Viewer Engagement with Long-Running Shows: The presence of TV shows with many seasons suggests that some series have successfully maintained viewer engagement over time, potentially leading to long-term subscriber retention.

Potential Negative Impact:

1. Production Costs and Sustained Engagement: Longer-running shows require sustained resources and consistent audience interest. A drop in production quality or viewer engagement in these shows could lead to negative growth.

2. Content Quality and Viewer Satisfaction: Maintaining a balance between quantity and quality is crucial. An excessive focus on producing short or long shows at the expense of quality might lead to viewer dissatisfaction and churn.

3. Changing Viewer Preferences: The shifting distribution of show durations could indicate evolving viewer preferences. Failure to adapt content offerings to these changing preferences may result in negative growth in certain segments.

4. Content Gaps: The gaps in the distribution, such as the absence of mid-range shows (7-9 seasons), might signify missed opportunities to capture viewer interest. Neglecting these gaps could lead to potential negative growth within specific audience segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix's content strategy should prioritize viewer behavior and feedback analysis to gauge content preferences. Ongoing competitor analysis ensures Netflix remains on-trend and identifies market gaps. Market research and surveys provide insight into viewer demographics, preferences, and emerging trends. Content testing through A/B trials and pilot programs minimizes risk and optimizes content choices.

Strategic partnerships with content creators, studios, and production companies foster exclusive and innovative offerings. Investment in original content maintains a competitive edge. Data analytics guides content decisions, such as renewals, acquisitions, and production. Flexibility and adaptability to changing viewer preferences are essential. Balancing quantity and quality ensures diverse and high-quality content. Global expansion targets diverse audiences, broadening the subscriber base. Personalization, driven by data, enhances engagement and retention.

#### **Netflix's Top 10 Directors**

In [None]:
# Separate the DataFrame into movies and TV shows
df_movies = netflix_df[netflix_df['type'] == 'Movie']
df_tvshows = netflix_df[netflix_df['type'] == 'TV Show']

# Create a figure with a specified size
plt.figure(figsize=(23, 8))

# Loop through movies and TV shows dataframes
for dataframe, content_type, subplot_index in ((df_movies, 'Movies', 0), (df_tvshows, 'TV Shows', 1)):
    # Create a subplot
    plt.subplot(1, 2, subplot_index + 1)

    # Group by director and count the number of unique titles
    df_director = dataframe.groupby(['director']).agg({'title': 'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[1:10]

    # Create a barplot for the top 10 directors
    plots = sns.barplot(y="director", x='title', data=df_director, palette='Paired')

    # Set the title and formatting for the plot
    plt.title(f'Directors Appeared in Most of the {content_type}')
    plt.grid(linestyle='--', linewidth=0.3)


# Show the subplots
plt.show()


##### 1. Why did you pick the specific chart?

The horizontal bar chart with subplots provides a visually effective means of comparing the top directors in both TV shows and movies. This chart offers insights into their contributions to Netflix's content library across different categories.

##### 2. What is/are the insight(s) found from the chart?

Here are the concise and paraphrased versions of the information you provided:

**Top 10 TV Show Directors:**

1. **Diverse Directorship:** Alastair Fothergill leads with 3 shows directed, while others have 2 each.

2. **Variety in Content:** Multiple directors contribute to Netflix's TV shows, indicating diversity.

3. **Documentaries and Series:** Directors like Alastair Fothergill and Ken Burns excel in documentaries.

4. **Continuity in Series:** Directors Shin Won-ho, Iginio Straffi, and Rob Seidenglanz suggest successful series continuations.

**Top 10 Movie Directors:**

1. **Highly Prolific Directors:** Raúl Campos and Jan Suter top with 18 movies, followed closely by others.

2. **Comedy and Stand-Up:** Marcus Raboy, Jay Karas, and Jay Chapman specialize in comedy, including stand-up.

3. **Diverse Genres:** Directors like Cathy Garcia-Molina and Youssef Chahine cover various movie genres.

4. **Renowned Filmmakers:** Martin Scorsese and Steven Spielberg collaborate with Netflix for original films.

5. **Variety in Style:** Diverse director styles contribute to Netflix's broad movie portfolio.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

**Diverse Content Portfolio:** Partnering with directors specializing in various genres can enrich Netflix's content offering, attracting a broader audience and potentially increasing subscribers.

**Renowned Filmmakers:** Collaboration with respected directors like Scorsese and Spielberg enhances Netflix's reputation, drawing subscribers seeking high-quality content.

**Prolific Directors:** Experienced directors ensure a consistent stream of fresh content, maintaining subscriber engagement.

**Catering to Audience Preferences:** A mix of directors covering different genres caters to global audience preferences.

Potential Negative Impact:

**Overemphasis on Quantity:** Prioritizing quantity over quality may lead to content fatigue and compromise viewer satisfaction.

**Lack of Focus:** An extensive director roster could result in a fragmented content strategy, causing viewer confusion.

**Risk of Exclusivity:** Dependence on a few renowned directors may lead to content gaps if they work with other platforms.

**Niche Versus Mainstream:** The director mix can tilt towards niche or mainstream content, necessitating a balance for diverse audience segments.

**Disproportionate Focus:** Dominance of a select group of directors may overshadow emerging talent and innovative storytelling.

**Limited Originality:** Excessive reliance on certain directors might hinder originality in content, resulting in repetitive themes and narratives.

#### **Leading Performers in Television and Film**

In [None]:
# Filter out rows with 'unknown' cast entries
filtered_netflix_df = netflix_df[~netflix_df['cast'].str.contains('unknown', case=False, na=False)]

# Create a figure with two subplots
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Separate TV show actors from the 'cast' column
top_TVshows_actor = filtered_netflix_df[filtered_netflix_df['type'] == 'TV Show']['cast'].str.split(', ', expand=True).stack()

# Create a horizontal bar chart for the top 10 TV show actors
a = top_TVshows_actor.value_counts().head(10).plot(kind='barh', ax=ax[0])
a.set_title('Top 10 TV Show Actors', size=15)

# Separate movie actors from the 'cast' column
top_movie_actor = filtered_netflix_df[filtered_netflix_df['type'] == 'Movie']['cast'].str.split(', ', expand=True).stack()

# Create a horizontal bar chart for the top 10 movie actors
b = top_movie_actor.value_counts().head(10).plot(kind='barh', ax=ax[1])
b.set_title('Top 10 Movie Actors', size=15)

# Adjust the layout and spacing of the subplots
plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])

# Show the subplots
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart with subplots is an effective visual tool for comparing and contrasting the popularity and engagement of top actors in both Netflix TV shows and movies. It offers insights into their performance in each category.

##### 2. What is/are the insight(s) found from the chart?

Concise and Paraphrased Version:

**Top 10 TV Show Actors:**

1. **Japanese Voice Actors:** The presence of Japanese voice actors like Takahiro Sakurai, Yuki Kaji, Daisuke Ono, and Ai Kayano in the top TV show actors indicates a strong representation of anime content on Netflix.

2. **Dubbed Anime:** The frequent appearance of these voice actors suggests the popularity of dubbed anime content on the platform.

3. **Frequent Collaborations:** Junichi Suwabe, Yoshimasa Hosoya, and Yuichi Nakamura also feature prominently, implying consistent collaborations or recurring roles in TV shows.

4. **Diverse Genres:** While known for anime, these actors' diverse appearances hint at involvement in various genres beyond animation.

**Top 10 Movie Actors:**

1. **Bollywood Dominance:** Bollywood stars like Shah Rukh Khan, Akshay Kumar, and Amitabh Bachchan dominate the list, reflecting a significant presence of Indian cinema on Netflix.

2. **Indian Cinema Showcase:** High counts for actors like Anupam Kher, Om Puri, Naseeruddin Shah, and Paresh Rawal underscore the platform's focus on showcasing classic and contemporary Indian cinema.

3. **Versatile Actors:** These actors exhibit versatility across various genres in Indian cinema.

4. **Global Appeal of Bollywood:** The popularity of these actors indicates that Bollywood films have a global audience on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

1. **Diverse Audience Appeal:** Featuring Japanese voice actors and Bollywood stars broadens Netflix's global audience appeal.

2. **Boosted Engagement:** High actor appearances deepen viewer engagement and attract new subscribers.

3. **Regional Focus:** Prominent Bollywood actors highlight Netflix's regional content commitment, attracting subscribers in regions keen on Indian cinema.

4. **Effective Collaborations:** Consistent actor appearances suggest fruitful collaborations, resulting in high-quality content and positive audience reception.

**Negative Impact:**

1. **Overreliance on Actors:** Overemphasizing a few actors can lead to viewer fatigue and perceptions of repetitiveness.

2. **Cultural Balance Concerns:** Underrepresentation of specific regions or cultures can dissatisfy certain audience groups.

3. **Competitive Limitations:** Heavy reliance on specific actors may hinder competition with platforms having exclusive content deals.

4. **Long-Term Success:** Platform success depends on factors beyond actor popularity, including content quality and viewer experience.



#### **Annual Count of Movies and TV Shows Released and Added to Netflix**

In [None]:
# Create a figure with two subplots for Movies and TV Shows
plt.figure(figsize=(20, 6))

for i, j, k in ((df_movies, 'Movies', 0), (df_tvshows, 'TV Shows', 1)):
    # Create a subplot
    plt.subplot(1, 2, k + 1)

    # Group data by release year and aggregate unique titles
    df_release_year = i.groupby(['release_year']).agg({'title': 'nunique'}).reset_index().sort_values(
        by=['release_year'], ascending=False)[:14]

    # Create a bar plot
    plots = sns.barplot(x='release_year', y='title', data=df_release_year, palette='husl')

    # Set title and labels
    plt.title(f'{j} released by year')
    plt.ylabel(f"Number of {j} released")

    # Add a grid
    plt.grid(linestyle='--', linewidth=0.3)

    # Annotate each bar with its height
    for bar in plots.patches:
        plots.annotate(bar.get_height(),
                        (bar.get_x() + bar.get_width() / 2,
                         bar.get_height()), ha='center', va='center',
                        size=12, xytext=(0, 8),
                        textcoords='offset points')

# Show the figure with subplots for Movies and TV Shows
plt.show()

# Create another figure with two subplots for Movies and TV Shows
plt.figure(figsize=(20, 6))

for i, j, k in ((df_movies, 'Movies', 0), (df_tvshows, 'TV Shows', 1)):
    # Create a subplot
    plt.subplot(1, 2, k + 1)

    # Group data by year added and aggregate unique titles
    df_country = i.groupby(['year_added']).agg({'title': 'nunique'}).reset_index().sort_values(
        by=['year_added'], ascending=False)

    # Create a bar plot
    plots = sns.barplot(x='year_added', y='title', data=df_country, palette='husl')

    # Set title and labels
    plt.title(f'{j} added to Netflix by year')
    plt.ylabel(f"Number of {j} added on Netflix")

    # Add a grid
    plt.grid(linestyle='--', linewidth=0.3)

    # Annotate each bar with its height
    for bar in plots.patches:
        plots.annotate(bar.get_height(),
                        (bar.get_x() + bar.get_width() / 2,
                         bar.get_height()), ha='center', va='center',
                        size=12, xytext=(0, 8),
                        textcoords='offset points')

# Show the figure with subplots for Movies and TV Shows
plt.show()


##### 1. Why did you pick the specific chart?

Combining the histogram and countplot offers a holistic perspective on the distribution of release years for content and the evolving mix of content genres across different time periods. This combined visualization facilitates straightforward comparisons, the recognition of recurring trends, and the extraction of valuable insights into Netflix's content tactics and viewer inclinations.

##### 2. What is/are the insight(s) found from the chart?

Release Year Insights:

- Recent years, especially from 2016 to 2020, witnessed a surge in both TV show and movie releases, likely due to the rise of streaming platforms and original content.
  
Top 15 Years for TV Shows:

- In 2020, the highest number of TV shows were released, closely followed by 2019 and 2018, indicating a trend of increased TV show production in recent years.

Top 15 Years for Movies:

- Similarly, 2017 saw the most movie releases, followed by 2018, 2016, and 2019, pointing to a notable rise in movie production in recent times.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



**Positive Business Impact:**

1. **Increased Content Production:** Netflix's growing content library attracts and retains subscribers with diverse choices.

2. **Original Content Focus:** Emphasizing original content sets Netflix apart and engages subscribers.

3. **Genre Variety:** Diverse genres cater to a broader audience.

4. **Global Reach:** International content expansion broadens Netflix's audience.

5. **Expanding Library:** Consistent content growth maintains user engagement.

**Challenges and Considerations:**

1. **Quality vs. Quantity:** Balancing content quantity with quality is crucial.

2. **Viewer Fatigue:** Overwhelming releases may reduce engagement.

3. **Market Saturation:** Increasing competition poses acquisition challenges.


#### **Countries with the Broadest Netflix Content Selection**

In [None]:
# Create a figure for the first plot
plt.figure(figsize=(18, 5))
plt.grid(linestyle='--', linewidth=0.3)

# Find the top 15 countries with the most content
top_countries = netflix_df['country'].value_counts().index[:15]

# Create a countplot to visualize content distribution by country and type
sns.countplot(x=netflix_df['country'], order=top_countries, hue=netflix_df['type'], palette="Set1")
plt.xticks(rotation=50)
plt.title('Top 15 countries with the most content', fontsize=15, fontweight='bold')
plt.legend(title='Type')

# Create a figure for the second set of plots
plt.figure(figsize=(20, 8))

# Separate the data into Movies and TV Shows
df_movies = netflix_df[netflix_df['type'] == 'Movie']
df_tvshows = netflix_df[netflix_df['type'] == 'TV Show']

for df, content_type in [(df_movies, 'Movies'), (df_tvshows, 'TV Shows')]:
    # Create subplots for Movies and TV Shows
    plt.subplot(1, 2, 1 if content_type == 'Movies' else 2)

    # Count the top 10 countries with the most content of the current type
    df_country = df['country'].value_counts().head(10).reset_index()
    df_country.columns = ['country', 'count']

    # Create a barplot to visualize the top countries for the current type
    plots = sns.barplot(y="country", x='count', data=df_country, palette='Set1')
    plt.title(f'Top 10 countries launching {content_type}', fontsize=15, fontweight='bold')
    plt.grid(linestyle='--', linewidth=0.3)

    # Add labels to the bars
    for i, value in enumerate(df_country['count']):
        plots.text(value + 10, i, str(value), ha='center', va='center')

# Adjust layout for better visualization
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

1. **Top 15 Countries by Content:** A countplot illustrates the content distribution among the top 15 countries on Netflix. It offers a quick comparison of content types in these countries.

2. **Top 10 Countries for Movie Releases:** A bar chart showcases the top 10 countries with continuous movie releases, revealing production trends and potential collaborations.

3. **Top 10 Countries for TV Show Releases:** Another bar chart displays the top 10 countries with frequent TV show releases, highlighting active TV show production trends.

##### 2. What is/are the insight(s) found from the chart?



1. **United States Dominance:** The United States leads in content production, producing over double the number of TV shows compared to the United Kingdom and more than triple the number of movies compared to India.

2. **Indian Growth:** India is the second-largest TV show producer and experiencing rapid content consumption growth due to streaming popularity, a growing middle class, and rising incomes.

3. **South Korean Influence:** South Korea is a major global player known for popular dramas and comedies, with the "Korean Wave" boosting the visibility of its TV shows worldwide.

4. **Canadian Impact:** Canada is a significant TV show producer, hosting popular series like "Schitt's Creek" and "The Handmaid's Tale," with government support attracting foreign investment and job creation.

5. **Chinese Media Control:** China, with a massive population, has a rising appetite for content but faces media control limitations, resulting in limited access to foreign TV shows and movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. **Market Expansion:** Identify new markets like India with growing content consumption for targeted production.

2. **Global Strategies:** Learn from the success of South Korean TV shows to create content with international appeal.

3. **Strategic Partnerships:** Collaborate with streaming services to broaden content distribution.

**Challenges and Considerations:**

1. **Chinese Media Control:** Stringent media control in China may hinder content distribution opportunities.

2. **Competitive Landscape:** Increasing global competition may pose challenges for smaller production companies in the industry.

These insights can guide strategic decisions for content production companies, opening up opportunities for growth while being mindful of potential challenges.

#### **Netflix's Trending Genres: Viewer Preferences**

In [None]:
# Create a figure for the first plot
plt.figure(figsize=(23, 8))

# Group data by listed_in (genres) and aggregate unique titles, then sort by popularity
df_genre = df.groupby(['listed_in']).agg({'title': 'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[:10]

# Create a bar plot to visualize the most popular genres
plots = sns.barplot(y="listed_in", x='title', data=df_genre)
plt.title(f'Most popular genres on Netflix')
plt.grid(linestyle='--', linewidth=0.3)

# Add labels to the bars
plots.bar_label(plots.containers[0])

# Show the first plot
plt.show()

# Create a figure for the second set of plots
plt.figure(figsize=(23, 8))

for i, j, k in ((df_movies, 'Movies', 0), (df_tvshows, 'TV Shows', 1)):
    # Create subplots for Movies and TV Shows
    plt.subplot(1, 2, k + 1)

    # Group data by listed_in (genres) and aggregate unique titles, then sort by popularity
    df_genre = i.groupby(['listed_in']).agg({'title': 'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[:10]

    # Create a bar plot to visualize the most popular genres for the current type
    plots = sns.barplot(y="listed_in", x='title', data=df_genre, palette='Set1')
    plt.title(f'Most popular genres of {j}')
    plt.grid(linestyle='--', linewidth=0.3)

    # Add labels to the bars
    plots.bar_label(plots.containers[0])

    # Rotate y-axis labels for better readability
    plt.yticks(rotation=45)

# Adjust layout for better visualization of subplots
plt.tight_layout()

# Show the second set of plots
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots excel at visualizing categorical data, making them ideal for showcasing Netflix's most popular genres.

##### 2. What is/are the insight(s) found from the chart?

1. **International TV Shows:** The top genre for TV shows indicates a global appetite for content from diverse cultures, possibly due to increasing globalization and the availability of international content on streaming platforms.

2. **Crime TV Shows:** Ranking second, crime TV shows offer universal appeal with their suspenseful and exciting nature, drawing viewers of all ages.

3. **Kids' TV:** The popularity of kids' TV shows is unsurprising, as they engage young audiences through relatable stories and educational content.

4. **British TV Shows:** British TV's high quality, originality, and awards recognition make it a strong fourth, attracting viewers seeking exceptional content.

5. **Documentaries:** In fifth place, the interest in documentaries reflects a desire for informative, educational, and entertaining content, fostering a better understanding of the world.

#### **Monthly Netflix Content Additions: Movies and TV Shows**

In [None]:
# Create a figure for the set of plots
plt.figure(figsize=(23, 8))

for i, j, k in ((df_movies, 'Movies', 0), (df_tvshows, 'TV Shows', 1)):
    # Create subplots for Movies and TV Shows
    plt.subplot(1, 2, k + 1)

    # Group data by the month_added and aggregate unique titles
    df_month = i.groupby(['month_added']).agg({'title': 'nunique'}).reset_index().sort_values(by=['month_added'], ascending=False)

    # Create a bar plot to visualize the number of content added by month
    plots = sns.barplot(x='month_added', y='title', data=df_month, palette='husl')
    plt.title(f'{j} added to Netflix by month')
    plt.ylabel(f"Number of {j} added on Netflix")
    plt.grid(linestyle='--', linewidth=0.3)

    # Annotate each bar with its height
    for bar in plots.patches:
        plots.annotate(bar.get_height(),
                       (bar.get_x() + bar.get_width() / 2,
                        bar.get_height()), ha='center', va='center',
                       size=12, xytext=(0, 8),
                       textcoords='offset points')

# Show the set of plots
plt.show()


##### 1. Why did you pick the specific chart?

We created this graph to identify the month with the highest content additions and the year with the lowest additions for movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

- **TV Shows**: October, November, and December are the top months for additions.
- **Movies**: January, October, and December see the most additions.
- **Lowest Activity**: February experiences the lowest additions for both movies and TV shows on Netflix.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statement 1:**

Null Hypothesis: There is no substantial disparity in the rating proportions between drama and comedy movies available on Netflix.

Alternative Hypothesis: A significant difference exists in the rating proportions between drama and comedy movies on Netflix.

**Hypothetical Statement 2:**

Null Hypothesis: The average duration of TV shows introduced in 2020 on Netflix does not differ significantly from the average duration of TV shows introduced in 2021.

Alternative Hypothesis: The average duration of TV shows added in 2020 on Netflix significantly differs from the average duration of TV shows added in 2021.

**Hypothetical Statement 3:**

Null Hypothesis: The proportion of American-produced TV shows added to Netflix is not substantially distinct from the proportion of American-produced movies added to Netflix.

Alternative Hypothesis: There is a significant difference in the proportion of American-produced TV shows and movies added to Netflix.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: There is no substantial disparity in the rating proportions between drama and comedy movies available on Netflix.

Alternative Hypothesis: A significant difference exists in the rating proportions between drama and comedy movies on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Import the necessary library for the z-test
from statsmodels.stats.proportion import proportions_ztest

# Subset the data to only include drama and comedy movies
subset = netflix_df[netflix_df['listed_in'].str.contains('Dramas') | netflix_df['listed_in'].str.contains('Comedies')]

# Calculate the proportion of drama and comedy movies in the subset
drama_prop = len(subset[subset['listed_in'].str.contains('Dramas')]) / len(subset)
comedy_prop = len(subset[subset['listed_in'].str.contains('Comedies')]) / len(subset)

# Set up the parameters for the z-test
count = [int(drama_prop * len(subset)), int(comedy_prop * len(subset))]  # Number of successes (dramas and comedies)
nobs = [len(subset), len(subset)]  # Number of observations (total count)
alternative = 'two-sided'  # Two-sided test to compare proportions

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)

# Print the z-statistic and p-value
print('z-statistic: ', z_stat)
print('p-value: ', p_value)

# Set the significance level (alpha)
alpha = 0.05

# Print the results of the z-test based on the p-value and significance level
if p_value < alpha:
    print(f"Reject the null hypothesis.")
else:
    print(f"Fail to reject the null hypothesis.")


##### Which statistical test have you done to obtain P-Value?

I have conducted a statistical analysis using the z-test for proportions to calculate the p-value.

##### Why did you choose the specific statistical test?

The z-test for proportions was selected because it's suitable for comparing the proportions of two categorical variables (drama and comedy movies) in a sample. It helps us assess whether the observed difference in proportions is statistically significant, by testing the null hypothesis that there's no difference between the proportions.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: The average duration of TV shows introduced in 2020 on Netflix does not differ significantly from the average duration of TV shows introduced in 2021.

Alternative Hypothesis: The average duration of TV shows added in 2020 on Netflix significantly differs from the average duration of TV shows added in 2021.

#### 2. Perform an appropriate statistical test.

In [None]:
# Import the necessary library for the t-test
from scipy.stats import ttest_ind

# Create separate dataframes for TV shows in 2020 and 2021
tv_2020 = netflix_df[(netflix_df['type'] == 'TV Show') & (netflix_df['release_year'] == 2020)]
tv_2021 = netflix_df[(netflix_df['type'] == 'TV Show') & (netflix_df['release_year'] == 2021)]

# Perform a two-sample t-test to compare the average durations
# equal_var=False assumes unequal variances between the two groups
t, p = ttest_ind(tv_2020['duration'].astype(int), tv_2021['duration'].astype(int), equal_var=False)
print('t-value: ', t)
print('p-value: ', p)

# Print the results of the t-test
if p < 0.05:
    print('Reject null hypothesis.')
    print('The average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.')
else:
    print('Failed to reject null hypothesis.')
    print('The average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.')

##### Which statistical test have you done to obtain P-Value?

The P-value was obtained using a two-sample t-test.

##### Why did you choose the specific statistical test?

The two-sample t-test was selected to compare the means of two different samples (TV shows added in 2020 vs. 2021) for significance. We assume unequal variances between the two samples, considering the unlikely scenario of them having the same variance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: The proportion of American-produced TV shows added to Netflix is not substantially distinct from the proportion of American-produced movies added to Netflix.

Alternative Hypothesis: There is a significant difference in the proportion of American-produced TV shows and movies added to Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Import the necessary library for the z-test
from statsmodels.stats.proportion import proportions_ztest

# Calculate the proportion of TV shows and movies produced in the United States
tv_proportion = np.sum(df_tvshows['country'].str.contains('United States')) / len(df_tvshows)
movie_proportion = np.sum(df_movies['country'].str.contains('United States')) / len(df_movies)

# Set up the parameters for the z-test
count = [int(tv_proportion * len(df_tvshows)), int(movie_proportion * len(df_movies))]  # Number of successes (TV shows and movies produced in the US)
nobs = [len(df_tvshows), len(df_movies)]  # Number of observations (total count of TV shows and movies)
alternative = 'two-sided'  # Two-sided test to compare proportions

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)

# Print the z-statistic and p-value
print('z-statistic: ', z_stat)
print('p-value: ', p_value)

# Set the significance level (alpha)
alpha = 0.05

# Print the results of the z-test based on the p-value and significance level
if p_value < alpha:
    print(f"Reject the null hypothesis.")
else:
    print(f"Fail to reject the null hypothesis.")


##### Which statistical test have you done to obtain P-Value?

The P-value was obtained using a two-sample proportion test.

##### Why did you choose the specific statistical test?

We selected this test because it's suitable for comparing two proportions and helps us assess whether the observed difference is statistically significant or a result of chance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Create a subplot with two graphs side by side
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plot a distribution plot (histogram) for the 'release_year' column
sns.distplot(x=netflix_df['release_year'], ax=ax[0])
ax[0].set_title('Distribution Plot for Release Year')

# Plot a box plot to visualize outliers in the 'release_year' column
sns.boxplot(data=netflix_df, ax=ax[1])
ax[1].set_title('Box Plot for Release Year')

# Display the subplots
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Since nearly all the data is in textual format except for the release year, and the data needed for clustering/modeling is in textual form, outlier handling is unnecessary.

### 2. Textual Data Preprocessing

#### 1. Textual Columns

In [None]:
# Drop unnecessary columns from the DataFrame
columns_to_drop = ['month_added', 'day_added', 'year_added']
netflix_df.drop(columns=columns_to_drop, inplace=True)

# Create a new feature "content_detail" by combining values from other textual attributes
netflix_df["content_detail"] = netflix_df["cast"] + " " + netflix_df["director"] + " " + netflix_df["listed_in"] + " " + netflix_df["type"] + " " + netflix_df["rating"] + " " + netflix_df["country"] + " " + netflix_df["description"]

# Check the DataFrame to see the changes
netflix_df.head(5)

#### 2. Lower Casing

In [None]:
# Convert the "content_detail" column values to lowercase
netflix_df['content_detail'] = netflix_df['content_detail'].str.lower()

# Checking the result for a specific row (e.g., row 281)
content_detail_281 = netflix_df.iloc[281]['content_detail']
content_detail_281

#### 3. Removing Punctuations

In [None]:
def remove_punctuations(text):
    '''This function is used to remove the punctuations from the given sentence'''

    # Importing the 'string' library, which contains a string of all punctuation marks.
    import string

    # Creating a translator object that maps each punctuation mark to None (deletes it).
    translator = str.maketrans('', '', string.punctuation)

    # Return the input text with punctuation marks removed.
    return text.translate(translator)


In [None]:
# Applying the remove_punctuations function to the "content_detail" column
netflix_df['content_detail'] = netflix_df['content_detail'].apply(remove_punctuations)

# Checking the result for a specific row (e.g., row 281)
content_detail_281 = netflix_df.iloc[281]['content_detail']

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
def remove_url_and_numbers(text):
    """
    Remove URLs and numbers from the given text.

    Args:
        text (str): The input text from which URLs and numbers will be removed.

    Returns:
        str: The text with URLs and numbers removed.
    """
    # Regular expression pattern to match URLs
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    # Remove URLs by replacing them with an empty string
    text = re.sub(url_pattern, '', text)

    # Remove digits and non-alphabet characters by replacing them with spaces
    text = re.sub('[^a-zA-Z]', ' ', text)

    return text


In [None]:
# Apply the 'remove_url_and_numbers' function to the 'content_detail' column
netflix_df['content_detail'] = netflix_df['content_detail'].apply(remove_url_and_numbers)

# Check the result for a specific observation (e.g., row 281)
specific_observation = netflix_df.iloc[281]['content_detail']

# Print or inspect the result
print(specific_observation)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# NLTK library and downloading English stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords
stop_words = set(stopwords.words('english'))

# Display the set of stopwords
print(stop_words)


In [None]:
def remove_stopwords_and_whitespaces(text):
    """
    Remove stopwords and extra whitespaces from the given sentence.

    Args:
        text (str): The input sentence.

    Returns:
        str: The sentence with stopwords removed and extra whitespaces reduced.
    """
    # Tokenize the sentence and filter out stopwords
    words = [word for word in text.split() if word.lower() not in stopwords.words('english')]

    # Join the filtered words back into a sentence with a single space separator
    cleaned_text = " ".join(words)

    # Remove extra whitespaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text


In [None]:
# Apply the 'remove_stopwords_and_whitespaces' function to the 'content_detail' column
netflix_df['content_detail'] = netflix_df['content_detail'].apply(remove_stopwords_and_whitespaces)

# Check the result for a specific observation (e.g., row 281)
specific_observation = netflix_df.iloc[281]['content_detail']

# Print or inspect the result
print(specific_observation)

#### 6. Tokenization

In [None]:
# Download the NLTK 'punkt' dataset for tokenization
nltk.download('punkt')

# Tokenize the 'content_detail' column
netflix_df['content_detail'] = netflix_df['content_detail'].apply(nltk.word_tokenize)

# Check the result for a specific observation (e.g., row 281)
specific_observation = netflix_df.iloc[281]['content_detail']

# Print or inspect the result
print(specific_observation)

#### 7. Text Normalization

In [None]:
# Import the WordNetLemmatizer from the nltk.stem module
from nltk.stem import WordNetLemmatizer

# Create an instance of the WordNetLemmatizer
wordnet = WordNetLemmatizer()

In [None]:
def lemmatize_sentence(text):
    """
    Lemmatize the words in the given sentence.

    Args:
        text (str): The input sentence.

    Returns:
        str: The sentence with words lemmatized.
    """
    # Lemmatize each word in the sentence
    text = [wordnet.lemmatize(word) for word in text]

    # Join the lemmatized words back into a sentence with a space separator
    text = " ".join(text)

    return text


In [None]:
# Download the NLTK datasets required for lemmatization
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Apply the 'lemmatizing_sentence' function to the 'content_detail' column
netflix_df['content_detail'] = netflix_df['content_detail'].apply(lemmatize_sentence)

# Check the result for a specific observation (e.g., row 281)
specific_observation = netflix_df.iloc[281]['content_detail']

# Print or inspect the result
print(specific_observation)

##### Which text normalization technique have you used and why?

I have chosen Lemmatization over Stemming for our project for the following reasons:

1. Enhanced Accuracy: Unlike Stemming, which simply trims word suffixes, Lemmatization considers word meanings and context, resulting in a more precise base form.

2. Handling Varied Inflections: Lemmatization can manage diverse inflections such as plurals, verb tenses, and comparisons, making it valuable for natural language processing tasks.

3. Creation of Real Words: Lemmatization consistently generates valid dictionary words, simplifying the interpretation of text analysis outcomes.

4. Improved Text Comprehension: By reducing words to their base forms, Lemmatization aids in better comprehension of sentence context and meaning.

5. Multilingual Support: While Stemming may be limited to English, Lemmatization proves effective across numerous languages, rendering it a versatile text processing technique.

#### 8. Part of speech tagging

In [None]:
# Tokenize the 'content_detail' column into words and apply POS tagging
netflix_df['pos_tags'] = netflix_df['content_detail'].apply(nltk.word_tokenize).apply(nltk.pos_tag)

# Display the first 5 rows of the DataFrame to check the result
print(netflix_df.head(5))


#### 9. Text Vectorization

In [None]:
# Import the necessary library
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the TF-IDF vectorizer with a maximum of 30000 features to avoid memory issues
tfidf_vectorizer = TfidfVectorizer(max_features=30000)

# Fit the TF-IDF vectorizer on the 'content_detail' column of the DataFrame
x = tfidf_vectorizer.fit_transform(netflix_df['content_detail'])

# Print the shape of the resulting document-term matrix
print(x.shape)

##### Which text vectorization technique have you used and why?

I opted for TF-IDF vectorization instead of BAG OF WORDS because it allows me to consider the significance of each word within my document. TF-IDF also assigns greater weight to uncommon words that are exclusive to my content, which, in turn, enhances their importance in the representation.

### 4. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In textual data processing, I had to deal with the creation of 30,000 attributes during text vectorization, resulting in a vast number of columns that posed challenges for my local machine. To address this issue, I decided to employ Principal Component Analysis (PCA) techniques to effectively reduce the dimensions of this large sparse matrix.

In [None]:
# Import PCA from sklearn
from sklearn.decomposition import PCA

# Create a PCA object with the desired number of components (you can specify the number of components as a parameter if needed)
pca = PCA()

# Fit the PCA model on the TF-IDF matrix (convert to dense array using toarray() if necessary)
pca.fit(x.toarray())

# Calculate the percentage of variance explained by each component
variance = pca.explained_variance_ratio_

# Print the explained variance for each component
print(f"Explained variance by each component: {variance}")


In [None]:
# Create a figure and axis for the plot
fig, ax = plt.subplots()

# Plot the cumulative explained variance ratio versus the number of components
ax.plot(range(1, len(variance) + 1), np.cumsum(pca.explained_variance_ratio_))

# Set labels and title
ax.set_xlabel('Number of Components')
ax.set_ylabel('Percent of Variance Captured')
ax.set_title('PCA Analysis')

# Add gridlines for clarity
plt.grid(linestyle='--', linewidth=0.3)

# Show the plot
plt.show()


From the plot displayed above, it's evident that 7770 principal components are sufficient to capture 100% of the variance. However, for our specific case, we will focus on retaining only the number of principal components necessary to capture 95% of the variance.

In [None]:
# Define a PCA object with n_components set to capture 95% of variance
pca_tuned = PCA(n_components=0.95)

# Fit and transform the PCA model on the TF-IDF matrix (convert to dense array using toarray() if necessary)
pca_tuned.fit(x.toarray())
x_transformed = pca_tuned.transform(x.toarray())

# Check the shape of the transformed matrix
transformed_shape = x_transformed.shape

# Print the shape to see the dimensions
print(f"Shape of transformed matrix: {transformed_shape}")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I chose to implement PCA (Principal Component Analysis) for dimensionality reduction in our project. PCA is a commonly used technique for reducing the dimensionality of high-dimensional datasets while preserving the essential information present in the original data.

The core concept behind PCA involves identifying the principal components of the data, which are linear combinations of the original features that capture the maximum variance within the dataset. Through the projection of data onto these principal components, PCA effectively reduces the number of dimensions while retaining the majority of the original data's important characteristics.

PCA is a favored choice for dimensionality reduction due to its simplicity of implementation, computational efficiency, and widespread availability in various data analysis software packages. Moreover, it has undergone extensive research and has a solid theoretical foundation, which establishes it as a dependable and well-understood method.

## ***7. ML Model Implementation***

### K-Means Clustering - ML Model

In [None]:
# Import the necessary libraries
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Create an instance of the K-Means clustering model with a specified random state
model = KMeans(random_state=0)

# Instantiate the KElbowVisualizer with a range of K values (from 1 to 16 in this case)
visualizer = KElbowVisualizer(model, k=(1, 16), locate_elbow=False)

# Fit the data (transformed by PCA) to the visualizer
visualizer.fit(x_transformed)

# Finalize and display the figure
visualizer.show()

Here, it appears that there might be an elbow forming at the 2-cluster point. However, before making a definitive decision, let's create another chart that iterates over the same range of cluster numbers and calculates the Silhouette Score at each point.

But what exactly is the Silhouette Score?

The Silhouette Score serves as a metric to assess how closely an object aligns with its own cluster in comparison to other clusters. It plays a crucial role in evaluating the quality of clustering, with a higher score indicating that objects are more similar to their respective clusters and less similar to clusters nearby.

The Silhouette Score ranges from -1 to 1, where a score of 1 signifies that the object is an excellent match for its own cluster and poorly matches neighboring clusters. Conversely, a score of -1 suggests that the object is a poor match for its own cluster but aligns well with neighboring clusters.

In [None]:
# Import the necessary libraries
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Create an instance of the K-Means clustering model
model = KMeans(random_state=0)

# Instantiate the KElbowVisualizer with a range of K values (from 2 to 16 in this case)
# Specify metric='silhouette' to use the silhouette score for evaluation
# Specify timings=True to measure the time taken for each K value
visualizer = KElbowVisualizer(model, k=(2, 16), metric='silhouette', timings=True, locate_elbow=False)

# Fit the transformed data (from PCA) to the visualizer
visualizer.fit(x_transformed)

# Finalize and display the figure
visualizer.show()


In [None]:
# Import the necessary libraries
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Define the range of K values you want to evaluate
k_range = range(2, 7)

# Loop through each K value and compute the Silhouette score
for k in k_range:
    # Create a K-Means clustering model with the current K value
    Kmodel = KMeans(n_clusters=k)

    # Fit the model to the transformed data (from PCA) and get cluster labels
    labels = Kmodel.fit_predict(x_transformed)

    # Compute the Silhouette score for the current K value
    score = silhouette_score(x_transformed, labels)

    # Print the Silhouette score for the current K value
    print("k=%d, Silhouette score=%f" % (k, score))


Based on the insights gathered from the above plots, both the Elbow plot and Silhouette plot suggest that the Silhouette score is notably favorable when using 4 clusters. Therefore, we will proceed with a K-means analysis with 4 clusters.

Now, let's visualize how our data points appear once they have been assigned to their respective clusters.

In [None]:
# Create a K-Means clustering model with 4 clusters, using the 'k-means++' initialization method
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=0)

# Predict the cluster labels for each data point and store them in the 'label' variable
label = kmeans.fit_predict(x_transformed)

# Create a figure for plotting
plt.figure(figsize=(10, 6), dpi=120)

# Get unique cluster labels
unique_labels = np.unique(label)

# Plot the results by iterating through each unique cluster label
for i in unique_labels:
    plt.scatter(x_transformed[label == i, 0], x_transformed[label == i, 1], label=i)

# Add a legend to the plot
plt.legend()

# Show the plot
plt.show()


We do indeed have four distinct clusters, but the previous plot was presented in a two-dimensional format. To gain a better understanding of the data, let's create a 3D visualization using the mplot3d library. This will allow us to examine the separated clusters more effectively.

In [None]:
# Import the necessary library for 3D visualization
from mpl_toolkits.mplot3d import Axes3D

# Create a 3D plot
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(111, projection='3d')

# Define colors for each cluster
colors = ['r', 'g', 'b', 'y']

# Plot the data points in 3D for each cluster
for i in range(len(colors)):
    ax.scatter(
        x_transformed[kmeans.labels_ == i, 2],
        x_transformed[kmeans.labels_ == i, 0],
        x_transformed[kmeans.labels_ == i, 1],
        c=colors[i]
    )

# Rotate the 3D plot for better visibility
ax.view_init(elev=20, azim=-120)

# Set labels for each axis
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')

# Show the 3D plot
plt.show()


Great! It's clear that we can visually differentiate all four clusters in the 3D plot.

Now, to finalize the assignment, let's add a new attribute to the final dataframe and assign each 'Content' to its respective cluster. This will help organize and analyze the data more effectively.

In [None]:
# Add the K-Means cluster labels to the DataFrame
netflix_df['kmeans_cluster'] = kmeans.labels_

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Import the necessary libraries for word cloud generation
from wordcloud import WordCloud, STOPWORDS

def kmeans_wordcloud(cluster_number, column_name):
    '''
    Function for building a word cloud for the movie/shows in a specified cluster.

    Args:
        cluster_number (int): The cluster number for which you want to create the word cloud.
        column_name (str): The name of the column containing text data (e.g., 'content_detail').

    Returns:
        numpy.ndarray: A numpy array representing the generated word cloud image.
    '''

    # Filter the DataFrame by the specified cluster number and column name, removing NaN and empty strings
    df_wordcloud = netflix_df[['kmeans_cluster', column_name]].dropna()
    df_wordcloud = df_wordcloud[df_wordcloud['kmeans_cluster'] == cluster_number]
    df_wordcloud = df_wordcloud[df_wordcloud[column_name].str.len() > 0]

    # Combine all text documents into a single string
    text = " ".join(word for word in df_wordcloud[column_name])

    # Create the word cloud
    wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="black").generate(text)

    # Convert the word cloud to a numpy array
    image_array = wordcloud.to_array()

    # Return the numpy array representing the word cloud image
    return image_array


In [None]:
# Create subplots for plotting word clouds
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(20, 15))

# Loop through clusters and attributes to generate and plot word clouds
for i in range(4):
    for j, col in enumerate(['description', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(kmeans_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}', fontsize=14, fontweight='bold')

# Adjust the layout for better visualization
plt.tight_layout()

# Show the plots
plt.show()


### Hierarchial Clustering - ML Model

In [None]:
# Import the necessary libraries for hierarchical clustering and dendrogram plotting
from scipy.cluster.hierarchy import linkage, dendrogram

# Perform hierarchical clustering on the transformed data using Ward linkage and Euclidean distance
distances_linkage = linkage(x_transformed, method='ward', metric='euclidean')

# Create a figure for the dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('All Films/TV Shows')
plt.ylabel('Euclidean Distance')

# Plot the dendrogram without labels for each observation
dendrogram(distances_linkage, no_labels=True)

# Show the dendrogram
plt.show()

A dendrogram is a tree-like diagram used in clustering analysis to visualize how data points are grouped into clusters. The vertical lines in a dendrogram indicate the distances at which clusters are merged or split. To determine the optimal number of clusters, look for a significant gap between these vertical lines, suggesting a natural break in the data's hierarchy.



In [None]:
# Import the necessary libraries for Agglomerative Clustering and Silhouette score
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Define the range of K values you want to evaluate
k_range = range(2, 10)

# Loop through each K value and compute the Silhouette score
for k in k_range:
    # Create an Agglomerative Clustering model with the current K value
    model = AgglomerativeClustering(n_clusters=k)

    # Fit the model to the transformed data (from PCA) and get cluster labels
    labels = model.fit_predict(x_transformed)

    # Compute the Silhouette score for the current K value
    score = silhouette_score(x_transformed, labels)

    # Print the Silhouette score for the current K value
    print("k=%d, Silhouette score=%f" % (k, score))


Based on the silhouette scores presented above, it is evident that the optimal number of clusters is 2, as indicated by the maximum silhouette score. This conclusion is further supported by the dendrogram analysis, where we can observe that for 2 clusters, the Euclidean distances are at their maximum.

Now, let's proceed by plotting the chart once more to visually examine the two distinct clusters that have been formed. This visualization will provide us with a clearer understanding of the data partitioning.

In [None]:
# Create an Agglomerative Clustering model with 2 clusters, using Euclidean distance and Ward linkage
Agmodel = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')

# Predict the cluster labels for each data point and store them in the 'label' variable
label = Agmodel.fit_predict(x_transformed)

# Create a figure for plotting
plt.figure(figsize=(10, 6), dpi=120)

# Get unique cluster labels
unique_labels = np.unique(label)

# Plot the results by iterating through each unique cluster label
for i in unique_labels:
    plt.scatter(
        x_transformed[label == i, 0],
        x_transformed[label == i, 1],
        label=i
    )

# Add a legend to the plot
plt.legend()

# Show the plot
plt.show()


I'll replot the 3-dimensional graph to provide a clearer view of the clusters.

In [None]:
# Import the necessary library for 3D visualization
from mpl_toolkits.mplot3d import Axes3D

# Create a 3D plot
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(111, projection='3d')

# Define colors for each cluster
colors = ['r', 'g', 'b', 'y']

# Plot the data points in 3D for each cluster
for i in range(len(colors)):
    ax.scatter(
        x_transformed[Agmodel.labels_ == i, 0],
        x_transformed[Agmodel.labels_ == i, 1],
        x_transformed[Agmodel.labels_ == i, 2],
        c=colors[i]
    )

# Set labels for each axis
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')

# Show the 3D plot
plt.show()


Certainly, we can visually distinguish the two clusters easily. To proceed, let's assign the 'Content(Movies and TV Shows)' to their respective clusters by adding one more attribute to the final dataframe.

In [None]:
# Add the Agglomerative Clustering cluster labels to the DataFrame
netflix_df['agglomerative_cluster'] = Agmodel.labels_


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Import the necessary libraries for word cloud generation
from wordcloud import WordCloud, STOPWORDS

def agglomerative_wordcloud(cluster_number, column_name):
    '''
    Function for building a word cloud for the movie/shows in a specified cluster using Agglomerative Clustering.

    Args:
        cluster_number (int): The cluster number for which you want to create the word cloud.
        column_name (str): The name of the column containing text data (e.g., 'content_detail').

    Returns:
        WordCloud: A WordCloud object representing the generated word cloud.
    '''

    # Filter the DataFrame by the specified cluster number and column name, removing NaN
    df_wordcloud = netflix_df[['agglomerative_cluster', column_name]].dropna()
    df_wordcloud = df_wordcloud[df_wordcloud['agglomerative_cluster'] == cluster_number]

    # Combine all text documents into a single string
    text = " ".join(word for word in df_wordcloud[column_name])

    # Create the word cloud
    wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="black").generate(text)

    # Return the WordCloud object representing the word cloud
    return wordcloud


In [None]:
# Create subplots for plotting word clouds
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(20, 15))

# Loop through clusters and attributes to generate and plot word clouds
for i in range(2):
    for j, col in enumerate(['description', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(agglomerative_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}', fontsize=14, fontweight='bold')

# Adjust the layout for better visualization
plt.tight_layout()

# Show the plots
plt.show()


### Building a Recommendaton System

In [None]:
# Import the necessary libraries for cosine similarity and TF-IDF vectorization
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer object and transform the text data
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(netflix_df['content_detail'])

# Compute the cosine similarity matrix between all program descriptions
cosine_sim = cosine_similarity(tfidf_matrix)

def recommend_content(title, cosine_sim=cosine_sim, data=netflix_df):
    '''
    Function to recommend content similar to a given title based on cosine similarity.

    Args:
        title (str): The title of the content for which recommendations are sought.
        cosine_sim (array-like, optional): The cosine similarity matrix.
        data (DataFrame, optional): The DataFrame containing the content data.

    Returns:
        DataFrame: A DataFrame containing the top 10 recommended titles and their similarity scores.
    '''

    # Get the index of the input title in the program list
    programme_list = data['title'].to_list()
    index = programme_list.index(title)

    # Create a list of tuples containing the similarity score and index
    # between the input title and all other programs in the dataset
    sim_scores = list(enumerate(cosine_sim[index]))

    # Sort the list of tuples by similarity score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]

    # Get the recommended movie titles and their similarity scores
    recommend_index = [i[0] for i in sim_scores]
    rec_movie = data['title'].iloc[recommend_index]
    rec_score = [round(i[1], 4) for i in sim_scores]

    # Create a pandas DataFrame to display the recommendations
    rec_table = pd.DataFrame(list(zip(rec_movie, rec_score)), columns=['Recommendation', 'Similarity_score(0-1)'])

    return rec_table


Now, it's time to evaluate the performance of our recommender system.

In [None]:
# Testing Indian movie
indian_movie_recommendations = recommend_content('Zindagi Na Milegi Dobara')
print("Recommendations for Indian Movie 'Zindagi Na Milegi Dobara':")
print(indian_movie_recommendations)

# Testing non-Indian movie
non_indian_movie_recommendations = recommend_content('THE RUM DIARY')
print("\nRecommendations for Non-Indian Movie 'THE RUM DIARY':")
print(non_indian_movie_recommendations)

# Testing Indian TV show
indian_tv_show_recommendations = recommend_content('Humsafar')
print("\nRecommendations for Indian TV Show 'Humsafar':")
print(indian_tv_show_recommendations)

# Testing non-Indian TV show
non_indian_tv_show_recommendations = recommend_content('The World Is Yours')
print("\nRecommendations for Non-Indian TV Show 'The World Is Yours':")
print(non_indian_tv_show_recommendations)


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We've opted for the Silhouette Score as our evaluation metric for several reasons. The Silhouette Score measures how effectively each data point within a cluster is separated from other clusters. It operates on a scale from -1 to 1, with higher scores indicating better cluster separation. A score close to 1 suggests that a data point is well-suited to its own cluster and poorly suited to neighboring clusters. Conversely, a score near 0 implies a data point is at or very close to the boundary between two clusters, while a score close to -1 suggests the data point might be wrongly assigned.

There are advantages to using the Silhouette Score over the Distortion Score (also known as inertia or sum of squared distances):

1. The Silhouette Score takes both cohesion (similarity among data points within a cluster) and separation (dissimilarity between data points in different clusters) into account. In contrast, the Distortion Score only considers cluster compactness.

2. Silhouette Score is less sensitive to cluster shape, making it suitable for clusters that are not perfectly spherical, which may be the case in our data.

3. The Silhouette Score assigns a score to each data point, providing more detailed and interpretable results compared to the Distortion Score, which provides a single value for the entire clustering solution.

This choice allows us to comprehensively assess the quality of our clustering results.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We've chosen K-means as our final clustering model for several reasons:

1. **High Silhouette Score:** K-means clustering has provided us with a comparatively high Silhouette Score, indicating that the clusters are well-separated and data points are appropriately assigned to clusters.

2. **Effectiveness in Certain Situations:** K-means tends to perform well in situations where:

   - **Speed**: It's faster than hierarchical clustering, making it advantageous for large datasets due to its simplicity and fewer computational requirements.
   
   - **Ease of Use**: K-means is straightforward to implement and interpret, with few parameters to tune, such as the number of clusters. It provides a clear partitioning of the data.
   
   - **Scalability**: K-means is scalable and can handle datasets with a large number of variables or dimensions, which is beneficial when dealing with high-dimensional data.
   
   - **Independence of Clusters**: K-means produces non-overlapping clusters, which can be preferable for applications where clear separation is needed.

While K-means has its advantages, it's important to note that the choice of clustering algorithm should depend on the specific characteristics of your data and the goals of your analysis. Different algorithms may perform better in different scenarios, so it's essential to consider the nature of your data and your objectives when selecting a clustering method.

# **Conclusion**

In summary, both the exploratory data analysis (EDA) and the machine learning model have provided valuable insights into Netflix's content distribution, production trends, viewer preferences, and the effectiveness of clustering techniques for recommendation. Here are the key conclusions:

**From EDA:**

1. **Content Diversity and Global Reach:** Netflix's content library is diverse and caters to a global audience, emphasizing international TV shows and popular genres like crime and kids' TV.

2. **Production Trends:** Netflix has experienced significant growth in content production, adapting to the evolving streaming landscape by creating more original content.

3. **Global Influences:** The dominance of the United States and the rise of Indian content highlight regional and global factors shaping the industry.

4. **Regional Success Stories:** South Korean dramas and Canadian financial support for TV shows have had a substantial impact on the platform.

5. **Viewer Engagement:** Viewer preferences vary widely, with interests spanning Japanese voice actors, crime TV shows, kids' TV, British TV shows, and documentaries.

6. **Quality and Collaboration:** Netflix collaborates with prolific directors and actors, emphasizing quality and collaboration within and beyond traditional entertainment.

In essence, Netflix aims to serve a global audience with diverse, high-quality content, adapt to production trends, and engage viewers across cultures and genres.

**From ML Model:**

1. **Clustering Results:** K-Means clustering suggests an optimal number of 4 clusters, while Agglomerative Hierarchical Clustering indicates 2 clusters as optimal.

2. **Evaluation Metric Choice:** Silhouette Score was chosen over Distortion Score due to its interpretability and robustness to cluster shape.

3. **Recommendation System:** A recommendation system was developed to enhance user experience and reduce subscriber churn. It provides personalized recommendations based on similarity scores.

In conclusion, the combination of data exploration and machine learning techniques has provided Netflix with valuable insights into its content and clustering solutions for content recommendations, enabling the platform to stay competitive in the ever-evolving world of streaming entertainment.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***