## **IMPORTS AND DATA LOADING**

In [None]:
import numpy as np # linear algebra operations
import pandas as pd # used for data preparation
import plotly.express as px #used for data visualization
from textblob import TextBlob #used for sentiment analysis

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/mubi01.csv')

## **BASIC DATA EXPLORATION**

In [None]:
df.shape

(1450, 12)

In [None]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


In [None]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

## **DISTRIBUTION OF CONTENT TYPES**

In [None]:
type_distribution = df['type'].value_counts().reset_index()
type_distribution.columns = ['Type', 'Count']
fig = px.pie(type_distribution, values='Count', names='Type', title='Distribution of Content Types')
fig.show()

**Insight:** If the platform has a skewed distribution towards one content type (e.g., mostly movies or TV shows), adding content of the less represented type might attract a broader audience.

**Action:** If movies are underrepresented, adding high-quality movies could diversify the offerings and attract more users.

## **CUSTOMER ID AND DURATION ANALYSIS**

In [None]:
df.rename(columns={'show_id': 'cus_id'}, inplace=True)
df['cus_id'] = df['cus_id'].astype('category').cat.codes

# Scatter plot of duration vs cus_id
fig = px.scatter(df, x='cus_id', y='duration', title='Duration vs. Customer ID')
fig.update_xaxes(title='Customer ID')
fig.update_yaxes(title='Duration')
fig.show()

**Insight:** This analysis helps identify if there’s a preference for short or long-duration content. If the scatter shows a preference towards specific durations, it suggests users’ consumption patterns.

**Action**: Add movies that align with the preferred duration range to match user expectations and increase engagement.

## **AVERAGE RATING BY CONTENT TYPE**

In [None]:
avg_rating_by_type = df.groupby('type')['rating'].apply(lambda x: pd.Series(pd.Categorical(x).codes).mean()).reset_index(name='Average Rating')
fig = px.bar(avg_rating_by_type, x='type', y='Average Rating', title='Average Rating by Content Type')
fig.show()


**Insight:** Higher average ratings for a specific content type indicate user satisfaction. If movies have high ratings, focusing on quality over quantity might be key.

**Action:** Prioritize adding movies with similar traits to those that have received high ratings, as these are likely to be well-received.

## **DISTRIBUTION OF RATINGS**

In [None]:
rating_counts = df['rating'].value_counts().reset_index()
rating_counts.columns = ['Rating', 'Count']
fig = px.pie(rating_counts, values='Count', names='Rating', title='Distribution of Ratings')
fig.show()

**Insight:** A wide distribution in ratings suggests varying content quality. If most content has high ratings, adding movies that match these high standards is crucial.

**Action:** Focus on acquiring critically acclaimed or audience-approved movies with similar ratings to maintain the platform’s quality perception.

## **TOP DIRECTORS BY CONTENT COUNT**

In [None]:
df['director'] = df['director'].fillna('Director not specified')
directors_list = df['director'].str.split(',', expand=True).stack().to_frame(name='Director')
directors_counts = directors_list['Director'].value_counts().reset_index()
directors_counts.columns = ['Director', 'Count']
directors_counts = directors_counts[directors_counts['Director'] != 'Director not specified']

fig = px.bar(directors_counts.head(10), x='Count', y='Director', title='Top 10 Directors by Content Count')
fig.show()

**Insight:** Certain directors consistently produce content that resonates with the audience. Movies by these directors might attract more viewers.

**Action:** Consider acquiring new releases or popular past movies by top directors identified in this analysis.

## **TOP ACTORS BY CONTENT COUNT**

In [None]:
df['cast'] = df['cast'].fillna('No cast specified')
cast_list = df['cast'].str.split(',', expand=True).stack().to_frame(name='Actor')
cast_counts = cast_list['Actor'].value_counts().reset_index()
cast_counts.columns = ['Actor', 'Count']
cast_counts = cast_counts[cast_counts['Actor'] != 'No cast specified']

fig = px.bar(cast_counts.head(10), x='Count', y='Actor', title='Top 10 Actors by Content Count')
fig.show()

**Insight:** Actors with a high presence and popularity are likely to draw viewers. Movies featuring these actors could be more profitable.

**Action:** Secure movies starring these top actors, as their fan base could drive significant viewership.

## **CONTENT ADDED OVER TIME**

In [None]:
df['date_added'] = pd.to_datetime(df['date_added'], format='%B %d, %Y', errors='coerce')
df['year_added'] = df['date_added'].dt.year
content_added_over_time = df.groupby('year_added').size().reset_index(name='Total Count')

fig = px.line(content_added_over_time, x='year_added', y='Total Count', title='Content Added Over Time')
fig.show()

**Insight:** Identifying years with a surge in popular content can help understand audience preferences for certain periods.

**Action:** Add movies from eras that saw high engagement, especially if these movies align with current trends or are considered classics.

In [None]:
print(df.columns)

Index(['cus_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'year_added'],
      dtype='object')


## **SCATTER PLOAT WITH ANNOTATION**

In [None]:
fig = px.scatter(df, x='cus_id', y='duration', text='cus_id', title='Duration by ID with Annotations')
fig.update_traces(textposition='top center')
fig.show()

**Insight:** If specific customer IDs correlate with high-duration content consumption, this suggests a dedicated viewer base.

**Action:** Target these users with longer-duration movies or series, as they are likely to engage with this type of content.

## **DURATION BY CONTENT TYPE**

In [None]:
# Box plot of duration by content type
fig = px.box(df, x='type', y='duration', title='Duration by Content Type')
fig.update_xaxes(title='Content Type')
fig.update_yaxes(title='Duration')
fig.show()

**Insight:** If certain content types have a broader duration range that performs well, it indicates flexibility in viewer tolerance.

**Action:** Add movies that offer a balance in duration – not too short or too long – to cater to the widest audience range.

## **SENTIMENT ANALYSIS OF DESCRIPTION**

In [None]:
from textblob import TextBlob

def analyze_sentiment(description):
    testimonial = TextBlob(description)
    return 'Positive' if testimonial.sentiment.polarity > 0 else 'Negative' if testimonial.sentiment.polarity < 0 else 'Neutral'

df['Sentiment'] = df['description'].fillna('').apply(analyze_sentiment)
sentiment_counts = df['Sentiment'].value_counts().reset_index()
sentiment_counts.columns = ['Sentiment', 'Count']

fig = px.pie(sentiment_counts, values='Count', names='Sentiment', title='Sentiment Distribution of Descriptions')
fig.show()

**Insight:** Positive sentiment in content descriptions suggests higher engagement. Negative or neutral sentiments may not attract as much attention.

**Action:** Add movies with positive descriptions or those that can be marketed with uplifting or intriguing narratives to attract viewers

## **ACTOR COLLABORATION**

In [None]:
actor_collab_df = df[['cast']].copy()
actor_collab_df['cast'] = actor_collab_df['cast'].to_frame(name='cast') # Rename column to 'Actor'
# Check if the 'Actor' column exists and print its values
print(actor_collab_df['cast'].head())
actor_collaborations = actor_collab_df.groupby('cast').size().reset_index(name='Count')
actor_collaborations = actor_collaborations.sort_values(by='Count', ascending=False)

fig = px.bar(actor_collaborations.head(10), x='Count', y='cast', title='Top 10 Actors by Collaboration Count')
fig.show()

0    Chris Diamantopoulos, Tony Anselmo, Tress MacN...
1             Jim Varney, Noelle Parker, Douglas Seale
2    Raymond Albert Romano, John Leguizamo, Denis L...
3    Darren Criss, Adam Lambert, Derek Hough, Alexa...
4    John Lennon, Paul McCartney, George Harrison, ...
Name: cast, dtype: object


**Insight:** Frequent collaborations between popular actors could indicate successful formulas. Movies featuring such collaborations might draw more attention.

**Action:** Acquire or promote movies that feature successful actor pairings identified in the analysis.

## **RATING WITH CAST SIZE**

In [None]:
df['cast_size'] = df['cast'].apply(lambda x: len(x.split(',')) if pd.notnull(x) else 0)
rating_cast_size = px.scatter(df, x='cast_size', y='rating', title='Rating vs Cast Size')
rating_cast_size.show()

**Insight:** A larger cast size might correlate with higher ratings, suggesting that star-studded movies perform well.

**Action:** Focus on adding movies with large, well-known casts, as these might appeal more to the audience.

## **DIRECTOR IMPACT ON CONTENT TYPE**

In [None]:
director_type_df = df[['director', 'type']].copy()
director_type_df['director'] = director_type_df['director'].str.split(',', expand=True).stack().reset_index(drop=True).to_frame(name='director')
director_type_counts = director_type_df.groupby(['director', 'type']).size().reset_index(name='Count')

fig = px.sunburst(director_type_counts, path=['director', 'type'], values='Count', title='Director Impact on Content Type')
fig.show()

**Insight:** Certain directors might be strongly associated with specific content types. This association can guide content acquisition decisions.

**Action:** Target movies from directors known to excel in the content type that aligns with the platform’s user base.

## **GENRE POPULARITY OVER TIME**

In [None]:
df['listed_in'] = df['listed_in'].fillna('Unknown')
genre_trend = df[['release_year', 'listed_in']].copy()
genre_trend = genre_trend.explode('listed_in')
genre_trend = genre_trend.groupby(['release_year', 'listed_in']).size().reset_index(name='Count')

fig = px.line(genre_trend, x='release_year', y='Count', color='listed_in', title='Genre Popularity Over Time')
fig.show()


**Insight:** Tracking genre popularity trends helps in identifying which genres are currently in demand or making a comeback.

**Action:** Add movies from trending genres to capitalize on current viewer interests and preferences.

## **CONTENT ADDITION FREQUENCY BY GENRE**

In [None]:
df['listed_in'] = df['listed_in'].fillna('Unknown')
genre_addition_frequency = df[['listed_in', 'date_added']].copy()
genre_addition_frequency['month_added'] = genre_addition_frequency['date_added'].dt.to_period('M').astype(str)
genre_addition_frequency = genre_addition_frequency.groupby(['listed_in', 'month_added']).size().reset_index(name='Total Count')

fig = px.line(genre_addition_frequency, x='month_added', y='Total Count', color='listed_in', title='Content Addition Frequency by Genre')
fig.show()

**Insight:** Understanding when genres are typically added can help in planning content releases to avoid saturation.

**Action:** Time the addition of movies in less saturated periods to maximize visibility and engagement.

## **CONTENT POPULARITY BY RELEASE YEAR AND RATING**

In [None]:
popularity_by_year_rating = df.groupby(['release_year', 'rating']).size().reset_index(name='Total Count')
fig = px.line(popularity_by_year_rating, x='release_year', y='Total Count', color='rating', title='Content Popularity by Release Year and Rating')
fig.show()

**Insight:** Popular content from specific years that also holds high ratings can indicate a nostalgic or sustained interest in that period.

**Action:** Consider adding highly rated movies from the identified popular release years to tap into this ongoing interest.

## **SENTIMENT ANALYSIS OVER TIME**

In [None]:
df3 = df[['release_year', 'description']]
df3 = df3.rename(columns = {'release_year':'Release Year', 'description':'Description'})
for index, row in df3.iterrows():
  d=row['Description']
  testimonial = TextBlob(d)
  p = testimonial.sentiment.polarity
  if p==0:
    sent = 'Neutral'
  elif p>0:
    sent = 'Positive'
  else:
    sent = 'Negative'
  df3.loc[[index, 2], 'Sentiment']=sent

df3 = df3.groupby(['Release Year', 'Sentiment']).size().reset_index(name = 'Total Count')

df3 = df3[df3['Release Year']>2005]
barGraph = px.bar(df3, x="Release Year", y="Total Count", color = "Sentiment", title = " Analysis of Content")
barGraph.show()

**Insight:** Positive sentiment over the years suggests consistency in content quality and reception.

**Action:** Add movies that align with the sentiment trends observed, ensuring they resonate with current user expectations.

## **MONTHLY TRENDS IN CONTENT ADDITON**

In [None]:
df['month_added'] = df['date_added'].dt.month
monthly_trends = df.groupby(['month_added']).size().reset_index(name='Total Count')
fig = px.line(monthly_trends, x='month_added', y='Total Count', title='Monthly Trends in Content Additions')
fig.show()

**Insight:** Identifying peak times for content consumption can guide when to add new movies to maximize visibility.

**Action:** Plan to add major movie releases during peak consumption months to boost initial viewership.

## **GENRE TRENDS OVER TIME**

In [None]:
genre_trends = df.groupby(['release_year', 'listed_in']).size().reset_index(name='Total Count')
genre_trends_chart = px.area(genre_trends, x='release_year', y='Total Count', color='listed_in', title='Genre Trends Over Time') # Change 'genres' to 'listed_in'
genre_trends_chart.show()

**Insight:** Certain genres may rise or fall in popularity over time, indicating shifting viewer preferences.

**Action:** Focus on adding movies from genres that are currently rising in popularity, ensuring alignment with evolving tastes.

## **PREDICTION RATING BY GENRE**

In [None]:
genre_rating_df = df[['listed_in', 'rating']].copy()
genre_rating_df = genre_rating_df.explode('listed_in')
genre_rating_mean = genre_rating_df.groupby('listed_in')['rating'].apply(lambda x: pd.Series(pd.Categorical(x).codes).mean()).reset_index(name='Rating')
fig = px.bar(genre_rating_mean, x='listed_in', y='Rating', title='Predicted Rating by Genre')
fig.show()

**Insight:** Identifying which genres tend to receive higher ratings can help prioritize content that will likely perform well.

**Action:** Add movies from these highly rated genres to ensure a positive reception and maximize business profits.

# **Final Recommendation:**
To maximize business profits, the platform should focus on adding high-rated movies from trending genres, starring popular actors, and directed by renowned directors. These movies should ideally align with current user preferences in terms of duration, content type, and sentiment. Additionally, timing the release of these movies during peak consumption months and avoiding genre saturation can further enhance visibility and profitability.