<a href="https://colab.research.google.com/github/malaika-n/Netflix-Data-Content-Analysis/blob/main/Netflix_Data_Content_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Netflix Data Content Analysis**

Through this project, I will model Netflix data to understand what’s best for their business. From the data I plan on analyzing and understanding:

1. What content is available
2. The similarities between the content
3. The network between actors and directors
4. What exactly Netflix is focusing on and sentiment analysis of content available on Netflix.


In [None]:
#importing the dataset and the Python libraries:

import numpy as np # linear algebra
import pandas as pd # for data preparation
import plotly.express as px # for data visualization
from textblob import TextBlob # for sentiment analysis

dff=pd.read_csv('netflix_titles.csv')
dff.shape

(8807, 12)

In [None]:
#look at the column names:
dff.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

**To begin the task of analyzing Netflix data, I’ll start by looking at the distribution of content ratings on Netflix:**

In [None]:
distribution = dff.groupby(['rating']).size().reset_index(name='counts')
pieChart = px.pie(distribution, values='counts', names='rating',
                  title='Distribution of Content Ratings on Netflix',
                  color_discrete_sequence=px.colors.qualitative.Set3)
pieChart.show()

From the graph above I can conclude that the majority of content on Netflix is categorized as “TV-MA.” This means that most of the content available on Netflix is intended for viewing by mature and adult audiences.

**Looking at the top 5 successful directors on this platform:**

In [None]:
dff['director']=dff['director'].fillna('No Director Specified')
filtered_directors=pd.DataFrame()
filtered_directors=dff['director'].str.split(',',expand=True).stack()
filtered_directors=filtered_directors.to_frame()
filtered_directors.columns=['Director']
directors=filtered_directors.groupby(['Director']).size().reset_index(name='Total Content')
directors=directors[directors.Director !='No Director Specified']
directors=directors.sort_values(by=['Total Content'],ascending=False)
directorsTop5=directors.head()
directorsTop5=directorsTop5.sort_values(by=['Total Content'])
fig1=px.bar(directorsTop5,x='Total Content',y='Director',title='Top 5 Directors on Netflix')
fig1.show()

From the above graph I can conclude the top 5 directors on this platform are:
1. Raul Campos
2. Jan Suter
3. Jay Karas
4. Marcus Raboy
5. Jay Chapman

**Looking at the top 5 successful actors on this platform:**

In [None]:
dff['cast']=dff['cast'].fillna('No Cast Specified')
filtered_cast=pd.DataFrame()
filtered_cast=dff['cast'].str.split(',',expand=True).stack()
filtered_cast=filtered_cast.to_frame()
filtered_cast.columns=['Actor']
actors=filtered_cast.groupby(['Actor']).size().reset_index(name='Total Content')
actors=actors[actors.Actor !='No Cast Specified']
actors=actors.sort_values(by=['Total Content'],ascending=False)
actorsTop5=actors.head()
actorsTop5=actorsTop5.sort_values(by=['Total Content'])
fig2=px.bar(actorsTop5,x='Total Content',y='Actor', title='Top 5 Actors on Netflix')
fig2.show()

From the above graph I can conclude the top 5 actors on this platform are:
1. Anupam Kher
2. Om Puri
3. Shah Rukh Khan
4. Takahira Sakurai
5. Boman Irani

**Analyzing the trend of production over the years on Netflix:**

In [None]:
df1=dff[['type','release_year']]
df1=df1.rename(columns={"release_year": "Release Year"})
df2=df1.groupby(['Release Year','type']).size().reset_index(name='Total Content')
df2=df2[df2['Release Year']>=2010]
fig3 = px.line(df2, x="Release Year", y="Total Content", color='type',title='Trend of content produced over the years on Netflix')
fig3.show()

The above line graph shows that there has been a decline in the production of the content for both movies and other shows since 2018.

**Finally I will analyze the sentiment of content on Netflix:**

In [None]:
dfx=dff[['release_year','description']]
dfx=dfx.rename(columns={'release_year':'Release Year'})
for index,row in dfx.iterrows():
    z=row['description']
    testimonial=TextBlob(z)
    p=testimonial.sentiment.polarity
    if p==0:
        sent='Neutral'
    elif p>0:
        sent='Positive'
    else:
        sent='Negative'
    dfx.loc[[index,2],'Sentiment']=sent


dfx=dfx.groupby(['Release Year','Sentiment']).size().reset_index(name='Total Content')

dfx=dfx[dfx['Release Year']>=2010]
fig4 = px.bar(dfx, x="Release Year", y="Total Content", color="Sentiment", title="Sentiment of content on Netflix")
fig4.show()

The above graph shows that the overall positive content is always greater than the neutral and negative content combined.