# Netflix EDA and Data Visualization
This notebook is the EDA and Data Visualization of a Netflix Dataset that consists of contents launched on Netflix from 2008 to 2021.

The purpose of this analysis is to explore, clean and create data visualizations.

## Resources:
* Netflix dataset on kaggle (https://www.kaggle.com/datasets/shivamb/netflix-shows)
* Piechart from Python Charts (https://python-charts.com/part-whole/pie-chart-matplotlib/#colors)
* Matplotlib colors format (https://matplotlib.org/stable/users/explain/colors/colors.html)
* TV parental Guidelines (https://en.wikipedia.org/wiki/TV_Parental_Guidelines#TV-G)


## Import the Libraries and Read the Data

In [61]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import itertools

import re

import string

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from wordcloud import WordCloud, ImageColorGenerator
import imageio.v2 as imageio
from datetime import datetime

from nltk.tokenize import word_tokenize
#or from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#or from nltk.stem.snowball import SnowballStemmer


In [62]:
movies = pd.read_csv('netflix_titles.csv')

In [63]:
movies.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


## Delete the title, director, cast, release_year, and duration columns

In [64]:
movies = movies.drop(['title', 'director', 'cast', 'duration', 'release_year'], axis=1)

## Change and Capitalize the Columns' Names

In [65]:
movies = movies.rename(columns={'date_added': 'date', 'listed_in':'category', 'rating': 'classification'})
movies.columns = movies.columns.str.capitalize()

In [66]:
movies.columns

Index(['Show_id', 'Type', 'Country', 'Date', 'Classification', 'Category',
       'Description'],
      dtype='object')

In [67]:
movies.head()

Unnamed: 0,Show_id,Type,Country,Date,Classification,Category,Description
0,s1,Movie,United States,"September 25, 2021",PG-13,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,South Africa,"September 24, 2021",TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,,"September 24, 2021",TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,,"September 24, 2021",TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,India,"September 24, 2021",TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


## Check the NANs

In [68]:
pd.isnull(movies).sum()

Show_id             0
Type                0
Country           831
Date               10
Classification      4
Category            0
Description         0
dtype: int64

## Delete NANs in the Date and Classification Column

In [69]:
movies= movies.dropna(subset = ['Classification', 'Date'])

In [70]:
pd.isnull(movies).sum()

Show_id             0
Type                0
Country           829
Date                0
Classification      0
Category            0
Description         0
dtype: int64

## Deal with the NANs in the Country Column

As there are a lot of NANs I decided to replace by Unknown to not affect the analysis.

In [71]:
movies['Country'] = movies['Country'].replace(np.nan, 'Unknown')

In [72]:
movies

Unnamed: 0,Show_id,Type,Country,Date,Classification,Category,Description
0,s1,Movie,United States,"September 25, 2021",PG-13,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,South Africa,"September 24, 2021",TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Unknown,"September 24, 2021",TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Unknown,"September 24, 2021",TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,India,"September 24, 2021",TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...
8802,s8803,Movie,United States,"November 20, 2019",R,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Unknown,"July 1, 2019",TV-Y7,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,United States,"November 1, 2019",R,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,United States,"January 11, 2020",PG,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


## Transform the Date column into Datatime

In [73]:
movies['Date'] = movies['Date'].str.strip() #remove before and after spaces
movies['Date'] = pd.to_datetime(movies['Date'], format="%B %d, %Y")

In [74]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8793 entries, 0 to 8806
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show_id         8793 non-null   object        
 1   Type            8793 non-null   object        
 2   Country         8793 non-null   object        
 3   Date            8793 non-null   datetime64[ns]
 4   Classification  8793 non-null   object        
 5   Category        8793 non-null   object        
 6   Description     8793 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 549.6+ KB


In [75]:
movies

Unnamed: 0,Show_id,Type,Country,Date,Classification,Category,Description
0,s1,Movie,United States,2021-09-25,PG-13,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,South Africa,2021-09-24,TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Unknown,2021-09-24,TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Unknown,2021-09-24,TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,India,2021-09-24,TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...
8802,s8803,Movie,United States,2019-11-20,R,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Unknown,2019-07-01,TV-Y7,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,United States,2019-11-01,R,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,United States,2020-01-11,PG,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


## Create a new Dataframe, split the Columns and Explode the Country and Category into Rows

In [76]:
#Assign to a new dataframe
#strip delete white spaces before and after
# Explode 'Country' and 'Category' columns in individual rows
# Replace 'Type', 'Classification', 'Date', 'Description' value in duplicated rows by None

movies_clean = movies.assign(Country=movies['Country'].str.split(',')).explode('Country')
movies_clean['Country'] = movies_clean['Country'].str.strip()
movies_clean = movies_clean.assign(Category=movies_clean['Category'].str.split(',')).explode('Category')
movies_clean['Category'] = movies_clean['Category'].str.strip()


movies_clean.loc[movies_clean.duplicated(subset=['Show_id'], keep='first'), ['Type', 'Classification', 'Date', 'Description']] = None

## Check if it worked

In [77]:
movies_clean

Unnamed: 0,Show_id,Type,Country,Date,Classification,Category,Description
0,s1,Movie,United States,2021-09-25,PG-13,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,South Africa,2021-09-24,TV-MA,International TV Shows,"After crossing paths at a party, a Cape Town t..."
1,s2,,South Africa,NaT,,TV Dramas,
1,s2,,South Africa,NaT,,TV Mysteries,
2,s3,TV Show,Unknown,2021-09-24,TV-MA,Crime TV Shows,To protect his family from a powerful drug lor...
...,...,...,...,...,...,...,...
8805,s8806,Movie,United States,2020-01-11,PG,Children & Family Movies,"Dragged from civilian life, a former superhero..."
8805,s8806,,United States,NaT,,Comedies,
8806,s8807,Movie,India,2019-03-02,TV-14,Dramas,A scrappy but poor boy worms his way into a ty...
8806,s8807,,India,NaT,,International Movies,


In [42]:
movies_clean['Country'].value_counts().head(10)

Country
United States     6768
India             2804
United Kingdom    1780
Unknown           1720
France             916
Canada             877
Japan              729
South Korea        632
Spain              591
Germany            511
Name: count, dtype: int64

In [43]:
movies_clean['Category'].value_counts().head(10)

Category
International Movies        3513
Dramas                      3201
Comedies                    1981
International TV Shows      1463
Action & Adventure          1182
Documentaries               1118
Independent Movies          1040
TV Dramas                    851
Children & Family Movies     845
Thrillers                    806
Name: count, dtype: int64

In [44]:
movies_clean['Type'].value_counts()

Type
Movie      6129
TV Show    2664
Name: count, dtype: int64

In [45]:
movies_clean['Classification'].value_counts().head()

Classification
TV-MA    3205
TV-14    2157
TV-PG     861
R         799
PG-13     490
Name: count, dtype: int64

## Create the movies_clean Dataframe and save as CSV

In [46]:
movies_clean = pd.DataFrame(movies_clean)

In [47]:
movies_clean = movies_clean.to_csv('movies_clean.csv', index=False)

## Sentiment Analysis of the Description Column

Perform the sentiment analysis to identify the most common words

In [48]:
movies_desc = movies['Description']

In [49]:
movies_desc

0       As her father nears the end of his life, filmm...
1       After crossing paths at a party, a Cape Town t...
2       To protect his family from a powerful drug lor...
3       Feuds, flirtations and toilet talk go down amo...
4       In a city of coaching centers known to train I...
                              ...                        
8802    A political cartoonist, a crime reporter and a...
8803    While living alone in a spooky town, a young g...
8804    Looking to survive in a world taken over by zo...
8805    Dragged from civilian life, a former superhero...
8806    A scrappy but poor boy worms his way into a ty...
Name: Description, Length: 8793, dtype: object

## Cleaning the Column

Those are the steps performed:
* Tokenize
* Change to lowercase
* Remove punctuation and white spaces
* Remove personalized stopwords
* Lemmatize

In [50]:
def preprocess_text(text):
    # Tokenize the text
    words = word_tokenize(text)
    
    # Convert to lower case
    words = [word.lower() for word in words]
    
    # Remove punctuation
    words = [re.sub(r'[^\w\s]', '', word.replace("'", "")) for word in words]
    # Remove empty or whitespace-only words
    words = [word for word in words if word.strip()]
    
    # Remove stopwords
    st_words = set(stopwords.words("english"))
    additional_stopwords = {"movie", "film", "story", "plot", "character","The", "When", "But", "must", "After", "home", "one", "two", "three", "four","This"} 
    words = [word for word in words if word not in st_words and word not in additional_stopwords]
    
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    return words

desc_clean = movies_desc.apply(preprocess_text)

In [51]:
desc_clean.head()

0    [father, nears, end, life, filmmaker, kirsten,...
1    [crossing, path, party, cape, town, teen, set,...
2    [protect, family, powerful, drug, lord, skille...
3    [feud, flirtation, toilet, talk, go, among, in...
4    [city, coaching, center, known, train, india, ...
Name: Description, dtype: object

## Flatten the list, Count the Frequency and Create a Dataframe

In [52]:
flat_list = list(itertools.chain(*desc_clean))
 # Flatten the list of tokens so that each individual element will become a row in the Dataframe
word_freq = Counter(flat_list) # Count the frequency of each word
desc_final = pd.DataFrame.from_dict(word_freq, orient='index').reset_index()# Create a DataFrame from the word frequencies
desc_final = desc_final.sort_values(by=0, ascending=False)
# Rename the columns
desc_final.columns = ['Word', 'Frequency']

In [53]:
desc_final.head(100)

Unnamed: 0,Word,Frequency
3,life,1061
77,young,728
31,family,709
57,new,699
53,woman,661
...,...,...
298,game,141
565,career,140
538,move,138
593,marriage,138


## Visualize the Words with WordCloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Wordcloud expects a string of text
text = ' '.join([str(word) for word in desc_clean])
# Create a WordCloud object with custom parameters
text = text.replace("'", "") #delete the '
wordcloud = WordCloud(width=1000, height=400,  # Specify the width and height
                      max_words=100,          # Maximum number of words to display
                      background_color='white',  # Background color
                      colormap='Reds',     # Colormap for colors
                      contour_color='black',  # Contour color
                      contour_width=1,        # Contour width
                      stopwords=None).generate(text)  # Stopwords to exclude
# Display the WordCloud
plt.figure(figsize=(10, 5))  # Adjust the figure size
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Add the Netflix logo as a mask to the WordCloud

In [None]:
text = ' '.join([str(word) for word in desc_clean])
text = text.replace("'", "")
# Read the mask image
mask = imageio.imread("netflix.jpg")

# Create a WordCloud object with custom parameters
wordcloud = WordCloud(width=1000, height=400,  # Specify the width and height
                      max_words=100,          # Increase the maximum number of words
                      background_color='white',  # Background color
                      colormap='Reds',     # Change the colormap
                      contour_color='red',  # Contour color
                      contour_width=1,        # Contour width
                      mask=mask,              # Add the mask
                      stopwords=None).generate(text)  # Stopwords to exclude

# Create a color generator from the mask image
color_gen = ImageColorGenerator(mask)

# Apply the color generator to the word cloud
wordcloud.recolor(color_func=color_gen)

# Display the WordCloud
plt.figure(figsize=(20, 10))  # Adjust the figure size
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

TV-MA (Mature Audiences Only): not suitable for children under 17.

TV-14 (Parents Strongly Cautioned): Some material may not be suitable for children under 14.

TV-PG (Parental Guidance Suggested): Some material may not be suitable for young children.

R (Restricted): Restricted to viewers over 17 years old or those accompanied by a parent or adult guardian.

PG-13 (Parents Strongly Cautioned): Some material may be inappropriate for children under 13 years old.

TV-Y7 (Directed to Older Children): Intended for children age 7 and above.

TV-Y (All Children): Suitable for all children.

PG (Parental Guidance Suggested): Some material may not be suitable for young children.

TV-G (General Audience): Suitable for all ages.

NR (Not Rated): The content has not been assigned a specific rating.

G (General Audience): Suitable for all ages.

TV-Y7-FV (Directed to Older Children - Fantasy Violence): Intended for children age 7 and above.

NC-17 (Adults Only): No one 17 and under admitted.

UR (Unrated): The content has not been assigned a specific rating or is an unrated version of a rated film.

## Save movies as CSV

In [155]:
movies = movies.to_csv('movies.csv', index=False)