<a href="https://colab.research.google.com/github/priyankashinde-DS/Capstone_project-Netflix_Movies-TV_Shows/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
---

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## <b>In this  project, you are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



# **Attribute Information**

**1.show_id :** Unique ID for every Movie / Tv Show

**2.type :** Identifier - A Movie or TV Show

**3.title :** Title of the Movie / Tv Show

**4.director :** Director of the Movie

**5.cast :** Actors involved in the movie / show

**6.country :** Country where the movie / show was produced

**7.date_added :** Date it was added on Netflix

**8.release_year :** Actual Releaseyear of the movie / show

**9.rating :** TV Rating of the movie / show

**10.duration :** Total Duration - in minutes or number of seasons

**11.listed_in :** Genere

**12.description:** The Summary description

In [1]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **1.Importing Libraries**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#for nlp
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
import scipy.cluster.hierarchy as sch

import warnings
warnings.filterwarnings('ignore')

# **2.Data importing**

In [3]:
netflix_df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CapstoneProject/Unsupervised ML-Clustring/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

# **3.Data Exploration**

In [4]:
# Let's take look at top row of dataset
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [5]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [6]:
# Shpae of Dataset
print( 'Number of Rows: {}'.format( netflix_df.shape[0] ) )
print( 'Number of Columns: {}'.format( netflix_df.shape[1] ) )

Number of Rows: 7787
Number of Columns: 12


In [7]:
netflix_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
release_year,7787.0,2013.93258,8.757395,1925.0,2013.0,2017.0,2018.0,2021.0


In [8]:
netflix_df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

# **4. Data Cleaning**

## **4.1 Checking NaN values**

In [9]:
#Check NAN values
NaN_df = pd.DataFrame({"No Of Total Values": netflix_df.shape[0] , "'Missing values count": netflix_df.isnull().sum(),
                    "%age of NaN values" : round((netflix_df.isnull().sum()/ netflix_df.shape[0])*100 , 2) })
NaN_df.sort_values("'Missing values count" , ascending = False)

Unnamed: 0,No Of Total Values,'Missing values count,%age of NaN values
director,7787,2389,30.68
cast,7787,718,9.22
country,7787,507,6.51
date_added,7787,10,0.13
rating,7787,7,0.09
show_id,7787,0,0.0
type,7787,0,0.0
title,7787,0,0.0
release_year,7787,0,0.0
duration,7787,0,0.0


**5 columns have missing values, with Director missing 1/3 of the time.**
---

* **Director column has highest NaN values "30.7%" data is missing.**

* **Cast column has "9%" NaN values.**

* **country , date_added , rating this columns also containing missing values.**

## **4.2 Dealing with Nan Values.**

In [10]:
netflix_df.director.fillna("No Director", inplace=True)
netflix_df.cast.fillna("No Cast", inplace=True)

## **4.3 Lets check NaN values on data_added**

In [11]:
data_added_Nan_df = netflix_df[netflix_df['date_added'].isna()]
data_added_Nan_df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
258,s259,TV Show,A Young Doctor's Notebook and Other Stories,No Director,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
549,s550,TV Show,Anthony Bourdain: Parts Unknown,No Director,Anthony Bourdain,United States,,2018,TV-PG,5 Seasons,Docuseries,This CNN original series has chef Anthony Bour...
2263,s2264,TV Show,Frasier,No Director,"Kelsey Grammer, Jane Leeves, David Hyde Pierce...",United States,,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies",Frasier Crane is a snooty but lovable Seattle ...


In [12]:
data_added_Nan_df.shape

(10, 12)

* **There are only 10 observations which are containing NaN values in data_added column.**

In [13]:
print(f"Before dropping the NaN values from date_added the shape was {netflix_df.shape}")
netflix_df.dropna(subset = [ 'date_added' ], inplace = True)
print(f"After dropping the NaN values from date_added now the shape is {netflix_df.shape}")

Before dropping the NaN values from date_added the shape was (7787, 12)
After dropping the NaN values from date_added now the shape is (7777, 12)


## **4.4 Checking duplicate values**

In [14]:
df_duplicate = netflix_df[netflix_df.duplicated()]
print("Let's print all the duplicated rows as a dataframe")
df_duplicate

Let's print all the duplicated rows as a dataframe


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


* **No duplicate values present in this dataset.**