<a href="https://colab.research.google.com/github/kanakpriyatiwari/Netflix-And-TV-Show-Clustering/blob/main/Netflix_And_TV_Show_Clustering_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name -**   **Netflix And TV Show Clustering**



---






# **Project type -** **Unsupervised Clustering and Recommendation System**


# **Contribution - Individual**

# **Name -** **Kanak Priya Tiwari**


# **Problem Statement** - 



---
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.


**Our Goal**

---


By creating clusters, we will be able to comprehend the shows that are alike and different from one another. These clusters can be used to provide customers with individualized show recommendations based on their preferences.

This project aims to classify and group Netflix shows into specific clusters in such a way that shows in the same cluster are similar to one another and shows in different clusters are different.


# **In this project, you are required to do**

**Exploratory Data Analysis**

**Understanding what type content is available in different countries**

**Is Netflix has increasingly focusing on TV rather than movies in recent years.**

**Clustering similar content by matching text-based features**

# **Attribute Information**

1.show_id : Unique ID for every Movie / Tv Show

2.type : Identifier - A Movie or TV Show

3.title : Title of the Movie / Tv Show

4.director : Director of the Movie

5.cast : Actors involved in the movie / show

6.country : Country where the movie / show was produced

7.date_added : Date it was added on Netflix

8.release_year : Actual Releaseyear of the movie / show

9.rating : TV Rating of the movie / show

10.duration : Total Duration - in minutes or number of seasons

11.listed_in : Genere

12.description: The Summary description

# **GitHub Link -**


## ***Let's Begin !***

## **Importing Required libraries**

---



In [2]:
# import libraries
import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# libraries used to process textual data
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# libraries that are used to construct a recommendation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**loading the dataset**

In [3]:
from google.colab import drive 
drive.mount('/content/drive/')

Mounted at /content/drive/


In [4]:
# LOADING DATASET 
Netflix_df = pd.read_csv("/content/sample_data/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

### Dataset First View

In [5]:
# LET'S SEE TOP 5 ROWS 
Netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [7]:
# LET'S SEE BOTTOM 5 ROWS 
Netflix_df.tail(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
2712,s2713,TV Show,Herrens veje,,"Lars Mikkelsen, Ann Eleonora Jorgensen, Simon ...",Denmark,"June 1, 2019",2018,TV-MA,2 Seasons,"International TV Shows, TV Dramas",A family with a storied history of service to ...
2713,s2714,Movie,Hey Arnold! The Jungle Movie,"Raymie Muzquiz, Stu Livingston","Mason Vale Cotton, Benjamin Flores Jr., France...","United States, South Korea, Japan","November 2, 2019",2017,TV-PG,81 min,"Children & Family Movies, Comedies",When Arnold and his crew win a trip to San Lor...
2714,s2715,TV Show,"Hi Bye, Mama!",,"Kim Tae-hee, Lee Kyoo-hyung, Go Bo-gyeol, Shin...",South Korea,"February 23, 2020",2020,TV-14,1 Season,"International TV Shows, Korean TV Shows, Roman...",When the ghost of a woman gains a second chanc...
2715,s2716,TV Show,Hi Score Girl,,"Kohei Amasaki, Sayumi Suzushiro, Yuuki Hirose,...",Japan,"April 9, 2020",2019,TV-14,2 Seasons,"Anime Series, International TV Shows, Romantic...",A chronic gamer abysmally inept in academics a...
2716,s2717,TV Show,Hibana: Spark,,"Kento Hayashi, Kazuki Namioka, Mugi Kadowaki, ...",Japan,"June 2, 2016",2016,TV-MA,1 Season,"International TV Shows, TV Dramas",A dramatic series about friendship and conflic...


**Dataset Rows & Columns Count**

In [8]:
Netflix_df.shape

(2717, 12)

In [17]:
print("Number of Rows ",  {Netflix_df.shape[0]},  "\nNumber of Columns ", {Netflix_df.shape[1]})

Number of Rows  {2717} 
Number of Columns  {12}


In [20]:
# COLUMN NAMES 
Netflix_df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

### Dataset Information

In [18]:
# Dataset Info
Netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2717 entries, 0 to 2716
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       2717 non-null   object
 1   type          2717 non-null   object
 2   title         2717 non-null   object
 3   director      1899 non-null   object
 4   cast          2466 non-null   object
 5   country       2534 non-null   object
 6   date_added    2712 non-null   object
 7   release_year  2717 non-null   int64 
 8   rating        2715 non-null   object
 9   duration      2717 non-null   object
 10  listed_in     2717 non-null   object
 11  description   2717 non-null   object
dtypes: int64(1), object(11)
memory usage: 254.8+ KB


**Duplicate Values**


 We can save time and money by not sending the same data to the machine learning model multiple times by removing duplicate data from our set.

In [23]:
# LET'S SEE DUPLICATE VALUES

duplicate = Netflix_df.duplicated().sum()
print("The  number of duplicate values is  : ",duplicate)

The  number of duplicate values is  :  0


We found that there were no duplicate entries in the above data.

**Missing Values/Null Values**


There are frequently a lot of missing values in the actual data. Corrupted or missing data may result in missing values. Since many machine-learning algorithms do not support missing values, missing data must be handled during the dataset's pre-processing. Therefore, we begin by looking for values that are missing.

In [26]:
null_values = Netflix_df.isnull().sum()
print(" The null values in our dataset is :", null_values)

 The null values in our dataset is : show_id           0
type              0
title             0
director        818
cast            251
country         183
date_added        5
release_year      0
rating            2
duration          0
listed_in         0
description       0
dtype: int64


In [27]:
# Missing Values Percentage
round(Netflix_df.isna().sum()/len(Netflix_df)*100, 2)

show_id          0.00
type             0.00
title            0.00
director        30.11
cast             9.24
country          6.74
date_added       0.18
release_year     0.00
rating           0.07
duration         0.00
listed_in        0.00
description      0.00
dtype: float64