### This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

### This dataset contains +5k unique titles on Netflix with 15 columns containing their information, including:

- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
### And over +77k credits of actors and directors on Netflix titles with 5 columns containing their information, including:

- person_ID: The person ID on JustWatch.
- id: The title ID on JustWatch.
- name: The actor or director's name.
- character_name: The character name.
- role: ACTOR or DIRECTOR.

In [1]:
## import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read the data
df_titles = pd.read_csv("netflix/titles.csv")
df_credits = pd.read_csv("netflix/credits.csv")

### EDA TASKS

- How many movies/shows are released each year
- List the different kinds of genres in the dataset
- Understanding what content is available in different countries
- Identifying similar content by matching text-based features
- Network analysis of Actors / Directors and find interesting insights
- Is Netflix has increasingly focusing on TV rather than movies in recent years.

In [3]:
# View the first five rows of the titles dataset
df_titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7


In [4]:
df_titles[df_titles["id"]=="tm371959"]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
2501,tm371959,6 Balloons,MOVIE,"Over the course of one night, a woman drives a...",2018,,75,['drama'],['US'],,tt6142496,5.9,4087.0,10.88,6.1


In [5]:
# view the first five rows of the credits data set
df_credits.head()

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


In [6]:
# view the dimensions of the titles data set
print(f"There are {df_titles.shape[0]} rows and {df_titles.shape[1]} columns in the titles dataset")

There are 5806 rows and 15 columns in the titles dataset


In [7]:
# view the dimensions of the titles data set
print(f"There are {df_credits.shape[0]} rows and {df_credits.shape[1]} columns in the credits dataset")

There are 77213 rows and 5 columns in the credits dataset


## Merge the datasets

In [8]:
unique_ids = df_credits["id"].unique()
ids = df_credits["id"]
roles = df_credits["role"]
names = df_credits["name"]

# List save actors and directors
actors = []
directors = []

start = 0

for id_unique in unique_ids:
    act = []
    dir = []
    for i in range(start, len(ids)):
        if (ids[i] == id_unique):
            if roles[i] == "ACTOR":
                act.append(names[i])
            elif roles[i] == "DIRECTOR":
                dir.append(names[i])
        else:
            start = i
            break
    actors.append(act)
    directors.append(dir)

In [9]:
data = {'id': unique_ids, 'actors': actors, 'directors': directors}
new_df = pd.DataFrame(data = data)

In [10]:
df = df_titles.merge(new_df, how='inner', on='id')

df['genres'] = df['genres'].apply(literal_eval)

In [11]:
df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,actors,directors
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"[crime, drama]",['US'],,tt0075314,8.3,795222.0,27.612,8.2,"[Robert De Niro, Jodie Foster, Albert Brooks, ...",[Martin Scorsese]
1,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"[comedy, fantasy]",['GB'],,tt0071853,8.2,530877.0,18.216,7.8,"[Graham Chapman, John Cleese, Eric Idle, Terry...","[Terry Jones, Terry Gilliam]"
2,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,[comedy],['GB'],,tt0079470,8.0,392419.0,17.505,7.8,"[Graham Chapman, John Cleese, Terry Gilliam, E...",[Terry Jones]
3,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,[horror],['US'],,tt0070047,8.1,391942.0,95.337,7.7,"[Ellen Burstyn, Linda Blair, Max von Sydow, Le...",[William Friedkin]
4,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,"[comedy, european]",['GB'],4.0,tt0063929,8.8,72895.0,12.919,8.3,"[Graham Chapman, Michael Palin, Terry Jones, E...",[]


In [12]:
# create important columns for the recommender system
columns = ["actors", "directors", "genres", "title"]

In [13]:
df.dtypes

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
actors                   object
directors                object
dtype: object

In [14]:
df[columns] = df[columns].astype('str')

In [15]:
df[columns].head()

Unnamed: 0,actors,directors,genres,title
0,"['Robert De Niro', 'Jodie Foster', 'Albert Bro...",['Martin Scorsese'],"['crime', 'drama']",Taxi Driver
1,"['Graham Chapman', 'John Cleese', 'Eric Idle',...","['Terry Jones', 'Terry Gilliam']","['comedy', 'fantasy']",Monty Python and the Holy Grail
2,"['Graham Chapman', 'John Cleese', 'Terry Gilli...",['Terry Jones'],['comedy'],Life of Brian
3,"['Ellen Burstyn', 'Linda Blair', 'Max von Sydo...",['William Friedkin'],['horror'],The Exorcist
4,"['Graham Chapman', 'Michael Palin', 'Terry Jon...",[],"['comedy', 'european']",Monty Python's Flying Circus


In [16]:
for column in columns:
    df[column] = df[column].str.strip('[]').astype(str)

In [17]:
# check if any missing values in the important columns
df[columns].isnull().sum()
    

actors       0
directors    0
genres       0
title        0
dtype: int64

In [18]:
df['title'] = df['title'].fillna("")

In [19]:
# create a function to combine values of important columns into one string
def get_important_features(data):
    important_features = []
    for i in range(0, data.shape[0]):
        important_features.append(data["actors"][i]+" "+data["directors"][i]+" "+data["genres"][i]+" "+data["title"][i])
    return important_features

In [20]:
# create column to hold combined string
df["important_features"] = get_important_features(df)

In [21]:
df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,actors,directors,important_features
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"'crime', 'drama'",['US'],,tt0075314,8.3,795222.0,27.612,8.2,"'Robert De Niro', 'Jodie Foster', 'Albert Broo...",'Martin Scorsese',"'Robert De Niro', 'Jodie Foster', 'Albert Broo..."
1,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"'comedy', 'fantasy'",['GB'],,tt0071853,8.2,530877.0,18.216,7.8,"'Graham Chapman', 'John Cleese', 'Eric Idle', ...","'Terry Jones', 'Terry Gilliam'","'Graham Chapman', 'John Cleese', 'Eric Idle', ..."
2,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,'comedy',['GB'],,tt0079470,8.0,392419.0,17.505,7.8,"'Graham Chapman', 'John Cleese', 'Terry Gillia...",'Terry Jones',"'Graham Chapman', 'John Cleese', 'Terry Gillia..."
3,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,'horror',['US'],,tt0070047,8.1,391942.0,95.337,7.7,"'Ellen Burstyn', 'Linda Blair', 'Max von Sydow...",'William Friedkin',"'Ellen Burstyn', 'Linda Blair', 'Max von Sydow..."
4,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,"'comedy', 'european'",['GB'],4.0,tt0063929,8.8,72895.0,12.919,8.3,"'Graham Chapman', 'Michael Palin', 'Terry Jone...",,"'Graham Chapman', 'Michael Palin', 'Terry Jone..."


In [22]:
df["important_features"].head()

0    'Robert De Niro', 'Jodie Foster', 'Albert Broo...
1    'Graham Chapman', 'John Cleese', 'Eric Idle', ...
2    'Graham Chapman', 'John Cleese', 'Terry Gillia...
3    'Ellen Burstyn', 'Linda Blair', 'Max von Sydow...
4    'Graham Chapman', 'Michael Palin', 'Terry Jone...
Name: important_features, dtype: object

# Recommendation System

In [23]:
# convert text to matrix of token counts

cm = CountVectorizer().fit_transform(df["important_features"])


In [24]:
# Get cosine matrix from the count matrix
cs = cosine_similarity(cm)

In [25]:
# print similarly matrix
print(cs)

[[1.         0.00819232 0.03665083 ... 0.         0.01796053 0.01581139]
 [0.00819232 1.         0.53045108 ... 0.         0.01471384 0.01295319]
 [0.03665083 0.53045108 1.         ... 0.         0.02194228 0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.01796053 0.01471384 0.02194228 ... 0.         1.         0.        ]
 [0.01581139 0.01295319 0.         ... 0.         0.         1.        ]]


In [26]:
# Get the shape of the cosine similarity matrix

cs.shape

(5434, 5434)

In [27]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [28]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cs=cs):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Create a list of enumerations for the similarity scores
    scores = list(enumerate(cs[idx]))

    # Sort the movie list based on the similarity scores
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sorted_scores =  sorted_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sorted_scores]
    
    # Create a loop to print the ten most similar movies
    print("The 10 most recommended movies to ", title , "are:\n")
    j=0
    for i in sorted_scores:
        movie_title = df[df.index==i[0]]["title"].values[0]
        print(j+1, movie_title)
        j = j+1
        if j>9:
            break

In [29]:
get_recommendations("Fatherhood")

The 10 most recommended movies to  Fatherhood are:

1 Legend
2 Kevin Hart's Guide to Black History
3 Julie's Greenroom
4 She's Gotta Have It
5 Horse Girl
6 Good Sam
7 BoJack Horseman
8 Pee-wee's Big Holiday
9 Harold & Kumar Go to White Castle
10 Kevin Hart: Irresponsible


In [30]:
get_recommendations("Inception")

The 10 most recommended movies to  Inception are:

1 Dunkirk
2 Teenage Mutant Ninja Turtles II: The Secret of the Ooze
3 Argo
4 Blade: Trinity
5 Alien Warfare
6 Body of Lies
7 LEGO DC Comics Super Heroes: Batman Be-Leaguered
8 Top Gun
9 The Imitation Game
10 The Frozen Ground


In [31]:
get_recommendations("Dunkirk")

The 10 most recommended movies to  Dunkirk are:

1 Inception
2 Darkest Hour
3 The King
4 Talking Tom and Friends
5 The Imitation Game
6 LEGO DC Comics Super Heroes: Batman Be-Leaguered
7 Django Unchained
8 American Ultra
9 The Irishman
10 Top Gun


In [32]:
get_recommendations("Darkest Hour")

The 10 most recommended movies to  Darkest Hour are:

1 The Irishman
2 The Dig
3 Saudi Arabia Uncovered
4 Sherlock Holmes
5 Contagion
6 The Bourne Ultimatum
7 Dunkirk
8 In a Valley of Violence
9 Public Enemies
10 Les Misérables
