# Machine Learning: Membangun Sistem Rekomendasi Dengan Similarity Untuk Tayangan Film
#### by: Nadia Fitriana Latifah

## Overview
#### Pada project machine learning sebelumnya, sudah dibuat sistem rekomendasi untuk tayangan film dengan menggunakan IMDB average rating dengan mengurutkan value yang pada setiap data di film secara descending, sehingga dapat diketahui suatu film atau beberapa film yang menarik menurut para penonton.
#### Dalam project machine learning ini akan dibuat suatu sistem rekomendasi untuk tayangan film dengan menggunakan content atau feature dari film tersebut, kemudian melakukan perhitungan terhadap kesamaan antara satu film dengan film lainnya, sehingga jika ditunjuk satu film akan mendapatkan beberapa film lain yang memiliki kesamaan dengan film yang telah ditunjuk tersebut. Kondisi seperti ini disebut sebagai Content Based Recommender System.
#### Dengan membandingkan kesamaan genre yang ada, sebagai contoh ketika penonton lebih menyukai film Narnia, maka Content Based Recommender System ini akan merekomendasikan film seperti Harry Potter atau The Lords of The Rings yang memiliki genre yang mirip.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_film_rating=pd.read_csv('movie_rating_df.csv')

In [3]:
df_film_rating

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1.0,"Documentary,Short",5.6,1608
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5.0,"Animation,Short",6.0,197
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1285
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,12.0,"Animation,Short",6.1,121
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1.0,"Comedy,Short",6.1,2050
...,...,...,...,...,...,...,...,...,...,...,...
751609,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019.0,,123.0,,8.4,5
751610,tt9916544,short,My Sweet Prince,My Sweet Prince,0,2019.0,,12.0,"Drama,Short",7.2,19
751611,tt9916576,tvEpisode,Destinee's Story,Destinee's Story,0,2019.0,,85.0,,6.0,9
751612,tt9916720,short,The Nun 2,The Nun 2,0,2019.0,,10.0,"Comedy,Horror,Mystery",5.6,49


In [4]:
df_film_rating.shape

(751614, 11)

In [5]:
df_film_rating.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1.0,"Documentary,Short",5.6,1608
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5.0,"Animation,Short",6.0,197
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1285
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,12.0,"Animation,Short",6.1,121
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1.0,"Comedy,Short",6.1,2050


#### Untuk mengetahui jenis tipe data pada file movie_rating_df

In [6]:
df_film_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751614 entries, 0 to 751613
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          751614 non-null  object 
 1   titleType       751614 non-null  object 
 2   primaryTitle    751614 non-null  object 
 3   originalTitle   751614 non-null  object 
 4   isAdult         751614 non-null  int64  
 5   startYear       751614 non-null  float64
 6   endYear         16072 non-null   float64
 7   runtimeMinutes  751614 non-null  float64
 8   genres          486766 non-null  object 
 9   averageRating   751614 non-null  float64
 10  numVotes        751614 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 63.1+ MB


### Add Actors DataFrame
#### Dari hasil output pada project machine learning sebelumnya, diperoleh list film dengan beberapa metadata sepertu isAdult, runtimeMinutes, dan genres dari setiap film.
#### Untuk project machine learning ini, akan ditambahkan metadata lain seperti aktor atau aktris yang bermain pada film tersebut. Selain itu, juga akan digunakan DataFrame lain yang kemudian akan dilakukan join dengan DataFrame df_film_rating.

In [7]:
df_nama_aktor=pd.read_csv('actor_name.csv')

In [8]:
df_nama_aktor

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm1774132,Nathan McLaughlin,1973,\N,"special_effects,make_up_department","tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,\N,\N,actor,tt7718088
2,nm1021485,Brandon Fransvaag,\N,\N,miscellaneous,tt0168790
3,nm6940929,Erwin van der Lely,\N,\N,miscellaneous,tt4232168
4,nm5764974,Svetlana Shypitsyna,\N,\N,actress,tt3014168
...,...,...,...,...,...,...
995,nm7596674,Paul Whitrow,\N,\N,actor,"tt4118352,tt9104322,tt4447090,tt4892804"
996,nm5938546,Wendy Ponce,\N,\N,,tt2125666
997,nm2101810,Ans Brugmans,\N,\N,costume_designer,tt0488280
998,nm5245804,Eliza Jenkins,\N,\N,,tt1464058


In [9]:
df_nama_aktor.shape

(1000, 6)

In [10]:
df_nama_aktor.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm1774132,Nathan McLaughlin,1973,\N,"special_effects,make_up_department","tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,\N,\N,actor,tt7718088
2,nm1021485,Brandon Fransvaag,\N,\N,miscellaneous,tt0168790
3,nm6940929,Erwin van der Lely,\N,\N,miscellaneous,tt4232168
4,nm5764974,Svetlana Shypitsyna,\N,\N,actress,tt3014168


#### Untuk mengetahui jenis tipe data pada file actor_name

In [11]:
df_nama_aktor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   nconst             1000 non-null   object
 1   primaryName        1000 non-null   object
 2   birthYear          1000 non-null   object
 3   deathYear          1000 non-null   object
 4   primaryProfession  891 non-null    object
 5   knownForTitles     1000 non-null   object
dtypes: object(6)
memory usage: 47.0+ KB


### Add Directors and Writers DataFrame
#### Pada langkah ini, ditambahkan DataFrame berupa directors dan writers dari setiap film.

In [12]:
df_director_writer=pd.read_csv('directors_writers.csv')

In [13]:
df_director_writer

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,David Kirkland,"John Emerson,Anita Loos"
1,tt0011890,Roy William Neill,"Arthur F. Goodrich,Burns Mantle,Mary Murillo"
2,tt0014341,"Buster Keaton,John G. Blystone","Jean C. Havez,Clyde Bruckman,Joseph A. Mitchell"
3,tt0018054,Cecil B. DeMille,Jeanie Macpherson
4,tt0024151,James Cruze,"Max Miller,Wells Root,Jack Jevne"
...,...,...,...
981,tt9236688,Kai Wessel,Christian Jeltsch
982,tt9278408,Bahadir Ince,"Levent Cantek,Ali Demirel,Baris Erdogan"
983,tt9285882,Rapman,Rapman
984,tt9310372,Sujoy Ghosh,"Sujoy Ghosh,Raj Vasant,Pratim D. Gupta,Suresh ..."


In [14]:
df_director_writer.shape

(986, 3)

In [15]:
df_director_writer.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,David Kirkland,"John Emerson,Anita Loos"
1,tt0011890,Roy William Neill,"Arthur F. Goodrich,Burns Mantle,Mary Murillo"
2,tt0014341,"Buster Keaton,John G. Blystone","Jean C. Havez,Clyde Bruckman,Joseph A. Mitchell"
3,tt0018054,Cecil B. DeMille,Jeanie Macpherson
4,tt0024151,James Cruze,"Max Miller,Wells Root,Jack Jevne"


#### Untuk mengetahui jenis tipe data pada file directors_writers

In [16]:
df_director_writer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tconst         986 non-null    object
 1   director_name  986 non-null    object
 2   writer_name    986 non-null    object
dtypes: object(3)
memory usage: 23.2+ KB


### Converting into List Data
#### Setelah mengetahui informasi dari file directors_writers dapat diketahui bahwa tidak ada nilai NULL pada DataFrame tersebut. Langkah selanjutnya adalah mengubah director_name dan writer_name dari data string menjadi bentuk data list.

In [17]:
df_director_writer['director_name']=df_director_writer['director_name'].apply(lambda row: row.split(','))
df_director_writer['writer_name']=df_director_writer['writer_name'].apply(lambda row: row.split(','))

In [18]:
df_director_writer

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"
...,...,...,...
981,tt9236688,[Kai Wessel],[Christian Jeltsch]
982,tt9278408,[Bahadir Ince],"[Levent Cantek, Ali Demirel, Baris Erdogan]"
983,tt9285882,[Rapman],[Rapman]
984,tt9310372,[Sujoy Ghosh],"[Sujoy Ghosh, Raj Vasant, Pratim D. Gupta, Sur..."


In [19]:
df_director_writer.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


### Cleaning and Processing Table Cast
#### Pada tahap ini dilakukan penghapusan data yang mempunyai nilai NULL dan menghapus kolom-kolom yang tidak digunakan.
#### Update pada file df_nama_aktor, disini hanya dibutuhkan kolom nconst, primaryName, dan knownForTitles.

In [20]:
df_nama_aktor=df_nama_aktor[['nconst','primaryName','knownForTitles']]

In [21]:
df_nama_aktor

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168
...,...,...,...
995,nm7596674,Paul Whitrow,"tt4118352,tt9104322,tt4447090,tt4892804"
996,nm5938546,Wendy Ponce,tt2125666
997,nm2101810,Ans Brugmans,tt0488280
998,nm5245804,Eliza Jenkins,tt1464058


In [22]:
df_nama_aktor.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168


### Aktor pada Film
#### Langkah selanjutnya adalah mengetahui variasi dari jumlah film yang diperankan oleh seorang aktor.
#### Setiap aktor dapat membintangi lebih dari satu film, sehingga diperlukan tabel yang mempunyai relasi bernilai 1-1 ke masing-masing judul film tersebut. Oleh karena itu, perlu dilakukan unnest terhadap tabel tersebut.
#### Tahapan yang dilakukan pada tahap ini adalah:
#### 1. Melakukan pengecekan variasi jumlah film yang diperankan oleh aktor.
#### 2. Mengubah tipe data dari kolom 'knownForTitles' menjadi tipe data list of list.

In [23]:
#melakukan pengecekan variasi dari jumlah film yang diperankan oleh aktor
print(df_nama_aktor['knownForTitles'].apply(lambda x: len(x.split(','))).unique())

[4 1 2 3]


In [24]:
#Mengubah tipe data dari kolom 'knownForTitles' menjadi tipe data list of list.
df_nama_aktor['knownForTitles']=df_nama_aktor['knownForTitles'].apply(lambda x: x.split(','))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nama_aktor['knownForTitles']=df_nama_aktor['knownForTitles'].apply(lambda x: x.split(','))


In [25]:
df_nama_aktor

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"[tt0417686, tt1713976, tt1891860, tt0454839]"
1,nm10683464,Bridge Andrew,[tt7718088]
2,nm1021485,Brandon Fransvaag,[tt0168790]
3,nm6940929,Erwin van der Lely,[tt4232168]
4,nm5764974,Svetlana Shypitsyna,[tt3014168]
...,...,...,...
995,nm7596674,Paul Whitrow,"[tt4118352, tt9104322, tt4447090, tt4892804]"
996,nm5938546,Wendy Ponce,[tt2125666]
997,nm2101810,Ans Brugmans,[tt0488280]
998,nm5245804,Eliza Jenkins,[tt1464058]


In [26]:
df_nama_aktor.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"[tt0417686, tt1713976, tt1891860, tt0454839]"
1,nm10683464,Bridge Andrew,[tt7718088]
2,nm1021485,Brandon Fransvaag,[tt0168790]
3,nm6940929,Erwin van der Lely,[tt4232168]
4,nm5764974,Svetlana Shypitsyna,[tt3014168]


### Relasi untuk Korespondensi 1-1
#### Pada data sebelumnya dapat diketahui bahwa satu aktor dapat memerankan 1-4 film, maka diperlukan table yang mempunyai relasi korespondensi 1-1 dari aktor ke masing-masing judul film tersebut.

In [27]:
#Menyiapkan bucket untuk DataFrame
df_uni=[]

for x in ['knownForTitles']: #Mengulang index dari tiap baris pada 'knownForTitles' 
    idx=df_nama_aktor.index.repeat(df_nama_aktor['knownForTitles'].str.len())
    
    #Memecahkan values dari list di setiap baris dan menggabungkannya dengan baris lain menjadi DataFrame.
    df_value_list=pd.DataFrame({
        x: np.concatenate(df_nama_aktor[x].values)
    })
    
    #Mengganti index DataFrame tersebut dengan idx yang sudah didefinisikan di awal.
    df_value_list.index=idx
    
    #Untuk setiap DataFrame yang terbentuk, langkah selanjutnya adalah append ke DataFrame dari bucket.
    df_uni.append(df_value_list)
    
#Menggabungkan semua DataFrame menjadi satu.
df_concat=pd.concat(df_uni, axis=1)

#Left join dengan value dari DataFrame yang awal.
df_unnested=df_concat.join(df_nama_aktor.drop(['knownForTitles'], 1), how='left')

#Select kolom sesuai dengan DataFrame awal.
df_unnested=df_unnested[df_nama_aktor.columns.tolist()]
df_unnested

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,tt0417686
0,nm1774132,Nathan McLaughlin,tt1713976
0,nm1774132,Nathan McLaughlin,tt1891860
0,nm1774132,Nathan McLaughlin,tt0454839
1,nm10683464,Bridge Andrew,tt7718088
...,...,...,...
998,nm5245804,Eliza Jenkins,tt1464058
999,nm0948460,Greg Yolen,tt0436869
999,nm0948460,Greg Yolen,tt0476663
999,nm0948460,Greg Yolen,tt0109723


In [28]:
df_unnested.shape

(1918, 3)

In [29]:
df_unnested.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,tt0417686
0,nm1774132,Nathan McLaughlin,tt1713976
0,nm1774132,Nathan McLaughlin,tt1891860
0,nm1774132,Nathan McLaughlin,tt0454839
1,nm10683464,Bridge Andrew,tt7718088


In [30]:
df_unnested.head(20)

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,tt0417686
0,nm1774132,Nathan McLaughlin,tt1713976
0,nm1774132,Nathan McLaughlin,tt1891860
0,nm1774132,Nathan McLaughlin,tt0454839
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168
5,nm8621807,Utku Arslan,tt5493404
5,nm8621807,Utku Arslan,tt7661932


### Nesting pada primaryName group by knownForTitles
#### Melakukan grouping kembali pada kolom player karena yang diperlukan adalah level movie untuk melakukan rekomendasi film

In [31]:
drop_unnested=df_unnested.drop(['nconst'], axis=1)

#Menyiapkan busket untuk DataFrame
df_uni=[]

for kolom_player in ['primaryName']:
    #Agergasi pada kolom primaryName sesuai dengan group_kolom_player yang sudah didefinisikan diatas.
    df_kolom=drop_unnested.groupby(['knownForTitles'])[kolom_player].apply(list)
    #Melakukan append
    df_uni.append(df_kolom)
    
df_grouping=pd.concat(df_uni, axis=1).reset_index()
df_grouping.columns=['knownForTitles','cast_name']
df_grouping

Unnamed: 0,knownForTitles,cast_name
0,tt0008125,[Charles Harley]
1,tt0009706,[Charles Harley]
2,tt0010304,[Natalie Talmadge]
3,tt0011414,[Natalie Talmadge]
4,tt0011890,[Natalie Talmadge]
...,...,...
1893,tt9610496,[Stefano Baffetti]
1894,tt9714030,[Kevin Kain]
1895,tt9741820,[Caroline Plyler]
1896,tt9759814,[Ethan Francis]


In [32]:
df_grouping.shape

(1898, 2)

In [33]:
df_grouping.head(20)

Unnamed: 0,knownForTitles,cast_name
0,tt0008125,[Charles Harley]
1,tt0009706,[Charles Harley]
2,tt0010304,[Natalie Talmadge]
3,tt0011414,[Natalie Talmadge]
4,tt0011890,[Natalie Talmadge]
5,tt0014341,[Natalie Talmadge]
6,tt0016622,[Booth Grainge]
7,tt0018054,[Reeka Roberts]
8,tt0024151,[James Hackett]
9,tt0025981,[Bernard Loftus]


### Joining with Movie Table
#### Tahapan yang dilakukan pada joining ini antara lain:
#### 1. Join antara movie table dengan cast table (field of knownForTitles dan tconst).
#### 2. Join antara df_base dengan df_director_writer (field of tconst and tconst)

In [34]:
#Join antara movie table dengan cast table
df_base=pd.merge(df_grouping, df_film_rating, left_on='knownForTitles', right_on='tconst', how='inner')

#Join antara df_base dengan df_director_writer
df_base=pd.merge(df_base, df_director_writer, left_on='tconst', right_on='tconst', how='left')

In [35]:
df_base

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Natalie Talmadge],tt0011890,movie,Yes or No,Yes or No,0,1920.0,,72.0,,6.3,7,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,[Natalie Talmadge],tt0014341,movie,Our Hospitality,Our Hospitality,0,1923.0,,65.0,"Comedy,Romance,Thriller",7.8,9621,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Reeka Roberts],tt0018054,movie,The King of Kings,The King of Kings,0,1927.0,,155.0,"Biography,Drama,History",7.3,1826,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Hackett],tt0024151,movie,I Cover the Waterfront,I Cover the Waterfront,0,1933.0,,80.0,"Drama,Romance",6.3,455,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1055,tt9246600,[Vanessa Hanson],tt9246600,tvSeries,UFC on ESPN,UFC on ESPN,0,2019.0,,180.0,,8.1,38,,
1056,tt9278408,[Utku Arslan],tt9278408,tvMiniSeries,Bozkir,Bozkir,0,2018.0,2019.0,50.0,"Crime,Drama,Mystery",8.2,1231,[Bahadir Ince],"[Levent Cantek, Ali Demirel, Baris Erdogan]"
1057,tt9285882,[Jonathon Deering],tt9285882,movie,Blue Story,Blue Story,0,2019.0,,91.0,"Crime,Drama",5.5,1411,[Rapman],[Rapman]
1058,tt9310372,[Sandini Dhar],tt9310372,tvSeries,Typewriter,Typewriter,0,2019.0,,48.0,"Horror,Thriller",6.5,2895,[Sujoy Ghosh],"[Sujoy Ghosh, Raj Vasant, Pratim D. Gupta, Sur..."


In [36]:
df_base.shape

(1060, 15)

In [37]:
df_base.head()

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Natalie Talmadge],tt0011890,movie,Yes or No,Yes or No,0,1920.0,,72.0,,6.3,7,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,[Natalie Talmadge],tt0014341,movie,Our Hospitality,Our Hospitality,0,1923.0,,65.0,"Comedy,Romance,Thriller",7.8,9621,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Reeka Roberts],tt0018054,movie,The King of Kings,The King of Kings,0,1927.0,,155.0,"Biography,Drama,History",7.3,1826,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Hackett],tt0024151,movie,I Cover the Waterfront,I Cover the Waterfront,0,1933.0,,80.0,"Drama,Romance",6.3,455,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


### Cleaning Data
#### Setelah melakukan join table, maka sekarang adalah melakukan cleaning data terhadap data yang sudah dihasilkan.

In [38]:
#Melakukan drop terhadap kolom 'knownForTitles'
drop_base=df_base.drop(['knownForTitles'], axis=1)
print(drop_base.info())

#Mengganti nilai NULL pada kolom 'genres' dengan 'Unknown'
drop_base['genres']=drop_base['genres'].fillna('Unknown')

#Melakukan perhitungan jumlah nilai NULL pada tiap kolom
print(drop_base.isnull().sum())

#Mengganti nilai NULL pada kolom director_name dan writer_name dengan 'Unknown'
drop_base[['director_name','writer_name']]=drop_base[['director_name','writer_name']].fillna('Unknown')

#Karena value pada kolom genres terdapat multiple values, jadi akan diubah menjadi list of list
drop_base['genres']=drop_base['genres'].apply(lambda x: x.split(','))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cast_name       1060 non-null   object 
 1   tconst          1060 non-null   object 
 2   titleType       1060 non-null   object 
 3   primaryTitle    1060 non-null   object 
 4   originalTitle   1060 non-null   object 
 5   isAdult         1060 non-null   int64  
 6   startYear       1060 non-null   float64
 7   endYear         110 non-null    float64
 8   runtimeMinutes  1060 non-null   float64
 9   genres          745 non-null    object 
 10  averageRating   1060 non-null   float64
 11  numVotes        1060 non-null   int64  
 12  director_name   986 non-null    object 
 13  writer_name     986 non-null    object 
dtypes: float64(4), int64(2), object(8)
memory usage: 124.2+ KB
None
cast_name           0
tconst              0
titleType           0
primaryTitle        0
originalTitle   

### Reformat on df_base table
#### Langkah selanjutnya adalah melakukan reformat pada tabel df_base dimana beberapa kolomnya sudah didrop.

In [39]:
#Drop kolom tconst, isAdult, endYear, originalTitle
drop_base_kedua=drop_base.drop(['tconst', 'isAdult', 'endYear', 'originalTitle'], axis=1)

drop_base_kedua=drop_base_kedua[['primaryTitle', 'titleType', 'startYear', 'runtimeMinutes', 'genres', 
                                'averageRating', 'numVotes', 'cast_name', 'director_name', 'writer_name']]

drop_base_kedua.columns=['title', 'type', 'start', 'duration', 'genres', 'rating', 'votes', 'cast_name', 
                         'director_name', 'writer_name']

drop_base_kedua

Unnamed: 0,title,type,start,duration,genres,rating,votes,cast_name,director_name,writer_name
0,The Love Expert,movie,1920.0,60.0,"[Comedy, Romance]",4.9,136,[Natalie Talmadge],[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,movie,1920.0,72.0,[Unknown],6.3,7,[Natalie Talmadge],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,movie,1923.0,65.0,"[Comedy, Romance, Thriller]",7.8,9621,[Natalie Talmadge],"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,movie,1927.0,155.0,"[Biography, Drama, History]",7.3,1826,[Reeka Roberts],[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,movie,1933.0,80.0,"[Drama, Romance]",6.3,455,[James Hackett],[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"
...,...,...,...,...,...,...,...,...,...,...
1055,UFC on ESPN,tvSeries,2019.0,180.0,[Unknown],8.1,38,[Vanessa Hanson],Unknown,Unknown
1056,Bozkir,tvMiniSeries,2018.0,50.0,"[Crime, Drama, Mystery]",8.2,1231,[Utku Arslan],[Bahadir Ince],"[Levent Cantek, Ali Demirel, Baris Erdogan]"
1057,Blue Story,movie,2019.0,91.0,"[Crime, Drama]",5.5,1411,[Jonathon Deering],[Rapman],[Rapman]
1058,Typewriter,tvSeries,2019.0,48.0,"[Horror, Thriller]",6.5,2895,[Sandini Dhar],[Sujoy Ghosh],"[Sujoy Ghosh, Raj Vasant, Pratim D. Gupta, Sur..."


In [40]:
drop_base_kedua.shape

(1060, 10)

In [41]:
drop_base_kedua.head()

Unnamed: 0,title,type,start,duration,genres,rating,votes,cast_name,director_name,writer_name
0,The Love Expert,movie,1920.0,60.0,"[Comedy, Romance]",4.9,136,[Natalie Talmadge],[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,movie,1920.0,72.0,[Unknown],6.3,7,[Natalie Talmadge],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,movie,1923.0,65.0,"[Comedy, Romance, Thriller]",7.8,9621,[Natalie Talmadge],"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,movie,1927.0,155.0,"[Biography, Drama, History]",7.3,1826,[Reeka Roberts],[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,movie,1933.0,80.0,"[Drama, Romance]",6.3,455,[James Hackett],[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


### Creating Content Based Recommender System
#### Dalam tahap ini dilakukan klasifikasi metadata, dimana klasifikasi berdasarkan genres, primaryName (cast_name), director_name, dan writer_name.

In [42]:
df_feature=drop_base_kedua[['title', 'cast_name', 'genres', 'director_name', 'writer_name']]

In [43]:
df_feature

Unnamed: 0,title,cast_name,genres,director_name,writer_name
0,The Love Expert,[Natalie Talmadge],"[Comedy, Romance]",[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,[Natalie Talmadge],[Unknown],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,[Natalie Talmadge],"[Comedy, Romance, Thriller]","[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,[Reeka Roberts],"[Biography, Drama, History]",[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,[James Hackett],"[Drama, Romance]",[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"
...,...,...,...,...,...
1055,UFC on ESPN,[Vanessa Hanson],[Unknown],Unknown,Unknown
1056,Bozkir,[Utku Arslan],"[Crime, Drama, Mystery]",[Bahadir Ince],"[Levent Cantek, Ali Demirel, Baris Erdogan]"
1057,Blue Story,[Jonathon Deering],"[Crime, Drama]",[Rapman],[Rapman]
1058,Typewriter,[Sandini Dhar],"[Horror, Thriller]",[Sujoy Ghosh],"[Sujoy Ghosh, Raj Vasant, Pratim D. Gupta, Sur..."


In [44]:
df_feature.shape

(1060, 5)

In [45]:
df_feature.head()

Unnamed: 0,title,cast_name,genres,director_name,writer_name
0,The Love Expert,[Natalie Talmadge],"[Comedy, Romance]",[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,[Natalie Talmadge],[Unknown],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,[Natalie Talmadge],"[Comedy, Romance, Thriller]","[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,[Reeka Roberts],"[Biography, Drama, History]",[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,[James Hackett],"[Drama, Romance]",[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


### Cara membuat fungsi untuk strip spaces dari setiap row dan setiap elementnya:

#### Membuat fungsi untuk membentuk metadata soup (menggabungkan semua feature menjadi satu bagian kalimat) untuk setiap judul film.

In [46]:
#Kolom yang digunakan adalah cast_name, genres, director_name, dan writer_name

def feature_soup(x):
    return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])

#Menmbuat metadata soup menjadi satu kolom
df_feature['soup']=df_feature.apply(feature_soup, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_feature['soup']=df_feature.apply(feature_soup, axis=1)


### Menyiapkan CountVectorizer (stop_word = English) dan fit dengan metadata soup yang telah didefinisikan diatas.
#### CountVectorizer adalah tipe yang paling sederhana dari Vectorizer. Untuk penjelasan lebih mudah dijabarkan melalui contoh di bawah ini:

In [47]:
#Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Mendefinisikan CountVectorizer dan mengubah metadata soup diatas menjadi bentuk vector.

In [48]:
count=CountVectorizer(stop_words='english')
matrix_count=count.fit_transform(df_feature['soup'])

print(matrix_count)

  (0, 6343)	1
  (0, 8502)	1
  (0, 1697)	1
  (0, 7445)	1
  (0, 1967)	1
  (0, 4639)	1
  (0, 4300)	1
  (0, 2489)	1
  (0, 260)	1
  (0, 5306)	1
  (1, 6343)	1
  (1, 8502)	1
  (1, 8888)	1
  (1, 7511)	1
  (1, 9286)	1
  (1, 6380)	1
  (1, 363)	1
  (1, 3266)	1
  (1, 1205)	1
  (1, 5559)	1
  (1, 5683)	1
  (1, 6258)	1
  (2, 6343)	1
  (2, 8502)	1
  (2, 1697)	1
  :	:
  (1057, 2258)	1
  (1057, 1813)	1
  (1057, 4312)	1
  (1057, 2017)	1
  (1057, 7198)	2
  (1058, 8634)	1
  (1058, 3903)	1
  (1058, 7157)	1
  (1058, 3439)	1
  (1058, 8400)	1
  (1058, 3130)	2
  (1058, 7656)	1
  (1058, 2107)	1
  (1058, 8389)	2
  (1058, 8957)	1
  (1058, 7028)	1
  (1058, 6307)	1
  (1059, 1697)	1
  (1059, 3903)	1
  (1059, 2640)	1
  (1059, 4996)	1
  (1059, 4299)	2
  (1059, 3434)	1
  (1059, 8627)	1
  (1059, 6521)	2


In [49]:
print(matrix_count.shape)

(1060, 9639)


### Membuat Model Similarity Antara Matrix Count
#### Pada langkah ini, akan dilakukan perhitungan terhadap score cosine similarity dari setiap pasangan judul dengan berdasarkan semua kombinasi pasangan yang ada, dengan kata lain membuat 675 x 675 matrix, dimana cell di kolom i dan j menunjukkan score similarity antara judul i dan j. Disini dapat dengan mudah untuk melihat bahwa matrix ini simetris dan setiap elemen pada diagonal adalah 1, karena itu adalah similarity score dengan dirinya sendiri.
#### Cosine similarity pada bagian ini menggunakan formula cosine similarity untuk membuat model. Scor cosine ini sangatlah berguna dan mudah untuk dihitung.
#### Formula untuk perhitungan cosine similarity antara dua text adalah sebagai berikut:

$$ cosine(x,y) = (x.y^T)/(||x||.||y||) $$

#### Output yang didapat adalah antara range -1 sampai 1. Score yang hampir mencapai 1 artinya kedua entitas tersebut sangatlah mirip sedangkan score yang hampir mencapai -1 artinya kedua entitas tersebut adalah beda.

In [50]:
#Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

In [51]:
#Menggunakan cosine_similarity antara matrix_count
cosine_similarity_output=cosine_similarity(matrix_count, matrix_count)

print(cosine_similarity_output)

[[1.         0.18257419 0.40824829 ... 0.         0.         0.08451543]
 [0.18257419 1.         0.1490712  ... 0.         0.         0.        ]
 [0.40824829 0.1490712  1.         ... 0.         0.06085806 0.06900656]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.06085806 ... 0.         1.         0.06299408]
 [0.08451543 0.         0.06900656 ... 0.         0.06299408 1.        ]]


### Cara Membuat Content Based Recommender System
#### Langkah selanjutnya yang harus dilakukan adalah melakukan reverse mapping dengan judul sebagai index-nya.

In [52]:
indices=pd.Series(df_feature.index, index=df_feature['title']).drop_duplicates()

def content_recommender(title):
    #Mendapatkan index dari judul film (title) yang disebutkan.
    idx=indices[title]

    #Menjadikan list dari array similarity cosine_similarity_output.
    #Hint: cosine_similarity_output[idx]
    similarity_score=list(enumerate(cosine_similarity_output[idx]))

    #Mengurutkan film dari similarity tertinggi sampai similarity terendah.
    similarity_score=sorted(similarity_score,key=lambda x: x[1],reverse=True)

    #Untuk mendapatkan list judul dari item ke-2 sampe ke-11.
    similarity_score=similarity_score[1:11]

    #Mendapatkan index dari judul-judul yang muncul di similarity_score.
    indices_film=[i[0] for i in similarity_score]

    #Dengan menggunakan iloc, maka dapat dipanggil balik berdasarkan index dari indices_film.
    return df_base.iloc[indices_film]

### Mengaplikasikan dari function diatas untuk pencarian.

In [53]:
content_recommender('The Lion King')

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
848,tt3040964,[Cristina Carrión Márquez],tt3040964,movie,The Jungle Book,The Jungle Book,0,2016.0,,106.0,"Adventure,Drama,Family",7.4,250994,[Jon Favreau],"[Justin Marks, Rudyard Kipling]"
530,tt0848456,[Jessi Castro],tt0848456,video,Sodom 2: The Bottom Feeder,Sodom 2: The Bottom Feeder,1,2006.0,,162.0,"Adult,Adventure,Drama",5.0,5,,
822,tt2552322,[Gizem Gülen],tt2552322,movie,Paddle Pop Adventures 2: Journey Into the Kingdom,Paddle Pop Adventures 2: Journey Into the Kingdom,0,2012.0,,88.0,"Action,Adventure,Animation",4.7,23,,
917,tt4498760,[Kaleigh Phillips],tt4498760,movie,Khali the Killer,Khali the Killer,0,2017.0,,89.0,"Crime,Drama",3.6,234,[Jon Matthews],[Jon Matthews]
835,tt2798920,[Dan Churchill],tt2798920,movie,Annihilation,Annihilation,0,2018.0,,115.0,"Adventure,Drama,Horror",6.9,260155,[Alex Garland],"[Alex Garland, Jeff VanderMeer]"
158,tt0100192,[William Holden Jr.],tt0100192,tvMovie,Mother Goose Rock 'n' Rhyme,Mother Goose Rock 'n' Rhyme,0,1990.0,,96.0,"Adventure,Family,Fantasy",7.6,717,[Jeff Stein],"[Rod Ash, Mark Curtiss, Linda Engelsiepen, Hil..."
383,tt0286336,[Francisco Bretas],tt0286336,tvSeries,The Animals of Farthing Wood,The Animals of Farthing Wood,0,1993.0,1995.0,25.0,"Adventure,Animation,Drama",8.3,3057,"[Elphin Lloyd-Jones, Philippe Leclerc]","[Valerie Georgeson, Colin Dann, Jenny McDade, ..."
237,tt0119796,[Jeff Kurzner],tt0119796,tvMovie,"My Stepson, My Lover","My Stepson, My Lover",0,1997.0,,93.0,"Drama,Thriller",4.4,289,[Mary Lambert],[Ron Cutler]
664,tt1518812,[Andrew Pope],tt1518812,movie,Meek's Cutoff,Meek's Cutoff,0,2010.0,,104.0,"Drama,Western",6.5,11340,[Kelly Reichardt],[Jonathan Raymond]
815,tt2481480,[Yana Karin],tt2481480,movie,Rob the Mob,Rob the Mob,0,2014.0,,104.0,"Crime,Drama",6.3,10463,[Raymond De Felitta],[Jonathan Fernandez]


In [54]:
content_recommender('Annihilation')

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
843,tt2963090,[Navid Taheri],tt2963090,short,Echoes in an Empty Apartment,Echoes in an Empty Apartment,0,2014.0,,23.0,"Drama,Mystery,Short",8.4,5,"[Cara E. Brewer, Alex Reyme]","[Lindsay Lane, Alex Reyme, Ko Wills]"
530,tt0848456,[Jessi Castro],tt0848456,video,Sodom 2: The Bottom Feeder,Sodom 2: The Bottom Feeder,1,2006.0,,162.0,"Adult,Adventure,Drama",5.0,5,,
616,tt1291580,[Harvey J. Alperin],tt1291580,movie,Behind the Candelabra,Behind the Candelabra,0,2013.0,,118.0,"Biography,Drama,Music",7.0,39228,[Steven Soderbergh],"[Richard LaGravenese, Scott Thorson, Alex Thor..."
637,tt1375666,[Dan Churchill],tt1375666,movie,Inception,Inception,0,2010.0,,148.0,"Action,Adventure,Sci-Fi",8.8,1950039,[Christopher Nolan],[Christopher Nolan]
799,tt2333598,[Alex],tt2333598,movie,7 Boxes,7 cajas,0,2012.0,,105.0,"Adventure,Crime,Drama",7.1,5002,"[Juan Carlos Maneglia, Tana Schémbori]","[Juan Carlos Maneglia, Tito Chamorro, Tana Sch..."
511,tt0796366,"[Matthew Fuchs, Aida Caefer]",tt0796366,movie,Star Trek,Star Trek,0,2009.0,,127.0,"Action,Adventure,Sci-Fi",7.9,567224,[J.J. Abrams],"[Gene Roddenberry, Roberto Orci, Alex Kurtzman]"
974,tt6105098,[Rainy Kala],tt6105098,movie,The Lion King,The Lion King,0,2019.0,,118.0,"Adventure,Animation,Drama",6.9,185808,[Jon Favreau],"[Jonathan Roberts, Jeff Nathanson, Irene Mecch..."
979,tt6236554,[Colin Burroughs],tt6236554,movie,Imperial Blue,Imperial Blue,0,2019.0,,90.0,"Drama,Fantasy,Thriller",7.9,16,[Dan Moss],"[David Cecil, Dan Moss]"
630,tt1339050,[Jeff Randy],tt1339050,movie,Aswang: A Journey Into Myth,Aswang: A Journey Into Myth,0,2008.0,,81.0,"Drama,Horror,Mystery",6.9,26,[Jordan Clark],"[Jordan Clark, Janice Santos Valdez]"
237,tt0119796,[Jeff Kurzner],tt0119796,tvMovie,"My Stepson, My Lover","My Stepson, My Lover",0,1997.0,,93.0,"Drama,Thriller",4.4,289,[Mary Lambert],[Ron Cutler]


In [55]:
content_recommender('Sodom 2: The Bottom Feeder')

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
73,tt0075147,[Joaquín Parra],tt0075147,movie,Robin and Marian,Robin and Marian,0,1976.0,,106.0,"Adventure,Drama,Romance",6.5,10830,[Richard Lester],[James Goldman]
232,tt0119051,[Chris Kosloski],tt0119051,movie,The Edge,The Edge,0,1997.0,,117.0,"Action,Adventure,Drama",6.9,65673,[Lee Tamahori],[David Mamet]
9,tt0028657,[Bernard Loftus],tt0028657,movie,Boss of Lonely Valley,Boss of Lonely Valley,0,1937.0,,60.0,"Action,Adventure,Drama",6.2,41,[Ray Taylor],"[Frances Guihan, Forrest Brown]"
908,tt4276752,[Figo Li],tt4276752,movie,Xun long jue,Xun long jue,0,2015.0,,127.0,"Action,Adventure,Drama",6.0,3288,[Wuershan],"[Chia-Lu Chang, Muye Zhang]"
848,tt3040964,[Cristina Carrión Márquez],tt3040964,movie,The Jungle Book,The Jungle Book,0,2016.0,,106.0,"Adventure,Drama,Family",7.4,250994,[Jon Favreau],"[Justin Marks, Rudyard Kipling]"
638,tt1377521,[Logan Olson],tt1377521,short,The Macabre World of Lavender Williams,The Macabre World of Lavender Williams,0,2009.0,,26.0,"Adventure,Drama,Fantasy",7.4,49,[Nick Delgado],[Nick Delgado]
803,tt2356464,[Sina Müller],tt2356464,movie,Ostwind,Ostwind,0,2013.0,,101.0,"Adventure,Drama,Family",6.8,1350,[Katja von Garnier],"[Kristina Magdalena Henn, Lea Schmidbauer]"
954,tt5525846,[Vishal Sharma],tt5525846,movie,Yeh Hai India,Yeh Hai India,0,2017.0,,128.0,"Action,Adventure,Drama",5.4,174,[Lom Harsh],[Lom Harsh]
15,tt0036400,[Tomohisa Higuchi],tt0036400,movie,Sanshiro Sugata,Sugata Sanshirô,0,1943.0,,79.0,"Action,Adventure,Drama",6.8,4067,[Akira Kurosawa],"[Akira Kurosawa, Tsuneo Tomita]"
835,tt2798920,[Dan Churchill],tt2798920,movie,Annihilation,Annihilation,0,2018.0,,115.0,"Adventure,Drama,Horror",6.9,260155,[Alex Garland],"[Alex Garland, Jeff VanderMeer]"


### Pencarian dilakukan dengan cara mengetikkan judul film, sehingga akan ditampilkan 10 film yang memiliki similarity dengan film yang diketik.