# Machine Learning: Membangun Sistem Rekomendasi Untuk Tayangan Film
#### by: Nadia Fitriana Latifah

## Overview
#### Pada pembuatan sistem rekomendasi ini menggunakan database film dari IMDB dan menggunakan dua jenis sistem rekomendasi:
#### 1. Sistem rekomendasi sederhana.
#### 2. Sistem rekomendasi berdasarkan konten dan fitur.
#### Pada sistem rekomendasi sederhana, hanya mengurutkan film berdasarkan dari perhitungan 5 film terbaik.
#### Dalam pembuatan sistem rekomendasi untuk tayangan film ini mengurutkan film berdasarkan dari rating tertinggi, vote terbanyak, dan membuat metric baru dari metric yang telah ada, kemudian melakukan sorting untuk metric baru dari urutan tertinggi sampai terendah.
#### Rekomendasi yang digunakan adalah "Simple Recommender Engine using Weighted Average" yang merupakan rekomendasi umum untuk semua user berdasarkan popularitas film ataupun genre film.
#### Formula dari IMDB dengan Weighted Rating adalah sebagai berikut:

$$ Weighted Rating (WR) = (v/(v+m).R) + (m/(v+m).C) $$

#### dengan:
#### v = jumlah votes pada film,
#### m = jumlah minimum votes yang diperlukan agar dapat masuk dalam chart kategori,
#### R = rata-ratang rating film,
#### C = rata-rata jumlah votes dari seluruh genre film.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_film=pd.read_csv('title_basics.csv')
df_rating=pd.read_csv('title_ratings.csv', sep='\t')

In [3]:
df_film

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898,\N,\N,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018,\N,\N,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016,\N,29,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987,\N,\N,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973,\N,\N,Documentary
...,...,...,...,...,...,...,...,...,...
9020,tt3984412,tvEpisode,"I'm Not Going to Come Last, I'm Just Going to ...",0,2014,\N,\N,Reality-TV,
9021,tt8740950,tvEpisode,Weight Loss Resolution Restart - Ins & Outs of...,0,2015,\N,\N,Reality-TV,
9022,tt9822816,tvEpisode,Zwischen Vertuschung und Aufklärung - Missbrau...,0,2019,\N,\N,\N,
9023,tt9900062,tvEpisode,The Direction of Yuu's Love: Hings Aren't Goin...,0,1994,\N,\N,"Animation,Comedy,Drama",


In [4]:
df_film.shape

(9025, 9)

In [5]:
df_film.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898,\N,\N,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018,\N,\N,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016,\N,29,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987,\N,\N,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973,\N,\N,Documentary


In [6]:
df_film.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9025 entries, 0 to 9024
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          9025 non-null   object
 1   titleType       9025 non-null   object
 2   primaryTitle    9011 non-null   object
 3   originalTitle   9011 non-null   object
 4   isAdult         9025 non-null   int64 
 5   startYear       9025 non-null   object
 6   endYear         9025 non-null   object
 7   runtimeMinutes  9025 non-null   object
 8   genres          9014 non-null   object
dtypes: int64(1), object(8)
memory usage: 634.7+ KB


## Cleaning Table Movie
#### Dalam hal ini dilakukan pengecekan terhadap data yang memiliki nilai NULL. Pengecekan dilakukan pada tabel movie(df_film) yang memiliki data bernilai NULL yang harus dihapus.

In [7]:
df_film.isnull().sum()

tconst             0
titleType          0
primaryTitle      14
originalTitle     14
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            11
dtype: int64

#### Pengecekan data bernilai NULL juga dilakukan pada kolom 'primaryTitle' dan 'originalTitle' yang terdapat banyak data bernilai NULL.

In [8]:
df_film.loc[(df_film['primaryTitle'].isnull()) | (df_film['originalTitle'].isnull())]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
9000,tt10790040,tvEpisode,,,0,2019,\N,\N,\N
9001,tt10891902,tvEpisode,,,0,2020,\N,\N,Crime
9002,tt11737860,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9003,tt11737862,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9004,tt11737866,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9005,tt11737872,tvEpisode,,,0,2020,\N,\N,\N
9006,tt11737874,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9007,tt1971246,tvEpisode,,,0,2011,\N,\N,Biography
9008,tt2067043,tvEpisode,,,0,1965,\N,\N,Music
9009,tt4404732,tvEpisode,,,0,2015,\N,\N,Comedy


#### Menghapus data dengan nilai NULL.
#### Dalam tabel diatas diketahui bahwa semua data tidak terdapat judul film sehingga data-data tersebut harus dihapus.

In [9]:
df_film=df_film.loc[(df_film['primaryTitle'].notnull()) & (df_film['originalTitle'].notnull())]

In [10]:
df_film

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898,\N,\N,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018,\N,\N,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016,\N,29,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987,\N,\N,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973,\N,\N,Documentary
...,...,...,...,...,...,...,...,...,...
9020,tt3984412,tvEpisode,"I'm Not Going to Come Last, I'm Just Going to ...",0,2014,\N,\N,Reality-TV,
9021,tt8740950,tvEpisode,Weight Loss Resolution Restart - Ins & Outs of...,0,2015,\N,\N,Reality-TV,
9022,tt9822816,tvEpisode,Zwischen Vertuschung und Aufklärung - Missbrau...,0,2019,\N,\N,\N,
9023,tt9900062,tvEpisode,The Direction of Yuu's Love: Hings Aren't Goin...,0,1994,\N,\N,"Animation,Comedy,Drama",


#### Data dari film yang tidak memiliki judul film sudah dihapus terlihat terdapat pengurangan data dari 9025 data menjadi 9011 data.

#### Dalam hal ini dilakukan pengecekan terhadap data yang memiliki nilai NULL. Pengecekan dilakukan pada kolom 'genres' yang memiliki data bernilai NULL yang harus dihapus.

In [11]:
df_film.loc[df_film['genres'].isnull()]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
9014,tt10233364,tvEpisode,Rolling in the Deep Dish\tRolling in the Deep ...,0,2019,\N,\N,Reality-TV,
9015,tt10925142,tvEpisode,The IMDb Show on Location: Star Wars Galaxy's ...,0,2019,\N,\N,Talk-Show,
9016,tt10970874,tvEpisode,Die Bauhaus-Stadt Tel Aviv - Vorbild für die M...,0,2019,\N,\N,\N,
9017,tt11670006,tvEpisode,...ein angenehmer Unbequemer...\t...ein angene...,0,1981,\N,\N,Documentary,
9018,tt11868642,tvEpisode,GGN Heavyweight Championship Lungs With Mike T...,0,2020,\N,\N,Talk-Show,
9019,tt2347742,tvEpisode,No sufras por la alergia esta primavera\tNo su...,0,2004,\N,\N,\N,
9020,tt3984412,tvEpisode,"I'm Not Going to Come Last, I'm Just Going to ...",0,2014,\N,\N,Reality-TV,
9021,tt8740950,tvEpisode,Weight Loss Resolution Restart - Ins & Outs of...,0,2015,\N,\N,Reality-TV,
9022,tt9822816,tvEpisode,Zwischen Vertuschung und Aufklärung - Missbrau...,0,2019,\N,\N,\N,
9023,tt9900062,tvEpisode,The Direction of Yuu's Love: Hings Aren't Goin...,0,1994,\N,\N,"Animation,Comedy,Drama",


#### Dalam tabel diatas diketahui bahwa semua data tidak terdapat genre film sehingga data-data tersebut harus dihapus.

In [12]:
df_film=df_film.loc[df_film['genres'].notnull()]

In [13]:
df_film

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898,\N,\N,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018,\N,\N,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016,\N,29,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987,\N,\N,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973,\N,\N,Documentary
...,...,...,...,...,...,...,...,...,...
8995,tt1357878,tvEpisode,Poison,Poison,0,2004,\N,\N,Documentary
8996,tt2252371,tvEpisode,Episode dated 20 February 2012,Episode dated 20 February 2012,0,2012,\N,\N,Talk-Show
8997,tt6934076,tvEpisode,Episode #1.59,Episode #1.59,0,2012,\N,\N,Talk-Show
8998,tt11988828,tvEpisode,Episode #1.263,Episode #1.263,0,\N,\N,\N,Drama


#### Data dari film yang tidak memiliki jenis genre film sudah dihapus terlihat terdapat pengurangan data dari 9011 data menjadi 9000 data

#### Mengubah data yang mempunyai nilai '\N' .
#### Pada kolom 'startYear', 'endYear', and 'runtimeMinutes' terdapat data dengan nilai '\N'.
#### Dimana nilai '\N' ini sama dengan NULL.
#### Sehingga nilai dari '\N' ini harus diubah menjadi np.nan dan melakukan casting pada kolom 'startYear', 'endYear', dan 'runtimeMinutes' menjadi float64.

In [14]:
df_film['startYear']=df_film['startYear'].replace('\\N', np.nan)
df_film['startYear']=df_film['startYear'].astype('float64')

df_film['endYear']=df_film['endYear'].replace('\\N', np.nan)
df_film['endYear']=df_film['endYear'].astype('float64')

df_film['runtimeMinutes']=df_film['runtimeMinutes'].replace('\\N', np.nan)
df_film['runtimeMinutes']=df_film['runtimeMinutes'].astype('float64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_film['startYear']=df_film['startYear'].replace('\\N', np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_film['startYear']=df_film['startYear'].astype('float64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_film['endYear']=df_film['endYear'].replace('\\N', np.nan)
A value is trying t

In [15]:
df_film

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898.0,,,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018.0,,,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016.0,,29.0,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987.0,,,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973.0,,,Documentary
...,...,...,...,...,...,...,...,...,...
8995,tt1357878,tvEpisode,Poison,Poison,0,2004.0,,,Documentary
8996,tt2252371,tvEpisode,Episode dated 20 February 2012,Episode dated 20 February 2012,0,2012.0,,,Talk-Show
8997,tt6934076,tvEpisode,Episode #1.59,Episode #1.59,0,2012.0,,,Talk-Show
8998,tt11988828,tvEpisode,Episode #1.263,Episode #1.263,0,,,,Drama


In [16]:
df_film.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898.0,,,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018.0,,,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016.0,,29.0,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987.0,,,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973.0,,,Documentary


In [17]:
print(df_film['startYear'])

0       1898.0
1       2018.0
2       2016.0
3       1987.0
4       1973.0
         ...  
8995    2004.0
8996    2012.0
8997    2012.0
8998       NaN
8999    2019.0
Name: startYear, Length: 9000, dtype: float64


In [18]:
print(df_film['startYear'].unique()[:10])

[1898. 2018. 2016. 1987. 1973. 1951. 2006. 2015. 1998. 2014.]


In [19]:
print(df_film['endYear'])

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
8995   NaN
8996   NaN
8997   NaN
8998   NaN
8999   NaN
Name: endYear, Length: 9000, dtype: float64


In [20]:
print(df_film['endYear'].unique()[:10])

[  nan 2005. 1955. 2006. 1999. 2018. 1978. 1997. 2017. 2001.]


In [21]:
print(df_film['runtimeMinutes'])

0        NaN
1        NaN
2       29.0
3        NaN
4        NaN
        ... 
8995     NaN
8996     NaN
8997     NaN
8998     NaN
8999     NaN
Name: runtimeMinutes, Length: 9000, dtype: float64


In [22]:
print(df_film['runtimeMinutes'].unique()[:10])

[nan 29.  7. 23. 85. 45. 52. 11. 22. 90.]


#### Mengubah tipe dari kolom 'genres'menjadi list dengan membuat fungsi bernama transform_to_list.

In [23]:
def transform_to_list(a):
    if ',' in a:
        #apabila ada , dalam kolom genres diubah menjadi list.
        return a.split(',')
    else:
        #jika tidak ada , dalam kolom genres diubah menjadi list yang kosong.
        return []
    
df_film['genres']=df_film['genres'].apply(lambda a: transform_to_list(a))   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_film['genres']=df_film['genres'].apply(lambda a: transform_to_list(a))


### Cleaning NULL data in Ratings File

In [24]:
df_rating

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1608
1,tt0000002,6.0,197
2,tt0000003,6.5,1285
3,tt0000004,6.1,121
4,tt0000005,6.1,2050
...,...,...,...
1030004,tt9916576,6.0,9
1030005,tt9916578,8.4,17
1030006,tt9916720,5.6,49
1030007,tt9916766,6.8,13


In [25]:
df_rating.shape

(1030009, 3)

In [26]:
df_rating.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1608
1,tt0000002,6.0,197
2,tt0000003,6.5,1285
3,tt0000004,6.1,121
4,tt0000005,6.1,2050


In [27]:
df_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030009 entries, 0 to 1030008
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1030009 non-null  object 
 1   averageRating  1030009 non-null  float64
 2   numVotes       1030009 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ MB


### Joining Basics File and Ratings File
#### Digunakan inner join untuk menggabungkan Basics File dengan Ratings File. (inner join untuk 2 buah fungsi dari df_film dan df_rating untuk mendapatkan rating pada setiap film yang terdapat di file tersebut, sehingga diperoleh 5 film teratas dan tipe data dari tiap kolom yang ada).

In [28]:
df_film_rating=pd.merge(df_film, df_rating, on='tconst', how='inner')

In [29]:
df_film_rating

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240
3,tt6941700,tvEpisode,RuPaul Roast,RuPaul Roast,0,2017.0,,,[],8.0,11
4,tt7305674,video,UCLA Track & Field Promo,UCLA Track & Field Promo,0,2017.0,,,"[Short, Sport]",9.7,7
...,...,...,...,...,...,...,...,...,...,...,...
1371,tt4027946,movie,Alone in the Universe,Alone in the Universe,0,2015.0,,97.0,"[Comedy, Drama, Romance]",4.1,22
1372,tt1119633,tvSpecial,UFC 67 Countdown,UFC 67 Countdown,0,2007.0,,60.0,[],8.0,13
1373,tt0290419,movie,Andru Kanda Mugam,Andru Kanda Mugam,0,1968.0,,164.0,[],6.4,5
1374,tt0522596,tvEpisode,The Clampetts Play Cupid,The Clampetts Play Cupid,0,1967.0,,30.0,"[Comedy, Family]",7.5,38


In [30]:
df_film_rating.shape

(1376, 11)

In [31]:
df_film_rating.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240
3,tt6941700,tvEpisode,RuPaul Roast,RuPaul Roast,0,2017.0,,,[],8.0,11
4,tt7305674,video,UCLA Track & Field Promo,UCLA Track & Field Promo,0,2017.0,,,"[Short, Sport]",9.7,7


#### Menampilkan tipe data untuk tiap kolom.

In [32]:
df_film_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1376 entries, 0 to 1375
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          1376 non-null   object 
 1   titleType       1376 non-null   object 
 2   primaryTitle    1376 non-null   object 
 3   originalTitle   1376 non-null   object 
 4   isAdult         1376 non-null   int64  
 5   startYear       1376 non-null   float64
 6   endYear         26 non-null     float64
 7   runtimeMinutes  1004 non-null   float64
 8   genres          1376 non-null   object 
 9   averageRating   1376 non-null   float64
 10  numVotes        1376 non-null   int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 129.0+ KB


### Memperkecil Ukuran Tabel
#### Cara untuk memperkecil ukuran tabel yaitu dengan cara menghilangkan semua nilai NULL dari kolom startYear dan runtimeMinutes, karena setiap film pasti diketahui tahun rilis film dan durasi film.

In [33]:
df_film_rating=df_film_rating.dropna(subset=['startYear','runtimeMinutes'])

In [34]:
df_film_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1004 entries, 0 to 1374
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          1004 non-null   object 
 1   titleType       1004 non-null   object 
 2   primaryTitle    1004 non-null   object 
 3   originalTitle   1004 non-null   object 
 4   isAdult         1004 non-null   int64  
 5   startYear       1004 non-null   float64
 6   endYear         17 non-null     float64
 7   runtimeMinutes  1004 non-null   float64
 8   genres          1004 non-null   object 
 9   averageRating   1004 non-null   float64
 10  numVotes        1004 non-null   int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 94.1+ KB


### Membangun Simple Recommender System
#### Dengan menggunakan formula rumus dari IMDB Weighted Average, disini sudah diketahui nilai dari v dan R, sehingga perlu dicari terlebih dahulu nilai dari C dan m.

In [35]:
C=df_film_rating['averageRating'].mean()

In [36]:
print(C)

6.829581673306767


In [37]:
m=df_film_rating['numVotes'].quantile(0.8)

In [38]:
print(m)

229.0


#### Membut Fungsi dari IMDB Weighted Average

In [39]:
def imdb_weighted_filmrating(df_imdb, var = 0.8):
    v=df_imdb['numVotes']
    R=df_imdb['averageRating']
    C=df_imdb['averageRating'].mean()
    m=df_imdb['numVotes'].quantile(var)
    df_imdb['value']=(v/(m+v))*R + (m/(m+v))*C
    return df_imdb['value']

In [40]:
imdb_weighted_filmrating(df_film_rating)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_imdb['value']=(v/(m+v))*R + (m/(m+v))*C


0       7.009992
1       6.796077
2       7.684380
5       6.921384
6       6.869089
          ...   
1369    6.867943
1371    6.590335
1372    6.892455
1373    6.820403
1374    6.924997
Name: value, Length: 1004, dtype: float64

In [41]:
df_film_rating

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459,7.009992
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7,6.796077
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240,7.684380
5,tt2262289,movie,The Pin,The Pin,0,2013.0,,85.0,[],7.7,27,6.921384
6,tt0874027,tvEpisode,Episode #32.9,Episode #32.9,0,2006.0,,29.0,"[Comedy, Game-Show, News]",8.0,8,6.869089
...,...,...,...,...,...,...,...,...,...,...,...,...
1369,tt8870226,video,Sublime: Santeria,Sublime: Santeria,0,1996.0,,4.0,"[Music, Short]",7.6,12,6.867943
1371,tt4027946,movie,Alone in the Universe,Alone in the Universe,0,2015.0,,97.0,"[Comedy, Drama, Romance]",4.1,22,6.590335
1372,tt1119633,tvSpecial,UFC 67 Countdown,UFC 67 Countdown,0,2007.0,,60.0,[],8.0,13,6.892455
1373,tt0290419,movie,Andru Kanda Mugam,Andru Kanda Mugam,0,1968.0,,164.0,[],6.4,5,6.820403


In [42]:
df_film_rating.shape

(1004, 12)

In [43]:
df_film_rating.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459,7.009992
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7,6.796077
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240,7.68438
5,tt2262289,movie,The Pin,The Pin,0,2013.0,,85.0,[],7.7,27,6.921384
6,tt0874027,tvEpisode,Episode #32.9,Episode #32.9,0,2006.0,,29.0,"[Comedy, Game-Show, News]",8.0,8,6.869089


### Cara Pembuatan Simple Recommender System
#### Dari perhitungan IMDB Weigted Average diperoleh kolom tambahan berupa value. Langkah pertama adalah memfilter numVotes yang lebih dari m kemudia mengurutkan value dari tertinggi ke terendah untuk diambil beberapa value teratas.

In [44]:
def simple_recommender(df_recommender, top = 100):
    df_recommender=df_recommender.loc[df_recommender['numVotes'] >= m]
    df_recommender=df_recommender.sort_values(by='value', ascending=False) #untuk mengurutkan dari value tertinggi ke value terndah
    
    #mengambil 100 data dengan value teratas
    df_recommender=df_recommender[:100]
    return df_recommender

#### Mengambil 25 data dengan value teratas.

In [45]:
simple_recommender(df_film_rating, top=25)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value
68,tt4110822,tvEpisode,S.O.S. Part 2,S.O.S. Part 2,0,2015.0,,43.0,"[Action, Adventure, Drama]",9.4,3820,9.254624
236,tt2200252,video,Attack of the Clones Review,Attack of the Clones Review,0,2010.0,,86.0,[],9.3,1411,8.955045
1181,tt7697962,tvEpisode,Chapter Seventeen: The Missionaries,Chapter Seventeen: The Missionaries,0,2019.0,,54.0,"[Drama, Fantasy, Horror]",9.2,1536,8.892450
326,tt7124590,tvEpisode,Chapter Thirty-Four: Judgment Night,Chapter Thirty-Four: Judgment Night,0,2018.0,,42.0,"[Crime, Drama, Mystery]",9.1,1859,8.850993
1045,tt0533506,tvEpisode,The Prom,The Prom,0,1999.0,,60.0,"[Action, Drama, Fantasy]",8.9,2740,8.740308
...,...,...,...,...,...,...,...,...,...,...,...,...
1105,tt0098325,movie,Sidewalk Stories,Sidewalk Stories,0,1989.0,,97.0,[],7.1,287,6.979989
728,tt0519603,tvEpisode,Moon of the Wolf,Moon of the Wolf,0,1992.0,,22.0,"[Action, Adventure, Animation]",7.0,926,6.966211
453,tt0091658,movie,Nutcracker,Nutcracker,0,1986.0,,89.0,"[Family, Fantasy, Music]",7.0,838,6.963425
758,tt0095305,movie,High Tide,High Tide,0,1987.0,,101.0,[],7.0,765,6.960739


#### Berikutnya adalah cara membuat simple recommender system dengan menggunakan user preferences.
#### Dari simple recommender system sebelumnya sudah didapatkan daftar film yang telah diurutkan dari value tertinggi sampai value terendah.
#### Film dengan averageRating tinggi tidak selalu mendapatkan posisi urutan tinggi dii value dibandingkan dengan averageRating yang rendah, hal ini disebabkan oleh pertimbangan dari perhitungan faktor banyaknya votes.
#### Sistem rekomendasi untuk film ini masih dapat ditingkatkan lagi dengan menambahka filter yang lebih spesifik untuk 'titleType', 'startYear', 'genres', ataupun yang lainnya.
#### Selanjutnya, membuat function untuk melakukan pemfilteran berdasarkan isAdult, startYear, dan genres.

In [46]:
df_next_recommender=df_film_rating.copy()

def user_preference_recommender(df_next_recommender, ask_adult, ask_start_year, ask_genre, top=100):
    #bagian untuk ask_adult dengan output yes atau no.
    if ask_adult.lower() == 'yes':
        df_next_recommender=df_next_recommender.loc[df_next_recommender['isAdult'] == 1]
    elif ask_adult.lower() == 'no':
        df_next_recommender=df_next_recommender.loc[df_next_recommender['isAdult'] == 0]
        
    #bagian untuk ask_start_year dengan output berupa data numeric
    df_next_recommender=df_next_recommender.loc[df_next_recommender['startYear'] >= int(ask_start_year)]
    
    #bagian untuk ask_genre dengan output berupa all genre atau yang lain
    if ask_genre.lower() == 'all':
        df_next_recommender=df_next_recommender
    else:
        def filter_genre(g):
            if ask_genre.lower() in str(g).lower():
                return True
            else:
                return False
        df_next_recommender=df_next_recommender.loc[df_next_recommender['genres'].apply(lambda g: filter_genre(g))]
        
    #bagian untuk output dari numVotes berdasarkan value dari perhitungan dengan IMDB Weighted Average.
    df_next_recommender=df_next_recommender.loc[df_next_recommender['numVotes'] >= m] #mengambil data dengan m yang lebih besar daripada numVotes.
    df_next_recommender=df_next_recommender.sort_values(by='value', ascending=False)
    
    #Pengambilan untuk 100 film teratas
    df_next_recommender=df_next_recommender[:top]
    return df_next_recommender

In [47]:
user_preference_recommender(df_next_recommender,
                           ask_adult = 'no',
                           ask_start_year = 2005,
                           ask_genre ='drama')

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value
68,tt4110822,tvEpisode,S.O.S. Part 2,S.O.S. Part 2,0,2015.0,,43.0,"[Action, Adventure, Drama]",9.4,3820,9.254624
1181,tt7697962,tvEpisode,Chapter Seventeen: The Missionaries,Chapter Seventeen: The Missionaries,0,2019.0,,54.0,"[Drama, Fantasy, Horror]",9.2,1536,8.89245
326,tt7124590,tvEpisode,Chapter Thirty-Four: Judgment Night,Chapter Thirty-Four: Judgment Night,0,2018.0,,42.0,"[Crime, Drama, Mystery]",9.1,1859,8.850993
71,tt8399426,tvEpisode,Savages,Savages,0,2018.0,,58.0,"[Drama, Fantasy, Romance]",9.0,1428,8.700045
1234,tt2843830,tvEpisode,VIII.,VIII.,0,2014.0,,57.0,"[Adventure, Drama]",8.9,1753,8.660784
1054,tt2503932,tvEpisode,Trial and Error,Trial and Error,0,2013.0,,43.0,"[Drama, Fantasy, Horror]",8.6,2495,8.451165
1281,tt3166390,tvEpisode,Looking for a Plus-One,Looking for a Plus-One,0,2014.0,,28.0,"[Comedy, Drama, Romance]",8.7,396,8.014679
151,tt3954426,tvEpisode,Bleeding Kansas,Bleeding Kansas,0,2014.0,,42.0,"[Drama, Western]",8.6,437,7.991253
1344,tt6644294,tvEpisode,The Hostile Hospital: Part Two,The Hostile Hospital: Part Two,0,2018.0,,40.0,"[Adventure, Comedy, Drama]",8.3,812,7.976536
357,tt4084774,tvEpisode,Trial and Punishment,Trial and Punishment,0,2015.0,,56.0,"[Adventure, Drama]",8.8,289,7.928908


In [48]:
user_preference_recommender(df_next_recommender,
                           ask_adult = 'yes',
                           ask_start_year = 1990,
                           ask_genre ='action')

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value


In [49]:
user_preference_recommender(df_next_recommender,
                           ask_adult = 'yes',
                           ask_start_year = 2005,
                           ask_genre ='horror')

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value


In [50]:
user_preference_recommender(df_next_recommender,
                           ask_adult = 'no',
                           ask_start_year = 2005,
                           ask_genre ='horror')

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,value
1181,tt7697962,tvEpisode,Chapter Seventeen: The Missionaries,Chapter Seventeen: The Missionaries,0,2019.0,,54.0,"[Drama, Fantasy, Horror]",9.2,1536,8.89245
1054,tt2503932,tvEpisode,Trial and Error,Trial and Error,0,2013.0,,43.0,"[Drama, Fantasy, Horror]",8.6,2495,8.451165
708,tt2751234,tvEpisode,Resurrection,Resurrection,0,2014.0,,43.0,"[Crime, Drama, Horror]",8.0,1077,7.794774
910,tt3348270,tvEpisode,Bloodline,Bloodline,0,2014.0,,44.0,"[Drama, Horror, Mystery]",7.7,436,7.400262
298,tt9597142,tvEpisode,Skincrawlers/By the Silver Water of Lake Champ...,Skincrawlers/By the Silver Water of Lake Champ...,0,2019.0,,44.0,"[Fantasy, Horror]",7.1,294,6.981595
902,tt3820128,short,The Herd,The Herd,0,2014.0,,21.0,"[Horror, Short, Thriller]",6.5,230,6.664432
1367,tt4031126,movie,Lycan,Lycan,0,2017.0,,87.0,"[Horror, Thriller]",5.0,959,5.352672
583,tt7039000,movie,American Nightmares,Mr. Malevolent,0,2018.0,,90.0,"[Comedy, Horror]",4.1,518,4.936779
484,tt3138376,video,Joy Ride 3: Road Kill,Joy Ride 3: Road Kill,0,2014.0,,95.0,"[Crime, Horror, Thriller]",4.7,4191,4.810334
677,tt8923408,tvEpisode,#JinnHunter,#JinnHunter,0,2019.0,,24.0,"[Drama, Fantasy, Horror]",3.6,592,4.500821
