## 02. Netflix Movie Recommendation System: Content_Based_Filtering

- Business Understanding <br><br>
Netflix 영화 추천 시스템_콘텐츠 기반 필터링 <br>

- Data Understanding

1. Data Load

In [2]:
import pandas as pd
df = pd.read_csv('data/netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


2. EDA

In [3]:
df.info() # null 값 있음

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [4]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [5]:
df.describe(include='object')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,17,220,514,8775
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,3207,1793,362,4


- Data Preparation

In [6]:
# 감독 이름 중 ','가 포함된 경우: 614개
df['director'].fillna('').apply(lambda x: ',' in x).sum()

614

In [7]:
# 4406명의 Unique한 감독(감독이 여러 명일 경우 가장 첫번째 감독 이름만 적용)
df['director'][df['director'].fillna('').apply(lambda x: ',' in x)]

6                           Robert Cullen, José Luis Ucha
16          Pedro de Echave García, Pablo Azorín Williams
23                                Alex Woo, Stanley Moore
30      Ashwiny Iyer Tiwari, Abhishek Chaubey, Saket C...
68      Hanns-Bruno Kammertöns, Vanessa Nöcker, Michae...
                              ...                        
8727                            Ritu Sarin, Tenzing Sonam
8728                      Heidi Brandenburg, Mathew Orzel
8737               Milla Harrison-Hansley, Alicky Sussman
8739                          Frank Capra, Anatole Litvak
8765    Jovanka Vuckovic, Annie Clark, Roxanne Benjami...
Name: director, Length: 614, dtype: object

In [8]:
# 결론: 데이터의 행이 약 8800개인데, 4400개의 행 추가(one-hot encoding)하는 것은 얻을 것이 적음 -> 컬럼 사용 X
# 평균 한 감독에 대해 같은 감독의 다른 영화 한 개를 더 추천받을 수 있는 정도임
# Alternative method: 특정 개수 이상 작품(e.g. 5개 이상)이 있는 영화 감독만 특징을 사용할 수 있음
df['director'].fillna('').apply(lambda x: x.split(',')[0].strip()).nunique()

4406

In [9]:
pd.DataFrame(df['director'].fillna('').apply(lambda x: x.split(',')[0].strip()).value_counts())

Unnamed: 0,director
,2634
Rajiv Chilaka,22
Raúl Campos,18
Suhas Kadav,16
Marcus Raboy,16
...,...
Jung Ji-woo,1
Matt D'Avella,1
Parthiban,1
Scott McAboy,1


> Cast 컬럼

In [10]:
df['cast'].fillna('').apply(lambda x: x.split(','))

0                                                      []
1       [Ama Qamata,  Khosi Ngema,  Gail Mabalane,  Th...
2       [Sami Bouajila,  Tracy Gotoas,  Samuel Jouy,  ...
3                                                      []
4       [Mayur More,  Jitendra Kumar,  Ranjan Raj,  Al...
                              ...                        
8802    [Mark Ruffalo,  Jake Gyllenhaal,  Robert Downe...
8803                                                   []
8804    [Jesse Eisenberg,  Woody Harrelson,  Emma Ston...
8805    [Tim Allen,  Courteney Cox,  Chevy Chase,  Kat...
8806    [Vicky Kaushal,  Sarah-Jane Dias,  Raaghav Cha...
Name: cast, Length: 8807, dtype: object

In [11]:
cast_dict = {}

# cast_list는 각 행에 있는 배우들 리스트
for cast_list in df['cast'].fillna('').apply(lambda x: x.split(',')):
    if cast_list:
        # cast는 각 배우(1명)
        for cast in cast_list:
            if cast.strip() in cast_dict.keys():
                cast_dict[cast.strip()] +=1
            else:
                # 키를 추가
                cast_dict[cast.strip()] = 1

In [12]:
len(cast_dict.keys())

36440

In [13]:
cast_df = pd.DataFrame(cast_dict.items(), columns=['cast','freq'])
cast_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36440 entries, 0 to 36439
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    36440 non-null  object
 1   freq    36440 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 569.5+ KB


In [14]:
cast_df.describe()

Unnamed: 0,freq
count,36440.0
mean,1.782409
std,4.706956
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,825.0


In [15]:
cast_df.sort_values(by='freq', ascending=False).head()

Unnamed: 0,cast,freq
0,,825
1434,Anupam Kher,43
783,Shah Rukh Khan,35
304,Julie Tejwani,33
1635,Naseeruddin Shah,32


In [16]:
# gt: greater than(초과)
# 작품이 4개 이상인 배우(3보다 큰)가 3247명 -> 3247명만 one-hot encoding으로 사용
# 기준: 데이터행(약 8,800개) 보다 작은 수로 선택
cast_df[cast_df['freq'].gt(3)].sort_values(by='freq', ascending=False)

Unnamed: 0,cast,freq
0,,825
1434,Anupam Kher,43
783,Shah Rukh Khan,35
304,Julie Tejwani,33
4943,Takahiro Sakurai,32
...,...,...
1220,Tom Waits,4
9454,Thomas Middleditch,4
9449,Tom Papa,4
9445,Felipe Esparza,4


In [17]:
cast_list_enc = cast_df['cast'][cast_df['freq'].gt(3)].tolist()[1:] # [1:] 슬라이싱 통해 빈값을 제거
len(cast_list_enc)

3246

In [18]:
cast_list_enc[:10]

['Mayur More',
 'Henry Thomas',
 'Rahul Kohli',
 'Annabeth Gish',
 'Michael Trucco',
 'Vanessa Hudgens',
 'Kimiko Glenn',
 'James Marsden',
 'Ken Jeong',
 'Elizabeth Perkins']

In [19]:
cast_list_enc.sort() # cast_list_enc = sorted(cast_list_enc)
cast_list_enc[:10]

['50 Cent',
 'A.K. Hangal',
 'Aakash Dabhade',
 'Aamir Bashir',
 'Aamir Khan',
 'Aaron Abrams',
 'Aaron Eckhart',
 'Aaron Paul',
 'Aaron Taylor-Johnson',
 'Aaron Yan']

In [20]:
cast_one_hot_list = []

# cast_list: 각 영화의 배우들 리스트
for cast_list in df['cast'].fillna('').apply(lambda x: x.split(',')):
    tmp_list = [0] * len(cast_list_enc) # 3246개의 0으로 채움
    if cast_list:
        # cast: 각 배우
        for cast in cast_list:
            if cast.strip() in cast_list_enc:
                tmp_list[cast_list_enc.index(cast.strip())] = 1 # 해당 인덱스에 1로 채움
    cast_one_hot_list.append(tmp_list)

len(cast_one_hot_list)

8807

In [21]:
# 각 컬럼은 해당 영화/TV Show에 출연했는지 여부를 의미
cast_one_hot_df = pd.DataFrame(cast_one_hot_list, columns=cast_list_enc)
cast_one_hot_df.head()

Unnamed: 0,50 Cent,A.K. Hangal,Aakash Dabhade,Aamir Bashir,Aamir Khan,Aaron Abrams,Aaron Eckhart,Aaron Paul,Aaron Taylor-Johnson,Aaron Yan,...,Zoe Saldana,Zoey Deutch,Zooey Deschanel,Àlex Monner,Álvaro Cervantes,Ángela Molina,Ólafur Darri Ólafsson,Özge Borak,İbrahim Büyükak,İpek Bilgin
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# 데이터 검증 목적
cast_one_hot_df.sum(axis=1).value_counts() # 0 값은 주요 배우가 아닐 뿐 cast는 맞다

0     3144
1     1522
2      957
3      749
4      615
5      545
6      397
7      336
8      205
9      119
10      87
11      42
12      29
13      21
14      13
18       5
20       5
23       4
15       3
16       2
17       2
19       1
21       1
26       1
24       1
33       1
dtype: int64

In [23]:
df[cast_one_hot_df.sum(axis=1) == 24]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3774,s3775,TV Show,Black Mirror,,"Jesse Plemons, Cristin Milioti, Jimmi Simpson,...",United Kingdom,"June 5, 2019",2019,TV-MA,5 Seasons,"British TV Shows, International TV Shows, TV D...",This sci-fi anthology series explores a twiste...


> Country

In [24]:
df['country'].nunique()

748

In [25]:
# country에 , 가 포함된 개수
df['country'].fillna('').str.contains(',').sum()

1320

In [26]:
df['country'][df['country'].fillna('').str.contains(',')]

7       United States, Ghana, Burkina Faso, United Kin...
12                                Germany, Czech Republic
29                           United States, India, France
38                           China, Canada, United States
46                     South Africa, United States, Japan
                              ...                        
8788                Croatia, Slovenia, Serbia, Montenegro
8794                                        Egypt, France
8795                                        Japan, Canada
8797        United States, France, South Korea, Indonesia
8801                         United Arab Emirates, Jordan
Name: country, Length: 1320, dtype: object

In [27]:
# 여러 나라가 있을 경우, 가장 앞에 있는 나라만 포함
df['country'].fillna('country_na').apply(lambda x: x.split(',')[0].strip()).nunique()

87

In [28]:
country_dummy = pd.get_dummies(df['country'].fillna('country_na').apply(lambda x: x.split(',')[0].strip())).drop(columns=['','country_na'])
country_dummy

Unnamed: 0,Argentina,Australia,Austria,Bangladesh,Belarus,Belgium,Brazil,Bulgaria,Cambodia,Cameroon,...,Turkey,Ukraine,United Arab Emirates,United Kingdom,United States,Uruguay,Venezuela,Vietnam,West Germany,Zimbabwe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8803,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8804,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8805,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


> date_added

In [29]:
df['date_added'].head() # 연도 (netflix에 포함된 연도)

0    September 25, 2021
1    September 24, 2021
2    September 24, 2021
3    September 24, 2021
4    September 24, 2021
Name: date_added, dtype: object

In [30]:
df['date_added'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 8807 entries, 0 to 8806
Series name: date_added
Non-Null Count  Dtype 
--------------  ----- 
8797 non-null   object
dtypes: object(1)
memory usage: 68.9+ KB


In [31]:
# Null 값 채우기
# 출시 연도의 값으로 null 값을 채움
df[df['date_added'].isnull()].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6066,s6067,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
6174,s6175,TV Show,Anthony Bourdain: Parts Unknown,,Anthony Bourdain,United States,,2018,TV-PG,5 Seasons,Docuseries,This CNN original series has chef Anthony Bour...
6795,s6796,TV Show,Frasier,,"Kelsey Grammer, Jane Leeves, David Hyde Pierce...",United States,,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies",Frasier Crane is a snooty but lovable Seattle ...
6806,s6807,TV Show,Friends,,"Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...",United States,,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies",This hit sitcom follows the merry misadventure...
6901,s6902,TV Show,Gunslinger Girl,,"Yuuka Nanri, Kanako Mitsuhashi, Eri Sendai, Am...",Japan,,2008,TV-14,2 Seasons,"Anime Series, Crime TV Shows","On the surface, the Social Welfare Agency appe..."


In [32]:
# release year에 있는 값을 빈 date_added 컬럼에 주입
df.loc[df['date_added'].isnull(), 'date_added'] = df[df['date_added'].isnull()]['release_year']

In [33]:
df[df['date_added'].isnull()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


In [34]:
df['date_added'].head()

0    September 25, 2021
1    September 24, 2021
2    September 24, 2021
3    September 24, 2021
4    September 24, 2021
Name: date_added, dtype: object

In [35]:
df['year'] = df['date_added'].apply(lambda x: str(x)[-4:]).astype('int')
df['year'].value_counts()

2019    2016
2020    1879
2018    1650
2021    1498
2017    1188
2016     430
2015      84
2014      24
2011      13
2013      12
2012       4
2008       3
2009       2
2003       2
2010       2
Name: year, dtype: int64

> rating 컬럼 (카테고리 -> 원핫 인코딩)

In [36]:
df['rating'].unique() # 데이터 퀄리티 이슈: 74 min, 84 min, ... (일단 지금 실습에서는 무시)

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV', 'UR'], dtype=object)

In [37]:
df['rating'].unique() # 데이터 퀄리티 이슈: 74 min, 84 min, ... (일단 지금 실습에서는 무시)

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV', 'UR'], dtype=object)

In [38]:
# 빈도가 가장 높은 값 확인(TV-MA)
df['rating'].describe()

count      8803
unique       17
top       TV-MA
freq       3207
Name: rating, dtype: object

In [39]:
# 빈도가 가장 높은 값으로 null 값 fill
rating_dummy = pd.get_dummies(df['rating'].fillna('TV-MA'), prefix='rating')
rating_dummy.head()

Unnamed: 0,rating_66 min,rating_74 min,rating_84 min,rating_G,rating_NC-17,rating_NR,rating_PG,rating_PG-13,rating_R,rating_TV-14,rating_TV-G,rating_TV-MA,rating_TV-PG,rating_TV-Y,rating_TV-Y7,rating_TV-Y7-FV,rating_UR
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


> duration (범주화)

In [40]:
df['duration'].value_counts()

1 Season     1793
2 Seasons     425
3 Seasons     199
90 min        152
94 min        146
             ... 
16 min          1
186 min         1
193 min         1
189 min         1
191 min         1
Name: duration, Length: 220, dtype: int64

In [41]:
df['duration'].unique()

array(['90 min', '2 Seasons', '1 Season', '91 min', '125 min',
       '9 Seasons', '104 min', '127 min', '4 Seasons', '67 min', '94 min',
       '5 Seasons', '161 min', '61 min', '166 min', '147 min', '103 min',
       '97 min', '106 min', '111 min', '3 Seasons', '110 min', '105 min',
       '96 min', '124 min', '116 min', '98 min', '23 min', '115 min',
       '122 min', '99 min', '88 min', '100 min', '6 Seasons', '102 min',
       '93 min', '95 min', '85 min', '83 min', '113 min', '13 min',
       '182 min', '48 min', '145 min', '87 min', '92 min', '80 min',
       '117 min', '128 min', '119 min', '143 min', '114 min', '118 min',
       '108 min', '63 min', '121 min', '142 min', '154 min', '120 min',
       '82 min', '109 min', '101 min', '86 min', '229 min', '76 min',
       '89 min', '156 min', '112 min', '107 min', '129 min', '135 min',
       '136 min', '165 min', '150 min', '133 min', '70 min', '84 min',
       '140 min', '78 min', '7 Seasons', '64 min', '59 min', '139 min',
    

In [42]:
def apply_duration(x):
    if 'Season' in x:
        return 'season'
    elif 'min' in x:
        min_num = int(x.split()[0])
        if min_num < 60:
            return 'short_duration'
        elif min_num < 120:
            return 'middle_duration'
        elif min_num < 180:
            return 'long_duration'
        else:
            return 'longer_duration'
    else:
        raise ValueError('Value Error')

In [43]:
df['duration'].isnull().sum() # 원래 있던 null 값과 동일함

3

In [44]:
# 일반적인 상영시간
df['duration'].fillna('150 min').apply(apply_duration).value_counts()

middle_duration    4472
season             2676
long_duration      1152
short_duration      458
longer_duration      49
Name: duration, dtype: int64

In [45]:
duration_dummy = pd.get_dummies(df['duration'].fillna('150 min').apply(apply_duration))
duration_dummy

Unnamed: 0,long_duration,longer_duration,middle_duration,season,short_duration
0,0,0,1,0,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0
...,...,...,...,...,...
8802,1,0,0,0,0
8803,0,0,0,1,0
8804,0,0,1,0,0
8805,0,0,1,0,0


> listed in 

In [46]:
df['listed_in'].unique()

array(['Documentaries', 'International TV Shows, TV Dramas, TV Mysteries',
       'Crime TV Shows, International TV Shows, TV Action & Adventure',
       'Docuseries, Reality TV',
       'International TV Shows, Romantic TV Shows, TV Comedies',
       'TV Dramas, TV Horror, TV Mysteries', 'Children & Family Movies',
       'Dramas, Independent Movies, International Movies',
       'British TV Shows, Reality TV', 'Comedies, Dramas',
       'Crime TV Shows, Docuseries, International TV Shows',
       'Dramas, International Movies',
       'Children & Family Movies, Comedies',
       'British TV Shows, Crime TV Shows, Docuseries',
       'TV Comedies, TV Dramas', 'Documentaries, International Movies',
       'Crime TV Shows, Spanish-Language TV Shows, TV Dramas',
       'Thrillers',
       'International TV Shows, Spanish-Language TV Shows, TV Action & Adventure',
       'International TV Shows, TV Action & Adventure, TV Dramas',
       'Comedies, International Movies',
       'Comedies, 

In [47]:
# 여러 장르가 포함된 경우, 가장 앞에 있는 장르만 선택
df['listed_in'].isnull().sum()

0

In [48]:
df['listed_in'].apply(lambda x: x.split(',')[0].strip())

0                  Documentaries
1         International TV Shows
2                 Crime TV Shows
3                     Docuseries
4         International TV Shows
                  ...           
8802                 Cult Movies
8803                    Kids' TV
8804                    Comedies
8805    Children & Family Movies
8806                      Dramas
Name: listed_in, Length: 8807, dtype: object

In [49]:
listed_in_dummy = pd.get_dummies(df['listed_in'].apply(lambda x: x.split(',')[0].strip()))
listed_in_dummy

Unnamed: 0,Action & Adventure,Anime Features,Anime Series,British TV Shows,Children & Family Movies,Classic & Cult TV,Classic Movies,Comedies,Crime TV Shows,Cult Movies,...,Sports Movies,Stand-Up Comedy,Stand-Up Comedy & Talk Shows,TV Action & Adventure,TV Comedies,TV Dramas,TV Horror,TV Sci-Fi & Fantasy,TV Shows,Thrillers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8803,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8804,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8805,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


> 데이터 합치기

In [50]:
# 기존 df에서 활용한 컬럼만 추출 (+ type 컬럼 get dummpy (Movie/ TV Show))
df_concat = pd.get_dummies(df, columns=['type'])[['release_year', 'year', 'type_Movie', 'type_TV Show']]
df_concat.head()

Unnamed: 0,release_year,year,type_Movie,type_TV Show
0,2020,2021,1,0
1,2021,2021,0,1
2,2021,2021,0,1
3,2021,2021,0,1
4,2021,2021,0,1


In [51]:
# one-hot encoding이 된 컬럼은 어차피 0, 1로 구성되어있기 때문에 전체 df 에 Scaling을 적용해도 문제가 되지않음
# 연도 관련 컬럼에 대해 minmax scaling 적용
from sklearn.preprocessing import minmax_scale
df_concat.loc[:,:] = minmax_scale(df_concat) # DataFrame 형태 유지(컬럼명도 유지)
df_concat.head()

  df_concat.loc[:,:] = minmax_scale(df_concat) # DataFrame 형태 유지(컬럼명도 유지)


Unnamed: 0,release_year,year,type_Movie,type_TV Show
0,0.989583,1.0,1.0,0.0
1,1.0,1.0,0.0,1.0
2,1.0,1.0,0.0,1.0
3,1.0,1.0,0.0,1.0
4,1.0,1.0,0.0,1.0


In [52]:
df_concat.describe()

Unnamed: 0,release_year,year,type_Movie,type_TV Show
count,8807.0,8807.0,8807.0,8807.0
mean,0.92896,0.881294,0.696151,0.303849
std,0.091868,0.089039,0.459944,0.459944
min,0.0,0.0,0.0,0.0
25%,0.916667,0.833333,0.0,0.0
50%,0.958333,0.888889,1.0,0.0
75%,0.979167,0.944444,1.0,1.0
max,1.0,1.0,1.0,1.0


In [53]:
# 병합
df_concat = pd.concat([df_concat, listed_in_dummy, cast_one_hot_df, country_dummy, duration_dummy, rating_dummy], axis=1)
df_concat.head()

Unnamed: 0,release_year,year,type_Movie,type_TV Show,Action & Adventure,Anime Features,Anime Series,British TV Shows,Children & Family Movies,Classic & Cult TV,...,rating_PG-13,rating_R,rating_TV-14,rating_TV-G,rating_TV-MA,rating_TV-PG,rating_TV-Y,rating_TV-Y7,rating_TV-Y7-FV,rating_UR
0,0.989583,1.0,1.0,0.0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1.0,1.0,0.0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1.0,1.0,0.0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,1.0,1.0,0.0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1.0,1.0,0.0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [54]:
embeddings = df_concat.values
embeddings.shape

(8807, 3393)

- Modeling

In [55]:
# matrix 들을 넣어주면 각 행과 열을 모두 연산을 해줌
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity_matrix = cosine_similarity(embeddings, embeddings)
cosine_similarity_matrix.shape

(8807, 8807)

In [56]:
cosine_similarity_matrix[:2, :2] # 0 번째 콘텐츠와 첫 번째 콘텐츠의 유사도

array([[1.        , 0.28464788],
       [0.28464788, 1.        ]])

In [57]:
import pickle 
with open('data/cosine_similarity_matrix.pickle','wb') as fw:
    pickle.dump(cosine_similarity_matrix, fw)

In [58]:
df_copy = df.copy()

def most_similar(idx, top_n=10):
    df_copy['cosine_similarity'] = cosine_similarity_matrix[idx]
    return df_copy.sort_values(by='cosine_similarity', ascending=False)[:top_n]

In [59]:
df_copy.to_csv('data/cb_df.csv', index=False)

- Evaluation

In [60]:
# 인천상륙작전 (7670)
most_similar(7670, top_n=4)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year,cosine_similarity
7670,s7671,Movie,Operation Chromite,John H. Lee,"Jung-jae Lee, Beom-su Lee, Liam Neeson, Se-yeo...",South Korea,"January 15, 2018",2016,NR,111 min,"Action & Adventure, Dramas, International Movies",To pave the way for a major amphibious invasio...,2018,1.0
4192,s4193,Movie,Revenger,Lee Seung-won,"Bruce Khan, Park Hee-soon, Yoon Jin-seo, Kim I...",South Korea,"January 15, 2019",2018,TV-MA,102 min,"Action & Adventure, International Movies",Hell-bent on avenging the murder of his family...,2019,0.791724
4918,s4919,Movie,Psychokinesis,Sang-ho Yeon,"Ryu Seung-ryong, Shim Eun-kyung, Jung-min Park...",South Korea,"April 25, 2018",2018,TV-MA,102 min,"Action & Adventure, Comedies, International Mo...","Suddenly possessed with supernatural powers, a...",2018,0.79089
6190,s6191,Movie,Asura: The City of Madness,Sung-soo Kim,"Jung-min Hwang, Do-won Kwak, Man-sik Jung, Woo...",South Korea,"February 15, 2018",2016,NR,133 min,"Action & Adventure, Dramas, International Movies",Caught between a corrupt mayor and a prosecuto...,2018,0.79049


- Deployment

In [61]:
df.index[df['title'] == 'Operation Chromite'][0]

7670

In [62]:
def get_index(title):
    index = df.index[df['title'] == title][0]
    return index

In [63]:
index = get_index('Operation Chromite')
index

7670

In [64]:
most_similar(7670, top_n=4)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year,cosine_similarity
7670,s7671,Movie,Operation Chromite,John H. Lee,"Jung-jae Lee, Beom-su Lee, Liam Neeson, Se-yeo...",South Korea,"January 15, 2018",2016,NR,111 min,"Action & Adventure, Dramas, International Movies",To pave the way for a major amphibious invasio...,2018,1.0
4192,s4193,Movie,Revenger,Lee Seung-won,"Bruce Khan, Park Hee-soon, Yoon Jin-seo, Kim I...",South Korea,"January 15, 2019",2018,TV-MA,102 min,"Action & Adventure, International Movies",Hell-bent on avenging the murder of his family...,2019,0.791724
4918,s4919,Movie,Psychokinesis,Sang-ho Yeon,"Ryu Seung-ryong, Shim Eun-kyung, Jung-min Park...",South Korea,"April 25, 2018",2018,TV-MA,102 min,"Action & Adventure, Comedies, International Mo...","Suddenly possessed with supernatural powers, a...",2018,0.79089
6190,s6191,Movie,Asura: The City of Madness,Sung-soo Kim,"Jung-min Hwang, Do-won Kwak, Man-sik Jung, Woo...",South Korea,"February 15, 2018",2016,NR,133 min,"Action & Adventure, Dramas, International Movies",Caught between a corrupt mayor and a prosecuto...,2018,0.79049


In [None]:
# # 678개
# # 0번째: ~ 10 RMSE 
# [5 4.9 ]

# RMSE 

# # 영화1, 해당 영화에 대한 평점이 4점 이상
#       영화1 영화2 영화3 영화4 영화5 영화6 ...
# User   [4     0     5     4     0     3 ... ]
# [0   4   5 4  0 3]

---