### 장르 속성을 이용한 영화 콘텐츠 기반 필터링
#### 콘텐츠 기반 필터링 : 
- 유저가 특정 영화를 감상하고 좋아했다면 비슷한 특성/ 속성, 구성요소를 갖는 다른 영화를 추천하는 것 
    - Ex. <인셉션> 재밌게 봤다면 인셉션의 장르인 액션, 공상과학으로 높은 평점을 받은 영화를 추천하거나, 크리스토퍼 놀란 감독의 다른 영화를 추천하는 방식 
- 상품/ 서비스간의 유사성을 판단하는 기준이 영화를 구성하는 다양한 콘텐츠(장르, 감독, 배우, 평점, 키워드, 영화 설명)을 기반으로 하는 방식

- 아래 분석 내용은 장르 칼럼을 기준으로 유사도 비교, 높은 평점을 가지는 영화를 추천하는 방식.

#### STEP1. 데이터 로딩 및 가공

In [4]:
import pandas as pd
import numpy as np
import warnings; warnings; warnings.filterwarnings('ignore')

movies = pd.read_csv("tmdb_5000_movies.csv")
print(movies.shape) 
movies.head(1)

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [5]:
movies.columns 

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

- 추천시스템을 만들기 위한 새로운 데이터셋 생성

In [6]:
movies_df = movies[['id', 'title', 'genres', 'vote_average', 'vote_count', 'popularity',
                   'keywords', 'overview']] 
movies_df.head(1) # genres, keywords 칼럼같은 경우, dictionary 형태로 되어있기 때문에 전처리 필요

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di..."


In [7]:
pd.set_option('max_colwidth', 100)
movies_df[['genres','keywords']][:1]

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp..."


In [8]:
from ast import literal_eval

movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

    # 문자열을 객체로 변환하기 위해 literal_eval 함수 사용
    # 문자 그대로 evaluate 실행. 즉, 기본 타입 변환해주는 용도로 사용
    # 이렇게 안해주면 string 형태로 들어가게 됨.

movies_df['genres'] = movies_df['genres'].apply(lambda x: [ i['name'] for i in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x: [ i['name'] for i in x])
movies_df[['genres', 'keywords']][:1]

# 장르, keywords 의 name 만 list 형태로 잘 들어옴. 

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa..."


####  STEP2. 장르 콘텐츠 유사도 측정 
- 1. 리스트에 담아둔 genres 칼럼을 Count 기반으로 피처 벡터 행렬 변환.
- 2. 코사인 유사도 측정해 비교.
- 3. 장르 유사도 높은 영화 중 평점 높은 영화순 추천

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# CounterVectorizer 적용하기 위해 공백 단위로 word 구분되는 문자열로 변환하기
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x: ' '.join(x)) # (apply,lambda 함께 사용하는거 편리! 익숙해져야지)
movies_df.head(1)

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",7.2,11800,150.437577,"[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa...","In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ...",Action Adventure Fantasy Science Fiction


In [10]:
count_vect = CountVectorizer(min_df = 0, ngram_range=(1,2)) 
    # min_df : 단어장에 포함되기 위한 최소 빈도 
    # ngram_range : 단어장 생성에 사용할 토큰의 크기 
        # 모노그램(1-그램)은 토큰 하나만 단어로, 바이그램(2-그램)은 두 개의 연결된 토큰을 하나의 단어로 사용한다.

genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)
    # row :movies, col :genre

(4803, 276)


In [12]:
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
    # movies_df에서 각각의 movie 별로(각각의 row별) 장르 유사도 값을 계산한 것.
    # genre_literal 칼럼을 피처 벡터화한 행렬 genre_mat 데이터의 행별 유사도 행렬. 
print(genre_sim.shape)
print(genre_sim)

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]
 [0.4472136  0.4        1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


In [16]:
# genre_sim_sorted_ind = genre_sim.argsort() # 오름차순 (유사도가 작은 순으로 정렬)
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1] 
    # 내림차순 (유사도가 높은 순으로 정렬하고 index 가져옴.)
print(genre_sim_sorted_ind)

genre_sim_sorted_ind[0]

[[   0 3494  813 ... 3038 3037 2401]
 [ 262    1  129 ... 3069 3067 2401]
 [   2 1740 1542 ... 3000 2999 2401]
 ...
 [4800 3809 1895 ... 2229 2230    0]
 [4802 1594 1596 ... 3204 3205    0]
 [4802 4710 4521 ... 3140 3141    0]]


array([   0, 3494,  813, ..., 3038, 3037, 2401], dtype=int64)

genre_sim_sorted_ind[0]
- 결과 : [   0, 3494,  813, ..., 3038, 3037, 2401]
- 의미 : 0번째 movie는 0번째(자기자신)을 제외하면 3494번째 movie와 가장 유사도가 높다. 반대로 2401번째 movie와는 가장 유사도가 낮음을 의미함.

#### STEP3. 장르 콘텐츠 필터링을 이용한 영화 추천 
- find_sim_movie : 영화 추천 함수 
- input으로는 data, 유사도행렬, 영화제목, 추천영화 건수를 넣을 것임.

In [38]:
def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]    # 입력한 title_name에 해당하는 row 데이터 추출 
    title_index = title_movie.index.values         # 아래 참고. title_name에 해당하는 row index 추출
    similar_indexes = sorted_ind[title_index, :(top_n)] # 유사도행렬에서 index 값 추출
                                                        # title_index는 해당 작품의 row index이며 :(top_n)은 몇개 작품 추출할건지
    print(similar_indexes)
    
    similar_indexes = similar_indexes.reshape(-1)  # 1차원 array로 변환
    
    return df.iloc[similar_indexes]

In [39]:
find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
2731,240,The Godfather: Part II,"[Drama, Crime]",8.3,3338,105.792936,"[italo-american, cuba, vororte, melancholy, praise, revenge, mafia, lawyer, blood, corrupt polit...","In the continuing saga of the Corleone crime family, a young Vito Corleone grows up in Sicily an...",Drama Crime
1243,203,Mean Streets,"[Drama, Crime]",7.2,345,17.002096,"[epilepsy, protection money, secret love, money, redemption]","A small-time hood must choose from among love, friendship and the chance to rise within the mob.",Drama Crime
3636,36351,Light Sleeper,"[Drama, Crime]",5.7,15,6.063868,"[suicide, drug dealer, redemption, addict, existentialism]",A drug dealer with upscale clientele is having moral problems going about his daily deliveries. ...,Drama Crime
1946,11699,The Bad Lieutenant: Port of Call - New Orleans,"[Drama, Crime]",6.0,326,17.339852,"[police brutality, organized crime, policeman, illegal drugs, murder investigation, corrupt cop]","Terrence McDonagh, a New Orleans Police sergeant, who starts out as a good cop, receiving a meda...",Drama Crime
2640,400,Things to Do in Denver When You're Dead,"[Drama, Crime]",6.7,85,6.932221,"[father son relationship, bounty hunter, boat, way of life, coffin, denver, godmother, paranoia,...",A mafia film in Tarantino style with a star-studded cast. Jimmy’s “The Saint” gangster career ha...,Drama Crime
4065,364083,Mi America,"[Drama, Crime]",0.0,0,0.039007,"[new york state, hate crime]","A hate-crime has been committed in a the small city of Braxton, N.Y. Five migrant laborers have ...",Drama Crime
1847,769,GoodFellas,"[Drama, Crime]",8.2,3128,63.654244,"[prison, based on novel, florida, 1970s, mass murder, irish-american, drug traffic, biography, b...","The true story of Henry Hill, a half-Irish, half-Sicilian Brooklyn kid who is adopted by neighbo...",Drama Crime
4217,9344,Kids,"[Drama, Crime]",6.8,279,13.291991,"[puberty, first time]",A controversial portrayal of teens in New York City which exposes a deeply disturbing world of s...,Drama Crime
883,640,Catch Me If You Can,"[Drama, Crime]",7.7,3795,73.944049,"[con man, biography, fbi agent, overhead camera shot, attempted jailbreak, engagement party, mis...","A true story about Frank Abagnale Jr. who, before his 19th birthday, successfully conned million...",Drama Crime
3866,598,City of God,"[Drama, Crime]",8.1,1814,44.356711,"[male nudity, street gang, brazilian, photographer, 1970s, puberty, ghetto, gang war, coming of ...",Cidade de Deus is a shantytown that started during the 1960s and became one of Rio de Janeiro’s ...,Drama Crime


`결과 해석`
- The Godfather 2편이 가장 먼저 추천되고, Mean Streets 등등 영화가 추천된 것을 알 수 있음.
- 낯선 영화도 많고 Mi America 는 평점이 0점..! 
- 따라서 추천시스템의 개선이 필요함. 

##### 참고 : find_sim_movie 함수 설명

In [24]:
# title_index 의미 : 해당하는 row index 추출해주기 위해서

title_movie = movies_df[movies_df['title'] == "The Godfather"] # 입력한 title_name에 해당하는 row 추출
title_movie
    # 여기서 3337 이라는 index 값을 뽑아다가 int 값으로 사용하기 위하여 
    # title_movie.index.values 라는 함수 사용한 것

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
3337,238,The Godfather,"[Drama, Crime]",8.4,5893,143.659698,"[italy, love at first sight, loss of father, patriarch, organized crime, mafia, lawyer, italian ...","Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime fa...",Drama Crime


In [31]:
# similar_indexes 결과

title_index = title_movie.index.values
similar_indexes = genre_sim_sorted_ind[title_index, :10]
similar_indexes # 2차원 array : 즉 [[]] 이렇게 들어가 있음. 

array([[2731, 1243, 3636, 1946, 2640, 4065, 1847, 4217,  883, 3866]],
      dtype=int64)

In [37]:
# similar_indexes 에서 1차원 array 변환 결과

similar_indexes = similar_indexes.reshape(-1) # 1차원으로 변환해주기 [] 이렇게 변환됨을 알 수 있음.
similar_indexes

array([2731, 1243, 3636, 1946, 2640, 4065, 1847, 4217,  883, 3866],
      dtype=int64)

In [36]:
movies_df.iloc[similar_indexes]

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
2731,240,The Godfather: Part II,"[Drama, Crime]",8.3,3338,105.792936,"[italo-american, cuba, vororte, melancholy, praise, revenge, mafia, lawyer, blood, corrupt polit...","In the continuing saga of the Corleone crime family, a young Vito Corleone grows up in Sicily an...",Drama Crime
1243,203,Mean Streets,"[Drama, Crime]",7.2,345,17.002096,"[epilepsy, protection money, secret love, money, redemption]","A small-time hood must choose from among love, friendship and the chance to rise within the mob.",Drama Crime
3636,36351,Light Sleeper,"[Drama, Crime]",5.7,15,6.063868,"[suicide, drug dealer, redemption, addict, existentialism]",A drug dealer with upscale clientele is having moral problems going about his daily deliveries. ...,Drama Crime
1946,11699,The Bad Lieutenant: Port of Call - New Orleans,"[Drama, Crime]",6.0,326,17.339852,"[police brutality, organized crime, policeman, illegal drugs, murder investigation, corrupt cop]","Terrence McDonagh, a New Orleans Police sergeant, who starts out as a good cop, receiving a meda...",Drama Crime
2640,400,Things to Do in Denver When You're Dead,"[Drama, Crime]",6.7,85,6.932221,"[father son relationship, bounty hunter, boat, way of life, coffin, denver, godmother, paranoia,...",A mafia film in Tarantino style with a star-studded cast. Jimmy’s “The Saint” gangster career ha...,Drama Crime
4065,364083,Mi America,"[Drama, Crime]",0.0,0,0.039007,"[new york state, hate crime]","A hate-crime has been committed in a the small city of Braxton, N.Y. Five migrant laborers have ...",Drama Crime
1847,769,GoodFellas,"[Drama, Crime]",8.2,3128,63.654244,"[prison, based on novel, florida, 1970s, mass murder, irish-american, drug traffic, biography, b...","The true story of Henry Hill, a half-Irish, half-Sicilian Brooklyn kid who is adopted by neighbo...",Drama Crime
4217,9344,Kids,"[Drama, Crime]",6.8,279,13.291991,"[puberty, first time]",A controversial portrayal of teens in New York City which exposes a deeply disturbing world of s...,Drama Crime
883,640,Catch Me If You Can,"[Drama, Crime]",7.7,3795,73.944049,"[con man, biography, fbi agent, overhead camera shot, attempted jailbreak, engagement party, mis...","A true story about Frank Abagnale Jr. who, before his 19th birthday, successfully conned million...",Drama Crime
3866,598,City of God,"[Drama, Crime]",8.1,1814,44.356711,"[male nudity, street gang, brazilian, photographer, 1970s, puberty, ghetto, gang war, coming of ...",Cidade de Deus is a shantytown that started during the 1960s and became one of Rio de Janeiro’s ...,Drama Crime


### 보완된 방법. 
- 앞선 방식은 단지 유사도행렬 계산하고 높은 값 갖는 영화 추천했다면, 이번에는 후보군 리스트를 늘린다음, 평점도 반영해서 추천해주기

In [49]:
# 이전에 평점 칼럼에 대해 알아보기. 

movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]
    # 즉 여기서 보면 Stiff Upper Lips 같은 경우 평점은 10점이나, 1명이 평가한 평점이기 때문에 bias가 존재한다. 
    # vote_average를 기준으로 정렬하기

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


- 가중평점 (Weighted Rating) 사용
    - = (v/(v+m))*R + (m/(v+m))*C
    - v: vote_count(평가한 사람 수)
    - m: vote_count높은 영화에 weight 주려고 
    
    - R: vote_average(각각의 영화별 평균 평점)
    - C: 전체 영화에 대한 평점(movies_df['vote_average'].mean() 값에 해당함)

In [53]:
m= movies_df['vote_count'].quantile(0.6) # 전체 투표수에서 상위 60%에 해당하는 횟수를 기준으로 계산
C= movies_df['vote_average'].mean() 

print("m:",m, ", C:",C)

m: 370.1999999999998 , C: 6.092171559442011


In [58]:
# 새로운 weighted_mean 만드는 함수 

percentile = 0.6
m= movies_df['vote_count'].quantile(percentile) 
C= movies_df['vote_average'].mean() 

def weighted_vote_average(record):
    v= record['vote_count']
    R= record['vote_average']
    
    return (v/(v+m))*R + (m/(v+m))*C
    # (v/(v+m))*R : v(해당 영화의 평점 수 많으면 해당 영화의 평점값에 큰 weight, 적으면 작은 weight)
    # (m/(v+m))*C : m(만약 평점 수 적은 경우 전체 평점에 weight 더 줌, 보정)

# 새로운 칼럼 생성하기
movies_df['weighted_vote'] = movies.apply(weighted_vote_average, axis = 1)

In [59]:
movies_df[['title','vote_average','weighted_vote','vote_count']].sort_values('weighted_vote', ascending = False)[:10]

Unnamed: 0,title,vote_average,weighted_vote,vote_count
1881,The Shawshank Redemption,8.5,8.396052,8205
3337,The Godfather,8.4,8.263591,5893
662,Fight Club,8.3,8.216455,9413
3232,Pulp Fiction,8.3,8.207102,8428
65,The Dark Knight,8.2,8.13693,12002
1818,Schindler's List,8.3,8.126069,4329
3865,Whiplash,8.3,8.123248,4254
809,Forrest Gump,8.2,8.105954,7927
2294,Spirited Away,8.3,8.105867,3840
2731,The Godfather: Part II,8.3,8.079586,3338


- 한결 낫다. Spirited Away: 센과 치히로의 행방불명!
- 따라서 이렇게 평가 수에 따라 weighted된 평점을 사용해보자.

- 앞선 경우처럼 유사도 높은 순으로 먼저 거른 다음에 weighted_vote 칼럼값이 높은 순으로 top_n 만큼을 추출하기.

In [64]:
# 앞에서 생성했던 함수 그대로 가져와서 변경된 부분만 주석! 

def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]    
    title_index = title_movie.index.values       
    similar_indexes = sorted_ind[title_index, :(top_n*2)] # 앞서서와 달리 top_n*2 만큼 수를 2배로 불려주기
                                                       
    print(similar_indexes)
    
    similar_indexes = similar_indexes.reshape(-1)  
    
    #---- 새로 추가 및 변경
    similar_indexes = similar_indexes[similar_indexes != title_index] # 위에서는 왜 안해준거지.? - 해당 영화 index 제외
    # return df.iloc[similar_indexes] # 이전의 return 값 
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending= False)[:top_n]
        # 마지막으로 weighted_vote 순으로 sorting해서 top_n만큼 값 추출해주기 !!!!!

In [68]:
similar_movies= find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title','vote_average', 'vote_count','weighted_vote']]
    # 어느정도 vote_count값이 크고, vote_average가 클 때 weighted_vote값도 대체적으로 큼.

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866 3112 4041  588 3337
  3378  281 1663 1464 1149 2839]]


Unnamed: 0,title,vote_average,vote_count,weighted_vote
2731,The Godfather: Part II,8.3,3338,8.079586
1847,GoodFellas,8.2,3128,7.976937
3866,City of God,8.1,1814,7.759693
1663,Once Upon a Time in America,8.2,1069,7.657811
883,Catch Me If You Can,7.7,3795,7.557097
281,American Gangster,7.4,1502,7.141396
4041,This Is England,7.4,363,6.739664
1149,American Hustle,6.8,2807,6.717525
1243,Mean Streets,7.2,345,6.626569
2839,Rounders,6.9,439,6.530427


#### 결과 해석 
- 이전보다 더 나은 결과가 추출됨.
- 여기서는 장르, 평점만을 기반으로 유사하고 평점도 높은 영화를 추천했으나, 영화배우나 감독 등 다른 요인을 고려하지 않았음
- 여기에 관련된 더 나은 연구들 몇개 살펴보기 ! 