# 컨텐츠 기반 필터링(Content Based Filtering)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import ast
from ast import literal_eval as lit
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [2]:
data = pd.read_csv('tmdb_5000_movies.csv')
data.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
data = data[['id','genres','vote_average','vote_count','popularity','title','keywords','overview']]
data.head(2)

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",6.9,4500,139.082615,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha..."


### Feature Explanation
- genres : 영화 장르
- keywords : 영화의 키워드
- original_language : 영화 언어
- title : 제목
- vote_average : 평점 평균
- vote_count : 평점 카운트
- popularity : 인기도
- overview : 개요 설명

### 평점 평균(vote average) 파생 변수
- 변수의 불공정성 : vote count 자체가 낮을 경우 평점 평균의 신뢰도가 낮음
- vote count를 고려한 파생 변수를 생성
  
<b><center> WR = (v / (v + m)) x R + (m / (v + m)) x C   </center></b>
  
- R : 영화 개별 평점 (average for the movie)
- v : 개별 영화에 평점을 부여한 횟수 (number of votes for the movie)
- m : N 순위 안에 들어야 하는 최소 투표 (minimum votes required to be listed in the Top N (currently 25,000))
- C : 전체 영화에 대한 평균 평점 (the mean vote across the whole report)



In [6]:
# assigning values for each parameter
m = data['vote_count'].quantile(0.90)
C = data['vote_average'].mean()
data_90 = data[data['vote_count'] >= m]
print("C : %.3f" %C)
print("m : %.3f" %m)

# Calculating Weighted Rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v / (v + m) * R) + (m / (v + m) * C)

# Calculating score using Weighted Rating
data_90['score'] = data_90.apply(weighted_rating, axis = 1)


C : 6.092
m : 1838.400


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [7]:
data_90.head(2)

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview,score
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",7.050669
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",6.9,4500,139.082615,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",6.665696


## 추천 알고리즘
- 컨텐츠 기반 필터링은 비슷한 아이템끼리 추천해줌
- 영화 추천시스템에서 비슷한 아이템(컨텐츠)을 선별하기 위해 <b>장르, 키워드</b> 기반으로 추천 진행

In [8]:
# Data Preprocessing
data_90['genres'] = data_90['genres'].apply(ast.literal_eval)
data_90['keywords'] = data_90['keywords'].apply(ast.literal_eval)

data_90['genres_str'] = data_90['genres'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))
data_90['keywords_str'] = data_90['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

data_90.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value inste

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview,score,genres_str,keywords_str
0,19995,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",7.2,11800,150.437577,Avatar,"[{'id': 1463, 'name': 'culture clash'}, {'id':...","In the 22nd century, a paraplegic Marine is di...",7.050669,Action Adventure Fantasy Science Fiction,culture clash future space war space colony so...
1,285,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",6.9,4500,139.082615,Pirates of the Caribbean: At World's End,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","Captain Barbossa, long believed to be dead, ha...",6.665696,Adventure Fantasy Action,ocean drug abuse exotic island east india trad...


In [9]:
# content based filtering
count_vector = CountVectorizer(ngram_range = (1, 3))
c_vector_genres = count_vector.fit_transform(data_90['genres_str'])
c_vector_genres.shape

(481, 364)

In [10]:
# cosine similarity 벡터 저장
genre_cosine_sim = cosine_similarity(c_vector_genres, c_vector_genres).argsort()[:, ::-1]
genre_cosine_sim.shape

(481, 481)

### 함수 정리
#### - argsort() : 작은 값부터 순서대로 데이터의 위치를 반환
ex) x = [2, 4, 3, 5] -> x.argsort() returns [0, 2, 1, 3]  
#### - reshape(-1) : m x n 차원의 matrix를 1 x (m X n) 차원으로 flatten
ex)  
[[1, 2, 3],  ->  [1, 2, 3, 4, 5, 6]  
[4, 5, 6]]

In [11]:
def get_recommend_movie_list(df, movie_title, top=30):
    
    # 특정 영화와 비슷한 영화를 추천해야 하기 때문에 '특정 영화(target)' 정보 추출
    target_movie_index = df[df['title'] == movie_title].index.values
    
    # 유사한 코사인 유사도를 가진 인덱스 추출
    sim_index = genre_cosine_sim[target_movie_index, : top].reshape(-1)
    
    # 타겟 영화 제외
    sim_index = sim_index[sim_index != target_movie_index]
    
    # 데이터프레임 생성, vote_count으로 정렬한 뒤 return
    result = df.iloc[sim_index].sort_values('score', ascending = False)[:10]
    
    return result

In [12]:
get_recommend_movie_list(data_90, movie_title = 'The Dark Knight Rises')

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview,score,genres_str,keywords_str
65,155,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",8.2,12002,187.322927,The Dark Knight,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",Batman raises the stakes in his war on crime. ...,7.92002,Drama Action Crime Thriller,dc comics crime fighter secret identity scarec...
2091,274,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",8.1,4443,18.174804,The Silence of the Lambs,"[{'id': 818, 'name': 'based on novel'}, {'id':...","FBI trainee, Clarice Starling ventures into a ...",7.512362,Crime Drama Thriller,based on novel psychopath horror suspense seri...
351,1422,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",7.9,4339,63.429157,The Departed,"[{'id': 1568, 'name': 'undercover'}, {'id': 16...","To take down South Boston's Irish Mafia, the p...",7.361989,Drama Thriller Crime,undercover boston police friends mafia underco...
2760,264644,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",8.1,2757,66.11334,Room,"[{'id': 818, 'name': 'based on novel'}, {'id':...",Jack is a young boy of 5 years old who has liv...,7.296764,Drama Thriller,based on novel carpet isolation kidnapping imp...
1850,111,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",8.0,2948,70.105981,Scarface,"[{'id': 416, 'name': 'miami'}, {'id': 417, 'na...",After getting a green card in exchange for ass...,7.267226,Action Crime Drama Thriller,miami corruption capitalism cuba prohibition b...
828,24,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",7.7,4949,79.754966,Kill Bill: Vol. 1,"[{'id': 233, 'name': 'japan'}, {'id': 732, 'na...",An assassin is shot at the altar by her ruthle...,7.264512,Action Crime,japan coma martial arts kung fu underworld yak...
1051,146233,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",7.9,3085,88.496873,Prisoners,"[{'id': 904, 'name': 'pennsylvania'}, {'id': 1...",When Keller Dover's daughter and her friend go...,7.224956,Drama Thriller Crime,pennsylvania kidnapping maze vigilante rural s...
119,272,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",7.5,7359,115.040024,Batman Begins,"[{'id': 486, 'name': 'himalaya'}, {'id': 779, ...","Driven by tragedy, billionaire Bruce Wayne ded...",7.2186,Action Crime Drama,himalaya martial arts dc comics crime fighter ...
4337,103,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",8.0,2535,58.845025,Taxi Driver,"[{'id': 422, 'name': 'vietnam veteran'}, {'id'...",A mentally unstable Vietnam War veteran works ...,7.198026,Crime Drama,vietnam veteran taxi obsession drug dealer nig...
3701,641,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",7.9,2443,11.573034,Requiem for a Dream,"[{'id': 1803, 'name': 'drug addiction'}, {'id'...",The hopes and dreams of four ambitious people ...,7.123732,Crime Drama,drug addiction junkie heroin speed diet unsoci...


### 다크나이트라이즈를 시청한 고객이 다음에 추천 받을 영화
- 다크나이트
- 양들의 침묵
- The Departed
- Room
- Scarface
- 킬빌 vol.1
- prisoner
- 배트맨 비긴즈
- 택시운전사