# 장르 속성을 이용한 영화 콘텐츠 기반 필터링

## 데이터 로딩 및 가공
- TMDB 5000 데이터 셋: 영화 데이터 정보 사이트인 imdb.com의 영화 중 주요 영화 5,000개에 대한 메타 정보를 가공해서 kaggle에서 제공하는 데이터 세트
- https://www.kaggle.com/tmdb/tmdb-movie-metadata

In [3]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')

movies = pd.read_csv('./dataset/tmdb_5000_movies.csv')
print(movies.shape)
display(movies.head())

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


- 분석에 사용할 주요 컬럼 추출
  - id, title, genres, vote_average(평균 평점), vote_count(평점 투표수), popularity(영화 인기도), keyword, overview(영화 개요)

In [4]:
movies_df = movies[['id', 'title', 'genres', 'vote_average', 'vote_count', 'popularity', 'keywords', 'overview']]

In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            4803 non-null   int64  
 1   title         4803 non-null   object 
 2   genres        4803 non-null   object 
 3   vote_average  4803 non-null   float64
 4   vote_count    4803 non-null   int64  
 5   popularity    4803 non-null   float64
 6   keywords      4803 non-null   object 
 7   overview      4800 non-null   object 
dtypes: float64(2), int64(2), object(4)
memory usage: 300.3+ KB


In [9]:
pd.set_option('max_colwidth', 200)

In [10]:
movies_df[['genres', 'keywords']][:1]

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science Fiction""}]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""}, {""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"": 3801, ""name..."


In [11]:
movies_df['genres'].values[0] # 리스트 형태이지만 데이터 유형은 문자타입

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

- eval()과 literal_eval()
  - eval()은 문자형태로 되어있는 표현식을 실행하는 함수로, 함수나 객체도 실행 가능
  - literal_eval()은 eval()과는 다르게 파이썬에서 제공하는 기본 데이터 타입 정도만 변환해주는 용도로 사용 가능

- literal_eval() 함수를 통해 genres,keywords 컬럼의 값을 리스트 객체로 변환

In [12]:
from ast import literal_eval
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

In [13]:
movies_df['genres'].values[0]  # 문자열이 아닌 리스트 객체로 바뀜

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

- 장르와 키워드 컬럼의 name 키의 값만 원소로 추출하여 리스트로 생성

In [17]:
movies_df['genres'] = movies_df['genres'].apply(lambda x : [ y['name'] for y in x ])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [ y['name'] for y in x ])

In [18]:
movies_df['genres'].values[0]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

## 장르 콘텐츠 유사도 측정
- 리스트로 변환된 장르 컬럼은 카운트 기반으로 피처 벡터화 변환
- sklearn의 CounterVectorizer 이용
- 장르 문자열을 피처 벡터화 행렬로 변환한 데이터 세트를 코사인 유사도를 통해 비교한다
- 장르 유사도가 높은 영화 중에 평점이 높은 순으로 영화를 추천한다
- 피처벡터화: 각 영화별 장르에 해당하는 것을 하나씩 꺼내 희소벡터로 변환(본인에 해당하는 장르는 1, 아닌것은 0으로) -> 희소 행렬로

[참고] CountVectorizer
- 텍스트에서 단위(단어)별 출현 횟수를 카운팅하여 수치 벡터화한다

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,1)) 
# ngram_range: 모델의 단어 순서를 어느 정도 보강하기 위한 범위, (범위 최소값, 범위 최대값)
# (1,1): 단어를 한개씩 피처로 추출
# (1,2): 토큰화된 단어를 1개씩 피처로 추출하고 또 순서대로 2개씩 묶어서 피처로 추출
# 한글처럼 두개의 단어가 붙어서 새로운 의미를 만들어낼 때 사용 
vectorizer.fit(['첫번째 문서 테스트', '두번째 문서 테스트']) # 4개의 어휘를 학습한 countervectorizer 생성
print(vectorizer.vocabulary_) # 고유한 단어가 각각의 인덱스를 가지게 됨
counts = vectorizer.transform(['직접 첫번째 테스트 두번째 테스트']) # 새로운 문서에 대해 미리 학습해놓은 사전을 기반으로 단어의 빈도수를 세어줌
print(counts)  # 밀집행렬의 형태
print('두번쨰:0, 문서:1, 첫번째:2, 테스트:3')
print(counts.toarray()) 

{'첫번째': 2, '문서': 1, '테스트': 3, '두번째': 0}
  (0, 0)	1
  (0, 2)	1
  (0, 3)	2
두번쨰:0, 문서:1, 첫번째:2, 테스트:3
[[1 0 1 2]]


In [24]:
# CounterVectorizer를 적용하기 위해 리스트가 아닌 공백 문자로 word 단위가 구분되는 문자열로 변환
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : ' '.join(x))
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
# min_df: 전체 문서에서 낮은 빈도수를 차지하는 문서를 제외
# min_df=0: 빈도수가 0이하인건 제외
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)
print(genre_mat.toarray()[:1])
# 영화가 4803개, 장르가 276개 (행: 영화, 열: 장르)
# 총 276개 장르에서 첫번째 영화가 해당하는 장르는 1의 값

(4803, 276)
[[1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [27]:
# 코사인 유사도 구하기
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat) # 각 행과 열에 들어갈 값
print(genre_sim.shape)
print(genre_sim[:3])
# 4803개의 상호간 유사도

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]
 [0.4472136  0.4        1.         ... 0.         0.         0.        ]]


In [28]:
# 행기준 내림차순 정렬하고 인덱스값 반환하는 유사도 행렬
genre_sim_sorted_idx = genre_sim.argsort()[:,::-1]
print(genre_sim_sorted_idx[:1]) # 첫번째 영화와 유사도가 높은 영화의 인덱스값

[[   0 3494  813 ... 3038 3037 2401]]


## 장르 콘텐츠 필터링을 이용한 영화 추천

In [32]:
movies_df[movies_df['title'] == 'The Dark Knight Rises']

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]",7.6,9106,112.31295,"[dc comics, crime fighter, terrorist, secret identity, burglar, hostage drama, time bomb, gotham city, vigilante, cover-up, superhero, villainess, tragic hero, terrorism, destruction, catwoman, ca...","Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's crimes to protect the late attorney's reputation and is subsequently hunted by the Gotham City Police...",Action Crime Drama Thriller


In [33]:
def find_sim_movie(df, sorted_idx, title_name, top_n=10): #(df, 정렬된 인덱스, 기준이 되는 영화, 추천 영화 개수)
    target_movie = df[df['title'] == title_name]

    title_index = target_movie.index.values
    similar_index = sorted_idx[title_index, :top_n] # 유사도가 높은순으로 정렬된 유사도 행렬에서타겟 영화 행에서 추천 영화 개수만큼 슬라이싱해서 가져오기
    print(similar_index)
    similar_index = similar_index.reshape(-1) # 추출된 top_n index가 2차원데이터이기 때문에 1차원 벡터로 변환

    return df.iloc[similar_index]

In [34]:
find_sim_movie(movies_df, genre_sim_sorted_idx, 'The Dark Knight Rises')

[[2195 1850 3316 2218 2435 3073 1503 1470 4230  629]]


Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
2195,4597,Armored,"[Action, Crime, Drama, Thriller]",5.5,208,15.21229,"[robbery, homeless person, bank, armored car, truck, heist]","A crew of officers at an armored transport security firm risk their lives when they embark on the ultimate heist against their own company. Armed with a seemingly fool-proof plan, the men plan on ...",Action Crime Drama Thriller
1850,111,Scarface,"[Action, Crime, Drama, Thriller]",8.0,2948,70.105981,"[miami, corruption, capitalism, cuba, prohibition, brother sister relationship, loss of sister, cocaine, cult film, bitterness]","After getting a green card in exchange for assassinating a Cuban government official, Tony Montana stakes a claim on the drug trade in Miami. Viciously murdering anyone who stands in his way, Tony...",Action Crime Drama Thriller
3316,11022,Narc,"[Action, Crime, Drama, Thriller]",6.8,142,8.526635,"[assertion, investigation, internal affairs, narcotics cop]","An undercover narc dies, the investigation stalls, so the Detroit P.D. brings back Nick Tellis, fired 18-months ago when a stray bullet hits a pregnant woman. Tellis teams with Henry Oak, a friend...",Action Crime Drama Thriller
2218,11835,Death Sentence,"[Action, Crime, Drama, Thriller]",6.5,297,12.643703,"[loss of son, repayment, revenge, murder, gang, police officer killed, hospital, extreme violence, justice, hoodlum, semiautomatic pistol, finger gun]","Nick Hume is a mild-mannered executive with a perfect life, until one gruesome night he witnesses something that changes him forever. Transformed by grief, Hume eventually comes to the disturbing ...",Action Crime Drama Thriller
2435,7304,Running Scared,"[Action, Crime, Drama, Thriller]",7.0,331,19.311572,"[ice hockey, racism, pedophile, throat slitting, shot in the stomach, head blown off, police investigation, pistol whip, shot in the shoulder, child uses gun, ankle holster, breaking finger]","After a drug-op gone bad, Joey Gazelle is put in charge of disposing the gun that shot a dirty cop. But things goes wrong for Joey after the neighbor kid stole the gun and used it to shoot his abu...",Action Crime Drama Thriller
3073,2088,Romeo Is Bleeding,"[Action, Crime, Drama, Thriller]",5.7,36,4.850402,"[police operation, sex addiction, police, mafia boss, suspense, bad cop, hitwoman]",A corrupt cop gets in over his head when he tries to assassinate a beautiful Russian hit-woman.,Action Crime Drama Thriller
1503,22907,Takers,"[Action, Crime, Drama, Thriller]",6.0,394,18.47242,[heist],"A seasoned team of bank robbers, including Gordon Jennings (Idris Elba), John Rahway (Paul Walker), A.J. (Hayden Christensen), and brothers Jake (Michael Ealy) and Jesse Attica (Chris Brown) succe...",Action Crime Drama Thriller
1470,127493,Stolen,"[Action, Crime, Drama, Thriller]",5.1,344,16.830184,"[taxi driver, thief, fbi agent]","A former thief frantically searches for his missing daughter, who has been kidnapped and locked in the trunk of a taxi.",Action Crime Drama Thriller
4230,507,Killing Zoe,"[Action, Crime, Drama, Thriller]",6.1,111,5.817519,"[paris, prostitute, robbery, drug abuse, aids, bank, jazz, hostage, night life, kidnapping, vault, junkie, bank robber, heroin, friendship, murder, independent film, pistol, violence, drug, bank r...",Zed (Eric Stoltz) is an American vault-cracker who travels to Paris to meet up with his old friend Eric (Jean-Hugues Anglade). Eric and his gang have planned to raid the only bank in the city whic...,Action Crime Drama Thriller
629,136797,Need for Speed,"[Action, Crime, Drama, Thriller]",6.1,1520,54.81489,"[street race, super cars, super speed, car, based on video game, duringcreditsstinger, 3d]","The film revolves around a local street-racer who partners with a rich and arrogant business associate, only to find himself framed by his colleague and sent to prison. After he gets out, he joins...",Action Crime Drama Thriller
