<a href="https://colab.research.google.com/github/gayoooon1/2021-RecSys/blob/main/Genre_Based_RecSys_using_1M_MovieLens_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Genre-Based Recommendation System using 1M MovieLens Data


## 방법
이번주에는 개인의 취향을 쟝르단위에서 찾아 영화를 추천해 봅시다.
1. 데이터 전처리 : (영화를 3개 또는 5개 등의 적절한 수 이상 본 이용자만 추출)
2. training-testing split
3. training 데이터에서 쟝르별 인기랭킹 구함
4. training 데이터에서 이용자의 쟝르 취향 찾기
5. 쟝르 취향에 따라 추천 ,  쟝르가 다양하면 max 쟝르 또는 avg쟝르 등을 활용하는 것을 제안해볼것
(기존 쟝르를 무시한 추천 모델과 성능 비교)

## 데이터 전처리

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd drive/MyDrive/DEEPLEARNING/ml-1m

/content/drive/MyDrive/DEEPLEARNING/ml-1m


In [None]:
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
# The movies file contains a binary feature for each genre.
genre_cols = [
    "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]

In [None]:
movies = pd.read_csv('movies.dat', header=None, sep='::', names=["movie_id", "title", "genres"]+genre_cols, encoding='latin-1', engine='python')
ratings = pd.read_csv('ratings.dat', header=None, sep='::', names = ['user_id', 'movie_id', 'rating', 'unix_timestamp'], encoding='latin-1', engine='python')

In [None]:
movies.head()

Unnamed: 0,movie_id,title,genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Animation|Children's|Comedy,,,,,,,,,,,,,,,,,,
1,2,Jumanji (1995),Adventure|Children's|Fantasy,,,,,,,,,,,,,,,,,,
2,3,Grumpier Old Men (1995),Comedy|Romance,,,,,,,,,,,,,,,,,,
3,4,Waiting to Exhale (1995),Comedy|Drama,,,,,,,,,,,,,,,,,,
4,5,Father of the Bride Part II (1995),Comedy,,,,,,,,,,,,,,,,,,


In [None]:
movies['genres'] = movies['genres'].str.split('|') # 장르 쪼개기!

In [None]:
ratings.drop('unix_timestamp', inplace=True, axis=1)

In [None]:
ratings = ratings[ratings.rating > 3]

In [None]:
# title 칼럼을 얻기 위해 movies와 조인
rating_movies = pd.merge(ratings, movies, on='movie_id')

In [None]:
# Compute the number of movies to which a genre is assigned.
genre_occurences = movies[genre_cols].sum().to_dict()

In [None]:
for a in movies['genres']:
  print(a)

['Animation', "Children's", 'Comedy']
['Adventure', "Children's", 'Fantasy']
['Comedy', 'Romance']
['Comedy', 'Drama']
['Comedy']
['Action', 'Crime', 'Thriller']
['Comedy', 'Romance']
['Adventure', "Children's"]
['Action']
['Action', 'Adventure', 'Thriller']
['Comedy', 'Drama', 'Romance']
['Comedy', 'Horror']
['Animation', "Children's"]
['Drama']
['Action', 'Adventure', 'Romance']
['Drama', 'Thriller']
['Drama', 'Romance']
['Thriller']
['Comedy']
['Action']
['Action', 'Comedy', 'Drama']
['Crime', 'Drama', 'Thriller']
['Thriller']
['Drama', 'Sci-Fi']
['Drama', 'Romance']
['Drama']
['Drama']
['Romance']
['Adventure', 'Sci-Fi']
['Drama']
['Drama']
['Drama', 'Sci-Fi']
['Adventure', 'Romance']
["Children's", 'Comedy', 'Drama']
['Drama', 'Romance']
['Drama']
['Documentary']
['Comedy']
['Comedy', 'Romance']
['Drama']
['Drama', 'War']
['Action', 'Crime', 'Drama']
['Drama']
['Action', 'Adventure']
['Comedy', 'Drama']
['Drama', 'Romance']
['Crime', 'Thriller']
['Animation', "Children's", 'Musica

In [None]:
movies.head()

Unnamed: 0,movie_id,title,genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,genre,all_genres
0,1,Toy Story (1995),Other,,,,,,,,,,,,,,,,,,,Other,Other
1,2,Jumanji (1995),Other,,,,,,,,,,,,,,,,,,,Other,Other
2,3,Grumpier Old Men (1995),Other,,,,,,,,,,,,,,,,,,,Other,Other
3,4,Waiting to Exhale (1995),Other,,,,,,,,,,,,,,,,,,,Other,Other
4,5,Father of the Bride Part II (1995),Other,,,,,,,,,,,,,,,,,,,Other,Other


In [None]:
rating_movies['genres'][0][0]
#rating_movies['genres']= rating_movies['genres'].apply(lambda x:(' ').join(x))

'Drama'

In [None]:
rating_movies['genres'].isnull().sum()

0

In [None]:
rating_movies['genres'].unique()

array(['Drama', "Animation Children's Comedy", 'Action Adventure Drama',
       'Comedy Drama', "Animation Children's Musical",
       "Adventure Children's Drama Musical", 'Musical', 'Comedy',
       "Animation Children's", 'Comedy Fantasy', 'Comedy Sci-Fi',
       'Drama War', "Animation Children's Musical Romance",
       "Children's Drama Fantasy Sci-Fi", 'Drama Romance',
       "Adventure Animation Children's Comedy Musical",
       "Animation Children's Comedy Musical", 'Thriller',
       'Action Crime Romance', 'Action Adventure Fantasy Sci-Fi',
       "Children's Comedy Musical", 'Action Drama War',
       "Children's Drama", 'Crime Drama Thriller', 'Action Crime Drama',
       'Action Adventure Mystery', 'Crime Drama',
       'Action Adventure Romance Sci-Fi War', 'Action Drama',
       'Comedy Drama Western', 'Comedy Drama War', 'Action Thriller',
       'Action Comedy Western', 'Adventure Comedy Drama',
       'Drama Thriller', 'Action Adventure Sci-Fi Thriller',
       'Com

## Training-Test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(rating_movies, test_size = 0.05, random_state = 42)

In [None]:
X_train.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,title,genres
278708,606,3175,5,975892863,Galaxy Quest (1999),"[Adventure, Comedy, Sci-Fi]"
174622,3531,1674,5,966997879,Witness (1985),"[Drama, Romance, Thriller]"
13843,2015,1270,5,974677243,Back to the Future (1985),"[Comedy, Sci-Fi]"
136181,1031,377,5,974999133,Speed (1994),"[Action, Romance, Thriller]"
208617,2784,529,4,973108029,Searching for Bobby Fischer (1993),[Drama]


In [None]:
X_train['Genres'].head()

40785                                  [Drama]
639739      [Action, Horror, Sci-Fi, Thriller]
157675    [Action, Adventure, Comedy, Romance]
416999                    [Action, Drama, War]
200061                               [Western]
Name: Genres, dtype: object

In [None]:
X_train['Genres'] = X_train['Genres'].apply(lambda x:(' ').join(x))
X_train.head()

Unnamed: 0,UserID,MovieID,Rating,Title,Genres
40785,1980,150,5,Apollo 13 (1995),Drama
639739,3158,2288,2,"Thing, The (1982)",Action Horror Sci-Fi Thriller
157675,3499,380,4,True Lies (1994),Action Adventure Comedy Romance
416999,2570,1233,5,"Boat, The (Das Boot) (1981)",Action Drama War
200061,2857,1266,4,Unforgiven (1992),Western


In [None]:
X_train.groupby('Genres').size()

Genres
Action                                           11740
Action Adventure                                  9886
Action Adventure Animation                         319
Action Adventure Animation Children's Fantasy      128
Action Adventure Animation Horror Sci-Fi           589
                                                 ...  
Sci-Fi Thriller War                                263
Sci-Fi War                                        1297
Thriller                                         16947
War                                                946
Western                                           5380
Length: 301, dtype: int64

In [None]:
X_train[X_train['Genres'] == 'Action'].groupby('Title').size().sort_values(ascending=False).index[:10]

Index(['Goldfinger (1964)', 'Shanghai Noon (2000)',
       'From Russia with Love (1963)', 'Dr. No (1962)',
       'Spy Who Loved Me, The (1977)', 'Under Siege (1992)',
       'Man with the Golden Gun, The (1974)', 'Single White Female (1992)',
       'Live and Let Die (1973)', 'For Your Eyes Only (1981)'],
      dtype='object', name='Title')

## training 데이터에서 쟝르별 인기랭킹 구하기

In [None]:
def genres_rank(X_train):
  for genre in X_train['Genres'].unique() :
    ranking = X_train[X_train['Genres'] == genre].groupby('Title').size().sort_values(ascending=False).index[:10].tolist()
    print(genre, '\n',ranking, '\n')

In [None]:
def get_genre_rank(genre): # 장르 하나 인기랭킹 return
    ranking = X_train[X_train['Genres'] == genre].groupby('Title').size().sort_values(ascending=False).index[:10].tolist()
    return ranking

In [None]:
genres_rank(X_train)

Drama 
 ['Shawshank Redemption, The (1994)', "One Flew Over the Cuckoo's Nest (1975)", 'Good Will Hunting (1997)', 'Fight Club (1999)', 'Amadeus (1984)', 'Rain Man (1988)', 'Erin Brockovich (2000)', 'Apollo 13 (1995)', 'Boogie Nights (1997)', 'Citizen Kane (1941)'] 

Action Horror Sci-Fi Thriller 
 ['Alien (1979)', 'Alien³ (1992)', 'Thing, The (1982)'] 

Action Adventure Comedy Romance 
 ['Princess Bride, The (1987)', 'True Lies (1994)', 'Romancing the Stone (1984)', 'Jewel of the Nile, The (1985)'] 

Action Drama War 
 ['Saving Private Ryan (1998)', 'Braveheart (1995)', 'Full Metal Jacket (1987)', 'Patriot, The (2000)', 'Glory (1989)', 'Boat, The (Das Boot) (1981)', 'Thin Red Line, The (1998)', 'G.I. Jane (1997)', 'Guns of Navarone, The (1961)', 'Longest Day, The (1962)'] 

Western 
 ['Unforgiven (1992)', 'Tombstone (1993)', 'High Noon (1952)', 'Outlaw Josey Wales, The (1976)', 'For a Few Dollars More (1965)', 'High Plains Drifter (1972)', 'Pale Rider (1985)', 'Wyatt Earp (1994)', 'Wi

## training 데이터에서 이용자의 쟝르 취향 찾기

In [None]:
X_train[X_train['UserID'] == 4317].groupby('Genres').size().sort_values(ascending=False).index[:10].tolist()

['Drama Sci-Fi',
 'Action Sci-Fi Thriller',
 'Action Sci-Fi',
 'Action Adventure Sci-Fi Thriller',
 'Comedy',
 'Comedy Fantasy',
 'Adventure Comedy Sci-Fi',
 'Sci-Fi',
 'Action Adventure Drama Sci-Fi War',
 'Action Adventure Fantasy']

## 쟝르 취향에 따라 추천 , 쟝르가 다양하면 max 쟝르 또는 avg쟝르 등을 활용하는 것을 제안해볼것 (기존 쟝르를 무시한 추천 모델과 성능 비교)

장르 추천해주는 rec_genre 함수
1. training data에서 유저의 장르 취향 top 10 뽑기
2. top 10에서 인기 영화 return

In [None]:
def rec_genre(X_test):
  for user in X_test['UserID'].unique():
    fav_genre = X_train[X_train['UserID'] == user].groupby('Genres').size().sort_values(ascending=False).index[:10].tolist()
    rec_movies = []
    for g in fav_genre: 
      rec_movies.append(get_genre_rank(g))
      movies = sum(rec_movies, [])
    print(user, movies)

In [None]:
rec_genre(X_test[:2])

5755 ['Shawshank Redemption, The (1994)', "One Flew Over the Cuckoo's Nest (1975)", 'Good Will Hunting (1997)', 'Fight Club (1999)', 'Amadeus (1984)', 'Rain Man (1988)', 'Erin Brockovich (2000)', 'Apollo 13 (1995)', 'Boogie Nights (1997)', 'Citizen Kane (1941)', 'Being John Malkovich (1999)', 'Airplane! (1980)', 'Monty Python and the Holy Grail (1974)', 'Election (1999)', "Ferris Bueller's Day Off (1986)", 'Raising Arizona (1987)', 'Austin Powers: The Spy Who Shagged Me (1999)', 'Clerks (1994)', 'High Fidelity (2000)', 'American Pie (1999)', 'Titanic (1997)', 'Edward Scissorhands (1990)', 'Jerry Maguire (1996)', 'Graduate, The (1967)', 'Leaving Las Vegas (1995)', 'Chasing Amy (1997)', 'Sense and Sensibility (1995)', 'Dangerous Liaisons (1988)', 'Like Water for Chocolate (Como agua para chocolate) (1992)', "Breakfast at Tiffany's (1961)", 'Sixth Sense, The (1999)', 'Fatal Attraction (1987)', 'Nikita (La Femme Nikita) (1990)', 'Cape Fear (1991)', 'Bone Collector, The (1999)', 'Arlington 

In [None]:
def rec_genre_user(user): # per user
    fav_genre = X_train[X_train['UserID'] == user].groupby('Genres').size().sort_values(ascending=False).index[:10].tolist()
    rec_movies = []
    for g in fav_genre: 
      rec_movies.append(get_genre_rank(g))
      movies = sum(rec_movies, [])
    return movies[:10]

In [None]:
print(rec_genre_user(4317))

['Twelve Monkeys (1995)', 'Close Encounters of the Third Kind (1977)', 'Contact (1997)', 'Powder (1995)', 'Day the Earth Stood Still, The (1951)', 'Nineteen Eighty-Four (1984)', 'Brother from Another Planet, The (1984)', 'Solaris (Solyaris) (1972)', 'Until the End of the World (Bis ans Ende der Welt) (1991)', 'Conceiving Ada (1997)']


## Hitrate

In [None]:
X_test.set_index('UserID', inplace=True)

In [None]:
def find_watched_list(user):
  for a in X_test['Title'].groupby('UserID'):
    if a[0] == user:return a[1].tolist()

hitrate 함수는
추천한 영화가 실제로 본 항목과 얼마나 일치하는지를 알려줍니다.
서로 리스트를 비교해서 일치하는 개수를 hit 변수에 담고, 이를 비율로 나타냈습니다.   
단점 : 시간이 너무 오래걸립니다. 그리고 성능이...이전 것보다 좋지 않습니다.......ㅠ
전자의 문제점이 함수 호출을 여러번 해서인지, for문을 돌아서 그런 것 같기도 합니다. 후자는 더 고민을 해봐야 할 것 같습니다 ㅎㅎ...   
보완 : for문을 줄여보고 장르를 묶어서 적절한 수의 영화를 추천하도록 한다? => 근데 Action 보다 Action Adventure을 더 좋아할 수도 있지 않을까?

In [None]:
def hitrate(X_test): # 예측한 것 중에 몇 개 hit?
  user_list = []
  hitrate = []
  mean_total = pd.DataFrame()
  for user in X_test.index.unique():
    hit = 0
    watch = find_watched_list(user)
    gen = rec_genre_user(user)
    for tmp in watch:
      if tmp in gen:
        hit +=1
    user_list.append(user)
    hitrate.append(hit/len(gen))
  mean_total = pd.DataFrame(list(zip(hitrate)), index = user_list, columns=['hitrate'])
  return mean_total.mean()

In [None]:
hitrate(X_test)

hitrate    0.019229
dtype: float64