# Movie Recommendation with Collabolative Filtering

### 사용자들의 영화 선호도에 따라 영화들의 상관관계를 나타내 상관관계 계수가 높은 영화를 추천하는 시스템 

In [2]:
import pandas as pd
import numpy as np 

movies_df = pd.read_table('ml-1m/movies.dat',header=None,sep="::",names=['movie_id','movie_title','movie_genre'])
ratings_df = pd.read_table('ml-1m/ratings.dat',header=None,sep="::",names=['user_id','movie_id','rating','timestamp'])



In [3]:
movies_df= pd.concat([movies_df,movies_df.movie_genre.str.get_dummies(sep="|")],axis=1)
del ratings_df['timestamp']

In [4]:
ratings_df = pd.merge(ratings_df,movies_df,on='movie_id')[['user_id','movie_title','movie_id','rating']]
ratings_df.head()

Unnamed: 0,user_id,movie_title,movie_id,rating
0,1,One Flew Over the Cuckoo's Nest (1975),1193,5
1,2,One Flew Over the Cuckoo's Nest (1975),1193,5
2,12,One Flew Over the Cuckoo's Nest (1975),1193,4
3,15,One Flew Over the Cuckoo's Nest (1975),1193,4
4,17,One Flew Over the Cuckoo's Nest (1975),1193,5


## The dataset is a matrix of users and movie ratings, so we convert the ratings_df to a matrix with a user per row and a movie per column.

#### pd.pivot_table( values = 테이블 안에 들어갈 colunm, index = 테이블에서 index로 사용할 colunm, colunms= 테이블에서 colunm으로 사용할 colunm(해당컬럼의 값들로 colunm값이 이루어진다) 

In [5]:
ratings_mtx_df = ratings_df.pivot_table(values='rating',index='user_id',columns='movie_title')
ratings_mtx_df.fillna(0,inplace=True)
movie_index = ratings_mtx_df.columns
ratings_mtx_df.head()

movie_title,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kj�rlighetens kj�tere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Person Product Moment Correlation Coefficient (PMCC)
#### Covariance Matix라고 이해하면 편할 듯 
### A measure of the linear correlation between two variables X and Y. It has a value between +1 and −1

Link : https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Use numpy.corrcoef function, that calculates the Pearson Product Moment Correlation Coefficient (PMCC) between each item pair. 

In [20]:
corr_matrix = np.corrcoef(ratings_mtx_df.T)
corr_matrix.shape

(3706, 3706)

*** Note: We use the transposed ratings matrix to calculate the correlation matrix so it gives back the correlation between movies (rows). If we used the ratings matrix without transposing it, np.corrcoef would return the correlation between users. *** <br><br>
*** 영화들의 관계를 알고 싶은 것이기 때문에 행 요소에 Movie가 오게 한다 ***

Now, if we want to find similar movies to a specific movie, it's just a matter of returning those movies that have a high correlation coefficent with that one.

In [39]:
favoured_movie_title = 'Toy Story (1995)'
favoured_movie_index = list(movie_index).index(favoured_movie_title)
P = corr_matrix[favoured_movie_index]

x=list(movie_index[(P>0.4) & (P<1.0)])
list(movie_index[(P>0.4) & (P<1.0)])

['Aladdin (1992)',
 "Bug's Life, A (1998)",
 'Groundhog Day (1993)',
 'Lion King, The (1994)',
 'Toy Story 2 (1999)']

We take the list of movies that user has rated. Then we sum the correlations of those movies with all the other ones and return a list of those movies sorted by their total correlation with the user.

In [51]:
def get_movies_similarity(movie_title):
    
    '''Returns correlation vector for a movie'''
    
    movie_idx = list(movie_index).index(movie_title)
    return corr_matrix[movie_idx]

def get_movie_recommendations(user_movies):
    
    '''given a set of movies, it returns all the movies sorted by their correlation with the user'''
    
    movie_similarities = np.zeros(corr_matrix.shape[0])
    
    for movie_id in user_movies:
        movie_similarities = movie_similarities+ get_movies_similarity(movie_id)
        
    similarities_df = pd.DataFrame({'movie_title' : movie_index, 'sum_similarities':movie_similarities })  
    similarities_df = similarities_df[~(similarities_df.movie_title.isin(user_movies))]
    similarities_df = similarities_df.sort_values(by=['sum_similarities'],ascending=False)
    
    return similarities_df    

Serise.str.contains(" ")  => boolean Serise형식, DataFrame안에서 filter로 작용, string 1개 가능 <br>
Serise.isin( 배열 ) => Boolean Serise 형식, filter 적용, 배열 가능<br>
list.index(" ") -> 스트링에 부합하는 index 출력 

# Application

In [48]:
simple_user =21
ratings_df[ratings_df.user_id == simple_user].sort_values(by=['rating'],ascending=False)

Unnamed: 0,user_id,movie_title,movie_id,rating
583304,21,Titan A.E. (2000),3745,5
707307,21,"Princess Mononoke, The (Mononoke Hime) (1997)",3000,5
70742,21,Star Wars: Episode VI - Return of the Jedi (1983),1210,5
239644,21,"South Park: Bigger, Longer and Uncut (1999)",2700,5
487530,21,Mad Max Beyond Thunderdome (1985),3704,4
707652,21,Little Nemo: Adventures in Slumberland (1992),2800,4
708015,21,Stop! Or My Mom Will Shoot (1992),3268,3
706889,21,"Brady Bunch Movie, The (1995)",585,3
623947,21,"Iron Giant, The (1999)",2761,3
619784,21,Wild Wild West (1999),2701,3


In [52]:
sample_user_movies = ratings_df[ratings_df.user_id == simple_user].movie_title.tolist()
recommendations= get_movie_recommendations(sample_user_movies)

In [53]:
recommendations.head(20)

Unnamed: 0,movie_title,sum_similarities
1939,"Lion King, The (1994)",5.453611
324,Beauty and the Beast (1991),5.384934
1948,"Little Mermaid, The (1989)",4.967455
3055,Snow White and the Seven Dwarfs (1937),4.954111
647,Charlotte's Web (1973),4.948065
679,Cinderella (1950),4.917892
1002,Dumbo (1941),4.90908
301,Batman (1989),4.878468
3250,"Sword in the Stone, The (1963)",4.851537
303,Batman Returns (1992),4.831879


# 장점 :  새로운 아이템이 추가될 때 수작업을 하지 않아도 된다. 새로운 요소를 추가하여 추천 기능을 높일 수 있다. 

# 단점 :  Cold Start

## Cold Start issue

Cold start is a potential problem in computer-based information systems which involve a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.