<h2>Movies recommendation<h2>

In [None]:
import pandas as pd

# Import the MovieLens dataset (with an additional user)
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data.user.added', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

In [None]:
#"ISO-8859-1" is an encoding format for text data. It supports most Western European languages.
m_cols = ['movie_id', 'title']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")
movies.head()

In [None]:
# Merge the two dataframes (inner join)
ratings = pd.merge(movies, ratings, on="movie_id")
ratings.head()

<b>Create a user/movie rating matrix and compute the similarities among movies</b>

In [None]:
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatings.head()
#NaN indicates missing data (movies that a specific users didn't rate)

In [None]:
# Compute the correlation score (similarity) between every pair of movies 
# (where at least one user rated both movies - otherwise NaN's will show up)
corrMatrix = movieRatings.corr(method='pearson')
corrMatrix.head()

<b>Are these correlation scores realiable?</b>

What if we based our analysis on movies that were watched (and rated) only by a handful of people?
Movies rated by a few people are likely to produce spurious results...let's remove them!

In [None]:
# Get the correlations score (similarity) just for pair of movies rated by more than 100 users
corrMatrix = movieRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

<b>Get recommendations for me (user_id: 0)</b>

In [None]:
myRatings = movieRatings.loc[0].dropna()
myRatings

In [None]:
simCandidates = pd.Series(dtype='float64')
for i in range(0, len(myRatings.index)):
    # Get similar movies to the one I rated
    print ("Getting similar movies for " + myRatings.index[i] + "...")
    sims = corrMatrix[myRatings.index[i]].dropna()
    # Scale its similarity by how well I rated this movie
    # (movies similar to ones I liked count more than movies similar to ones I did not like)
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = pd.concat([simCandidates, sims])
    
#Look at the results
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(20))


In [None]:
# Some of the same movies came up more than once, because they were similar to more than one movie I rated.
# We can sum up the scores for movies that showed up more than once, so they'll count more
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(20))

<b>Done! Remember to filter out movies I had already watched...!</b>