# Movie ratings

The idea for this example comes from this post: http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/.

The data can be found here: http://grouplens.org/datasets/movielens/.

Documentation on the Blaze library can be found here: http://blaze.pydata.org/.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import blaze as bz

sns.set_context('talk')
pd.set_option('float_format', '{:6.2f}'.format)

%matplotlib inline

Load ratings into Blaza Data object.

In [2]:
path = '/home/khrapov/ml-latest/'
ratings = bz.Data(path + 'ratings.csv')

print(bz.odo(ratings.head(5), pd.DataFrame))
#print(ratings.head(5))

   userId  movieId  rating   timestamp
0       1       50    4.00  1329753504
1       1      296    4.00  1329753602
2       1      318    4.50  1329753494
3       1      527    4.50  1329753507
4       1      541    3.00  1329753607


In contrast to Pandas, you can not just print the result. It has to be computed first. Blaze actually stores only the algorithm of operations. You have to start them by invoking either ``bz.compute`` function or convert the result to Pandas DataFrame with ``bz.odo``.

In [3]:
print(bz.compute(ratings.count()))

21622187


Read movie titles file.

In [4]:
movies = pd.read_csv(path + 'movies.csv', index_col='movieId')

print(movies.head(5))

                                      title  \
movieId                                       
1                          Toy Story (1995)   
2                            Jumanji (1995)   
3                   Grumpier Old Men (1995)   
4                  Waiting to Exhale (1995)   
5        Father of the Bride Part II (1995)   

                                              genres  
movieId                                               
1        Adventure|Animation|Children|Comedy|Fantasy  
2                         Adventure|Children|Fantasy  
3                                     Comedy|Romance  
4                               Comedy|Drama|Romance  
5                                             Comedy  


Extract all genres for each movie and place them on separate lines.

In [5]:
genres = movies['genres'].str.split('|').apply(pd.Series, 1).stack()
genres.index = genres.index.droplevel(-1)
genres.name = 'genre'

print(genres.head(15))

movieId
1    Adventure
1    Animation
1     Children
1       Comedy
1      Fantasy
2    Adventure
2     Children
2      Fantasy
3       Comedy
3      Romance
4       Comedy
4        Drama
4      Romance
5       Comedy
6       Action
Name: genre, dtype: object


Separate movie titles.

In [6]:
movie_titles = movies.loc[:, 'title']

print(movie_titles.head())

movieId
1                      Toy Story (1995)
2                        Jumanji (1995)
3               Grumpier Old Men (1995)
4              Waiting to Exhale (1995)
5    Father of the Bride Part II (1995)
Name: title, dtype: object


What are the most rated movies?

In [7]:
most_rated = bz.by(ratings['movieId'],
                   count=ratings['movieId'].count())

most_rated = bz.odo(most_rated,
                    pd.DataFrame).set_index('movieId')

most_rated = most_rated.join(movie_titles)

most_rated.sort_values(by='count',
                       ascending=False, inplace=True)

print(most_rated.head())

         count                             title
movieId                                         
356      75018               Forrest Gump (1994)
296      74418               Pulp Fiction (1994)
593      71490  Silence of the Lambs, The (1991)
318      70754  Shawshank Redemption, The (1994)
480      66348              Jurassic Park (1993)


What are the highest rated movies with minimum number of rating counts?

In [8]:
movie_stats = bz.by(ratings['movieId'],
                    mean_rating=ratings['rating'].mean(),
                    count=ratings['rating'].count())

movie_stats = movie_stats[movie_stats['count']
                          > movie_stats['count'].mean()]

movie_stats = bz.odo(movie_stats, pd.DataFrame).set_index('movieId')

movie_stats = movie_stats.join(movie_titles)

movie_stats.sort_values(by='mean_rating',
                        ascending=False, inplace=True)

print(movie_stats.head())

         count  mean_rating                             title
movieId                                                      
318      70754         4.44  Shawshank Redemption, The (1994)
858      46077         4.36             Godfather, The (1972)
50       49728         4.33        Usual Suspects, The (1995)
527      55613         4.30           Schindler's List (1993)
1221     30165         4.27    Godfather: Part II, The (1974)


Extract year from movie title.

In [9]:
import re

df = pd.DataFrame(movie_titles)

def find_year(x):
    try:
        return int(re.findall('\((\d+)', x)[-1])
    except:
        return -999
    
df['year'] = df['title'].apply(find_year)
df.sort_values(by='year', inplace=True)

print(df.head())
print(df.dtypes)

                                           title  year
movieId                                               
113190                        Slaying the Badger  -999
125571      The Court-Martial of Jackie Robinson  -999
79607    Millions Game, The (Das Millionenspiel)  -999
128612                                Body/Cialo  -999
136880                              Vaastupurush  -999
title    object
year      int64
dtype: object
