# Movie ratings

The idea for this example comes from this post: http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/.

The data can be found here: http://grouplens.org/datasets/movielens/.

Documentation on the Blaze library can be found here: http://blaze.pydata.org/.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import dask.dataframe as dd

from dask.diagnostics import ProgressBar
from odo import odo

sns.set_context('notebook')
pd.set_option('float_format', '{:6.2f}'.format)

Load ratings into Blaza Data object.

In [2]:
path = '/home/khrapov/ml-latest-small/'

ratings = dd.read_csv(path + 'ratings.csv')

with ProgressBar():
    print(ratings.head(5))

[########################################] | 100% Completed |  0.1s
   userId  movieId  rating   timestamp
0       1       31    2.50  1260759144
1       1     1029    3.00  1260759179
2       1     1061    3.00  1260759182
3       1     1129    2.00  1260759185
4       1     1172    4.00  1260759205


In contrast to Pandas, you can not just print the result. It has to be computed first. Blaze actually stores only the algorithm of operations. You have to start them by invoking either ``bz.compute`` function or convert the result to Pandas DataFrame with ``bz.odo``.

In [3]:
with ProgressBar():
    print('Count method:\n', ratings.count())
    print('Compute method:\n', ratings.count().compute())

Count method:
 dd.Series<datafra..., npartitions=1>
[########################################] | 100% Completed |  0.1s
Compute method:
 userId       100004
movieId      100004
rating       100004
timestamp    100004
dtype: int64


Read movie titles file.

In [4]:
movies = pd.read_csv(path + 'movies.csv', index_col='movieId')

print(movies.head(5))

                                      title  \
movieId                                       
1                          Toy Story (1995)   
2                            Jumanji (1995)   
3                   Grumpier Old Men (1995)   
4                  Waiting to Exhale (1995)   
5        Father of the Bride Part II (1995)   

                                              genres  
movieId                                               
1        Adventure|Animation|Children|Comedy|Fantasy  
2                         Adventure|Children|Fantasy  
3                                     Comedy|Romance  
4                               Comedy|Drama|Romance  
5                                             Comedy  


Extract all genres for each movie and place them on separate lines.

In [5]:
genres = movies['genres'].str.split('|').apply(pd.Series, 1).stack()
genres.index = genres.index.droplevel(-1)
genres.name = 'genre'

print(genres.unique())
print(genres.head(15))

['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy' 'Romance' 'Drama'
 'Action' 'Crime' 'Thriller' 'Horror' 'Mystery' 'Sci-Fi' 'Documentary'
 'IMAX' 'War' 'Musical' 'Western' 'Film-Noir' '(no genres listed)']
movieId
1    Adventure
1    Animation
1     Children
1       Comedy
1      Fantasy
2    Adventure
2     Children
2      Fantasy
3       Comedy
3      Romance
4       Comedy
4        Drama
4      Romance
5       Comedy
6       Action
Name: genre, dtype: object


Separate movie titles.

In [6]:
movie_titles = movies.loc[:, 'title']

print(movie_titles.head())

movieId
1                      Toy Story (1995)
2                        Jumanji (1995)
3               Grumpier Old Men (1995)
4              Waiting to Exhale (1995)
5    Father of the Bride Part II (1995)
Name: title, dtype: object


What are the most rated movies?

In [7]:
most_rated = ratings.groupby('movieId').count()
most_rated = most_rated.join(odo(movie_titles, dd.DataFrame, npartitions=1))
most_rated = most_rated.set_index('title')
df = most_rated.nlargest(n=5, columns=['rating'])

with ProgressBar():
    print(df.compute()['rating'])

[########################################] | 100% Completed |  0.2s
title
Forrest Gump (1994)                          341
Pulp Fiction (1994)                          324
Shawshank Redemption, The (1994)             311
Silence of the Lambs, The (1991)             304
Star Wars: Episode IV - A New Hope (1977)    291
Name: rating, dtype: int64


What are the highest rated movies with minimum number of rating counts?

In [8]:
grouped = ratings.groupby('movieId')[['rating']]

def summary(series):
    series = series['rating']
    return pd.Series({'mean_rating': series.mean(), 'count': series.count()})

movie_stats = grouped.apply(summary, meta={'mean_rating': 'f8', 'count': 'i4'})
movie_stats = movie_stats[movie_stats['count'] > movie_stats['count'].mean()]
movie_stats = movie_stats.join(odo(movie_titles, dd.DataFrame, npartitions=1))
movie_stats = movie_stats.set_index('title')
df = movie_stats.nlargest(n=5, columns=['mean_rating'])

with ProgressBar():
    print(df.compute())

df = odo(movie_stats, pd.DataFrame).nsmallest(n=5, columns=['mean_rating'])

print(df)

[########################################] | 100% Completed |  4.9s
                                  count  mean_rating
title                                               
Inherit the Wind (1960)           12.00         4.54
Godfather, The (1972)            200.00         4.49
Shawshank Redemption, The (1994) 311.00         4.49
Tom Jones (1963)                  12.00         4.46
On the Waterfront (1954)          29.00         4.45
                                           count  mean_rating
title                                                        
Battlefield Earth (2000)                   19.00         1.21
Speed 2: Cruise Control (1997)             23.00         1.65
Police Academy 6: City Under Siege (1989)  12.00         1.71
Super Mario Bros. (1993)                   17.00         1.74
Blade: Trinity (2004)                      12.00         1.79


Extract year from movie title.

In [9]:
import re

df = pd.DataFrame(movie_titles)

def find_year(x):
    try:
        return int(re.findall('\((\d+)', x)[-1])
    except:
        return -999
    
df['year'] = df['title'].apply(find_year)
df.sort_values(by='year', inplace=True)

print(df.head())
print(df.dtypes)

                                                     title  year
movieId                                                         
164979                               Women of '69, Unboxed  -999
162376                                     Stranger Things  -999
151307                           The Lovers and the Despot  -999
143410                                          Hyena Road  -999
32898    Trip to the Moon, A (Voyage dans la lune, Le) ...  1902
title    object
year      int64
dtype: object


What are the highest rated movies by genre?

In [10]:
df = ratings.join(odo(genres, dd.DataFrame, npartitions=1))

grouped = df.groupby(['genre', 'movieId'])[['rating']]

def summary(series):
    series = series['rating']
    return pd.Series({'mean_rating': series.mean(), 'count': series.count()})

movie_stats = grouped.apply(summary, meta={'mean_rating': 'f8', 'count': 'i4'})
movie_stats = movie_stats[movie_stats['count'] > movie_stats['count'].mean()]
movie_stats = movie_stats.join(odo(movie_titles, dd.DataFrame, npartitions=1))
with ProgressBar():
    movie_stats = odo(movie_stats, pd.DataFrame)

df = movie_stats.reset_index().sort_values(by=['genre', 'mean_rating'], ascending=False)
df = df.groupby('genre')[['title', 'mean_rating']].apply(lambda x: x.head())

print(df.ix['Sci-Fi'])

[########################################] | 100% Completed |  8.1s
[########################################] | 100% Completed |  7.6s
                                            title  mean_rating
2603                      Schindler's List (1993)         5.00
2606            Terminator 2: Judgment Day (1991)         5.00
2610                    African Queen, The (1951)         5.00
2614  Wallace & Gromit: The Wrong Trousers (1993)         5.00
2624     Treasure of the Sierra Madre, The (1948)         5.00
