# Movies dataset

In this Notebook you will explore the contents of the Movies dataset using the tools you have used previously to 
manipulate `DataFrames`. 
You will be using this dataset, which contains data about movies, their actors and directors, and audience and 
critics' ratings, as a relational database in some of the practical activities in Parts 9-12.

This dataset is derived from the [MovieLens + IMDb/Rotten Tomatoes](http://grouplens.org/datasets/hetrec-2011/) dataset 
made available at the *2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems* 
([HetRec 2011](http://ir.ii.uam.es/hetrec2011)) at the *5th ACM Conference on Recommender Systems* 
([RecSys 2011](http://recsys.acm.org/2011)). 
It is an extension of the [MovieLens 10M](http://grouplens.org/datasets/movielens/) 
dataset containing additional data from the 
[Internet Movie Database (IMDb)](http://www.imdb.com/) and the [RottenTomatoes (RT)](http://www.rottentomatoes.com/) 
movie review system.

This dataset comprises the following five individual datasets:

`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating


`movie_actor (movie_id, actor_name, ranking)`

Each movie features one or more actors. Each row records a particular actor featuring in a particular movie 
identified by the `movie_id` and `actor_name` primary key columns.


column | description
------ | -----------
movie_id  (PK) | movie identifier
actor_name  (PK) | actor's name
ranking | position of actor on the movie's cast list

`movie_country (movie_id, country)`

Each movie has one country of origin. Each row records the country of origin of a particular movie 
identified by the `movie_id` primary key column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
country | country of origin

`movie_director (movie_id, director_name)`

Each movie has one director. Each row records the director of a particular movie 
identified by the `movie_id` primary key column.


column | description
------ | -----------
movie_id  (PK) | movie identifier
director_name | director's name

`movie_genre (movie_id, genre)`

Each movie is categorised as belonging to one or more movie genres. Each row records a particular genre that 
categorises a particular movie identified by the `movie_id` and `genre` primary key columns.


column | description
------ | -----------
movie_id  (PK) | movie identifier
genre  (PK) | movie genre



#### Create a separate DataFrame from each individual Movies dataset

In [74]:
import pandas as pd

`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

In [75]:
# Create the DataFrame 'movie' from the CSV data file 'movie.csv'.
movie = pd.read_csv('data/movie.csv')
# Display data about the movie with the movie identifier of '1'.
movie[movie.movie_id==1]

Unnamed: 0,movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
0,1,Toy Story,1995,9,8.5,3.7,3.9


`movie_actor (movie_id, actor_name, ranking)`

In [76]:
# Create the DataFrame 'movie_actor' from the CSV data file 'movie_actor.csv'.
movie_actor = pd.read_csv('data/movie_actor.csv')
# Display the actors featuring in the movie with the movie identifier of '1' in the order as given on the cast list.
movie_actor[movie_actor.movie_id==1].sort_values(by=['ranking'])

Unnamed: 0,movie_id,actor_name,ranking
22,1,Tom Hanks,1
21,1,Tim Allen,2
2,1,Don Rickles,3
7,1,Jim Varney,4
23,1,Wallace Shawn,5
5,1,Jack Angel,6
20,1,Sherry Lynn,7
13,1,Laurie Metcalf,8
14,1,Patrick Pinney,9
0,1,Annie Potts,10


`movie_country (movie_id, country)`

In [77]:
# Create the DataFrame 'movie_country' from the CSV data file 'movie_country.csv'.
movie_country = pd.read_csv('data/movie_country.csv')
# Display the country of origin of the movie with the movie identifier of '1'.
movie_country[movie_country.movie_id==1]

Unnamed: 0,movie_id,country
0,1,USA


`movie_director (movie_id, director_name)`

In [78]:
# Create the DataFrame 'movie_director' from the CSV data file 'movie_director.csv'.
movie_director = pd.read_csv('data/movie_director.csv')
# Display the director of the movie with the movie identifier of '1'.
movie_director[movie_director.movie_id==1]

Unnamed: 0,movie_id,director_name
0,1,John Lasseter


`movie_genre (movie_id, genre)`

In [79]:
# Create the DataFrame 'movie_genre' from the CSV data file 'movie_genre.csv'.
movie_genre = pd.read_csv('data/movie_genre.csv')
# Display the genres of the movie with the movie identifier of '1'.
movie_genre[movie_genre.movie_id==1]

Unnamed: 0,movie_id,genre
0,1,Adventure
1,1,Animation
2,1,Children
3,1,Comedy
4,1,Fantasy


## Activity
Using the tools that you have used previously to manipulate `DataFrames`, characterise the Movies dataset by answering the following questions about the data recorded in the dataset:

    1 How many movies, actors, directors and countries are there?
    2 How many unique movie titles are there?
    3 What are the earliest and latest years of release?
    4 What are the ranges of values for critics, audience and user ratings?
    5 How many movies are classified under each genre? 
    6 Missing data - How many movies are recorded without:
        6.1 a title?
        6.2 a year of release?
        6.3 critics, audience or user ratings?
        6.4 any actors?
        6.5 a director?
        6.6 a country of origin?
        6.7 any genres?
        

In [90]:
# how many movies?
movie.count(axis=0)

#how many unique movie titles?
len(movie['title'].unique())

#earliest and latest years of release
movie['year'].min()
movie['year'].max()

#ranges of values for ratings
movie[['rt_all_critics_rating','rt_top_critics_rating','rt_audience_rating','ml_user_rating']].describe()

#how many movies are classified under each genre?
movie_genre.pivot_table(index='genre', values='movie_id', aggfunc='count')

#how many movies contain null values
movie[movie.isnull()['rt_all_critics_rating'] == True]
movie[movie.isnull()['rt_top_critics_rating'] == True]
movie[movie.isnull()['rt_audience_rating'] == True]
len(movie[movie['ml_user_rating'].isnull()])

#len(movie[movie.isnull().any(axis=1)])

4

Solutions can be found in the `08.1.soln Movies dataset` Notebook, but please DO attempt the activity yourself before looking at these solutions.

## Summary
In this Notebook you have explored the Movies dataset in order to familiarise yourself with the data that you will
using as a relational database in some of the practical activities in Parts 9-12. In particular, the presence of 
missing data, which you may need to accommodate when writing SQL queries against the data.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `Part 9` Notebooks.