# Day 5 - Using collections

Today we look at a code challenge from PyBites (https://pybit.es/codechallenge13.html).  The challenge - find the 20 highest rated directors based on their average movie IMDB ratings.  

Output should follow a similar formats as below: 

In [167]:
from urllib.request import urlretrieve
from collections import namedtuple, defaultdict
import pandas as pd
import numpy as np
from operator import itemgetter

PyBites have provided a download link to the dataset.  We'll grab the dataset and save it in our current working directory. 

In [2]:
movie_data = 'https://raw.githubusercontent.com/pybites/challenges/solutions/13/movie_metadata.csv'
movies_csv = 'movies.csv'
urlretrieve(movie_data, movies_csv)

('movies.csv', <http.client.HTTPMessage at 0x106433b70>)

In [180]:
# read file
imdb_movie_list = pd.read_csv('movies.csv')

# drop rows containing missing data 
imdb_movie_list.dropna(subset=['director_name', 'title_year', 'imdb_score'], inplace=True)

imdb_movie_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4935 entries, 0 to 5042
Data columns (total 28 columns):
color                        4920 non-null object
director_name                4935 non-null object
num_critic_for_reviews       4894 non-null float64
duration                     4923 non-null float64
director_facebook_likes      4935 non-null float64
actor_3_facebook_likes       4917 non-null float64
actor_2_name                 4925 non-null object
actor_1_facebook_likes       4928 non-null float64
gross                        4156 non-null float64
genres                       4935 non-null object
actor_1_name                 4928 non-null object
movie_title                  4935 non-null object
num_voted_users              4935 non-null int64
cast_total_facebook_likes    4935 non-null int64
actor_3_name                 4917 non-null object
facenumber_in_poster         4922 non-null float64
plot_keywords                4795 non-null object
movie_imdb_link              4935 non-

## Data Restrictions

There are a few restrictions in this challenge based on the data provided
* only consider directors with a minimum of 4 movies
* only consider movies created >= 1960.

Let's tackle these requirements.

First - limit the dataset to only contain movies created in or later than the year 1960. 

In [181]:
# we only want movies made >= 1960
imdb_movie_list = imdb_movie_list[imdb_movie_list.title_year >= 1960]

Second - let's include only those directors that have made at least 4 movies

In [182]:
# we only want directors that have made at least 4 movies.  We can do this using a Counter
director_list = Counter(imdb_movie_list.director_name)
director_list = list({d:director_list[d] for d in director_list if director_list[d] >= 4}.keys())

# filter overall list to only include directors with at least 4 movies
imdb_movie_list = imdb_movie_list[imdb_movie_list.director_name.isin(directors_list)]

Third - it looks like the movie names were not interpreted correctly.  We have a weird \xa0 in each movie name.  Let's remove this.

In [183]:
imdb_movie_list.movie_title.values

array(['Avatar\xa0', "Pirates of the Caribbean: At World's End\xa0",
       'Spectre\xa0', ..., 'Rampage\xa0', 'Slacker\xa0',
       'El Mariachi\xa0'], dtype=object)

In [184]:
# Remove bogus characters
imdb_movie_list['movie_title'] = imdb_movie_list.movie_title.apply(lambda d: unicodedata.normalize('NFKD', d).strip())

## Data Organization
We will represent each movie using a namedtuple, and group each movie by director - storing the results in a defaultdict.

In [185]:
# declare our movie tuple
Movie = namedtuple('Movie', ['name', 'year', 'imdb_rating'])

# declare dictionary to hold data
movie_dict = defaultdict(list)

# Fill dictionary with each director:movie pairing
for index, item in imdb_movie_list.iterrows():

    # create movie tuple
    m = Movie(item.movie_title, int(item.title_year), item.imdb_score)

    # store result
    movie_dict[item.director_name].append(m)    

## Data Presentation
With our data organized, we can now display our results.  

We will first sort our dictionary based on the each director's average rating.  Once we've found our top 20 directors, we can display the results.

In [186]:
def calc_avg_rating(m_list):
    '''
        Given a list of movies, finds the average movie rating. 
    '''
    
    rating = []

    # get rating for each movie
    for m in m_list:
        rating.append(m.imdb_rating)

    return np.mean(rating)

def find_avg_director_rating(tup):
    '''
        Given a director:movie_list object, finds the average rating for the 
        director. 
    '''

    # break apart tuple
    director, m_list = tup
    
    return calc_avg_rating(m_list)


In [187]:
# Sort the dictionary in order from best director to worst
sorted_directors = sorted(movie_dict.items(), key=lambda tup: find_avg_director_rating(tup), reverse=True)

Now that we know the top 20 directors, all we have to do is print the results

In [188]:
# Loop through the top 20 directors
for idx, (director, m_list) in enumerate(sorted_directors[:20]):

    # find the average movie rating for director
    avg_rating = calc_avg_rating(m_list)

    # print director summary
    print('{:02}. {:<52} {:.1f}'.format(idx+1, director, avg_rating))
    print('-'*60)

    # print movie list summary
    for item in sorted(m_list, key=lambda x: x.imdb_rating, reverse=True):

        # we turncate any movie names over 50 characters in length
        title = (item.name[:47] + '...') if len(item.name)>50 else item.name
        
        print('{}] {:<50} {}'.format(item.year, title, item.imdb_rating))
    
    print()

01. Christopher Nolan                                    8.4
------------------------------------------------------------
2008] The Dark Knight                                    9.0
2010] Inception                                          8.8
2014] Interstellar                                       8.6
2012] The Dark Knight Rises                              8.5
2006] The Prestige                                       8.5
2000] Memento                                            8.5
2005] Batman Begins                                      8.3
2002] Insomnia                                           7.2

02. Quentin Tarantino                                    8.2
------------------------------------------------------------
1994] Pulp Fiction                                       8.9
2012] Django Unchained                                   8.5
1992] Reservoir Dogs                                     8.4
2009] Inglourious Basterds                               8.3
2003] Kill Bill: Vol. 1

We have used our knowledge using Counters, defaultdicts, and namedtuples to complete today's task.  