# W2 Lab Assignment

[Internet Movie Database (IMDb)](http://www.imdb.com/) provides various information about movies, such as total budgets, lengths, actors, and user ratings. They are publicly available from [here](http://www.imdb.com/interfaces). In this lab, let's explore a processed dataset named 'imdb.csv', which contains some basic information of movies.

Download the file from Canvas. There are 4 columns separated by tab:

1. Title: title of the movie;
1. Year: release year;
1. Rating: average IMDb user rating;
1. Votes: number of IMDB users who rated this movie

Things to note:

1. Let's use Python 3.5;
2. There are 313,012 lines in the file. When printing things, print selectively.


## Q1: What is the first and last year in this dataset? How many movies were released in each year during the whole time period?

To do this, we first need to read the CSV file. Python provides the [csv](https://docs.python.org/3.5/library/csv.html) module to read and write CSV files. The [`csv.reader`](https://docs.python.org/3.5/library/csv.html#csv.reader) function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using list index. If we want to ignore the first line, we can use [`islice`](https://docs.python.org/3.5/library/itertools.html#itertools.islice). It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, `islice(reader, 0, 5)` means "give me the first 5 items from the `reader`". `islice(reader, 1, 5)` means "give me the 4 items starting from the second item". 

A basic usage example to read the first 11 lines of 'imdb.csv':

In [2]:
import csv
from itertools import islice

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])

['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014


There are many ways to do Q1. One way is to use [dictionaries](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) where the key: value pairs are:

- key: year
- value: a list of movie titles or number of movies


In [3]:
dt = {}
year = 1972
if year not in dt:
    dt[year] = 1
else:
    dt[year] += 1
print(dt)

{1972: 1}


Python automates the job above by using [`Counter`](https://docs.python.org/3.4/library/collections.html#collections.Counter). 

In [4]:
from collections import Counter

movie_counter = Counter()
movie_counter[1972] +=1 
print(movie_counter[1972])
print(movie_counter[1970])

1
0


Once all lines are read, we want to print the dictionary, which can be done by iterating its key: value pairs.

In [5]:
for key,val in dt.items():
    print(key,val)
for key,val in movie_counter.items():
    print(key,val)

1972 1
1972 1


You can get the keys (the years) by using `.keys()` function. 

In [6]:
movie_counter[1980] += 5
movie_counter[2015] += 1
movie_counter.keys()

dict_keys([1980, 1972, 2015])

and you have convenient functions like [`min()`](https://docs.python.org/2/library/functions.html#min) and [`max()`](https://docs.python.org/2/library/functions.html#max) for calculating the min and max value of a list or iterable. 

In [7]:
alist = [23,3,5,4,2,1,1,0,1000]
print(min(alist))
print(max(alist))

0
1000


**Code for Q1**

In [8]:
# implement below


## Q2: What are the average ratings and average votes?

We can store the ratings/votes column as a list and then calculate various basic statistics (mean, median, etc.). To do this, we can use the [NumPy](http://www.numpy.org/) library and call the function [`numpy.mean`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and [`numpy.median`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.median.html). For example,

In [9]:
import numpy as np

alist = [1,3,6,2,5,2]
print(np.mean(alist))
print(np.median(alist))

3.16666666667
2.5


**Code for Q2**

In [10]:
# implement below


## Q3: What are the 5 movies that have the highest ratings and highest votes, respectively?

Store the movie titles and ratings information as a dictonary:

- key: movie title
- value: movie rating

Then, we can sort the dictionary based on its values, which will return a list of [tuples](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences). Note to print only the top 5 movies.

In [11]:
import operator

dt = {1971: 2, 1975: 10, 1962: 1, 1980: 50, 1981: 55}
sorted_x_by_val = sorted(dt.items(), key=operator.itemgetter(1), reverse=True )
print(sorted_x_by_val)
for elem in sorted_x_by_val:
    print(elem[0],elem[1])

[(1981, 55), (1980, 50), (1975, 10), (1971, 2), (1962, 1)]
1981 55
1980 50
1975 10
1971 2
1962 1


**Code for Q3**

In [12]:
# implement below


#### Name the .ipynb file with file name 'lab02_lastname_firstname', and upload to Canvas under [w2] lab assingment.
