# W2 Lab Assignment

[Internet Movie Database (IMDb)](http://www.imdb.com/) provides various information about movies, such as total budgets, lengths, actors, and user ratings. They are publicly available from [here](http://www.imdb.com/interfaces). In this lab, let's explore a processed dataset named 'imdb.csv', which contains some basic information of movies.

Download the file from Canvas. There are 4 columns separated by tab:

1. Title: title of the movie;
1. Year: release year;
1. Rating: average IMDb user rating;
1. Votes: number of IMDB users who rated this movie

Things to note:

1. Let's use Python 3.5;
2. There are 313,012 lines in the file. When printing things, print selectively.


## Q1: What is the first and last year in this dataset? How many movies were released in each year during the whole time period?

To do this, we first need to read the CSV file. Python provides the [csv](https://docs.python.org/3.5/library/csv.html) module to read and write CSV files. The [`csv.reader`](https://docs.python.org/3.5/library/csv.html#csv.reader) function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using list index. If we want to ignore the first line, we can use [`islice`](https://docs.python.org/3.5/library/itertools.html#itertools.islice). It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, `islice(reader, 0, 5)` means "give me the first 5 items from the `reader`". `islice(reader, 1, 5)` means "give me the 4 items starting from the second item". 

A basic usage example to read the first 11 lines of 'imdb.csv':

In [2]:
import csv
from itertools import islice

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])

['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014


There are many ways to do Q1. One way is to use [dictionaries](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) where the key: value pairs are:

- key: year
- value: a list of movie titles or number of movies


In [3]:
dt = {}
year = 1972
if year not in dt:
    dt[year] = 1
else:
    dt[year] += 1
print(dt)

{1972: 1}


Python automates the job above by using [`Counter`](https://docs.python.org/3.4/library/collections.html#collections.Counter). 

In [4]:
from collections import Counter

movie_counter = Counter()
movie_counter[1972] +=1 
print(movie_counter[1972])
print(movie_counter[1970])

1
0


Once all lines are read, we want to print the dictionary, which can be done by iterating its key: value pairs.

In [5]:
for key,val in dt.items():
    print(key,val)
for key,val in movie_counter.items():
    print(key,val)

1972 1
1972 1


You can get the keys (the years) by using `.keys()` function. 

In [6]:
movie_counter[1980] += 5
movie_counter[2015] += 1
movie_counter.keys()

dict_keys([1980, 1972, 2015])

and you have convenient functions like [`min()`](https://docs.python.org/2/library/functions.html#min) and [`max()`](https://docs.python.org/2/library/functions.html#max) for calculating the min and max value of a list or iterable. 

In [7]:
alist = [23,3,5,4,2,1,1,0,1000]
print(min(alist))
print(max(alist))

0
1000


**Code for Q1**

In [2]:
# Q1, Part 1: What is the first and last year in this dataset?
import pandas as pd

df = pd.read_csv('imdb.csv', sep='\t')
earliest = df.Year.min()
latest = df.Year.max()
print("The earliest year: ", earliest)
print("The latest year: ", latest)

The earliest year:  1874
The latest year:  2017


In [3]:
# Q1, Part 2: How many movies were released in each year during the whole time period?
years = df.Year
counts = years.groupby(years).size()
print("Year\tCount")
print("----\t-----")
for year in counts.index:
    print(year,"\t",counts[year])

Year	Count
----	-----
1874 	 1
1878 	 1
1887 	 1
1888 	 5
1889 	 2
1890 	 5
1891 	 9
1892 	 9
1893 	 2
1894 	 94
1895 	 116
1896 	 678
1897 	 479
1898 	 321
1899 	 242
1900 	 265
1901 	 254
1902 	 217
1903 	 261
1904 	 214
1905 	 177
1906 	 182
1907 	 197
1908 	 267
1909 	 405
1910 	 389
1911 	 309
1912 	 376
1913 	 311
1914 	 315
1915 	 361
1916 	 328
1917 	 317
1918 	 286
1919 	 313
1920 	 323
1921 	 345
1922 	 328
1923 	 393
1924 	 466
1925 	 508
1926 	 554
1927 	 581
1928 	 609
1929 	 671
1930 	 836
1931 	 939
1932 	 1026
1933 	 1024
1934 	 1120
1935 	 1174
1936 	 1235
1937 	 1245
1938 	 1230
1939 	 1162
1940 	 1160
1941 	 1169
1942 	 1193
1943 	 1105
1944 	 969
1945 	 876
1946 	 952
1947 	 1010
1948 	 1084
1949 	 1208
1950 	 1283
1951 	 1318
1952 	 1316
1953 	 1393
1954 	 1397
1955 	 1476
1956 	 1479
1957 	 1604
1958 	 1533
1959 	 1572
1960 	 1567
1961 	 1623
1962 	 1669
1963 	 1635
1964 	 1823
1965 	 1896
1966 	 2025
1967 	 2086
1968 	 2199
1969 	 2320
1970 	 2240
1971 	 2370
197

## Q2: What are the average ratings and average votes?

We can store the ratings/votes column as a list and then calculate various basic statistics (mean, median, etc.). To do this, we can use the [NumPy](http://www.numpy.org/) library and call the function [`numpy.mean`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and [`numpy.median`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.median.html). For example,

In [9]:
import numpy as np

alist = [1,3,6,2,5,2]
print(np.mean(alist))
print(np.median(alist))

3.16666666667
2.5


**Code for Q2**

In [4]:
# Q2: What are the average ratings and average votes?
avgRating = df.Rating.mean()
avgVotes = df.Votes.mean()
print("Average Rating: ", avgRating)
print("Average Votes: ", avgVotes)

Average Rating:  6.296195341377723
Average Votes:  1691.2317746021706


## Q3: What are the 5 movies that have the highest ratings and highest votes, respectively?

Store the movie titles and ratings information as a dictonary:

- key: movie title
- value: movie rating

Then, we can sort the dictionary based on its values, which will return a list of [tuples](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences). Note to print only the top 5 movies.

In [11]:
import operator

dt = {1971: 2, 1975: 10, 1962: 1, 1980: 50, 1981: 55}
sorted_x_by_val = sorted(dt.items(), key=operator.itemgetter(1), reverse=True )
print(sorted_x_by_val)
for elem in sorted_x_by_val:
    print(elem[0],elem[1])

[(1981, 55), (1980, 50), (1975, 10), (1971, 2), (1962, 1)]
1981 55
1980 50
1975 10
1971 2
1962 1


**Code for Q3**

In [6]:
# Q3, Part 1: Top 5 movies, by rating
top5byRating = df.sort_values(by='Rating',ascending=False).head(5)
print("Top 5 movies, by rating")
print(top5byRating)

Top 5 movies, by rating
                                        Title  Year  Rating  Votes
57863   Adolfo Perez Esquivel: Rivers of Hope  2015     9.9      9
42123                   The Red Shirt Diaries  2014     9.8      6
140553                              High-Rise  2015     9.8      5
131241                     Girls Loving Girls  1996     9.8      5
24902        Mari White Presents the Newsboys  2011     9.7      6


In [7]:
# Q3, Part 2: Top5 movies, by votes
top5byVotes = df.sort_values(by='Votes', ascending=False).head(5)
print("Top5 movies, by votes")
print(top5byVotes)

Top5 movies, by votes
                           Title  Year  Rating    Votes
279320  The Shawshank Redemption  1994     9.3  1511933
264590           The Dark Knight  2008     9.0  1487023
149895                 Inception  2010     8.8  1285905
122656                Fight Club  1999     8.9  1189053
223981              Pulp Fiction  1994     8.9  1177471


#### Name the .ipynb file with file name 'lab02_lastname_firstname', and upload to Canvas under [w2] lab assingment.
