# Lab 1: Pandas Refresher

Welcome to lab 1! Most of you were in DATA 118 and are familiar with `pandas`, but this lab will serve as a refresher of some of the basics. Reminder that a great thing about `pandas` is that it is *heavily* documented online - most issues you may come across are easily googleable. Feel free to use your resources wisely in this course! [This](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is also useful.

First, import libraries by running the cell below.

In [1]:
import numpy as np
import pandas as pd

The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb. The following cell will load this file and store it in a dataframe called `imdb`. Then, we use the `.head()` operation to display the first 5 rows.

In [2]:
imdb = pd.read_csv('imdb.csv')
imdb.head()

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980


## 1. Analyzing datasets

With just a few pandas methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, there are two ways to select a column in pandas (interact with the cell below to confirm that they are the same). 

This syntax does not work if your column name has a space in it or if your column has the same name as a pandas method, like "mean" or "plot."


In [3]:
imdb["Rating"]
imdb.Rating

0      8.4
1      8.3
2      8.3
3      8.6
4      8.2
      ... 
245    8.7
246    8.1
247    8.2
248    8.1
249    8.3
Name: Rating, Length: 250, dtype: float64

The object returned is a pandas Series - this is a specific pandas object. You can turn it into a list or a numpy array as follows and then use opereations you know on those objects.

In [4]:
rating_list = list(imdb.Rating)
rating_array = imdb.Rating.values

Alternatively, there are many operations you can apply directly to a pandas series (or entire dataframe) to summarize the data - check the "Summarize Data" section of the cheat sheet for more details. The following code finds the sum of all the "Votes" in the imbd dataframe:

In [5]:
imdb.Votes.sum()

86266786

**Question 1.1.** Find the rating of the highest-rated movie in the dataset. You can either apply an operation directly to the dataframe, or convert the Ratings column to an array to use a function on that.

In [None]:
highest_rating = ...
highest_rating

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The corret answer is 9.2, and one possible way to get it is imdb.Rating.max().
</p>
</details>

That's not very useful, though. You'd probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the entire table by rating, which ensures that the ratings and titles will stay together.

In [None]:
imdb.sort_values("Rating")

Well, that actually doesn't help much, either -- we sorted the movies from lowest -> highest ratings.  To look at the highest-rated movies, sort in reverse order:

In [None]:
imdb.sort_values("Rating", ascending=False)

(The `ascending=False` bit is called an *optional argument*. It has a default value of `True`, so when you explicitly tell the function `ascending=False`, then the function will sort in descending order.)

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about sorting dataframes:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort_values` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort_values("Rating")` is a *copy of `imdb`*; the `imdb` table doesn't get modified. For example, if we called `imdb.sort_values("Rating")`, then running `imdb` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.

**Question 1.2.** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [None]:
imdb_by_year = ...
imdb_by_year

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.sort_values("Year")
</p>
</details>

**Question 1.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `imdb_by_year`, extract the Title column to get an array, then use `iloc` to access the first item; refer to the cheat sheet for help.

In [None]:
earliest_movie_title = ...
earliest_movie_title

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb_by_year["Title"].iloc[0].
</p>
</details>

## 2. Finding pieces of a dataset
Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we need to index into the table. In pandas, you do this through the argument 
`.loc[row,col]`. The `row` argument allows you to filter the rows you want to select (in this case, we are interested in rows where the decade is 1940), and the `col` argument allows you to select certain columns (you can do this either by index or by name). If you only include one argument to `.loc[]`, it will filter the rows and select all columns. 

The following retrieves all rows where the decade is 1940. You can use any logical conditional you want in the row selection argument.

In [None]:
forties = imdb.loc[imdb.Decade == 1940]
forties

**Question 2.1.** Compute the average rating of movies from the 1940s.

In [None]:
average_rating_in_forties = ...
average_rating_in_forties

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The answer is appx. 8.26 and one possible way to get it is 
`forties.Rating.mean()`
</p>
</details>

**Question 2.2.** Find all the movies with a rating higher than 8.5.  Put their data in a table called `really_highly_rated`.

In [None]:
really_highly_rated = ...
really_highly_rated

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.loc[imdb.Rating>8.5]`
</p>
</details>

**Question 2.3.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [None]:
average_20th_century_rating = ...
average_21st_century_rating = ...
print("Average 20th century rating:", average_20th_century_rating)
print("Average 21st century rating:", average_21st_century_rating)

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
Possible code is `imdb.loc[imdb.Decade<2000].Rating.mean()` and `imdb.loc[imdb.Decade>=2000].Rating.mean()`
</p>
</details>

The property `shape` tells you how many (rows, columns) are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.) This returns a tuple object; to just get the number of rows, you need to access the first element of the tuple.

In [None]:
num_movies_in_dataset = imdb.shape[0]
num_movies_in_dataset

**Question 2.4.** Use `shape` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies.

In [None]:
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.loc[imdb.Decade<2000].shape[0]/num_movies_in_dataset` and `imdb.loc[imdb.Decade>=2000].shape[0]/num_movies_in_dataset`.
</p>
</details>

**Question 2.5.** Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

In [None]:
num_even_year_movies = ...
num_even_year_movies

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 127, `imdb.loc[imdb.Year%2==0]`.shape[0]`
</p>
</details>

## 3. Other stuff

Let's say we want to add a new column to our dataframe that scales the rating column by 10. The syntax to do this is as follows:

In [None]:
imdb['scaled_rating'] = imdb.Rating*10
imdb.head()

Notice that this operation actually changes the underlying dataframe. Also notice that applying * to the Rating series performed the operation element-wise. In general, you create a new column in a table by defining the name of the column in brackets after the dataframe and then setting that equal to an object that is the same length as the dataframe (it could also be a list or an array). 

**Question 3.1.**

Add a column to `imdb` called `votes_per_million` that computes the number of votes each movie receieved per million voters.

In [None]:
imdb['votes_per_million'] = ...
imdb.head()

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
`imdb['votes_per_million'] = imdb.Votes/1000000`
</p>
</details>

It is also often useful to change the shape of the data or group by certain elements of the data to get more insight. For example, we might want to know the number of movies released in a given decade. A handy operation that gives you the count of the number of rows with each unique value of a variable is `value_counts`, and is used as follows:

In [6]:
imdb.Decade.value_counts()

2000    50
1990    42
1980    31
1950    30
2010    29
1960    22
1970    21
1940    14
1930     7
1920     4
Name: Decade, dtype: int64

You can achieve the same result using the `groupby` operation - see if you can use the cheat sheet (or Google) to figure out how to do this!

**Question 3.2.**
Use `groupby` to find the number of moves per decade and make sure it matches what `value_counts` gave you. You may notice that your first guess returns multiple columns with the same values - think about why and see if you can select just the first one.

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
`imdb.groupby('Decade').count()['Votes']`
</p>
</details>

Now, feel free to explore more of the operations on the cheat sheet! It is a pretty concise resource and you should know how to do everything on it. Feel free to ask your TA questions if you don't understand something.

In [None]:
...

All done!