# Introduction to pandas

This tutorial sheet is designed to:

* Introduce you to pandas, a Python library for data analysis that's especially suited to data analysis.
* Give you some more practice with data analysis using a different dataset.
* Encourage you to use library documentation to figure out functions in unfamiliar libraries. pandas documentation is available at https://pandas.pydata.org/docs/index.html

As we did yesterday, we'll first import pandas and load in our dataset, containing 1000 different movies from IMDB, into a `DataFrame` called `movies`:

In [9]:
import pandas as pd # another shortened name to make it easier to type
movies = pd.read_csv('data/imdb_1000.csv')

# Your tasks for today

## Displaying our dataset

You can just print out `movies` like any other variable, but this gives you output that's a little messy and hard to read. This dataset also has nearly 1000 rows, so it'll be massive! In the following cell, use the `head` method in pandas to display the first few rows of our dataset. You may need to refer to the pandas documentation to figure out how to use the `head` method.

In [58]:
# your code here

You should see that every row consists of a few different values:

* A star rating, out of 10.
* The title of the movie
* The content rating of the movie (using the American ratings system)
* The genre of the movie
* The duration of the movie in minutes
* A few of the main actors in the movie

Note that each column already has a title! You can refer to that title when you're slicing and indexing this DataFrame, rather than having to remember column numbers. Also, note that you can combine numeric data (star_rating and duration) with non-numeric data (title, content_rating and genre). You can even include Python lists, such as actors_list!

## Double checking our DataFrame

When importing a new dataset, it's always good to double check the data is laid out as expected and that each column's the right type. This can ensure you don't run into any really confusing bugs later on! In the next cell, output the shape of `movies`, and the data type of each column of `movies`.

In [57]:
# your code here

## Slicing and indexing

Let's start with some simple questions about our dataset. These should feel pretty similar to yesterday's tutorial sheet, though you'll implement them slightly differently in pandas. Make sure you use the pandas documentation!

In [None]:
# How long is the longest film in our dataset?

In [None]:
# What's the longest film in our dataset?

In [None]:
# What's the best Animation film in our dataset?

In [None]:
# What's the best film rated either G or PG in our dataset?

## Grouping and aggregation

Another very useful operation is grouping together rows in our dataset based on the value of one of the columns. The rest of this exercise will give you some practice in doing this. Remember to search the pandas documentation - there are built-in functions that will do this for you without using for or while loops.

In [None]:
# Output every genre of film in our dataset. Each genre should only appear once, without any duplicates.

In [None]:
# For each genre of film, what's the average length of films in that genre?

In [None]:
# For each content rating, what's the average star rating of films in that genre?

## numpy vs pandas

This question doesn't involve any code - this is mainly an opportunity for you to think about numpy and pandas.

What did you like about using pandas? Is there anything you found easier or harder to use compared to numpy? How do you think this tutorial sheet might be different if you tried to use numpy?

## 2.6 What else were they in?

Note that this question is a bit more complicated than the other questions so far, so don't worry if you don't get it in the tutorial!

Write a function that, given the name of an actor, returns every film in our data that said actor appears in.

In [56]:
# your code here