# Introduction

Data proccessing pipelines move and transform data from one source to another such that it can be stored, used for analytics / Machine Learning, or combined with other data structures. In COMP20008, we will cover an end-to-end process to preprocess data, conduct analysis, perform Machine Learning tasks, and communicate findings.

Here's a diagram provided depicting the "Data Science" pipeline we will cover in this subject:

<img src="images/pipeline.png" width="600" height="300"> 

## Workshop 1 Overview
- On your first week as an intern, your manager wants you to familiarise yourself with the data, practice loading the data onto Pandas Series and DataFrames, and answer some basic summary questions about the ticket sales in 2018. 
- The dataset this week is a summarised ticket sales for 30 movies in 2018 through the EODP system.
- These sales are summed up from more than 2000 movie sessions (which you will be presented with next week)

Here's an example of what the dataset looks like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>movie_name</th>
      <th>classification</th>
      <th>tickets_sold</th>
      <th>max_capacity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A Quiet Place</td>
      <td>M</td>
      <td>103813</td>
      <td>427725</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Alpha</td>
      <td>PG</td>
      <td>103596</td>
      <td>422525</td>
    </tr>
    <tr>
      <th>2</th>
      <td>An Interview with God</td>
      <td>PG</td>
      <td>104182</td>
      <td>426575</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Animal World</td>
      <td>G</td>
      <td>108293</td>
      <td>427300</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ant-Man and the Wasp</td>
      <td>PG</td>
      <td>104631</td>
      <td>429350</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Aquaman</td>
      <td>M</td>
      <td>102655</td>
      <td>423100</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Avengers: Infinity War</td>
      <td>M</td>
      <td>112178</td>
      <td>424325</td>
    </tr>
    <tr>
      <th>7</th>
      <td>A-X-L</td>
      <td>PG</td>
      <td>99339</td>
      <td>423200</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Between Worlds</td>
      <td>MA15+</td>
      <td>103208</td>
      <td>423375</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Black Panther</td>
      <td>M</td>
      <td>108831</td>
      <td>423375</td>
    </tr>
  </tbody>
</table>

As you can see, we have 4 features (also known as *columns* or *attributes*): Movie Name, Classification, Total Tickets Sold, and Total Capacity.

## Learning objectives
Become proficient in manipulating tabular data using Python's `pandas` package. `pandas` introduces powerful data structures for data analysis, time series, and statistical modelling. 

- Understand the data structures in the `Pandas` library: `Series`, `DataFrame`
- Construct or load a Series or DataFrame using `Pandas`
- Slicing and indexing using the `.loc[]` and `.iloc[]` methods
- How to work with Series and DataFrames using *methods* and *attributes*
- 5 Number Summary Statistics
- Sorting, Filtering, and Grouping DataFrames
- Problem Solving using a given dataset. 
- Learn to view and find functions using the API Documentation

## Workshop Overview

1. Load the `week1_booking_summary.csv` using `pandas`. Previously in COMP10001 you would have used the `csv` library.
2. Calculate the occupancy rate for each movie.
3. Get the classification rating of `'Ralph Breaks the Internet'`.
4. Query and return the movie with the highest number of tickets sold.
5. Sort the dataframe by Classification (ascending), Occupancy (descending), then Number of Tickets Sold (descending). Avoid outputting the Max Capacity feature. 
6. Filter the data to *only* show PG-Rated Movie.
7. How many movies are there *in each classification category*? For each category, which movie has the *highest* sales? Which one has the *lowest* sales?

# Getting Started with Jupyter Notebook
- Jupyter notebook is an extremely useful tool for developing and presenting projects (particularly in python).  You can include code segments and view their output directly in your browser.  You can also add rich text, visualisations, equations and more.

- The difference between this and Grok (from COMP10001) is that you can run your code line by line (without having to run all of your code at once for an output).

## Cells
Jupyter notebooks contains two main types of cells:
- Markdown cells: These can be used to contain text, equations and other non-code items.  The cell that you're reading right now is a markdown cell.  You can use [Markdown](https://www.markdownguide.org/) to format your text.  If you prefer, you can also format your text using <b>HTML</b>.  Clicking the <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button> button button will format and display your text.
- Code cells: These contain code segments that can be executed individually.  When executed, the output of the code will be displayed below the code cell.

## Keyboard Shortcuts
Cell Running shortcuts:
- _You can tell you are selecting a cell when the outline is colored is green_
- `shift + enter` : Run current cell - keyboard shortcut for the <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button> button
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- - _You can tell you are in Command Mode when the outline is colored is blue_
- Enter command mode pressing `esc` (blue highlight)
- `a` to create a cell **above**
- `b` to create a cell **below**
- `dd` (double d) to **delete** a cell
- `m` to make the cell render in **markdown**
- `r` to make the cell render in **raw** text
- `y` to make the cell render python code
- `enter` to "edit" the cell

Code Shortcuts:
- `shift + tab` : brings function/method arguments up

# Pandas
Depending on the use case, data come in various shapes and structures. One of the most common forms is *tabular data*, or data tables (think Excel spreadsheets or SQL tables). It's both human-readable and machine-readable, and it's easy to *vectorize* any transformation to our data. Here's a visualisation of what a DataFrame looks like:
![Dataframe](images/dataframe.jpg)

To work with tabular data in Python, we use the library `pandas`. We **strongly recommend** you bookmark the [API reference (Documentation)](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) which will serve as a bible for this subject.

In case you require additional reading material:
- [Intro to Data Structures](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html) 
- There are also a number of step-by-step tutorials such as [this one by DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python).

## Installing Packages
***If you are running this on the Ed Forum or via Anaconda (Python), you won't need to do this.***

Depending on the installation of Jupyter Notebook and your OS (Operating System), you may need to *install* the `pandas` package. To do so, run the command that matches your system:
```python
# for Windows with only Python installed
pip install pandas

# for Mac or Linux with Python3 installed
pip3 install pandas
```

Like `collections` from COMP10001, we import `pandas` in the same manner. The only note here is that we are importing and giving it the alias `pd` to shorten the amount of code we need to write (`pd.DataFrame()` vs `pandas.DataFrame()`)

In [1]:
import pandas as pd

## <u>Concept: Series</u>
A Series is a One-dimensional array-like object containing the array of data and an associated array of data labels called index. It's best to think of Series as a single column in Excel, or a vertical `list`-like object. Here's a visual example of what it looks like:

![Dataframe](images/series.jpg)

### Creating a Pandas Series

In [2]:
# define a list of values
sales_list = [107512, 103208, 99388, 103838, 104631]

# create a Pandas series
sales_series = pd.Series(sales_list)

In [3]:
# notice how we can just "display" the variable without printing it
sales_series

0    107512
1    103208
2     99388
3    103838
4    104631
dtype: int64

### <u>Concept: Attributes and Methods of a Python object</u>
The Pandas `series` also comes with useful attributes in methods. To be specific:  
1. Attributes are static variables that are stored when the object is created.
2. Methods are functions that a pre-defined with that object.

Examples of `series` attributes:
- `series.index` attribute (returns the index field like `dict.keys()`)
- `series.values` attribute (returns the values like `dict.values()`)

Examples of `series` methods:
- `series.mean()` method (computes the average)
- `series.sum()` method (computes the grand total sum)

To get all the attributes and methods available, you can call `help(pd.Series)`. 


There are a lot of functions, methods and attributes in the `pandas` library, so we won't be covering all of them in this subject. We encourage students to look up the [API Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) if you wish to use something outside the scope of this subject.



In [4]:
# The default indexing starts from zero
print(sales_series.index)

# Retrieve the values of the series
print(sales_series.values)

RangeIndex(start=0, stop=5, step=1)
[107512 103208  99388 103838 104631]


In [5]:
# Create your own index using lists
# Indexes don't have to be integers!
new_index = ['The Kissing Booth', 
            'Between Worlds', 
            'Sicario: Day of the Soldado', 
            'Spider-Man: Into the Spider-Verse', 
            'Ant-Man and the Wasp']

sales_series.index  = new_index

# Verify the index has been changed
print(sales_series)

The Kissing Booth                    107512
Between Worlds                       103208
Sicario: Day of the Soldado           99388
Spider-Man: Into the Spider-Verse    103838
Ant-Man and the Wasp                 104631
dtype: int64


Like dictionaries, we can access values using `[]`

In [6]:
# Access the sales values based on index
print(sales_series['Ant-Man and the Wasp'])

104631


In [7]:
# Create a series from a python dict
sales_dict = {'Dragon Ball Super: Origin of the Saiyans': 105982,
              'Animal World': 108293,
              'Avengers: Infinity War': 112178,
              'A Quiet Place': 103813,
              'Bumblebee': 106562}

sales_series_dict = pd.Series(sales_dict)
print(sales_series_dict)

Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
dtype: int64


If we want to concatenate two series together, we can use `pd.concat([LIST OF SERIES], axis='rows')`.

In [8]:
# Vertically concatenate two series
sales_series = pd.concat([sales_series, sales_series_dict], axis='rows')
print(sales_series)

The Kissing Booth                           107512
Between Worlds                              103208
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
Ant-Man and the Wasp                        104631
Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
dtype: int64


<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Question  
Is `pd.concat` a method or attribute?

Here are some operations that we can do (such as filtering the data).
- Here, we use `.loc` to *locate* the values that fulfill the conditions. 
- `.iloc` can also be used which *locates* the *indicies* that correspond to the slice provided

In [9]:
# Slicing the series using a boolean array operation 
sales_series.loc[sales_series < 100000]

Sicario: Day of the Soldado    99388
dtype: int64

In [10]:
# Slicing the series using index range
sales_series.loc['Ant-Man and the Wasp':'A Quiet Place']

Ant-Man and the Wasp                        104631
Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
dtype: int64

In [11]:
# Slicing the series using iloc
sales_series.iloc[0:5]

The Kissing Booth                    107512
Between Worlds                       103208
Sicario: Day of the Soldado           99388
Spider-Man: Into the Spider-Verse    103838
Ant-Man and the Wasp                 104631
dtype: int64

In [12]:
# Doubling the values of the series object
doubled = sales_series * 2
doubled

The Kissing Booth                           215024
Between Worlds                              206416
Sicario: Day of the Soldado                 198776
Spider-Man: Into the Spider-Verse           207676
Ant-Man and the Wasp                        209262
Dragon Ball Super: Origin of the Saiyans    211964
Animal World                                216586
Avengers: Infinity War                      224356
A Quiet Place                               207626
Bumblebee                                   213124
dtype: int64

In [13]:
# Finding the average value of the series
sales_series.mean()

105540.5

In [14]:
# Defining the column name
sales_series.name = 'Total tickets sold'
# Defining the name of the index
sales_series.index.name = 'Movie Name'

print(sales_series)

Movie Name
The Kissing Booth                           107512
Between Worlds                              103208
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
Ant-Man and the Wasp                        104631
Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
Name: Total tickets sold, dtype: int64


<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
1. Find all movies that sold under 100,000 movie tickets.  
_Hint: You may want to use `.loc[]` for this._  


2. Find the grand total number of movie tickets sold.

In [15]:
### ANSWER 1 HERE
sales_series.loc[sales_series < 100000]

Movie Name
Sicario: Day of the Soldado    99388
Name: Total tickets sold, dtype: int64

In [16]:
### ANSWER 2 HERE
sales_series.sum()

1055405

## <u>Concept: Dataframe</u>
A DataFrame has both row and column indices like a `series` and contains many useful methods to aid your analysis. 
- [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) details all of the functionality provided by `pandas`.  
- You will particularly need consult the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) reference page.


Here's a visualisation of what a DataFrame looks like:
![Dataframe](images/dataframe.jpg)

### Working with Pandas Dataframe

In [17]:
tickets_sold_dict =  {'The Kissing Booth': 107512,
                        'Between Worlds': 103208,
                        'Sicario: Day of the Soldado': 99388,
                        'Spider-Man: Into the Spider-Verse': 103838,
                        'Ant-Man and the Wasp': 104631, 
                        'Dragon Ball Super: Origin of the Saiyans': 105982,
                        'Animal World': 108293,
                        'Avengers: Infinity War': 112178,
                        'A Quiet Place': 103813,
                        'Bumblebee': 106562}

tickets_sold = pd.Series(tickets_sold_dict)

In [18]:
max_capacity_dict = {'A Quiet Place': 427725,
                      'Animal World': 427300,
                      'Ant-Man and the Wasp': 429350,
                      'Avengers: Infinity War': 424325,
                      'Between Worlds': 423375,
                      'Bumblebee': 427950,
                      'Dragon Ball Super: Origin of the Saiyans': 423225,
                      'Sicario: Day of the Soldado': 427950,
                      'Spider-Man: Into the Spider-Verse': 428375,
                      'The Kissing Booth': 418750}

max_capacity = pd.Series(max_capacity_dict)

In [19]:
# create a DataFrame object from the series objects
sales_df = pd.DataFrame({'tickets_sold': tickets_sold, 
                         'max_capacity': max_capacity})
sales_df

Unnamed: 0,tickets_sold,max_capacity
A Quiet Place,103813,427725
Animal World,108293,427300
Ant-Man and the Wasp,104631,429350
Avengers: Infinity War,112178,424325
Between Worlds,103208,423375
Bumblebee,106562,427950
Dragon Ball Super: Origin of the Saiyans,105982,423225
Sicario: Day of the Soldado,99388,427950
Spider-Man: Into the Spider-Verse,103838,428375
The Kissing Booth,107512,418750


As you can see, a DataFrame is essentially made up of several `series` (i.e columns or features).

In [20]:
# access a specific column (like dict[key])
sales_df['tickets_sold']

A Quiet Place                               103813
Animal World                                108293
Ant-Man and the Wasp                        104631
Avengers: Infinity War                      112178
Between Worlds                              103208
Bumblebee                                   106562
Dragon Ball Super: Origin of the Saiyans    105982
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
The Kissing Booth                           107512
Name: tickets_sold, dtype: int64

In [21]:
# find movies which did not sell 100k tickets
sales_df.loc[sales_df['tickets_sold'] < 100000]

Unnamed: 0,tickets_sold,max_capacity
Sicario: Day of the Soldado,99388,427950


<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Question  
What do you notice about the order of the movies?

## Reading and saving CSV's using Pandas
Previously in COMP10001, you would have used the `csv` library. Putting those days behind us, we can now introduce a simple way of reading in files.


Here, we will use the `df.head()` method which displays the first 5 rows by default. Correspondingly, the `df.tail()` method displays the last 5 rows by default.

In [22]:
# create a DataFrame from a csv file
total_sales = pd.read_csv('booking_summary.csv')
total_sales.head(10)

Unnamed: 0,movie_name,classification,tickets_sold,max_capacity
0,A Quiet Place,M,103813,427725
1,Alpha,PG,103596,422525
2,An Interview with God,PG,104182,426575
3,Animal World,G,108293,427300
4,Ant-Man and the Wasp,PG,104631,429350
5,Aquaman,M,102655,423100
6,Avengers: Infinity War,M,112178,424325
7,A-X-L,PG,99339,423200
8,Between Worlds,MA15+,103208,423375
9,Black Panther,M,108831,423375


In [23]:
# save a DataFrame as a csv file
last_ten = total_sales.tail(10)
last_ten.to_csv('last_ten.csv')

# Read it back in. 
pd.read_csv('last_ten.csv')

Unnamed: 0.1,Unnamed: 0,movie_name,classification,tickets_sold,max_capacity
0,20,Ralph Breaks the Internet,PG,103909,425500
1,21,Rampage,M,102746,420575
2,22,Siberia,G,107617,423925
3,23,Sicario: Day of the Soldado,MA15+,99388,427950
4,24,Spider-Man: Into the Spider-Verse,PG,103838,428375
5,25,The Darkest Minds,M,101663,432075
6,26,The Kissing Booth,M,107512,418750
7,27,The Meg,M,108652,422375
8,28,The Predator,MA15+,104824,424350
9,29,Venom,M,110053,424200


Note how we have an `Unnamed: 0` column? That's the index column from when we exported the CSV. To ensure that Pandas do not need to save the index column, you can add the `index=False` to the `.to_csv()` method.

In [24]:
# save a DataFrame as a csv file
last_ten = total_sales.tail(10)
last_ten.to_csv('last_ten.csv', index=False)

# Read it back in. 
pd.read_csv('last_ten.csv')

Unnamed: 0,movie_name,classification,tickets_sold,max_capacity
0,Ralph Breaks the Internet,PG,103909,425500
1,Rampage,M,102746,420575
2,Siberia,G,107617,423925
3,Sicario: Day of the Soldado,MA15+,99388,427950
4,Spider-Man: Into the Spider-Verse,PG,103838,428375
5,The Darkest Minds,M,101663,432075
6,The Kissing Booth,M,107512,418750
7,The Meg,M,108652,422375
8,The Predator,MA15+,104824,424350
9,Venom,M,110053,424200


<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
Count the number of movies in each classification. 

Hint: Try to search up _"pandas count values in dataframe"_

In [25]:
### ANSWER HERE
total_sales['classification'].value_counts()

M        15
PG        8
MA15+     5
G         2
Name: classification, dtype: int64

We will now set the `movie_name` as our index.

**Follow-up Question: Why would we want to set the name of the movie as the index?**

In [26]:
# set the name of movie as the index
total_sales = total_sales.set_index('movie_name')
total_sales

Unnamed: 0_level_0,classification,tickets_sold,max_capacity
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A Quiet Place,M,103813,427725
Alpha,PG,103596,422525
An Interview with God,PG,104182,426575
Animal World,G,108293,427300
Ant-Man and the Wasp,PG,104631,429350
Aquaman,M,102655,423100
Avengers: Infinity War,M,112178,424325
A-X-L,PG,99339,423200
Between Worlds,MA15+,103208,423375
Black Panther,M,108831,423375


<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
1. Calculate the occupancy rate for each movie. The occupancy rate is the number of tickets sold divided by the max capacity. Output this to a `'occupancy_rate'` column. Round the result to two decimal places using the `round()` function.

2. Return only the `classification` and `'occupancy_rate'` of `'Ralph Breaks the Internet'` 

3. Get data row (known as an *instance*) of the movie with the highest number of tickets sold. You may want to sort your values first using `df.sort_values(by=COLUMN)`

4. Find the subset of movies that have a `PG` classification.

In [27]:
### ANSWER 1 HERE
total_sales['occupancy_rate'] = round(total_sales['tickets_sold'] / total_sales['max_capacity'], 2)
total_sales

Unnamed: 0_level_0,classification,tickets_sold,max_capacity,occupancy_rate
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A Quiet Place,M,103813,427725,0.24
Alpha,PG,103596,422525,0.25
An Interview with God,PG,104182,426575,0.24
Animal World,G,108293,427300,0.25
Ant-Man and the Wasp,PG,104631,429350,0.24
Aquaman,M,102655,423100,0.24
Avengers: Infinity War,M,112178,424325,0.26
A-X-L,PG,99339,423200,0.23
Between Worlds,MA15+,103208,423375,0.24
Black Panther,M,108831,423375,0.26


In [28]:
### ANSWER 2 HERE
total_sales.loc['Ralph Breaks the Internet', ['classification', 'occupancy_rate']]

classification      PG
occupancy_rate    0.24
Name: Ralph Breaks the Internet, dtype: object

In [29]:
### ANSWER 3 HERE
total_sales.sort_values('occupancy_rate', ascending=False).head(1)

Unnamed: 0_level_0,classification,tickets_sold,max_capacity,occupancy_rate
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Venom,M,110053,424200,0.26


In [30]:
### ANSWER 4 HERE
total_sales[total_sales['classification'] == 'PG']

Unnamed: 0_level_0,classification,tickets_sold,max_capacity,occupancy_rate
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alpha,PG,103596,422525,0.25
An Interview with God,PG,104182,426575,0.24
Ant-Man and the Wasp,PG,104631,429350,0.24
A-X-L,PG,99339,423200,0.23
Hotel Transylvania 3: Summer Vacation,PG,103477,430400,0.24
Peter Rabbit,PG,111164,429075,0.26
Ralph Breaks the Internet,PG,103909,425500,0.24
Spider-Man: Into the Spider-Verse,PG,103838,428375,0.24


### Advanced: Sort the data over multiple columns
To sort values over multiple columns, you can pass through a `list` of columns (in order) to the `by=` argument.

Here's an example of sorting by:
1. Classification ascending
2. Occupancy rate descending
3. Tickets sold descending

In [31]:
total_sales.sort_values(['classification', 'occupancy_rate', 'tickets_sold'],
                       ascending=[True, False, False]).drop(['max_capacity'], axis='columns')

Unnamed: 0_level_0,classification,tickets_sold,occupancy_rate
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Animal World,G,108293,0.25
Siberia,G,107617,0.25
Avengers: Infinity War,M,112178,0.26
Venom,M,110053,0.26
Black Panther,M,108831,0.26
The Meg,M,108652,0.26
The Kissing Booth,M,107512,0.26
Bumblebee,M,106562,0.25
Dragon Ball Super: Origin of the Saiyans,M,105982,0.25
Maze Runner: The Death Cure,M,104793,0.25


## <u>Concept: Group by</u>
The `groupby` method lets you separate the data into different groups based off shared characteristics (akin to `itertools.groupby`). For example, we could group countries by region or income range, then analyse those groups individually.  

The official documentation on groupby can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). [This tutorial](https://www.marsja.se/python-pandas-groupby-tutorial-examples/) is also well worth reading.


Here's an example of finding the total number of tickets sold for each classification.

In [32]:
total_sales.groupby('classification')['tickets_sold'].sum()

classification
G         215910
M        1578844
MA15+     521932
PG        834136
Name: tickets_sold, dtype: int64

<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
1. How many movies are there in each classification category? Think of key search terms such as "size" or "count".

2. Then, for each category, what is the `mean` number of tickets sold?

In [33]:
### ANSWER 1 HERE
total_sales.groupby('classification').size()

classification
G         2
M        15
MA15+     5
PG        8
dtype: int64

In [34]:
### ANSWER 2 HERE
total_sales.groupby('classification').mean()

Unnamed: 0_level_0,tickets_sold,max_capacity,occupancy_rate
classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
G,107955.0,425612.5,0.25
M,105256.266667,424893.333333,0.248
MA15+,104386.4,424120.0,0.246
PG,104267.0,426875.0,0.2425


Finally, an advanced use case of groupby. Here, we are:
- Getting the max capacity possible for the classification;
- the average number of tickets sold for the classification;
- and the average occupancy rate.

The syntax for this is a dictionary using the `.agg()` method where the:
- `key` corresponds to the column
- `value` corresponds to the type of aggregation

View more here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

In [35]:
total_sales.groupby('classification').agg({'max_capacity': 'max', 'tickets_sold': 'mean', 'occupancy_rate': 'mean'})

Unnamed: 0_level_0,max_capacity,tickets_sold,occupancy_rate
classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
G,427300,107955.0,0.25
M,432225,105256.266667,0.248
MA15+,427950,104386.4,0.246
PG,430400,104267.0,0.2425


# Challenge questions

We don't give the answers to these questions, but we encourage students to discuss among themselves using the Ed forum. Some questions require use of functions or methods not covered in this tutorial, and some questions are open-ended (no fixed answer, depending on their arguments). We have provided this to give students a chance to get used to searching up the documentation.

1. Suppose that the average purchase price per ticket is `$22.00`, what's the average dollar sales for `MA15+` movies? Compared this to the median dollar sales for `M`-rated movies.

2. How many movies have a title that begins with `"T"`? 

3. Which movies tend to have better occupancy rate: low sales with low capacity, or high sales with high capacity?

4. How many movies have a below-average occupancy rate in each classification category?