### The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction Machine Learning, 2022 Semester 2

## Week 2 - Introduction to Pandas

## Cells
Jupyter notebooks contains two main types of cells:
- Markdown cells: These can be used to contain text, equations and other non-code items.  The cell that you're reading right now is a markdown cell.  You can use [Markdown](https://www.markdownguide.org/) to format your text.  If you prefer, you can also format your text using <b>HTML</b>.  Clicking the <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button> button button will format and display your text.
- Code cells: These contain code segments that can be executed individually.  When executed, the output of the code will be displayed below the code cell.

## Keyboard Shortcuts
Cell Running shortcuts:
- _You can tell you are selecting a cell when the outline is colored is green_
- `shift + enter` : Run current cell - keyboard shortcut for the <button class='btn btn-default btn-xs'><i class="fa-play fa"></i><span class="toolbar-btn-label">Run</span></button> button
- `ctrl + enter` : Run selected cells

Command mode (press `esc` to enter):
- - _You can tell you are in Command Mode when the outline is colored is blue_
- Enter command mode pressing `esc` (blue highlight)
- `a` to create a cell **above**
- `b` to create a cell **below**
- `dd` (double d) to **delete** a cell
- `m` to make the cell render in **markdown**
- `r` to make the cell render in **raw** text
- `y` to make the cell render python code
- `enter` to "edit" the cell

Code Shortcuts:
- `shift + tab` : brings function/method arguments up

# Pandas
Depending on the use case, data come in various shapes and structures. One of the most common forms is *tabular data*, or data tables (think Excel spreadsheets or SQL tables). It's both human-readable and machine-readable, and it's easy to *vectorize* any transformation to our data. Here's a visualisation of what a DataFrame looks like:
![Dataframe](images/dataframe.jpg)



To work with tabular data in Python, we use the library `pandas`. We **strongly recommend** you bookmark the [API reference (Documentation)](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) which will serve as a bible for this subject.

## Installing Packages

Depending on the installation of Jupyter Notebook and your OS (Operating System), you may need to *install* the `pandas` package. To do so, run the command that matches your system:
```python
# for Windows with only Python installed
pip install pandas

# for Mac or Linux with Python3 installed
pip3 install pandas
```

Then we import `pandas`. The only note here is that we are importing and giving it the alias `pd` to shorten the amount of code we need to write (`pd.DataFrame()` vs `pandas.DataFrame()`)

In [70]:
import pandas as pd

## <u>Concept: Series</u>
A Series is a One-dimensional array-like object containing the array of data and an associated array of data labels called index. It's best to think of Series as a single column in Excel, or a vertical `list`-like object. Here's a visual example of what it looks like:

![Dataframe](images/series.jpg)

### Creating a Pandas Series

In [71]:
# define a list of values
sales_list = [107512, 103208, 99388, 103838, 104631]

# create a Pandas series
sales_series = pd.Series(sales_list)

In [72]:
# notice how we can just "display" the variable without printing it
sales_series

0    107512
1    103208
2     99388
3    103838
4    104631
dtype: int64

### <u>Concept: Attributes and Methods of a Python object</u>
The Pandas `series` also comes with useful attributes in methods. To be specific:  
1. Attributes are static variables that are stored when the object is created.
2. Methods are functions that a pre-defined with that object.

Examples of `series` attributes:
- `series.index` attribute (returns the index field like `dict.keys()`)
- `series.values` attribute (returns the values like `dict.values()`)

Examples of `series` methods:
- `series.mean()` method (computes the average)
- `series.sum()` method (computes the grand total sum)

To get all the attributes and methods available, you can call `help(pd.Series)`. 


There are a lot of functions, methods and attributes in the `pandas` library, so we won't be covering all of them in this subject. We encourage students to look up the [API Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) if you wish to use something outside the scope of this subject.



In [73]:
# The default indexing starts from zero
print(sales_series.index)

# Retrieve the values of the series
print(sales_series.values)

RangeIndex(start=0, stop=5, step=1)
[107512 103208  99388 103838 104631]


In [74]:
# Create your own index using lists
# Indexes don't have to be integers!
new_index = ['The Kissing Booth', 
            'Between Worlds', 
            'Sicario: Day of the Soldado', 
            'Spider-Man: Into the Spider-Verse', 
            'Ant-Man and the Wasp']

sales_series.index  = new_index

# Verify the index has been changed
print(sales_series)

The Kissing Booth                    107512
Between Worlds                       103208
Sicario: Day of the Soldado           99388
Spider-Man: Into the Spider-Verse    103838
Ant-Man and the Wasp                 104631
dtype: int64


Like dictionaries, we can access values using `[]`

In [75]:
# Access the sales values based on index
print(sales_series['Ant-Man and the Wasp'])

104631


In [76]:
# Create a series from a python dict
sales_dict = {'Dragon Ball Super: Origin of the Saiyans': 105982,
              'Animal World': 108293,
              'Avengers: Infinity War': 112178,
              'A Quiet Place': 103813,
              'Bumblebee': 106562}

sales_series_dict = pd.Series(sales_dict)
print(sales_series_dict)

Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
dtype: int64


If we want to concatenate two series together, we can use `pd.concat([LIST OF SERIES], axis='rows')`.

In [77]:
# Vertically concatenate two series
sales_series = pd.concat([sales_series, sales_series_dict], axis='rows')
print(sales_series)

The Kissing Booth                           107512
Between Worlds                              103208
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
Ant-Man and the Wasp                        104631
Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
dtype: int64


Here are some operations that we can do (such as filtering the data).
- Here, we use `.loc` to *locate* the values that fulfill the conditions. 
- `.iloc` can also be used which *locates* the *indicies* that correspond to the slice provided

In [78]:
# Slicing the series using index
sales_series.loc['Ant-Man and the Wasp']

104631

In [79]:
# This is an alternative syntax
sales_series['Ant-Man and the Wasp']

104631

In [80]:
# Slicing the series using a boolean array operation 
sales_series.loc[sales_series < 100000]

Sicario: Day of the Soldado    99388
dtype: int64

In [81]:
# Slicing the series using iloc
sales_series.iloc[0:5]

The Kissing Booth                    107512
Between Worlds                       103208
Sicario: Day of the Soldado           99388
Spider-Man: Into the Spider-Verse    103838
Ant-Man and the Wasp                 104631
dtype: int64

In [82]:
# Doubling the values of the series object
doubled = sales_series * 2
doubled

The Kissing Booth                           215024
Between Worlds                              206416
Sicario: Day of the Soldado                 198776
Spider-Man: Into the Spider-Verse           207676
Ant-Man and the Wasp                        209262
Dragon Ball Super: Origin of the Saiyans    211964
Animal World                                216586
Avengers: Infinity War                      224356
A Quiet Place                               207626
Bumblebee                                   213124
dtype: int64

In [83]:
# Finding the average value of the series
sales_series.mean()

105540.5

In [84]:
# Defining the column name
sales_series.name = 'Total tickets sold'
# Defining the name of the index
sales_series.index.name = 'Movie Name'

print(sales_series)

Movie Name
The Kissing Booth                           107512
Between Worlds                              103208
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
Ant-Man and the Wasp                        104631
Dragon Ball Super: Origin of the Saiyans    105982
Animal World                                108293
Avengers: Infinity War                      112178
A Quiet Place                               103813
Bumblebee                                   106562
Name: Total tickets sold, dtype: int64




## Exercise
1. Find all movies that sold more than 100,000 but under 105,000 movie tickets.  
_Hint: You may want to use `.loc[]` for this._  


2. Find the grand total number of movie tickets sold.

In [85]:
### ANSWER 1 HERE


In [86]:
### ANSWER 2 HERE


## <u>Concept: Dataframe</u>
A DataFrame has both row and column indices like a `series` and contains many useful methods to aid your analysis. 
- [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) details all of the functionality provided by `pandas`.  
- You will particularly need consult the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) reference page.


Here's a visualisation of what a DataFrame looks like:
![Dataframe](images/dataframe.jpg)

### Working with Pandas Dataframe

In [87]:
tickets_sold_dict =  {'The Kissing Booth': 107512,
                        'Between Worlds': 103208,
                        'Sicario: Day of the Soldado': 99388,
                        'Spider-Man: Into the Spider-Verse': 103838,
                        'Ant-Man and the Wasp': 104631, 
                        'Dragon Ball Super: Origin of the Saiyans': 105982,
                        'Animal World': 108293,
                        'Avengers: Infinity War': 112178,
                        'A Quiet Place': 103813,
                        'Bumblebee': 106562}

tickets_sold = pd.Series(tickets_sold_dict)

In [88]:
max_capacity_dict = {'A Quiet Place': 427725,
                      'Animal World': 427300,
                      'Ant-Man and the Wasp': 429350,
                      'Avengers: Infinity War': 424325,
                      'Between Worlds': 423375,
                      'Bumblebee': 427950,
                      'Dragon Ball Super: Origin of the Saiyans': 423225,
                      'Sicario: Day of the Soldado': 427950,
                      'Spider-Man: Into the Spider-Verse': 428375,
                      'The Kissing Booth': 418750}

max_capacity = pd.Series(max_capacity_dict)

In [89]:
# create a DataFrame object from the series objects
sales_df = pd.DataFrame({'tickets_sold': tickets_sold, 
                         'max_capacity': max_capacity})
sales_df

Unnamed: 0,tickets_sold,max_capacity
A Quiet Place,103813,427725
Animal World,108293,427300
Ant-Man and the Wasp,104631,429350
Avengers: Infinity War,112178,424325
Between Worlds,103208,423375
Bumblebee,106562,427950
Dragon Ball Super: Origin of the Saiyans,105982,423225
Sicario: Day of the Soldado,99388,427950
Spider-Man: Into the Spider-Verse,103838,428375
The Kissing Booth,107512,418750


As you can see, a DataFrame is essentially made up of several `series` (i.e columns or features).

In [90]:
# access a specific column (like dict[key])
sales_df['tickets_sold']

A Quiet Place                               103813
Animal World                                108293
Ant-Man and the Wasp                        104631
Avengers: Infinity War                      112178
Between Worlds                              103208
Bumblebee                                   106562
Dragon Ball Super: Origin of the Saiyans    105982
Sicario: Day of the Soldado                  99388
Spider-Man: Into the Spider-Verse           103838
The Kissing Booth                           107512
Name: tickets_sold, dtype: int64

In [91]:
# access multiple column (note the order)
sales_df[['max_capacity', 'tickets_sold']]

Unnamed: 0,max_capacity,tickets_sold
A Quiet Place,427725,103813
Animal World,427300,108293
Ant-Man and the Wasp,429350,104631
Avengers: Infinity War,424325,112178
Between Worlds,423375,103208
Bumblebee,427950,106562
Dragon Ball Super: Origin of the Saiyans,423225,105982
Sicario: Day of the Soldado,427950,99388
Spider-Man: Into the Spider-Verse,428375,103838
The Kissing Booth,418750,107512


In [92]:
# find movies which did not sell 100k tickets
sales_df.loc[sales_df['tickets_sold'] < 100000]

Unnamed: 0,tickets_sold,max_capacity
Sicario: Day of the Soldado,99388,427950




## For more practice you can use
    
- [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) A brief introduction to pandas for new users on the official pandas site.

- [Pandas cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf) A one-page summary of some of the most common pandas functions