<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@acca-logo.jpg" alt="ACCA logo" style="width: 400px;"/>

# Python for data analysis
## Part 2 - First steps with pandas

* **Course:** __Machine learning with Python for finance professionals__ by ACCA
* **Instructor:** [Coefficient](https://coefficient.ai) / [@CoefficientData](https://twitter.com/CoefficientData)

---

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
pandas 101
</h2><br>
</div>

<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@pandas.png" alt="pandas" style="width: 300px;"/>

> **[pandas](https://pandas.pydata.org) is a Python library for data analysis & manipulation**. The name is a contraction of "[panel data](https://en.wikipedia.org/wiki/Panel_data) analysis" and refers to the kind of tabular data common in financial applications. It was released in 2008 by Wes McKinney and has been called "_[the most important tool in data science](https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/)_".
>
> pandas is built on top of NumPy and enables the storage and manipulation of Excel-like tables in Python. These special tables are called DataFrames, the primary object in `pandas`.

In [None]:
# We will import pandas using the alias "pd" (it's shorter, i.e. quicker to type)
import pandas as pd

In [None]:
pd.read_excel?

In [None]:
# Let's read in the Dream Destination hotel data.
orders = pd.read_excel(
    "Hotel Industry - Orders Database - 2019.xlsx", sheet_name="Order Database"
)

In [None]:
# What does it look like?
orders

In [None]:
# This DataFrame has 9501 rows and 19 columns
orders.shape

In [None]:
# We can use the Python "length" function, len(), to count the length of things
len({"a": 1, 'b': 2, 'c': 3})

In [None]:
# The "length" of a DataFrame is the number of rows it has
len(orders)

---

In this next section we will aim to answer the following questions:
1. How do we select a single column?
2. How do we select several columns?
3. How do we select the top 5 rows? The top 10 rows?
4. How do we select the bottom 5 rows?
5. How do we select only bookings made by people under 20 years old? How about only bookings made by women? How about bookings made by men aged 40-49?

### 1. How do we select a single column?

In [None]:
# Remember that square brackets are used for "selecting things" in Python.
numbers = [1, 2, 3]
numbers[0]

In [None]:
# Select the value associated with a dictionary key...
capitals = {"Germany": "Berlin", "France": "Paris", "Slovenia": "Ljubljana", "Tanzania": "Dodoma"}
capitals["Slovenia"]

In [None]:
orders

In [None]:
# The same is true for DataFrames. Here we get the single column with the name "Location".
orders['Location']

In [None]:
# We can also use the "dot" notation to access the same column.
orders.Age

In [None]:
# We can turn the above into a normal Python list
orders.Location.tolist()

In [None]:
# This doesn't work with columns containing spaces, we must use the square brackets here.
orders['Destination Country']

In [None]:
# We call this a "pandas Series"
type(orders['Location'])

A [Series](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#series) is a single column from a DataFrame. A DataFrame is made up of many Series (i.e. columns), each with its own column name.

### 2. How do we select several columns?

In [1]:
# We can select multiple columns by passing in a list of column names
# into the first set of square brackets.
orders[['Location', 'Destination Country']]

NameError: name 'orders' is not defined

In [None]:
# Note there's nothing special about double brackets [[]] in pandas!

# It's just a Python list...
columns = ['Location', 'Destination Country']

# ...being placed inside the normal pandas "selector" brackets.
orders[columns]

In [None]:
orders[['Location']]

### 3. How do we select the top 5 rows? The top 10 rows?

In [None]:
# We can use the pandas .head() method for this
orders.head()

In [None]:
# Let's take a look at the inline help - you can see the default is 5
orders.head?

In [None]:
# Let's try get 10 rows
orders.head(n=10)

### 4. How do we select the bottom 5 rows?

In [None]:
# The equivalent command for the bottom rows is .tail()
orders.tail?

> ### 🚩 Exercise
> Use the `.tail()` method to get the bottom 2 rows only.

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




> ### 🚩 Exercise
> Select the `Gender`, `Age` and `Destination Country` columns only. Add on `.head()` to the end to get the top 5 rows for just these two columns.

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




### 5. How do we select only bookings made by women? How about only bookings by people under 20 years old? How about bookings made by men aged 40-49?

Here we need to know how to query our data based on a condition. There are two ways of apply this "conditional filter" in pandas:
1. Using square brackets ("masking").
2. Using the `.query()` method.

#### Filter to only bookings made by women: option 1 (using a mask)

In [None]:
# Let's create a "mask" filter. This contains True where the condition is matched.

mask = (orders.Gender == "Female")  # the round brackets are optional but may aid readability
mask

In [None]:
# When you pass the mask into the pandas DataFrame selector brackets,
# it returns only the rows containing True, i.e. the rows where Gender is "Female"
orders[mask]

In [None]:
# This is usually done all in one go
orders[orders.Gender == "Female"]

#### Filter to only bookings made by women: option 2 (using the `.query()` method)

In [None]:
# .query() takes a string; pandas will then try to interpret the string
orders.query("Age <= 20")

In [None]:
# Notice here we need double equals (for equality) and single quotes
orders.query("Gender == 'Female'")

#### Bookings made by men aged 40-49: option 1 (using a mask)

In [None]:
# Round brackets and the & symbol are both essential here
orders[(orders.Gender == "Male") & (orders.Age >= 40) & (orders.Age <= 49)]

#### Bookings made by men aged 40-49: option 2 (using the `.query()` method)

In [None]:
# We can use the "and" keyword here
orders.query("Gender == 'Male' and Age >= 40 and Age <= 49")

> ### 🚩 Exercise
> Find all bookings made by women aged 30 whose destination country was Italy.
> 
> _**Tip**: [you can use backticks](https://stackoverflow.com/a/56157729/3279076) inside `.query()` to reference columns containing a space._

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




> ### 🚩 Exercise
> **How many** bookings by people aged 50 had a destination country of either Germany OR France?
> 
> _**Tips**:_
>   - _You may want to use brackets to help keep the logic clear._
>   - _The keyword in a `.query()` for "A or B" is `or`._
>   - _The "or" equivalent of `&` is `|` a.k.a. the "pipe operator"._

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Adding columns with map & apply
</h2><br>
</div>

In [None]:
# Adding a column to a pandas DataFrame is like adding a new key-value pair to a dictionary
orders['Continent'] = 'PLACEHOLDER'

In [None]:
# We can derive new columns from existing columns
orders['No. Of People'] / orders['Rooms']

In [None]:
# We need to save this information into a new column if we want to use it later
orders['People Per Room'] = orders['No. Of People'] / orders['Rooms']

In [None]:
# Let's take a look at our new column
orders.head(3)

### The `.map()` method

This is somewhat similar to a VLOOKUP in Excel. You must supply a "lookup dictionary".

In [None]:
continent_lookup = {
    # Africa
    'Egypt': 'Africa',
    'Kenya': 'Africa',
    # Asia
    'China': 'Asia',
    'India': 'Asia',
    'Israel': 'Asia',
    'Iran': 'Asia',
    'Japan': 'Asia',
    'Maldives': 'Asia',
    'Nepal': 'Asia',
    # Australia
    'Australia': 'Australia',
    'New Zealand': 'Australia',
    # Europe
    'Denmark': 'Europe',
    'France': 'Europe',
    'Germany': 'Europe',
    'Iceland': 'Europe',
    'Ireland': 'Europe',
    'Italy': 'Europe',
    # North America
    'Canada': 'North America',
    'Mexico': 'North America',
    # South America
    'Brazil': 'South America',
    'Colombia': 'South America',
}

In [None]:
# We can now "look up" the continent (i.e. dictionary lookup) associated
# with the dictionary key for e.g. Japan
continent_lookup['Brazil']

The format of `.map()` is:

```python
DATAFRAME[COLUMN].map(DICTIONARY)
```

In [None]:
# Let's map the Destination Country of our orders DataFrame
# using the continent_lookup dictionary.
orders['Destination Country'].map(continent_lookup)

In [None]:
# This looks good, but it isn't saved. Let's save it into a new column called "Continent".
orders['Continent'] = orders['Destination Country'].map(continent_lookup)

In [None]:
# Great, time to take a look!
orders.tail(10)

> ### 🚩 Exercise
> Let's further group the continents into "[Old World](https://en.wikipedia.org/wiki/Old_World)" and "[New World](https://en.wikipedia.org/wiki/New_World)". The mapping dictionary is provided to you below.
> 
> Create a new column called `World` with the continents mapped according to the dictionary below.

In [None]:
world_lookup = {
    # Old World
    "Africa": "Old World",
    "Asia": "Old World",
    "Europe": "Old World",
    
    # New World
    "North America": "New World",
    "South America": "New World",
    
    # Australia
    "Australia": "Australia",
}

In [None]:
world_lookup['Europe']

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




### The `.apply()` method
The `.map()` method cycles through the values in the specified column, and then runs each single value through a dictionary lookup.

In contrast the `.apply()` method cycles through the values in the specified column, and then runs each single value through a function. Let's see it in action.

In [None]:
# First let's see how the Python round() function works
for number in [4.2, 4.4, 4.5, 4.8, 5.5]:
    print(number, "rounds to", round(number))

> _Sidenote: Python implements "[Banker's Rounding](https://www.mathsisfun.com/numbers/rounding-methods.html)" whereby 0.5 intervals are rounded towards the nearest even number. This reduces bias in calculations performed on the rounded numbers. This is the standard method of rounding taught at school in some countries._

In [None]:
# Let's apply the round function to the Hotel Rating column
orders['Hotel Rating'].apply(round)

The format of `.apply()` is:

```python
DATAFRAME[COLUMN].apply(FUNCTION)
```

In [None]:
# As before, let's save our calculation into a new column in the DataFrame
orders['Hotel Rating (rounded)'] = orders['Hotel Rating'].apply(round)

In [None]:
orders.head(3)

---

> ### 🚩 Exercise
> Create a function that generates an age range string from an integer. We've broken this exercise out step by step for you.

In [None]:
# Let's import the floor() function from Python's math module
from math import floor

The `floor()` function from [Python's math module](https://docs.python.org/3/library/math.html) rounds a decimal number down to the nearest integer.

In [None]:
print(floor(3.1))
print(floor(3.9999999))

In [None]:
# We can use this to round down to the nearest decade.
age = 33
floor(age / 10) * 10

In [None]:
# We can format strings in Python using "f-strings". These inject anything inside the curly brackets {}
# into the string and are great for taking variables and formatting them into the string.
age = 33
print(f"age is {age}")

In [None]:
# You can put any Python code inside the curly brackets. It will be evaluated,
# and then the result will be inserted into the string.
f"{age}-{age + 10}"

> Now it's your turn! We've given you all the clues. We've even created the function outline for you. You should be able to take the `age` input to the function, calculate the nearest decade _lower_ than the age, calculate the age range, construct a format string, and **don't forget to return the formatted string.**
> 
> _Example: if `age` is 59, the nearest decade below this is 50, so the age bracket is "50-59"._

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




In [None]:
# Let's test out the function. This should display "30-39".
calculate_age_bracket(age=33)

In [None]:
# This should display "30-39"
calculate_age_bracket(age=39)

In [None]:
# This should display "40-49"
calculate_age_bracket(age=40)

> ### 🚩 Exercise
> Apply this function, using the `.apply()` method, to the `Age` column of the `orders` DataFrame, and create a new column called `Age Group`. Double check the previous examples, in fact it may be easiest to copy/paste/adapt the previous examples into the cell below in order to get started. Remember to enter just the name of the function you want to "apply" inside the apply method's brackets, i.e. `.apply(calculate_age_bracket)`.

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




---
<div class="alert alert-block alert-info">
    <b>Please proceed to the next part of the course when you are ready.</b> We recommend you download a copy of the <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf"><b>pandas cheatsheet</b></a> and start taking some notes on which methods and techniques you've seen so far.
</div>