# Lab 1 - Data Manipulation with Pandas


<div>
<img src="../../images/lab01/pandas_logo.png" width="700"/>
</div>

_(Adapted from [CS109a: Introduction to Data Science](https://harvard-iacs.github.io/2019-CS109A/), [Pandas: Getting Started](https://pandas.pydata.org/docs/getting_started/index.html) & [GitHub: pandas_exercises](https://github.com/guipsamora))_


# 1. Quick Overview


In [None]:
import pandas as pd

from pathlib import Path
from typing import List

# Initialize a base path for us to use
BASE_PATH = Path().cwd()

BASE_PATH

## How is a DataFrame structured?

<div>
<img src="../../images/lab01/pandas_structure.png" width="700"/>
</div>


Getting started with using pandas


In [None]:
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

df

In [None]:
df[["Age"]]

When selecting a single column of a pandas **`DataFrame`**, the result is a pandas **`Series`**.


In [None]:
type(df["Age"])

A pandas **`Series`** has no column labels, as it is just a single column of a **`DataFrame`**. A Series does have row labels.


In [None]:
# Access the series by the index (row label)
series = df["Age"]

series.loc[series.index % 2 == 0]

## How do we get data inside a DataFrame?

<div>
<img src="../../images/lab01/pandas_read_data.png" width="700"/>
</div>

Pretty simple, just use the (hopefully existing) **`read_<file_extension>`** method:


In [None]:
DATA_PATH = BASE_PATH / "data"

titanic = pd.read_csv(DATA_PATH / "titanic" / "titanic.csv", index_col=0)
titanic

The great thing about this modular approach, is that if we know that the file extension maps one to one to an existing pandas method, then we have nothing to worry about.

_Note: (If we were working with something like `xls` or `xlsx`, which are 'Microsoft Excel Open XML,' we would need map to the according method)_


In [None]:
def load_data(data_path: Path) -> List[pd.DataFrame]:
    """Loads all readable data files from a given directory into pandas DataFrames.

    Args:
        data_path (Path): Path object representing the base directory
            containing the data files.

    Returns:
        List[pd.DataFrame]: A list of pandas DataFrames, one per successfully
            loaded file.
    """
    files_found = [path for path in data_path.glob("*") if path.is_file()]

    result = []
    for found in files_found:
        # Give us the file extension (.<ext>) and then remove the '.' leaving us only with <ext>
        file_extension = found.suffix.lstrip(".")

        read_method = getattr(pd, f"read_{file_extension}")
        if callable(read_method):
            result.append(read_method(found))

    return result

In [None]:
data = load_data(DATA_PATH / "titanic")

print(f"Found: '{len(data)}' DataFrames")
data[2]

To check on how pandas interpreted each of the column data types can be done by requesting the pandas **`dtypes`** attribute:


In [None]:
titanic.dtypes

Here, for each of the columns, the used data type is enlisted. The data types in this **`DataFrame`** are integers (**`int64`**), floats (**`float64`**) and strings (**`object`**).

What is the (potential) consequence of **`dtype`** being **`object`** for strings? <br>
$\rightarrow$ Might not be the fastest approach & we also can't simply apply numerical operations

###### _Note:_ (_Starting pandas v3.0 the pyarrow string type will actually be the default string data type, for more see:_ https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#background)


In [None]:
import pandas as pd

n = 1_000_000
series_obj = pd.Series(["hello"] * n, dtype=object) # Numpy ndarray
series_arrow = series_obj.astype("string[pyarrow]") # pyarrow string

print(series_obj.dtype)
print(series_arrow.dtype)

print("\nBenchmarking .str.upper() ...")

print("object dtype:")
%timeit series_obj.str.upper()

print("string[pyarrow] dtype:")
%timeit series_arrow.str.upper()

## How can you work with pandas DataFrames?


The Titanic data set consists of the following data columns:

- **`PassengerId`**: Id of every passenger (implicit index of the row).

- **`Survived`**: Indication whether passenger survived. 0 for yes and 1 for no.

- **`Pclass`**: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

- **`Name`**: Name of passenger.

- **`Sex`**: Gender of passenger.

- **`Age`**: Age of passenger in years.

- **`SibSp`**: Number of siblings or spouses aboard.

- **`Parch`**: Number of parents or children aboard.

- **`Ticket`**: Ticket number of passenger.

- **`Fare`**: Indicating the fare.

- **`Cabin`**: Cabin number of passenger.

- **`Embarked`**: Port of embarkation.


<div>
<img src="../../images/lab01/pandas_columns.png" width="700"/>
</div>


Why did this fail? Are we sure we got the columns right?


The columns don't match our expected specifications, but we can adjust this easily


In [None]:
# Change the columns to match our specification from above

<div>
<img src="../../images/lab01/pandas_rows.png" width="700"/>
</div>


In [None]:
# Let's see how many passengers were on the titanic, who were older than 35 years at that time

The condition inside the selection brackets **`titanic["Age"] > 35`** checks for which rows the **`Age`** column has a value larger than 35, so:

```py
titanic["Age"] > 35
0      False
1       True
2      False
3      False
4      False
       ...
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool
```

returns a pandas **`Series`** of boolean values, which are either **`True`** or **`False`**, with the same number of rows as the original **`DataFrame`**.


<div>
<img src="../../images/lab01/pandas_specify.png" width="700"/>
</div>


Let's say, we are only interested in the names of passengers that were older than 35 years


In this case, a subset of both rows and columns is made in one go and just using selection brackets **`[]`** is not sufficient anymore. The **`loc`**/**`iloc`** operators are required in front of the selection brackets **`[]`**.

When using **`loc`**/**`iloc`**, the part before the comma is the **rows** you want, and the part after the comma is the **columns** you want to select.

For both the part before and after the comma, you can use a single label, a **list** of labels, a **slice** of labels, a **conditional expression** or a **colon**. Using a colon specifies you want to select all rows or columns.


When specifically interested in certain rows and/or columns **based on their position** in the table, use the **`iloc`** operator in front of the selection brackets **`[]`**


In [None]:
# Of course you can also mix the ideas of iloc and loc, which makes it easier to avoid accidental column selections

<div>
<img src="../../images/lab01/pandas_groupby.png" width="700"/>
</div>

What is the average age for male versus female Titanic passengers?


Since we are interested in the average age for each gender, we first do a subselection on these two columns(**`titanic[["Sex", "Age"]]`**). Next, we apply the **`groupby()`** method on the **`Sex`** column to create one group per category (and since there are only two values in the column, we will have two groups created). Last, the average each for each category is calculated and returned.

This approach is the general **`split-apply-combine`** pattern:

- **Split** the data into groups
- **Apply** a function to each group independently
- **Combine** the results into a data structure


In [None]:
# Why can't we just apply the groupby operation directly?

In [None]:
# Recalling the dtypes, we are applying a numeric operation on types that are incompatible with the operation.


# We can avoid this by passing `numeric_only=True`

<div>
<img src="../../images/lab01/pandas_count.png" width="700"/>
</div>
What is the number of passengers in each of the cabin classes?


The **`value_counts()`** method counts the number of records for each distinct value in a column. It is a shortcut method, as it is actually a groupby operation in combination with counting of the number of records within each group:

```py
titanic.groupby("Pclass")["Pclass"].count()
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64
```


# 2. Exercises

Summary of operations & Documentation available at: https://pandas.pydata.org/docs/user_guide/10min.html


### Give the percentage of survivors


### What is the average age and gender of the survivors compared to the people that didn't survive?

**Note**: Pandas does automatically exclude NaN numbers from aggregation functions. If the only value in the column is NaN, then we must take the aggregate value of an empty set, which is results in NaN


### Create a new column, called `AgeGroup`, which classifies the person based on their **`Age`** as follows:

- If 0 < **`Age`** <= 1, then classify them as **`Infant`**
- If 1 < **`Age`** <= 3, then classify them as **`Toddler`**
- If 3 < **`Age`** <= 12, then classify them as **`Child`**
- If 12 < **`Age`** <= 18, then classify them as **`Teen`**
- if 18 < **`Age`** <= 30, then classify them as **`YoungAdult`**
- If 30 < **`Age`** <= 50, then classify them as **`Adult`**
- If 50 < **`Age`** <= 80, then classify them as **`Senior`**
- If 80 < **`Age`** <= 130, then classify them as **`Urgestein`**


### Now, measure the survival rate by age group


## Working with Chipotle


Load the **`chipotle.csv`** from the **`data/chipotle`** directory.

Tip: perhaps some detail in the documentation is necessary to load the file.


In [None]:
chipo_path = ...

### Inspect the first 10 entries


### What is the number of columns in the dataset?


### Print the name of all the columns.


### How is the dataset indexed?


### What were the ten most-ordered items? And how often were they ordered?


In [None]:
# Solution: Chicken Bowl: 761, ...

### How many items were ordered in total?


### How much was the revenue for the period in the dataset?

Tip: if you are running into issues, check the type of the column(s) that you need to work with. Perhaps you need preprocessing before proceeding with some steps


In [None]:
revenue = ...

print("Revenue was: $" + str(round(revenue, 2)))
# Solution: Revenue was: $39237.02

### How many orders were made in the period?


### What is the lowest, average, and highest revenue per order?


In [None]:
# Solution: mean: 21.394...; min: 10.08; max: 1074.24; median: 16.65

### How many different items are sold?


In [None]:
# Solution: 50

### How many products cost more than $10.00 ?

Tip: Inspect the item_price column for a specific item to see how the price and item_name relate to each other.


In [None]:
# item_name and choice_description appear is pairs multiple times, so we must drop them to avoid falsifying our results

# Solution: 707 rows

### How many different product prices exist?


In [None]:
# Solution: 37

### What is the quantity of the most expensive **item** ordered?


### How many times did someone order more than one Canned Soda?


### List the full order of the person that wanted the most canned sodas.


### (Advanced): Create a profitability report about the menu, which includes for each item:

- Total quantity sold
- Total revenue generated
- Number of **unique** orders containing the item
- Average selling price per unit

At the end, rank the items by their revenue contribution (% of total revenue)


### (Advanced): Are there price inconsistencies? If so, list them.

Background: Some items on our menu may have been sold at different prices, e.g. depending on add-ons. Find all items that have more than one unique price and list their price ranges.
