# Pandas Basics
---
In this notebook, we will learn about the Pandas library, which is not a library about animals, but about working with [**pan**el **da**ta](https://en.wikipedia.org/wiki/Panel_data).

Pandas leverages many of the concepts and features of NumPy, and is built on top of it. Great! All we have learned last week will be useful today. On the other hand, if you skipped last week's class and you are not yet familiar with NumPy, this notebook will be a bit more challenging and you should go back to the previous notebook and work through it before proceeding.

## DataFrames
---
Just like NumPy had `np.ndarray` as its core data structure, Pandas has `pd.DataFrame` as its core data structure. Typically, you can think of a `pd.DataFrame` as a table with rows and columns, e.g.,

| Name | Age | Gender | Height |
| --- | --- | --- | --- |
| Alice | 25 | Female | 165 |
| Bob | 30 | Male | 175 |
| Charlie | 35 | Male | 172 |

Each row in the table represents an observation, and each column represents a variable. The variables are also called features or attributes. The rows are also called records or observations. We will mostly be calling columns *features* and rows *observations*, but you should be aware that there are other terminologies used in the literature, some of which are more common in certain fields.

Pandas and NumPy play well together. In fact, you can think of a `pd.DataFrame` as a generalization of a 2D NumPy array. In particular, columns of a DataFrame can be of different types, not necessarily just numbers.

You can also create a DataFrame from a NumPy array and vice versa, as we will see below.

In [46]:
# Import NumPy as we learned previously
import numpy as np

# Now we also import Pandas under the alias pd, which is the standard alias
import pandas as pd

In [47]:
# Create a 2D numpy array (2 rows, 3 columns)
X = np.array([[1, 2, 3], [4, 5, 6]])

# Create a DataFrame from the numpy array
df = pd.DataFrame(X)

# Display the DataFrame
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


We can also pass column names and row indices to the `pd.DataFrame` constructor.

In [48]:
df = pd.DataFrame(X, columns=['A', 'B', 'C'], index=['x', 'y'])
df

Unnamed: 0,A,B,C
x,1,2,3
y,4,5,6


As we mentioned above, Pandas and NumPy play well together. We can also reuse some of what we learned last week, notice however that there will be some differences.

In [49]:
# Similar to NumPy, we have access to the shape of the DataFrame
df.shape

(2, 3)

In [50]:
# We can also compute the mean, but notice how the default is column-wise!
df.mean()

A    2.5
B    3.5
C    4.5
dtype: float64

In [51]:
# In contrast, NumPy woulThe less recommended way to convert a dataframe to a numpy array

So what if we want to compute the mean of the whole dataframe? Well typically we don't want to do this, but if you really want to, there are a few options.

First, we can simply use the `np.mean` function, instead of calling the `.mean()` method of the dataframe.

In [52]:
# A dataframe can also be the input to a NumPy method
np.mean(df)

np.float64(3.5)

Second, we can also pass the `axis=None` argument to the `.mean()` method, which will compute the mean of all the elements in the dataframe.

In [53]:
# axis=None means that the method is applied to the whole array
df.mean(axis=None)

np.float64(3.5)

In fact, we can think of the difference between NumPy and Pandas methods as NumPy having default argument `axis=None`, which means that the method is applied to the whole array, while Pandas methods have default argument `axis=0`, which means that the method is applied to each column by default.

Third, and this is one you would like to avoid, but it's good to think about these things, we can call the `.mean` method twice.

In [54]:
# Be careful, this works due to some properties of the mean operator, can
# you think about a case where we cannot just call the method twice?
df.mean().mean()

np.float64(3.5)

Lastly, when we want to convert a dataframe to a numpy array, we can use the `.to_numpy()` method. Note that you will also often see the `.values` attribute, which is equivalent. 

In fact, you will probably see `.values` more often, as it is more convenient. However, when [looking at the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html), you will notice that `.to_numpy()` is the more up-to-date method, and it is recommended to use it.

In [55]:
# The recommended way to convert a dataframe to a numpy array
df.to_numpy()

array([[1, 2, 3],
       [4, 5, 6]])

In [56]:
# A possible but not recommended alternative
df.values

array([[1, 2, 3],
       [4, 5, 6]])

## Loading Data
---

While programming and mathematics are fun in their own right, this class is about data science and thus, we will now start working with data.

Data can be stored in a variety of formats. Most commonly, you will encounter CSV which stands for Comma-Separated Values. This is a text-based format that is used to store tabular data, such as a spreadsheet or a database.

CSVs are plain text files, which makes them easy to read and write to. It also means that they are readable by any text editor and we can open them with Excel as well.

We will be working with the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is a classic dataset in pattern recognition and statistics. It contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, together with their species.


If you don't know what petals and sepals are, here is an illustration:
![](https://cdn.britannica.com/39/91239-004-44353E32/Diagram-flowering-plant.jpg)

In [57]:
# Loading a CSV file is as easy as calling pd.read_csv
df = pd.read_csv("../data/iris.csv")

# Show the first 10 rows of the dataframe (use df.tail(n) to show the last n rows)
df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.7,2.6,6.9,2.3,virginica
1,7.7,3.8,6.7,2.2,virginica
2,7.7,2.8,6.7,2.0,virginica
3,4.6,3.6,1.0,0.2,setosa
4,4.3,3.0,1.1,0.1,setosa
5,5.8,4.0,1.2,0.2,setosa
6,5.0,3.2,1.2,0.2,setosa
7,5.4,3.9,1.3,0.4,setosa
8,4.7,3.2,1.3,0.2,setosa
9,4.4,3.2,1.3,0.2,setosa


Notice that we have a total of 5 columns with 4 of them being numeric and the last one `species` being a string.

So if we were to try `df.mean()`, we would get an error, as we cannot compute the mean of a non-numeric column.

However, there is a useful method called `.describe()` that will give us a summary statistics of all numerical columns without us having to specify which columns we want to ignore.

In [58]:
# Describe will give us the summary statistics
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


What about the `species` column? What type of statistics can we be interested in for a categorical column?

Well, one thing we could be interested in is the number of samples per category. We can do this by calling the `.value_counts()` method on the column.

In [59]:
# Count the number of observations (rows) per category
df["species"].value_counts()

species
virginica     50
setosa        50
versicolor    50
Name: count, dtype: int64

Notice how we accessed the `species` column as we would access a dictionary key. If we want to select multiple columns, we can pass a list instead of a single string.

In [60]:
# Select two columns by passing a list as a key
df[["species", "sepal length (cm)"]]

Unnamed: 0,species,sepal length (cm)
0,virginica,7.7
1,virginica,7.7
2,virginica,7.7
3,setosa,4.6
4,setosa,4.3
...,...,...
145,versicolor,5.7
146,versicolor,5.7
147,versicolor,6.2
148,versicolor,5.1


Sometimes, it's easier to access the columns using their number instead of their names, we can use the `.columns` attribute to do so.

In [61]:
# Select the first two columns
df[df.columns[:2]]

Unnamed: 0,sepal length (cm),sepal width (cm)
0,7.7,2.6
1,7.7,3.8
2,7.7,2.8
3,4.6,3.6
4,4.3,3.0
...,...,...
145,5.7,3.0
146,5.7,2.9
147,6.2,2.9
148,5.1,2.5


Lastly, we can also use negative indexing to drop specific columns.

In [62]:
# Select all columns except the sepal length and width columns
df.drop(columns=["sepal length (cm)", "sepal width (cm)"])

Unnamed: 0,petal length (cm),petal width (cm),species
0,6.9,2.3,virginica
1,6.7,2.2,virginica
2,6.7,2.0,virginica
3,1.0,0.2,setosa
4,1.1,0.1,setosa
...,...,...,...
145,4.2,1.2,versicolor
146,4.2,1.3,versicolor
147,4.3,1.3,versicolor
148,3.0,1.1,versicolor


#### ‚û°Ô∏è ‚úèÔ∏è Your turn

Compute the 10th percentile of all numeric columns.

*Hint*: Pandas has a `.quantile(q)` method that can be used to compute the q-th quantile of a column. Do you remember the relationship between quantiles and percentiles?

In [63]:
# ‚û°Ô∏è Your code here...




#### Solution

In [64]:
# We remove the species columns and compute the 0.1 quantile (10th percentile)
df.drop(columns=["species"]).quantile(.1)

sepal length (cm)    4.8
sepal width (cm)     2.5
petal length (cm)    1.4
petal width (cm)     0.2
Name: 0.1, dtype: float64

## Subsetting Data
---

So far, we have seen how to load data from a CSV file and how to select particular columns. Of course, this is not enough for real-world data analysis tasks. More often than not, we want to select rows based on certain conditions, e.g., we might want to select all observations for a given species or all observations with a sepal length greater than a certain value.

Recall how in NumPy we indexed our arrays as follows:

```python
X[row_indices, column_indices]
```

In Pandas, we can do the same using the `.loc` accessor, which allows us to select rows and columns by their index.

```python
df.loc[row_indices, column_indices]
```

We can also use the `.iloc` accessor, which allows us to select rows and columns by their integer location.

```python
df.iloc[row_indices, column_indices]
```

This can be slightly confusing at first, so let's make sure we understand the difference between the two.

In [65]:
# Let us print the first 5 rows for reference
df.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.7,2.6,6.9,2.3,virginica
1,7.7,3.8,6.7,2.2,virginica
2,7.7,2.8,6.7,2.0,virginica
3,4.6,3.6,1.0,0.2,setosa
4,4.3,3.0,1.1,0.1,setosa


In [66]:
# Access the observation with index 0
df.loc[0, :]

sepal length (cm)          7.7
sepal width (cm)           2.6
petal length (cm)          6.9
petal width (cm)           2.3
species              virginica
Name: 0, dtype: object

In [67]:
# Access the observation in the 0th row
df.iloc[0, :]

sepal length (cm)          7.7
sepal width (cm)           2.6
petal length (cm)          6.9
petal width (cm)           2.3
species              virginica
Name: 0, dtype: object

But these are the same?! That is because, in this case, the index and the row number are the same. 

‚ö†Ô∏è Be very careful with `.loc` and `.iloc` as they behave differently. In truth, it's very rare that you will need to use `.iloc` as it is more error-prone. We typically want to access rows by their index, not by their position!

In [68]:
# An example where they differ
df2 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 2, 3])
df2

Unnamed: 0,A,B
1,1,4
2,2,5
3,3,6


In [69]:
# Access the observation with index 1, notice this is the 0th row
df2.loc[1, :]

A    1
B    4
Name: 1, dtype: int64

In [70]:
# Access the observation in the 1st row, this is the index 2!
df2.iloc[1, :]

A    2
B    5
Name: 2, dtype: int64

#### ü§î The beauty of indices

It can take a while to appreciate the power of indices and they can be tricky to understand at first. However, as you proceed in your data science journey, you will come to find that indices are a very powerful tool to have in your disposal.

For instance, indices need not be numbers from 0 to N-1, but can be any value we want, e.g., when working with time-series data, we will often use the date as the index, which will allow us to instantly select observations for a given date or even a range of dates. But these are more advanced topics that we will not cover here. However, they might come up in your projects, so you can always come back to this notebook for reference.

üôÄ ü§Ø Don't worry about the `pd.to_datetime` function for now, we will see more of it in the next notebooks.

```python 
# Create a dataframe with dates as the index
df3 = pd.DataFrame({"A": [1, 2, 3]}, index=pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"]))

# Access the observation with index '2020-01-01'
df3.loc["2020-01-01", :]
```

Okay, once again, a quick recap is in order. We have seen how to select columns and rows in a dataframe. In particular, we can select rows by their index using the `.loc` accessor.

But we would like to do more, in fact, just as we did with NumPy, we would like to be able to filter rows based on certain conditions. For instance, we might want to select all observations for a given species, say the species **virginica**.

In [71]:
# Select all observations for the species virginica
df[df["species"] == "virginica"]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,7.7,2.6,6.9,2.3,virginica
1,7.7,3.8,6.7,2.2,virginica
2,7.7,2.8,6.7,2.0,virginica
53,4.9,2.5,4.5,1.7,virginica
54,6.0,3.0,4.8,1.8,virginica
55,6.2,2.8,4.8,1.8,virginica
56,5.6,2.8,4.9,2.0,virginica
57,6.3,2.7,4.9,1.8,virginica
58,6.1,3.0,4.9,1.8,virginica
59,6.0,2.2,5.0,1.5,virginica


‚ö†Ô∏è Notice that we didn't use the `.loc` accessor here. However, we could have! The code works just the same. This is a bit of a shortcut that Pandas provides to make our lives easier.

We have to be a bit careful though, if we want to do something more complex, such as selecting specific observations and columns at the same time, we will have to use the `.loc` accessor. Here is how one would do it.

Notice the pattern is the same as NumPy, we separate the row condition and the column condition with a comma:

```python
df.loc[row_condition, column_condition]
```

In [72]:
# Obtain only the sepal length and width for the virginica species
df.loc[df["species"] == "virginica", ["sepal length (cm)", "sepal width (cm)"]]

Unnamed: 0,sepal length (cm),sepal width (cm)
0,7.7,2.6
1,7.7,3.8
2,7.7,2.8
53,4.9,2.5
54,6.0,3.0
55,6.2,2.8
56,5.6,2.8
57,6.3,2.7
58,6.1,3.0
59,6.0,2.2


#### ‚û°Ô∏è ‚úèÔ∏è Your turn

1. Compute the mean sepal length for the virginica species and for the setosa species, which species has a larger sepal length on average?
2. For the versicolor species, compute a rough estimate of the petal area, which you may assume to be the product of the petal length and width. What is the mean petal area for the versicolor species? *Hint*: You can work do mathematical operations in Pandas just as you would do in NumPy.

In [73]:
# ‚û°Ô∏è Your code here...




#### Solution

In [74]:
# Part 1: Compute the average sepal length for the virginica species
df.loc[df["species"] == "virginica", "sepal length (cm)"].mean()

np.float64(6.587999999999999)

In [75]:
# Part 1: Compute the average sepal length for the setosa species
df.loc[df["species"] == "setosa", "sepal length (cm)"].mean()

np.float64(5.006)

The virginica species has a larger sepal length by 1.5cm on average! 6.588 vs. 5.006.

In [76]:
# Part 2: Compute a rough estimate of the petal area for the versicolor species
width = df.loc[df["species"] == "versicolor", "petal width (cm)"]
length = df.loc[df["species"] == "versicolor", "petal length (cm)"]
area = width * length
area.mean()

np.float64(5.7204)

Finally, we close this section by learning how to add new columns to a dataframe. This is a task we will often want to do and it is very easy to do in Pandas.

We will add a new column called `petal area` to our dataframe which consists of the product of the petal length and width.

In [77]:
# Add a new column called petal area to the dataframe
df["petal area"] = df["petal length (cm)"] * df["petal width (cm)"]

# Show the first 5 rows of the dataframe
df.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,petal area
0,7.7,2.6,6.9,2.3,virginica,15.87
1,7.7,3.8,6.7,2.2,virginica,14.74
2,7.7,2.8,6.7,2.0,virginica,13.4
3,4.6,3.6,1.0,0.2,setosa,0.2
4,4.3,3.0,1.1,0.1,setosa,0.11


## Grouping and Aggregation
---

Today's last topic is grouping and aggregation. This is where the true power of Pandas gets to shine.

It was a bit tideous to compute the mean for each individual species above. Of course, now that you know enough Python, you might come up with the idea to use a for-loop to iterate over the species and compute the mean sepal length for each species, which is not a bad idea at all.

However, Pandas provides us with a more convenient way to do this, which uses the concept of *grouping* and *aggregating*.

Grouping and aggregating go hand in hand. We can think of *grouping* as the process of dividing data into groups based on certain criteria, while *aggregating* is the process of applying a function to each group and aggregating the different sub-groups back into a single object.

üôÄ ü§Ø Sometimes, one only wants to group without aggregating. There are good reasons to do this, but this is a more advanced topic that you will probably not need for a while. As such we don't cover it here and you might not see it until much later down your data science path.


So how would we go about computing the mean sepal length for each species using these new concepts? It's actually pretty, just remember the two steps:

1. Group the data by a specified column (or multiple columns).
2. Apply a function to each group or to specific columns in each group.

In [78]:
# Step 1: Group the data by the species column
dfg = df.groupby("species")

# Step 2: Apply the mean function to the sepal length column
dfg["sepal length (cm)"].mean()

species
setosa        5.006
versicolor    5.936
virginica     6.588
Name: sepal length (cm), dtype: float64

Et voil√†! We have computed the mean sepal length for each species. Of course, we can also write the above code a bit more concisely by chaining the methods together.

In [79]:
df.groupby("species")["sepal length (cm)"].mean()

species
setosa        5.006
versicolor    5.936
virginica     6.588
Name: sepal length (cm), dtype: float64

&hellip; and we get the same result. Moreover, we can compute the mean sepal width for each species at the same time!

In [80]:
# Notice the double brackets, this is because the key is a list!
df.groupby("species")[["sepal length (cm)", "sepal width (cm)"]].mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,5.006,3.428
versicolor,5.936,2.77
virginica,6.588,2.974


&hellip; and we could also apply the mean to all the columns at once.

In [81]:
df.groupby("species").mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),petal area
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,5.006,3.428,1.462,0.246,0.3656
versicolor,5.936,2.77,4.26,1.326,5.7204
virginica,6.588,2.974,5.552,2.026,11.2962


Of course, we can also apply other aggregation functions, such as the sum, max, min, etc.

In [82]:
# Find the maximum value for each column by species
df.groupby("species").max()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),petal area
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,5.8,4.4,1.9,0.6,0.96
versicolor,7.0,3.4,5.1,1.8,8.64
virginica,7.9,3.8,6.9,2.5,15.87


In [83]:
# Find the standard deviation for each column by species
df.groupby("species").std()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),petal area
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,0.35249,0.379064,0.173664,0.105386,0.181155
versicolor,0.516171,0.313798,0.469911,0.197753,1.368403
virginica,0.63588,0.322497,0.551895,0.27465,2.157412


Lastly, we can also be more creative with our aggregation functions. For instance, we may want to compute both the mean and the standard deviation of the sepal length for each species. We cannot do this easily with the methods we have been using so far, but we can use the `.agg` method to achieve this.

In [84]:
df.groupby("species")["sepal length (cm)"].agg(["mean", "std"])

Unnamed: 0_level_0,mean,std
species,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,5.006,0.35249
versicolor,5.936,0.516171
virginica,6.588,0.63588


Notice that we have passed a list of aggregation functions (strings that represent method names, to be precise) to the `.agg` method.

Finally, we can also apply different aggregation functions to different columns. To do so, we simply pass a dictionary to the `.agg` method, with the keys being the columns and the values being the aggregation functions.

In [85]:
df.groupby("species").agg({
    "sepal length (cm)": ["mean", "std"],
    "sepal width (cm)": ["min", "max"],
    "petal length (cm)": "median"
})

Unnamed: 0_level_0,sepal length (cm),sepal length (cm),sepal width (cm),sepal width (cm),petal length (cm)
Unnamed: 0_level_1,mean,std,min,max,median
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
setosa,5.006,0.35249,2.3,4.4,1.5
versicolor,5.936,0.516171,2.0,3.4,4.35
virginica,6.588,0.63588,2.2,3.8,5.55


#### ‚û°Ô∏è ‚úèÔ∏è Your turn

1. Compute the mean and standard deviation of the petal length for each species. Does the standard deviation increase or decrease with the average petal length?
2. Add a new column called `ratio` to the dataframe that contains the ratio of the petal length to the sepal length. (`petal length / sepal length`)
3. Add a new column called `ratio_category` to the dataframe that contains the value `large` if the ratio is greater than 0.5 and `small` otherwise. *Hint*: Recall how to use `np.where` to create an array by applying an if-else condition to each element of the array.
4. Group by the list `["species", "ratio_category"]` and compute the mean sepal width for each group. Does the sepal width increase or decrease with the petal length ratio?

In [86]:
# ‚û°Ô∏è Your code here...



#### Solution

In [87]:
# 1. Compute the mean and standard deviation of the petal length for each species.
df.groupby("species")["petal length (cm)"].agg(["mean", "std"])

Unnamed: 0_level_0,mean,std
species,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,1.462,0.173664
versicolor,4.26,0.469911
virginica,5.552,0.551895


In [88]:
# 2. Add a new column called `ratio` to the dataframe that contains the ratio of the petal length to the sepal length. (`petal length / sepal length`)
df["ratio"] = df["petal length (cm)"] / df["sepal length (cm)"]
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,petal area,ratio
0,7.7,2.6,6.9,2.3,virginica,15.87,0.896104
1,7.7,3.8,6.7,2.2,virginica,14.74,0.870130
2,7.7,2.8,6.7,2.0,virginica,13.40,0.870130
3,4.6,3.6,1.0,0.2,setosa,0.20,0.217391
4,4.3,3.0,1.1,0.1,setosa,0.11,0.255814
...,...,...,...,...,...,...,...
145,5.7,3.0,4.2,1.2,versicolor,5.04,0.736842
146,5.7,2.9,4.2,1.3,versicolor,5.46,0.736842
147,6.2,2.9,4.3,1.3,versicolor,5.59,0.693548
148,5.1,2.5,3.0,1.1,versicolor,3.30,0.588235


In [93]:
# 3. Add a new column called `ratio_category` to the dataframe that contains the value `large` if the ratio is greater than 0.5 and `small` otherwise. *Hint*: Recall how to use `np.where` to create an array by applying an if-else condition to each element of the array.
df["ratio_category"] = np.where(df["ratio"] > 0.75, "large", "small")
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,petal area,ratio,ratio_category
0,7.7,2.6,6.9,2.3,virginica,15.87,0.896104,large
1,7.7,3.8,6.7,2.2,virginica,14.74,0.870130,large
2,7.7,2.8,6.7,2.0,virginica,13.40,0.870130,large
3,4.6,3.6,1.0,0.2,setosa,0.20,0.217391,small
4,4.3,3.0,1.1,0.1,setosa,0.11,0.255814,small
...,...,...,...,...,...,...,...,...
145,5.7,3.0,4.2,1.2,versicolor,5.04,0.736842,small
146,5.7,2.9,4.2,1.3,versicolor,5.46,0.736842,small
147,6.2,2.9,4.3,1.3,versicolor,5.59,0.693548,small
148,5.1,2.5,3.0,1.1,versicolor,3.30,0.588235,small


In [96]:
# 4. Group by the list `["species", "ratio_category"]` and compute the mean sepal width for each group. Does the sepal width increase or decrease with the petal length ratio?
df.groupby(["species", "ratio_category"])["sepal width (cm)"].mean()

species     ratio_category
setosa      small             3.428000
versicolor  large             2.836364
            small             2.751282
virginica   large             2.971429
            small             3.100000
Name: sepal width (cm), dtype: float64