# Lecture 3: Pandas [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) 1

* How to create a `DataFrame`
* How to access a `DataFrame`'s elements

## Imports

In [1]:
import pandas as pd

## How to Create a `DataFrame`

There are many ways to create a `DataFrame` manually. One way (which we will see later) is to use a dictionary, whose keys are the column names and whose values are the column contents. Another way (which we will see now) is to use a 2D list.

As an example, let's create a simple 2D list with five rows and three columns:

In [2]:
l = list(zip(range(5), range(5, 10), range(10, 15)))
l

[(0, 5, 10), (1, 6, 11), (2, 7, 12), (3, 8, 13), (4, 9, 14)]

As you can see, a quick way to do that is by `zip`ping together three separate iterables, each of which form one column.

You can see the structure more clearly when I print each row on a single line:

In [3]:
for row in l:
    print(row)

(0, 5, 10)
(1, 6, 11)
(2, 7, 12)
(3, 8, 13)
(4, 9, 14)


* The first column contains the numbers $0-4$ (i.e., `range(5)`);
* the second column contains the numbers $5-9$ (i.e., `range(5, 10)`;
* and the last column contains the numbers $10-14$ (i.e., `range(10, 15)`).

(Remember, `range(a, b)` returns the sequence $[a, b)$, i.e. excluding `b`.)

1) We can use the list `l` to create a `DataFrame` by supplying it to [`pd.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)'s `data` argument:

In [4]:
df = pd.DataFrame(data=l)
df

Unnamed: 0,0,1,2
0,0,5,10
1,1,6,11
2,2,7,12
3,3,8,13
4,4,9,14


Note that the explicit index == implicit index, since we did not specify an explicit index. Also note that the columns, too, are now named after the implicit column index.

2) Let's provide an explicit column index, i.e. column names. We can do that by providing the `columns` argument:

In [5]:
df = pd.DataFrame(data=l,
                  columns=['col0', 'col1', 'col2'])
df

Unnamed: 0,col0,col1,col2
0,0,5,10
1,1,6,11
2,2,7,12
3,3,8,13
4,4,9,14


3) Let's do the same for the row index, which uses the `index` argument, just like `Series`:

In [6]:
df = pd.DataFrame(data=l,
                  index=list('abcde'),
                  columns=['col0', 'col1', 'col2'])
df

Unnamed: 0,col0,col1,col2
a,0,5,10
b,1,6,11
c,2,7,12
d,3,8,13
e,4,9,14


Like with `Series`, we can access a `DataFrame`'s data as it is stored in NumPy arrays through the `.values` attribute:

In [7]:
df.values

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]], dtype=int64)

This is often useful if we need to do statistical computations, such as matrix operations like transposition ...

In [8]:
df.values.T

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int64)

... or multiplication using NumPy's [`@` operator](https://numpy.org/doc/stable/reference/routines.linalg.html#the-operator):

In [9]:
df.values.T @ df.values

array([[ 30,  80, 130],
       [ 80, 255, 430],
       [130, 430, 730]], dtype=int64)

A quick reminder:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/Matrix_multiplication_diagram_2.svg/330px-Matrix_multiplication_diagram_2.svg.png)
Let $A$ be an $m × n$ matrix and B an $n × p$ matrix, then $C = AB$ is an $m × p$ matrix. [Wikipedia](https://en.wikipedia.org/wiki/Matrix_multiplication)

## Accessing a `DataFrame`'s Elements

### `DataFrame[i]` and `DataFrame[[i,j]]`
#### Explicit Column Index
Single column:

In [10]:
df['col1']

a    5
b    6
c    7
d    8
e    9
Name: col1, dtype: int64

This is a **column `Series`**. You can see that the column's explicit index in the `DataFrame` has become the `Series`' `name`. It can be accessed as `Series.name`.

Multiple columns:

In [11]:
df[['col1', 'col2']]

Unnamed: 0,col1,col2
a,5,10
b,6,11
c,7,12
d,8,13
e,9,14


Implicit single column indices do not work:

In [12]:
df[1]

KeyError: 1

### `DataFrame[i:j]`
#### Explicit Row Index
Note: `j` is included.

In [13]:
df['b':'c']

Unnamed: 0,col0,col1,col2
b,1,6,11
c,2,7,12


#### Implicit Row Index

In [14]:
df[1:3]

Unnamed: 0,col0,col1,col2
b,1,6,11
c,2,7,12


### `DataFrame[bitmask]`
As with Series, but now we have to specify the column we wish to select on:

In [15]:
df['col0'] == 3

a    False
b    False
c    False
d     True
e    False
Name: col0, dtype: bool

We put the bitmask into the the `DataFrame`'s `[]` to perform the selection:

In [16]:
df[df['col0'] == 3]

Unnamed: 0,col0,col1,col2
d,3,8,13


Note that masking returned a `DataFrame`, because there is more than one column.

We can use the explicit index, too, like we did with `Series`:

In [17]:
(df.index == 'c') | (df.index == 'e')

array([False, False,  True, False,  True])

### `DataFrame.loc[...]`
Explicit index only.
#### Rows Only

In [18]:
df.loc['a']

col0     0
col1     5
col2    10
Name: a, dtype: int64

This is a **row `Series`**. You can see that the row's explicit index in the `DataFrame` has become the `Series`' `name`. It can be accessed as `Series.name`.

In [19]:
df.loc['a':'b']

Unnamed: 0,col0,col1,col2
a,0,5,10
b,1,6,11


In [20]:
df.loc[['a', 'c']]

Unnamed: 0,col0,col1,col2
a,0,5,10
c,2,7,12


#### Rows + Columns

In [21]:
df.loc['a', 'col0']

0

In [22]:
df.loc['b':'d', 'col0':'col1']

Unnamed: 0,col0,col1
b,1,6
c,2,7
d,3,8


In [23]:
df.loc[['a', 'c'], ['col0', 'col2']]

Unnamed: 0,col0,col2
a,0,10
c,2,12


#### Bitmask

In [24]:
df.loc[df['col0'] == 3, 'col1']

d    8
Name: col1, dtype: int64

Note that masking always returns a `Series`, even if it only returns a single element. To get its raw value, we need to use the standard `Series` API:

In [25]:
df.loc[df['col0'] == 3, 'col1'].iloc[0]

8

### `DataFrame.iloc[...]`

Implicit index only.

#### Rows Only

In [26]:
df.iloc[0]

col0     0
col1     5
col2    10
Name: a, dtype: int64

In [27]:
df.iloc[0:2]

Unnamed: 0,col0,col1,col2
a,0,5,10
b,1,6,11


In [28]:
df.iloc[[0,3]]

Unnamed: 0,col0,col1,col2
a,0,5,10
d,3,8,13


#### Rows + Columns

In [29]:
df.iloc[0, 0]

0

In [30]:
df.iloc[0:2, 1:3]

Unnamed: 0,col1,col2
a,5,10
b,6,11


In [31]:
df.iloc[[0, 3], [0, 2]]

Unnamed: 0,col0,col2
a,0,10
d,3,13


### Mixing Explicit and Implicit Row and Column Indices
#### Implicit rows, explicit columns
* with `.loc[]` and [`df.index[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html):

In [32]:
df.loc[df.index[[0, 2]], 'col0']

a    0
c    2
Name: col0, dtype: int64

* by chaining the implicit and explicit selections (in any order):

In [33]:
df['col0'].iloc[[0, 2]]

a    0
c    2
Name: col0, dtype: int64

In [34]:
df.iloc[[0, 2]]['col0']

a    0
c    2
Name: col0, dtype: int64

Chaining can lead to errors if you want to modify a `DataFrame`. Generally, be careful with chaining.

* with `.iloc[]` and [`Index.get_loc(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html) for a *single* column

In [35]:
df.iloc[[0, 2], df.columns.get_loc('col0')]

a    0
c    2
Name: col0, dtype: int64

#### Explicit rows, implicit columns
* with `.loc[]`

In [36]:
df.loc['a':'b', df.columns[[0, 2]]]

Unnamed: 0,col0,col2
a,0,10
b,1,11


* with `.iloc[]` and [`Index.get_loc(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html) for a *single* row

In [37]:
df.iloc[df.index.get_loc('a'), [0, 2]]

col0     0
col2    10
Name: a, dtype: int64

* with `.iloc[]` and [`Index.get_indexer(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_indexer.html) for multiple rows (but not slicing)

In [38]:
df.iloc[df.index.get_indexer(['a', 'b']), [0, 2]]

Unnamed: 0,col0,col2
a,0,10
b,1,11


© 2023 Philipp Cornelius