# Tools of the Trade. Part 2


In this second part of the workshop, we will work with external, structured data with the library **Pandas**.

## Pandas

Pandas is one of the most popular libraries for data analysis in python because of its ability to handle structured data with ease. It is directly inspired by `R`'s DataFrame manipulation. You will see that all of what we learned in `np` will come handy here.

In [4]:
# import pandas
import pandas as pd
print(pd.__version__)
# and numpu
import numpy as np

1.2.1


As we did with NumPy, we use an `alias` for Pandas, `pd`. This is very common.

### The `pd.Series` Class

Pandas has two classes that are of particular interest for us. The `pd.Series` and the `pd.DataFrame`. Since a `pd.DataFrame` is composed of `pd.Series`, we start from `pd.Series`. What is it? A `pd.Series` is an **ordered and labeled unidimensional collection of homogenous elements**. Let's compare this to a `np.array`:

1. *Collection*, both the `np.array` and the `pd.Series`  can contain more than one element (but also 1 or 0 elementes);
2. *Ordered*, both the `np.array` and the `pd.Series` contain information about the order of the elements they contain;
3. *Unidimensional* unlike a `np.array`, a `pd.Series` always has a unidimensional shape: it is a vector.
3. *Homogenous*, both the `pd.Series` and the `np.array` can contain only elements of the same type (for example, `float` or `integer`), following their `dtype`. 
4. *No Fixed dimensions*, the dimension of a `pd.Series` can easily change after it is instantiated -- even if it is still more efficient to instantiate an object once.

The two classes are indeed quite similar, with a major restriction for `pd.Series` (only one dimension), but also a brand new feature: labels -- more precisely, the `index`.

Let's create our first two `pd.Series`.

In [38]:
# You can proceed from a list
first_ser = [1, 2, 3, 5, 7, 11]
first_ser = pd.Series(first_ser)
# But also from a np.array
# Here we create a random boolean vector
seed = 12345
rng = np.random.default_rng(seed)
second_ser = rng.choice([1,0], 6)
# dtype argument!
second_ser = pd.Series(second_ser, dtype='bool')

print(f"This is my first Series:\n{first_ser}\n")
print(f"This is my second Series:\n{second_ser}\n")

This is my first Series:
0     1
1     2
2     3
3     5
4     7
5    11
dtype: int64

This is my second Series:
0    False
1     True
2    False
3     True
4     True
5    False
dtype: bool



After our introduction, you will be hardly surprised to know that the `pd.Series` has many of the same feature of the `np.array`. You already notice it has a `dtype`. Well, it also has a `shape` (which will always be unidimensional). Moreover, you can use `pd.Series` all of the `np` universal functions you already encounteed.

<div class="alert alert-block alert-success">
    <b>New <code>dtype</code>s</b> Pandas supports all of the NumPy-defined <code>dtype</code>s. However, it also adds some new <code>dtype</code> of its own. Like the <a href=https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>Categorical</a> <code>dtype</code> and the <a href=https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html>StringDtype</a>.
</div>

In [39]:
# Some old friends
print(f"This is the shape our second Series {second_ser.shape}")
print(f"This is the dtype of our second Series {second_ser.dtype}")
# You can convert the dtype of a Series using the usual .astype() method
second_ser = second_ser.astype("int16")
print(f"This is the new dtype of our second Series {second_ser.dtype}")
print(f"And this is how our second Series looks now\n{second_ser}")
print(f"This the arcsin of our second Series in PIs:\n{np.arcsin(second_ser)/np.pi}")
print(f"The sums of our two Series:\n{first_ser + second_ser}")

This is the shape our second Series (6,)
This is the dtype of our second Series bool
This is the new dtype of our second Series int16
And this is how our second Series looks now
0    0
1    1
2    0
3    1
4    1
5    0
dtype: int16
This the arcsin of our second Series in PIs:
0    0.0
1    0.5
2    0.0
3    0.5
4    0.5
5    0.0
dtype: float32
The sums of our two Series:
0     1
1     3
2     3
3     6
4     8
5    11
dtype: int64


### The `index` of a `pd.Series`

The `pd.Series` has a lot in common with the `nd.array`, but they also have a fundamental difference: a `pd.Series` associates an label to each of its element -- like a python basic dictionary. The labels are stored in the `index` attribute. Now, the behavior of the `pd.Series` depends crucially on these labels, as we shall see below. 

By default the `Index` of a `pd.Series` (and a `pd.DataFrame`!) is a simple range starting from `0`.

In [40]:
print(f"This is the index of the first Series: {first_ser.index}")

This is the index of the first Series: RangeIndex(start=0, stop=6, step=1)


But we can easily change that. Here we assign a random label to each element in the `first_ser` -- the `index` does not have to be numeric, but we will stick with that here.

In [41]:
new_ind = rng.choice([0, 1, 2, 3, 4, 5], size=first_ser.shape, replace=False)
first_ser.index = new_ind
print(f"This is the new index of the first Series: {first_ser.index}")

This is the new index of the first Series: Int64Index([5, 3, 0, 4, 2, 1], dtype='int64')


Notice how the behavior of `sum` (or `+`) changes!

In [42]:
print(f"This is our first Series:\n{first_ser}")
print(f"This is our second Series:\n{second_ser}")
print(f"Is this their sum?!\n{first_ser+second_ser}")

This is our first Series:
5     1
3     2
0     3
4     5
2     7
1    11
dtype: int64
This is our second Series:
0    0
1    1
2    0
3    1
4    1
5    0
dtype: int16
Is this their sum?!
0     3
1    12
2     7
3     3
4     6
5     1
dtype: int64


What's happening here? `pd` is matching the `index` before summing. **Elements with the same label gets summed**. When we have non-corresponding labels, we get a `np.nan`. The general rules is: *`pd` will match the index of `pd.Series` and `pd.DataFrame` when an operation involve two or more of them* -- so, beware of passing `pd.Series` and `pd.DataFrame` with the right `index`.

In [43]:
# Re-index the first index again
new_ind = rng.choice([0, 1, 2, 3, 7, 6], size=first_ser.shape, replace=False)
first_ser.index = new_ind

print(f"This is our first Series, re-indexed:\n{first_ser}")
print(f"This is our second Series:\n{second_ser}")
print(f"Their new sum:\n{first_ser+second_ser}")

This is our first Series, re-indexed:
0     1
6     2
2     3
3     5
7     7
1    11
dtype: int64
This is our second Series:
0    0
1    1
2    0
3    1
4    1
5    0
dtype: int16
Their new sum:
0     1.0
1    12.0
2     3.0
3     6.0
4     NaN
5     NaN
6     NaN
7     NaN
dtype: float64


### Selection (and Changing the Elements of a `Pd.Series`)

Selection is a pivotal part of Pandas. It turns out the selection process is similar to `np` selection, with the difference that you have to be explicit on the feature you are using for selectin: are you selecting on positions -- `pd.Series` are ordered -- or on `index`? We use the `.iloc[]` selector for positions and `.loc[]` for labels.

Exactly like the `np.array`, we often perform the operation of selecting from `pd.Series` and replacing. 

In [47]:
print(f"Again, this is the first Series:\n{first_ser}")

# We select the second element in the Series
print(f"This is the second element in firts Series: {first_ser.iloc[1]}")
# We select the element with label 1 in the Series
print(f"This is the element with label \"1\" in firts Series: {first_ser.loc[1]}")

Again, this is the first Series:
0     1
6     2
2     3
3     5
7     7
1    11
dtype: int64
This is the second element in firts Series: 2
This is the element with label "1" in firts Series: 11


We can use all the same techniques we used for selection on `np.array`, slicing, boolean masks and fancy indexing.

#### 1. Slicing

You can slice both the `index`, using `.loc[]`, and the positions, using `iloc[]`. The behavior of the two selector when slicing is somehow different.

<div class="alert alert-block alert-success">
    <b>Slicing in <code>loc[]</code> and <code>iloc[]</code></b> The way <code>loc[]</code> and <code>loc[]</code> deals with slicing is different, in subtle ways. We will not explore all of them here, but look at <a href=https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different>this wonderful Stackoverflow answer</a> for a detailed comparison.
</div>

In [56]:
print(f"The first Series:\n{first_ser}")

# Slicing. Notice the different behavior
print(f"Slicing on positions:\n{first_ser.iloc[:2]}")
print(f"Slicing on index:\n{first_ser.loc[:2]}")

The first Series:
0     1
6     2
2     3
3     5
7     7
1    11
dtype: int64
Slicing on positions:
0    1
6    2
dtype: int64
Slicing on index:
0    1
6    2
2    3
dtype: int64


#### 2. Masking

As for `np.array` we can use `bool`-valued `pd.Series` and `np.array` for selection. We will pass our `bool`-valued object inside a `loc[]` selector. Beware that `pd` will match the `index` of the `bool` `pd.Series` with the selected `pd.Series`. Once again, when two `pd.Series` interact, labels are matched. On the other hand, `pd`  match the `bool` `np.array` with the selected `pd.Series` based on positions. An example will clarify.

We will use the self-explicatory method `to_numpy()` to transform the `pd.Series` into a `np.array`. We will also replace the selected values with new values. Nothing new here with respect to what we saw for the `np.array`.

In [72]:
print(f"The first Series:\n{first_ser}")

# We build a Boolean mask starting with. Only works with loc
mask = second_ser.astype(bool)
# We have to match the index of the mask with the index of the Series
mask.index = rng.choice(first_ser.index, size=(6,), replace=False)
print(f"This is the mask as a Series:\n{mask}")
# Notice how pd has matched indices
print(f"This is the first Series masked by a Series:\n{first_ser.loc[mask]}")
mask = mask.to_numpy()
print(f"This is the mask as an Array: {mask}")
print(f"This is first Series masked by an Array:\n{first_ser.loc[mask]}")
#Replace the masked values with Nans
first_ser.loc[mask] = np.nan
print(f"This is first Series after replacement of values:\n{first_ser}")

The first Series:
0     1
6     2
2     3
3     5
7     7
1    11
dtype: int64
This is the mask as a Series:
6    False
1     True
2    False
7     True
0     True
3    False
dtype: bool
This is the first Series masked by a Series:
0     1
7     7
1    11
dtype: int64
This is the mask as an Array: [False  True False  True  True False]
This is first Series masked by an Array:
6    2
3    5
7    7
dtype: int64
This is first Series after replacement of values:
0     1.0
6     NaN
2     3.0
3     NaN
7     NaN
1    11.0
dtype: float64


#### 3. Fancy Indexing

Finally, we can go with fancy indexing on `pd.Series` as well -- even if I do not think this is the official name. This works both with `iloc[]` and with `loc[]`.

**Beware**: there is **no** fancy indexing on `pd.DataFrame` in the same way as there is on `np.array`. If you are in a situation where you need to use fancy indexing on `pd.DataFrame`, your best option is to convert the `pd.DataFrame` to a `np.array`, select the `np.array`, then convert the results to a `pd.DataFrame` or `pd.Sereis` if needed. We will come back to this.

Also, this is a good time to notice that an `index` does not have to contain unique labels...but it should! We can use the method `.reindex()` to enforce that.

In [73]:
# Randomly (repeated) sample from the first Series using fancy indexing
print(f"The first Series:\n{first_ser}")

# We create a random order based on labels by sampling the indices of the series
random_order = rng.choice(first_ser.index, size=(10,))
print(f"This is a random selection based on label:\n{first_ser.loc[random_order]}\n")
# We create a random order based on positions by sampling the positions of the series
# Notice the use of np.arange and shape
random_order = rng.choice(np.arange(first_ser.shape[0]), size=(10,))
selected_first = first_ser.iloc[random_order]
print(f"This is a random selection based on position:\n{selected_first}")
# Reindex
selected_first = selected_first.reset_index()
print(f"This is the same selection, but with indices resetted:\n{selected_first}")

The first Series:
0     1.0
6     NaN
2     3.0
3     NaN
7     NaN
1    11.0
dtype: float64
This is a random selection based on label:
3     NaN
7     NaN
7     NaN
0     1.0
1    11.0
1    11.0
7     NaN
2     3.0
6     NaN
3     NaN
dtype: float64

This is a random selection based on position:
7     NaN
6     NaN
7     NaN
6     NaN
6     NaN
0     1.0
6     NaN
6     NaN
7     NaN
1    11.0
dtype: float64
This is the same selection, but with indices resetted:
   index     0
0      7   NaN
1      6   NaN
2      7   NaN
3      6   NaN
4      6   NaN
5      0   1.0
6      6   NaN
7      6   NaN
8      7   NaN
9      1  11.0


#### New Elements

The selector `loc[]` can be used to add one new element to the a `pd.Series`. If we need to add multiple elements, the most efficient way (when possible) is to storing the new elements in another `pd.Series` and then use <a href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html>`pd.concat`</a> function, which can concatenate 2 or more `pd.Series` as well as `pd.DataFrame` (both along rows and columns!).

In [87]:
print(f"The first Series before adding a new element:\n{first_ser}")
# Add an element with label "new"
first_ser.loc['new'] = -99
print(f"The first Series after adding a new element:\n{first_ser}")
# Concatenation of the two series. Ignore_index automatically reset the indices
conc_series = pd.concat([first_ser, second_ser], ignore_index=True)
print(f"This is the concatenated Series: {conc_series}")

The first Series before adding a new element:
0        1.0
6        NaN
2        3.0
3        NaN
7        NaN
1       11.0
new    -99.0
new_   -99.0
dtype: float64
The first Series after adding a new element:
0        1.0
6        NaN
2        3.0
3        NaN
7        NaN
1       11.0
new    -99.0
new_   -99.0
dtype: float64
This is the concatenated Series: 0      1.0
1      NaN
2      3.0
3      NaN
4      NaN
5     11.0
6    -99.0
7    -99.0
8      0.0
9      1.0
10     0.0
11     1.0
12     1.0
13     0.0
dtype: float64


### Our Last Stop: The `pd.DataFrame`

You made untill the last section of the workshop.