Lecture: AI I - Basics 

Previous:
[**Chapter 3.1: Numpy**](../03_data/01_numpy.ipynb)

---

# Chapter 3.2: Pandas

- [The Pandas Series Object](#the-pandas-series-object)
- [The Pandas DataFrame Object](#the-pandas-dataframe-object)
- [The Pandas Index Object](#the-pandas-index-object)
- [Indexing, Selection, and Assignment of Data](#indexing-selection-and-assignment-of-data)
- [Reading Series and DataFrames](#reading-series-and-dataframes)
- [Ufuncs and Aggregation](#ufuncs-and-aggregation)
- [Group By](#group-by)
- [Aggregate, Filter, Transform, Apply](#aggregate-filter-transform-apply)


Pandas is a library for fast and efficient computation on large datasets.  
As in NumPy, many operations in Pandas are vectorized, making them efficient and fast.

Pandas is a newer package built on top of NumPy that provides an efficient implementation of a DataFrame.  
DataFrames are essentially multidimensional arrays with attached row and column labels, often containing heterogeneous types and/or missing data.  
In addition to providing a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks (→ relational algebra) and spreadsheet programs.

As we have seen, NumPy’s ndarray data structure provides essential features for the kind of clean, well-organized data typically encountered in numerical computing tasks.  
While it serves this purpose very well, its limitations become apparent when we need more flexibility (e.g., attaching labels to data, handling missing values, etc.) or when we attempt operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.).  
These are each key components of analyzing the less structured data that exists in many forms in the world around us.  

Pandas, and in particular its Series and DataFrame objects, build on the NumPy array structure and provide efficient access to these kinds of “data manipulation” tasks, which take up a large portion of a data scientist’s time.

Just as we usually import NumPy as `np`, we import Pandas under the alias `pd`.  
We also import NumPy, since we will often need it when working with Pandas:


In [1]:
import numpy as np
import pandas as pd

## The Pandas `Series` Object

A Pandas `Series` is a one-dimensional array with indexed data.  
It can be created from a list or an array as follows:


In [2]:
# Series with missing values 
data = pd.Series([0.25, 0.5, np.nan, 1.0])
data

0    0.25
1    0.50
2     NaN
3    1.00
dtype: float64

In [3]:
type(data)

pandas.core.series.Series

In [4]:
data.values, type(data.values)

(array([0.25, 0.5 ,  nan, 1.  ]), numpy.ndarray)

The index is an array-like object of type `pd.Index`:


In [5]:
data.index, type(data.index), list(data.index)

(RangeIndex(start=0, stop=4, step=1),
 pandas.core.indexes.range.RangeIndex,
 [0, 1, 2, 3])

As with a list or a NumPy array, the data can be accessed via the associated index using the familiar Python square-bracket notation:


In [6]:
data[1:3]

1    0.5
2    NaN
dtype: float64

In [7]:
type(data[1])

numpy.float64

In [8]:
print(dir(data))

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfl

### Series as a Generalized NumPy Array

From what we’ve seen so far, the Series object might appear to be essentially interchangeable with a one-dimensional NumPy array.  
The key difference, however, is the presence of the index: while a NumPy array has an implicitly defined integer index used to access its values, a Pandas Series has an explicitly defined index that is linked to its values.

This explicit index definition gives the Series object additional capabilities.  
For example, the index does not have to be an integer; it can consist of values of any type.  
If we want, we can use strings as the index:


In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'd', 'c'])
data

a    0.25
b    0.50
d    0.75
c    1.00
dtype: float64

In [10]:
data.index = list("AbCD")
data

A    0.25
b    0.50
C    0.75
D    1.00
dtype: float64

In [11]:
data["b"] == data[1] == data.values[1]

  data["b"] == data[1] == data.values[1]


np.True_

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[3, 7, 2, 4])
data

3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [13]:
data.index

Index([3, 7, 2, 4], dtype='int64')

When an explicit index is present, it takes precedence! (*As long as we are not slicing!*)


In [14]:
data[3]

np.float64(0.25)

In [15]:
data[1:3]

7    0.50
2    0.75
dtype: float64

In [16]:
type(data[3])

numpy.float64

Note that explicit indices do not have to be unique!


In [17]:
d2 = pd.concat([data, data])
d2

3    0.25
7    0.50
2    0.75
4    1.00
3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [18]:
d2[7]

7    0.5
7    0.5
dtype: float64

In [19]:
d2[7] = 2

In [20]:
d2

3    0.25
7    2.00
2    0.75
4    1.00
3    0.25
7    2.00
2    0.75
4    1.00
dtype: float64

### Series as a Specialized Dictionary

In this sense, you can think of a Pandas Series as a specialized version of a Python dictionary.  
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, while a Series is a structure that maps typed keys to a set of typed values.  

This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient for certain operations than a Python list, the type information of a Pandas Series makes it much more efficient than a Python dictionary for specific operations.

The Series-as-dictionary analogy becomes even clearer when a Series object is constructed directly from a Python dictionary:


In [21]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [22]:
population['Texas']

np.int64(26448193)

In contrast to a dictionary, however, a Series also supports array-like operations such as slicing:


In [23]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Note that Illinois is included!


### Creating Series Objects

Data can be a scalar:


In [24]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

Data can be a dictionary:


In [25]:
ser = pd.Series({2:'a', 1:'b', 3:'c'})
ser

2    a
1    b
3    c
dtype: object

In [26]:
ser.to_dict()

{2: 'a', 1: 'b', 3: 'c'}

In [27]:
list(ser.values)

['a', 'b', 'c']

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame.  
Like the Series object, it can be thought of either as a generalization of a NumPy array or as a specialization of a Python dictionary.  
We’ll now take a look at each of these perspectives.


### DataFrame as a Generalized NumPy Array

If a Series is analogous to a one-dimensional array with flexible indices, then a DataFrame is analogous to a two-dimensional array with flexible row indices and flexible column names.  
Just as you can think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.  
By “aligned” we mean that they share the same index.

To demonstrate this, let’s first construct a new Series that lists the area of each of the five states discussed in the previous section:


In [28]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from earlier, we can use a dictionary to construct a single two-dimensional object containing this information:


In [29]:
states = pd.DataFrame({'population': population,
                       'area': area,
                       'country': 'USA'})
states

Unnamed: 0,population,area,country
California,38332521,423967,USA
Texas,26448193,695662,USA
New York,19651127,141297,USA
Florida,19552860,170312,USA
Illinois,12882135,149995,USA


In [30]:
print(states.dtypes)

population     int64
area           int64
country       object
dtype: object


That looks like a generalized dictionary!  
The keys are the state names, and the values are like a list `[area, country, population]`.


In [31]:
states.sort_values(by="population", ascending=False)

Unnamed: 0,population,area,country
California,38332521,423967,USA
Texas,26448193,695662,USA
New York,19651127,141297,USA
Florida,19552860,170312,USA
Illinois,12882135,149995,USA


In [32]:
states['population'], type(states['population'])

(California    38332521
 Texas         26448193
 New York      19651127
 Florida       19552860
 Illinois      12882135
 Name: population, dtype: int64,
 pandas.core.series.Series)

Retrieve the "keys" (indices) of the DataFrame where "population" has its maximum value:


In [33]:
states["population"].idxmax()

'California'

Gibt die Serie am angegebenen Index zurück:

In [34]:
states.loc[states["population"].idxmax()]

population    38332521
area            423967
country            USA
Name: California, dtype: object

Return the Series at the specified index:


In [35]:
states.max()

population    38332521
area            695662
country            USA
dtype: object

In [36]:
try:
    states['California']
except KeyError as e:
    print("KeyError:", e)

KeyError: 'California'


In [37]:
states.loc['California']

population    38332521
area            423967
country            USA
Name: California, dtype: object

In [38]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [39]:
states.columns

Index(['population', 'area', 'country'], dtype='object')

In [40]:
states.values

array([[38332521, 423967, 'USA'],
       [26448193, 695662, 'USA'],
       [19651127, 141297, 'USA'],
       [19552860, 170312, 'USA'],
       [12882135, 149995, 'USA']], dtype=object)

In [41]:
type(states.values)

numpy.ndarray

You can think of a DataFrame as a generalization of a two-dimensional NumPy array, where both rows and columns have a generalized index for accessing the data.


### DataFrame as a Specialized Dictionary

Similarly, we can think of a DataFrame as a specialization of a dictionary.  
Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.  
For example, if you request the attribute `"area"`, you get the `Series` object containing the areas we saw earlier.

Note that indexing a DataFrame with square brackets returns the *column*!


In [42]:
states["area"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [43]:
type(states["area"])

pandas.core.series.Series

### Creating DataFrame Objects

A Pandas DataFrame can be constructed in several ways:


#### From a Single Series Object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from just one Series:


In [44]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [45]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From Multiple Series Objects


In [46]:
s1 = pd.Series(['100', '200', 'python', '300.12', '400'])
s2 = pd.Series(['10', '20', 'php', '30.12', '40'])
df = pd.concat([s1, s2], axis='columns')
df

Unnamed: 0,0,1
0,100,10
1,200,20
2,python,php
3,300.12,30.12
4,400,40


Note that many functions in Pandas take the `axis` argument—in this case, you can choose between 0/`index` and 1/`columns`.  
If you want to be explicit, I recommend using the string version!


#### From a List of Dictionaries

Any list of dictionaries can be turned into a DataFrame.  
We’ll use a simple list comprehension to generate some data.  
Even if some keys are missing from the dictionaries, Pandas will fill those with NaN (i.e., “not a number”) values:


In [47]:
df = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}], index=["first_dict", "second_dict"])
df

Unnamed: 0,a,b,c
first_dict,1.0,2,
second_dict,,3,4.0


Since each column must have a consistent data type and `np.NaN` is a float, some of the numbers will be converted to floats:


In [48]:
df['a']

first_dict     1.0
second_dict    NaN
Name: a, dtype: float64

In [49]:
df['b']

first_dict     2
second_dict    3
Name: b, dtype: int64

In [50]:
type(np.nan)

float

In [51]:
df.dtypes

a    float64
b      int64
c    float64
dtype: object

If we wanted to retrieve the rows, Pandas would have to explicitly enforce the numbers:


In [52]:
df

Unnamed: 0,a,b,c
first_dict,1.0,2,
second_dict,,3,4.0


In [53]:
df.loc['first_dict']

a    1.0
b    2.0
c    NaN
Name: first_dict, dtype: float64

#### From a Two-Dimensional NumPy Array

From a two-dimensional array of data, we can create a DataFrame with specified column and index names.  
If none are provided, an integer index will be used for each column:


In [54]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.852177,0.987077,-0.877352,0.004228
2013-01-02,-0.572042,1.202998,-0.314829,-1.395134
2013-01-03,-1.612946,2.251727,0.507596,-0.933851
2013-01-04,-0.639619,0.648964,-1.885012,0.883314
2013-01-05,0.567475,-1.159135,-1.139363,0.151764
2013-01-06,-0.680184,-0.858035,0.401916,1.408603


## The Pandas Index Object

We’ve seen that both the Series and the DataFrame objects contain an explicit index that allows you to reference and modify data.  
This Index object is an interesting structure in its own right, and can be thought of as an immutable array:


In [55]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Index([2, 3, 5, 7, 11], dtype='int64')

In [56]:
try:
    ind[0] = 1
except TypeError as e:
    print("TypeError:", e)

TypeError: Index does not support mutable operations


In [57]:
sr = pd.Series(0, index=ind)
sr

2     0
3     0
5     0
7     0
11    0
dtype: int64

Index objects have a name:


In [58]:
ind.names = ['indexx']
ind

Index([2, 3, 5, 7, 11], dtype='int64', name='indexx')

In [59]:
sr = pd.Series(np.zeros_like(ind), index=ind)
sr

indexx
2     0
3     0
5     0
7     0
11    0
dtype: int64

In [60]:
df = pd.DataFrame(np.zeros_like(ind), index=ind, columns=['first'])
df

Unnamed: 0_level_0,first
indexx,Unnamed: 1_level_1
2,0
3,0
5,0
7,0
11,0


In [61]:
df.index.names = [None]
df

Unnamed: 0,first
2,0
3,0
5,0
7,0
11,0


Index objects also have many of the attributes familiar from NumPy arrays:


In [62]:
ind.size, ind.shape, ind.ndim, ind.dtype

(5, (5,), 1, dtype('int64'))

While thinking of indices as immutable lists is natural, indices also support set operations:


In [63]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [64]:
indA & indB

Index([0, 3, 5, 7, 9], dtype='int64')

In [65]:
indA | indB

Index([3, 3, 5, 7, 11], dtype='int64')

In [66]:
indA ^ indB

Index([3, 0, 0, 0, 2], dtype='int64')

## Indexing, Selection, and Assignment of Data

From the NumPy notebook, we are already familiar with indexing, slicing, masking, and fancy indexing:


In [67]:
a = np.arange(16).reshape(4,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Select the values from the second and fourth columns that are divisible by 3:

In [68]:
a[:, [1, 3]][a[:, [1, 3]] % 3 == 0]

array([ 3,  9, 15])

Here we will look at similar ways of accessing and modifying values in Pandas Series and DataFrame objects.  
The corresponding patterns in Pandas are very similar to those in NumPy, although there are a few special considerations to keep in mind.

We’ll start with the simple case of the one-dimensional Series object, and then move on to the slightly more complex two-dimensional DataFrame object.


### Data Selection in Series

As we saw in the previous section, a Series object behaves in many ways like a one-dimensional NumPy array and in many ways like a standard Python dictionary.  
Keeping these two overlapping analogies in mind will help us understand the patterns of data indexing and selection in these arrays.


#### Series as a Dictionary

Like a dictionary, the Series object maps a collection of keys to a collection of values, which means that most of the corresponding functions work just as well for it:


In [69]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [70]:
data.__contains__('b')

True

In [71]:
'b' in data

True

In [72]:
np.array_equal(data.keys(), data.index)

True

In [73]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [74]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [75]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series as a One-Dimensional Array

Series builds on this dictionary-like interface and provides array-style element selection using the same basic mechanisms as NumPy arrays—namely slicing, masking, and fancy indexing.  
Examples include the following:

* Slicing by explicit index


In [76]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

* Slicing by implicit integer index

(Note that when slicing with an explicit index, e.g. `data['a':'c']`, the final index is included in the slice, whereas when slicing with an implicit index, e.g. `data[0:2]`, the final index is excluded from the slice.)


In [77]:
data[0:2]

a    0.25
b    0.50
dtype: float64

In [78]:
(data > 0.3) & (data < 0.8)

a    False
b     True
c     True
d    False
e    False
dtype: bool

* Masking

In [79]:
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

* Fancy indexing

In [80]:
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [81]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])
data

1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [82]:
data[1:3]

2    0.50
3    0.75
dtype: float64

**If your Series has an explicit integer index, an indexing operation like `data[1]` uses the explicit index, whereas a slicing operation like `data[1:3]` uses the implicit Python-style index.**


In [83]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

Explicit index when indexing:


In [84]:
data[1]

'a'

Implicit index when slicing:


In [85]:
data[1:3]

5    b
3    c
dtype: object

The `loc` attribute enables indexing and slicing that *always* refer to the explicit index:


In [86]:
data.loc[1]

'a'

In [87]:
data.loc[1:3]

1    a
5    b
3    c
dtype: object

Note that `loc` may, but does not always, raise an index error when slicing:


In [88]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

In [89]:
try:
    data.loc[3:10]
except KeyError as e:
    print("KeyError:", e)

KeyError: 10


In [90]:
try:
    data.loc['a':'z']
except KeyError as e:
    print("KeyError:", e)

KeyError: 'a'


The `iloc` attribute enables indexing and slicing that always refer to the implicit Python-style index:


In [91]:
data.iloc[1]

'b'

In [92]:
data.iloc[1:3]

5    b
3    c
dtype: object

Please, save yourself the pain and always be explicit about what you are doing—always use `.loc` and `.iloc`.

**Explicit is better than implicit.**

The statement does **not** mean that explicit indices are better than implicit ones,  
but rather that you should be explicit about which one you are using.


### Addendum: Indexing

From [the docs][1]: In earlier versions, using `.loc[list-of-labels]` worked as long as at least one of the keys was found (otherwise a `KeyError` was raised).  
This behavior is now deprecated!

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike


In [93]:
s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

In [94]:
try:
    s.loc[[1, 2, 3]]
except KeyError as e:
    print("KeyError:", e)

KeyError: '[3] not in index'


Instead, you should use [`reindex`][1], which aligns the Series/DataFrame to a new index with optional filling logic.

[1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html


In [95]:
s.reindex([1, 2, 3])

1    2.0
2    3.0
3    NaN
dtype: float64

### Data Selection in DataFrames

Remember that a DataFrame behaves in many ways like a two-dimensional or structured array,  
and in other ways like a dictionary of row structures that share the same index.  
These analogies can be helpful when exploring data selection within this structure.


In [96]:
area = pd.Series({
    'California': 423967, 
    'Texas': 695662,
    'New York': 141297, 
    'Florida': 170312,
    'Illinois': 149995
})
pop = pd.Series({
    'California': 38332521, 
    'Texas': 26448193,
    'New York': 19651127, 
    'Florida': 19552860,
    'Illinois': 12882135
})
data = pd.DataFrame({'area':area, 'pop':pop, 'Country':'USA'})
data

Unnamed: 0,area,pop,Country
California,423967,38332521,USA
Texas,695662,26448193,USA
New York,141297,19651127,USA
Florida,170312,19552860,USA
Illinois,149995,12882135,USA


Note that when we index a DataFrame, we are indexing the **column**!  
Dictionary-style indexing returns a Series.



In [97]:
print(type(data["area"]))
data["area"]

<class 'pandas.core.series.Series'>


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

We can also use attribute access (dot notation), but this can lead to side effects, even though it is technically just another method.


In [98]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [99]:
type(data.values)

numpy.ndarray

With this picture in mind, many familiar array-like operations can be performed directly on the DataFrame itself.  
For example, we can transpose the entire DataFrame to swap rows and columns:


In [100]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135
Country,USA,USA,USA,USA,USA


For array-style indexing, Pandas again uses the previously mentioned `loc` and `iloc` indexers.  
With the `iloc` indexer, we can index the underlying array as if it were a plain NumPy array (using the implicit Python-style index),  
**but the DataFrame index and column labels are preserved in the result** (indexing the underlying NumPy array):


In [101]:
data.values[:3, :2]

array([[423967, 38332521],
       [695662, 26448193],
       [141297, 19651127]], dtype=object)

In [102]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [103]:
data

Unnamed: 0,area,pop,Country
California,423967,38332521,USA
Texas,695662,26448193,USA
New York,141297,19651127,USA
Florida,170312,19552860,USA
Illinois,149995,12882135,USA


In [104]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [105]:
data.loc[:,['area','pop']]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


This way we get a Series:


In [106]:
data.loc["California", :]

area         423967
pop        38332521
Country         USA
Name: California, dtype: object

Adding a new column (using vectorized computations):


In [107]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,Country,density
California,423967,38332521,USA,90.413926
Texas,695662,26448193,USA,38.01874
New York,141297,19651127,USA,139.076746
Florida,170312,19552860,USA,114.806121
Illinois,149995,12882135,USA,85.883763


We can combine masking with fancy indexing:


In [108]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


If you want to combine explicit and implicit indexing, you need to chain them:


In [109]:
data.iloc[1:4].loc[:, ['pop', 'density']]

Unnamed: 0,pop,density
Texas,26448193,38.01874
New York,19651127,139.076746
Florida,19552860,114.806121


**While indexing refers to columns, slicing refers to rows:**


In [110]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [111]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,Country,density
Florida,170312,19552860,USA,114.806121
Illinois,149995,12882135,USA,85.883763


Again, be explicit with indexing to save yourself a lot of confusion.


In [112]:
try:
    data['area':'pop']
except KeyError as e:
    print("KeyError:", e)

KeyError: 'area'


In [113]:
data.loc[:, 'area':'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Quick access to a single element with **at**:


In [114]:
%%timeit
data.loc['Florida', 'pop']

3.78 μs ± 36.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [115]:
%%timeit
data.at['Florida', 'pop']

1.73 μs ± 9.98 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [116]:
data

Unnamed: 0,area,pop,Country,density
California,423967,38332521,USA,90.413926
Texas,695662,26448193,USA,38.01874
New York,141297,19651127,USA,139.076746
Florida,170312,19552860,USA,114.806121
Illinois,149995,12882135,USA,85.883763


### Reindexing

In [117]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame(
    {
        'http_status': [200, 200, 404, 404, 301],
        'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]
    },
    index=index
)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


In [118]:
new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


In [119]:
df.reindex(columns=['http_status', 'user_agent'])

Unnamed: 0,http_status,user_agent
Firefox,200,
Chrome,200,
Safari,404,
IE10,404,
Konqueror,301,


**Renaming indices** 

In [120]:
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 5, 7, 8]})
df = df.rename(mapper={'b': 'c'}, axis='columns')
df

Unnamed: 0,a,c
0,1,2
1,2,5
2,3,7
3,4,8


### Boolean Indexing

In [121]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df['E'] = ["one", "two", "three"] * 2
df

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.835259,-0.352185,-1.162691,-1.604661,one
2013-01-02,1.605014,0.396155,-1.206858,-1.163563,two
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-04,0.256055,0.032875,1.809133,1.309378,one
2013-01-05,0.309312,-2.20599,-0.355302,0.642557,two
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


In [122]:
df['A'] > 0.5

2013-01-01    False
2013-01-02     True
2013-01-03     True
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, Name: A, dtype: bool

In [123]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D,E
2013-01-02,1.605014,0.396155,-1.206858,-1.163563,two
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-04,0.256055,0.032875,1.809133,1.309378,one
2013-01-05,0.309312,-2.20599,-0.355302,0.642557,two
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


In [124]:
df.query('A > 0')

Unnamed: 0,A,B,C,D,E
2013-01-02,1.605014,0.396155,-1.206858,-1.163563,two
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-04,0.256055,0.032875,1.809133,1.309378,one
2013-01-05,0.309312,-2.20599,-0.355302,0.642557,two
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


[Alternative syntax](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) using `query`.


In [125]:
df['E'].isin(['one','two'])

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04     True
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

In [126]:
df[df['E'].isin(['one','two'])] = np.nan
df

Unnamed: 0,A,B,C,D,E
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-04,,,,,
2013-01-05,,,,,
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


In [127]:
pd.isna(df)

Unnamed: 0,A,B,C,D,E
2013-01-01,True,True,True,True,True
2013-01-02,True,True,True,True,True
2013-01-03,False,False,False,False,False
2013-01-04,True,True,True,True,True
2013-01-05,True,True,True,True,True
2013-01-06,False,False,False,False,False


In [128]:
pd.isna(df).any(axis=1)

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04     True
2013-01-05     True
2013-01-06    False
Freq: D, dtype: bool

In [129]:
df[~df.isna().any(axis=1)]

Unnamed: 0,A,B,C,D,E
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


In [130]:
df.dropna(how="any")

Unnamed: 0,A,B,C,D,E
2013-01-03,1.83733,-1.054564,0.598938,-0.640433,three
2013-01-06,0.605752,0.218389,1.200453,0.945817,three


### Data Assignment


In [131]:
df = pd.DataFrame(
    {'temp_c': [17.0, 25.0]},
    index=['Portland', 'Berkeley']
)
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,25.0


### Assigning Columns


In [132]:
df['country'] = 'USA'
df

Unnamed: 0,temp_c,country
Portland,17.0,USA
Berkeley,25.0,USA


In [133]:
df['temp_c'] <= 18

Portland     True
Berkeley    False
Name: temp_c, dtype: bool

In [134]:
df['too_cold'] = df['temp_c'] <= 18
df

Unnamed: 0,temp_c,country,too_cold
Portland,17.0,USA,True
Berkeley,25.0,USA,False


These, however, work in-place.  
To assign the values to a new DataFrame, use `assign`:


In [135]:
df = pd.DataFrame(
    {'temp_c': [17.0, 25.0]},
    index=['Portland', 'Berkeley']
)
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,25.0


In [136]:
df2 = df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
df2

Unnamed: 0,temp_c,temp_f
Portland,17.0,62.6
Berkeley,25.0,77.0


In [137]:
#vectorized version:
df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)

Unnamed: 0,temp_c,temp_f
Portland,17.0,62.6
Berkeley,25.0,77.0


In [138]:
df.assign(
    temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
    temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9
)

Unnamed: 0,temp_c,temp_f,temp_k
Portland,17.0,62.6,290.15
Berkeley,25.0,77.0,298.15


Multiple assignments at once are also possible:


In [139]:
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,25.0


### Row Assignments:


In [140]:
df.loc['Berkeley', 'temp_c'] = 26.0
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,26.0


In [141]:
type(df.loc['Portland'])

pandas.core.series.Series

In [142]:
df.loc['Portland'] = pd.Series({'temp_c': 99})
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0


In [143]:
df.loc['Osnabruck', 'temp_c'] = 18
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,18.0


In [144]:
df = pd.concat([df, df])
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,18.0
Portland,99.0
Berkeley,26.0
Osnabruck,18.0


In [145]:
df.loc['Osnabruck', 'temp_c'] = 25
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,25.0
Portland,99.0
Berkeley,26.0
Osnabruck,25.0


In [146]:
try:
    df.loc['Osnabruck'] = pd.Series({'temp_c': 99})
except Exception as e:
    print(f"Error: {e}")
df

Error: setting an array element with a sequence.


Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,25.0
Portland,99.0
Berkeley,26.0
Osnabruck,25.0


In [147]:
type(df.loc['Osnabruck'])

pandas.core.frame.DataFrame

In [148]:
np.where(df.index == 'Osnabruck')

(array([2, 5]),)

In [149]:
df.iloc[np.where(df.index == 'Osnabruck')[0][0]]

temp_c    25.0
Name: Osnabruck, dtype: float64

In [150]:
df.iloc[np.where(df.index == 'Osnabruck')[0][0]] = pd.Series({'temp_c': 99})
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,99.0
Portland,99.0
Berkeley,26.0
Osnabruck,25.0


### Multi-Indexing

Although Pandas provides objects that can natively handle three- and four-dimensional data, in practice it is far more common to use *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index levels within a single index.  
This allows higher-dimensional data to be represented compactly within the familiar one-dimensional Series and two-dimensional DataFrame objects.


In [151]:
index = [(
    'California', 2000), 
    ('California', 2010),
    ('New York', 2000), 
    ('New York', 2010),
    ('Texas', 2000), 
    ('Texas', 2010)
]
populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [152]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [153]:
index.names = ['state', 'year']

In [154]:
pop = pop.reindex(index)
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [155]:
pop['California', 2000], pop['California', 2010]

(np.int64(33871648), np.int64(37253956))

In [156]:
pop.iloc[0]

np.int64(33871648)

In [157]:
pop.iloc[1]

np.int64(37253956)

### Multi-Indexing

Although Pandas provides objects that can natively handle three- and four-dimensional data, in practice it is much more common to use *hierarchical indexing* (also known as *multi-indexing*) to embed multiple index levels within a single index.  
This way, higher-dimensional data can be represented compactly within the familiar one-dimensional Series and two-dimensional DataFrame objects.


In [158]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [159]:
index.names = [None, None]
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [160]:
pop.unstack()

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [161]:
pop.index.names = [None, None]
pop.unstack().T

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [162]:
popdf = pop.unstack(level=0)
popdf

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [163]:
popdf.stack()

2000  California    33871648
      New York      18976457
      Texas         20851820
2010  California    37253956
      New York      19378102
      Texas         25145561
dtype: int64

### Setting and Resetting the Index

Another way to rearrange hierarchical data is to convert the index labels into columns; this can be done with the `reset_index` method.

Calling this method on the population data results in a `DataFrame` with a *state* and *year* column containing the information that was previously in the index.

For clarity, we can optionally specify the name of the data for column representation:


In [164]:
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [165]:
pop.index.names = ['state', 'year']
print(type(pop))
pop

<class 'pandas.core.series.Series'>


state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [166]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


When working with real-world data, the raw input often looks like this, and it is useful to create a `MultiIndex` from the column values.

This can be done with the `set_index` method of a `DataFrame`, which returns a multi-indexed `DataFrame`:


In [167]:
pop_df = pop_flat.set_index(['state', 'year'])
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [168]:
pop_df.rename_axis([None, None])

Unnamed: 0,Unnamed: 1,population
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [169]:
asdf = pop_df.rename_axis([None, None]).unstack()
asdf

Unnamed: 0_level_0,population,population
Unnamed: 0_level_1,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [170]:
asdf.columns

MultiIndex([('population', 2000),
            ('population', 2010)],
           )

In [171]:
asdf["area"] = 999
asdf

Unnamed: 0_level_0,population,population,area
Unnamed: 0_level_1,2000,2010,Unnamed: 3_level_1
California,33871648,37253956,999
New York,18976457,19378102,999
Texas,20851820,25145561,999


In [172]:
asdf.columns

MultiIndex([('population', 2000),
            ('population', 2010),
            (      'area',   '')],
           )

In [173]:
print(type(asdf["area"]))
asdf["area"]

<class 'pandas.core.series.Series'>


California    999
New York      999
Texas         999
Name: area, dtype: int64

In [174]:
print(type(asdf["population"]))
asdf["population"]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [175]:
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [176]:
pop_df2 = pop_flat.set_index('state').rename_axis(None)
pop_df2

Unnamed: 0,year,population
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [177]:
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [178]:
pop_df.reset_index()

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


## Reading Series and DataFrames


In [179]:
df = pd.read_csv("data/pandas/Pokemon.csv")

Imagine someone hands you a random dataset. You know nothing about its contents.  
What are the first steps you would take?


In [180]:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [181]:
df.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [182]:
df["Type 1"].value_counts()

Type 1
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Ground       32
Ghost        32
Dragon       32
Dark         31
Poison       28
Fighting     27
Steel        27
Ice          24
Fairy        17
Flying        4
Name: count, dtype: int64

In [183]:
df["Legendary"].value_counts()

Legendary
False    735
True      65
Name: count, dtype: int64

When working with CSVs, you should always pay attention to whether you want the first column to serve as the index column!


In [184]:
df = pd.read_csv("data/pandas/Pokemon.csv", index_col=0)
df.tail()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


In [185]:
df.reset_index().tail()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
799,721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


In [186]:
df.reset_index().drop_duplicates(subset="#").tail()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
793,717,Yveltal,Dark,Flying,680,126,131,95,131,98,99,6,True
794,718,Zygarde50% Forme,Dragon,Ground,600,108,100,121,81,95,95,6,True
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
799,721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


In [187]:
df = df[df['Name'] != 'Volcanion']
df.tail()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
718,Zygarde50% Forme,Dragon,Ground,600,108,100,121,81,95,95,6,True
719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [188]:
no_duplicates = df.reset_index().drop_duplicates(subset="#").reset_index().drop("index", axis=1)  
no_duplicates.tail()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
715,716,Xerneas,Fairy,,680,126,131,95,131,98,99,6,True
716,717,Yveltal,Dark,Flying,680,126,131,95,131,98,99,6,True
717,718,Zygarde50% Forme,Dragon,Ground,600,108,100,121,81,95,95,6,True
718,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
719,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True


In [189]:
no_duplicates.set_index("#").to_csv('data/pandas/Pokemon_no_duplicates.csv')

In [190]:
gen_one = no_duplicates[no_duplicates["Generation"] == 1].set_index("#")
gen_one.tail()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
147,Dratini,Dragon,,300,41,64,45,50,50,50,1,False
148,Dragonair,Dragon,,420,61,84,65,70,70,70,1,False
149,Dragonite,Dragon,Flying,600,91,134,95,100,100,80,1,False
150,Mewtwo,Psychic,,680,106,110,90,154,90,130,1,True
151,Mew,Psychic,,600,100,100,100,100,100,100,1,False


In [191]:
first_gen_dict = gen_one["Name"].to_dict()

In [192]:
[str(key)+" : "+str(val) for index, (key, val) in enumerate(first_gen_dict.items()) if index < 9]

['1 : Bulbasaur',
 '2 : Ivysaur',
 '3 : Venusaur',
 '4 : Charmander',
 '5 : Charmeleon',
 '6 : Charizard',
 '7 : Squirtle',
 '8 : Wartortle',
 '9 : Blastoise']

**Documentation!**

There are really a lot of arguments for this function, suitable for all your needs!

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


## Ufuncs and Aggregation


### Aggregation in Pandas

Aggregations are functions in which one or more dimensions of data are reduced to a single value, such as the `max`, `sum`, or `mean` functions.

Statistical operations generally *exclude* missing data.


#### For Series


In [193]:
a = np.arange(7)
ser = pd.Series(a**2, index=a)
ser

0     0
1     1
2     4
3     9
4    16
5    25
6    36
dtype: int64

In [194]:
ser.sum()

np.int64(91)

In [195]:
ser.mean()

np.float64(13.0)

In [196]:
ser.median()

np.float64(9.0)

In [197]:
ser.min()

np.int64(0)

In [198]:
ser.max()

np.int64(36)

#### For DataFrames


In [199]:
df = pd.DataFrame({'A': a ** 2, 'B': a ** 3})
df

Unnamed: 0,A,B
0,0,0
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216


In [200]:
df.mean()

A    13.0
B    63.0
dtype: float64

In [201]:
df.mean(axis=0)

A    13.0
B    63.0
dtype: float64

In [202]:
df.mean(axis='rows')

A    13.0
B    63.0
dtype: float64

In [203]:
df.mean(axis=1)

0      0.0
1      1.0
2      6.0
3     18.0
4     40.0
5     75.0
6    126.0
dtype: float64

The following table summarizes some additional built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| `count()`                | Total number of elements        |
| `first()`, `last()`      | First and last element          |
| `mean()`, `median()`     | Mean and median                 |
| `min()`, `max()`         | Minimum and maximum             |
| `std()`, `var()`         | Standard deviation and variance |
| `mad()`                  | Mean absolute deviation         |
| `prod()`                 | Product of all elements         |
| `sum()`                  | Sum of all elements             |

All of these are methods of `DataFrame` and `Series` objects.


### Ufuncs

We already know ufuncs from NumPy: these are vectorized functions that apply to all elements of an array simultaneously.  

Pandas does the same, with a nice twist:  
For unary operations such as negation and trigonometric functions, these ufuncs *preserve* the index and column labels in the output.  
For binary operations such as addition and multiplication, Pandas automatically *aligns* the indices when the objects are passed to the ufunc.

This means that preserving data context and combining data from different sources—both of which can be error-prone tasks with raw NumPy arrays—become essentially foolproof with Pandas.


In [204]:
rng = np.random.RandomState(0)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,5,0,3,3
1,7,9,3,5
2,2,4,7,6


In [205]:
np.exp(df)

Unnamed: 0,A,B,C,D
0,148.413159,1.0,20.085537,20.085537
1,1096.633158,8103.083928,20.085537,148.413159
2,7.389056,54.59815,1096.633158,403.428793


### Ufuncs: Index Alignment

For binary operations on two `Series` or `DataFrame` objects, Pandas aligns the indices during the operation.

This is very convenient when working with incomplete data.


In [206]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')
area

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

In [207]:
population

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

In [208]:
area & population

  area & population


Alaska        False
California     True
New York      False
Texas          True
dtype: bool

In [209]:
area / population

Alaska             NaN
California    0.011060
New York           NaN
Texas         0.026303
dtype: float64

In [210]:
"divide" in dir(pd.DataFrame)

True

In [211]:
popdens = area.divide(population, fill_value=0)
popdens

Alaska             inf
California    0.011060
New York      0.000000
Texas         0.026303
dtype: float64

In [212]:
popdens = popdens.replace([np.inf, -np.inf], np.nan)
popdens.dropna()

California    0.011060
New York      0.000000
Texas         0.026303
dtype: float64

In [213]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A

Unnamed: 0,A,B
0,12,1
1,6,7


In [214]:
B = pd.DataFrame(rng.randint(0, 20, (3, 3)), columns=list('ABC'))
B

Unnamed: 0,A,B,C
0,14,17,5
1,13,8,9
2,19,16,19


In [215]:
A+B

Unnamed: 0,A,B,C
0,26.0,18.0,
1,19.0,15.0,
2,,,


In [216]:
A.add(B, fill_value=0)

Unnamed: 0,A,B,C
0,26.0,18.0,5.0
1,19.0,15.0,9.0
2,19.0,16.0,19.0


#### More Index-Alignment

In [217]:
df = pd.DataFrame({'a': np.random.randint(3, size=10)}, index=np.arange(1, 20, 2))
df

Unnamed: 0,a
1,0
3,2
5,1
7,0
9,2
11,2
13,1
15,1
17,2
19,2


Let’s add a new column to this DataFrame!


In [218]:
tmp = pd.Series([0]*len(df.index))
tmp

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [219]:
#df['new'] = tmp   #changes the original one
df.assign(new=tmp) #creates a copy

Unnamed: 0,a,new
1,0,0.0
3,2,0.0
5,1,0.0
7,0,0.0
9,2,0.0
11,2,
13,1,
15,1,
17,2,
19,2,


In [220]:
df.index | tmp.index

Index([1, 3, 7, 7, 13, 15, 15, 15, 25, 27], dtype='int64')

In [221]:
old_aligned, new_aligned = df.align(tmp, axis=0)
old_aligned

Unnamed: 0,a
0,
1,0.0
2,
3,2.0
4,
5,1.0
6,
7,0.0
8,
9,2.0


In [222]:
new_aligned

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     0.0
9     0.0
11    NaN
13    NaN
15    NaN
17    NaN
19    NaN
dtype: float64

In [223]:
old_aligned.assign(new=new_aligned)

Unnamed: 0,a,new
0,,0.0
1,0.0,0.0
2,,0.0
3,2.0,0.0
4,,0.0
5,1.0,0.0
6,,0.0
7,0.0,0.0
8,,0.0
9,2.0,0.0


In [224]:
tmp = pd.Series([0]*len(df.index), index=df.index)
tmp

1     0
3     0
5     0
7     0
9     0
11    0
13    0
15    0
17    0
19    0
dtype: int64

In [225]:
df['new'] = tmp
df

Unnamed: 0,a,new
1,0,0
3,2,0
5,1,0
7,0,0
9,2,0
11,2,0
13,1,0
15,1,0
17,2,0
19,2,0


If you want to apply more than one operation (ufunc/aggregation), use `agg()`:


In [226]:
df = pd.DataFrame(
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
        [np.nan, np.nan, np.nan]
    ],
    columns=['A', 'B', 'C']
)
df

Unnamed: 0,A,B,C
0,1.0,2.0,3.0
1,4.0,5.0,6.0
2,7.0,8.0,9.0
3,,,


In [227]:
df.agg(['sum', 'min'])

Unnamed: 0,A,B,C
sum,12.0,15.0,18.0
min,1.0,2.0,3.0


You can also apply different aggregations to different columns:


In [228]:
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})

Unnamed: 0,A,B
sum,12.0,
min,1.0,2.0
max,,8.0


This also works for ufuncs:


In [229]:
df.agg({'A' : 'exp', 'B' : [np.exp, 'sqrt']})

  df.agg({'A' : 'exp', 'B' : [np.exp, 'sqrt']})


Unnamed: 0_level_0,A,B,B
Unnamed: 0_level_1,exp,exp,sqrt
0,2.718282,7.389056,1.414214
1,54.59815,148.413159,2.236068
2,1096.633158,2980.957987,2.828427
3,,,


#### apply()

While some ufuncs (such as `cumsum` or `exp`) are predefined in Pandas, the `apply` method can be used to apply any function to all elements of a Series or DataFrame.


In [230]:
a = np.arange(7)
df = pd.DataFrame({'A': a ** 2, 'B': a ** 3})
df

Unnamed: 0,A,B
0,0,0
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216


In [231]:
df.cumsum()

Unnamed: 0,A,B
0,0,0
1,1,1
2,5,9
3,14,36
4,30,100
5,55,225
6,91,441


In [232]:
df.apply(np.cumsum)

Unnamed: 0,A,B
0,0,0
1,1,1
2,5,9
3,14,36
4,30,100
5,55,225
6,91,441


In [233]:
df["A_cumsum"] = df.cumsum()["A"]
df["B_cumsum"] = df.apply(np.cumsum)["B"]
df

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,1,1,1,1
2,4,8,5,9
3,9,27,14,36
4,16,64,30,100
5,25,125,55,225
6,36,216,91,441


With lambda functions, we can combine `apply` with any function.  
Note that the argument passed to the function is always an entire column of the dataset.


In [234]:
df.apply(lambda x: print(x, end='\n\n'))

0     0
1     1
2     4
3     9
4    16
5    25
6    36
Name: A, dtype: int64

0      0
1      1
2      8
3     27
4     64
5    125
6    216
Name: B, dtype: int64

0     0
1     1
2     5
3    14
4    30
5    55
6    91
Name: A_cumsum, dtype: int64

0      0
1      1
2      9
3     36
4    100
5    225
6    441
Name: B_cumsum, dtype: int64



A           None
B           None
A_cumsum    None
B_cumsum    None
dtype: object

In [235]:
df

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,1,1,1,1
2,4,8,5,9
3,9,27,14,36
4,16,64,30,100
5,25,125,55,225
6,36,216,91,441


In [236]:
df['A'] + 1

0     1
1     2
2     5
3    10
4    17
5    26
6    37
Name: A, dtype: int64

In [237]:
df.apply(lambda x: x + 1)

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,1,1,1,1
1,2,2,2,2
2,5,9,6,10
3,10,28,15,37
4,17,65,31,101
5,26,126,56,226
6,37,217,92,442


In [238]:
def my_more_complex_func(ser):
    res = []
    for elem in ser:
        print(elem if elem > 16 else -elem)
        res.append(elem if elem > 16 else -elem)
    return res

In [239]:
df.apply(my_more_complex_func)

0
-1
-4
-9
-16
25
36
0
-1
-8
27
64
125
216
0
-1
-5
-14
30
55
91
0
-1
-9
36
100
225
441


Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,-1,-1,-1,-1
2,-4,-8,-5,-9
3,-9,27,-14,36
4,-16,64,30,100
5,25,125,55,225
6,36,216,91,441


In [240]:
df

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,1,1,1,1
2,4,8,5,9
3,9,27,14,36
4,16,64,30,100
5,25,125,55,225
6,36,216,91,441


In [241]:
df.apply(lambda x: x.max() - x.min())

A            36
B           216
A_cumsum     91
B_cumsum    441
dtype: int64

In [242]:
df["A"].apply(lambda x: print(x))

0
1
4
9
16
25
36


0    None
1    None
2    None
3    None
4    None
5    None
6    None
Name: A, dtype: object

In [243]:
df["A_normed"] = df["A"].apply(lambda x: x / df["A"].max())
df

Unnamed: 0,A,B,A_cumsum,B_cumsum,A_normed
0,0,0,0,0,0.0
1,1,1,1,1,0.027778
2,4,8,5,9,0.111111
3,9,27,14,36,0.25
4,16,64,30,100,0.444444
5,25,125,55,225,0.694444
6,36,216,91,441,1.0


We can even use dictionaries with the `apply` function!


In [244]:
z_moves = {
    "Normal": "Breakneck Blitz", 
    "Fighting": "All-Out Pummeling", 
    "Flying": "Supersonic Skystrike", 
    "Poison": "Acid Downpour", 
    "Ground": "Tectonic Rage", 
    "Rock": "Continental Crush", 
    "Bug": "Savage Spin-Out", 
    "Ghost": "Never-Ending Nightmare",
    "Steel": "Corkscrew Crash", 
    "Fire": "Inferno Overdrive", 
    "Water": "Hydro Vortex", 
    "Grass": "Bloom Doom", 
    "Electric": "Gigavolt Havoc", 
    "Psychic": "Shattered Psyche", 
    "Ice": "Subzero Slammer", 
    "Dragon": "Devastating Drake", 
    "Dark": "Black Hole Eclipse", 
    "Fairy": "Twinkle Tackle"
}
df = pd.read_csv("data/pandas/Pokemon.csv")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [245]:
df["Z-Move"] = df["Type 1"].apply(lambda x:z_moves[x])
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Z-Move
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,Bloom Doom
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,Bloom Doom
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,Bloom Doom
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,Bloom Doom
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,Inferno Overdrive


With `apply`, we can also transform a list of Series into a DataFrame by turning the individual columns into Series:


In [246]:
s = pd.Series([ ['Red', 'Green', 'White'], ['Red', 'Black'], ['Yellow']]) 
print(type(s))
s

<class 'pandas.core.series.Series'>


0    [Red, Green, White]
1           [Red, Black]
2               [Yellow]
dtype: object

In [247]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [248]:
df = s.apply(pd.Series)
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1,2
0,Red,Green,White
1,Red,Black,
2,Yellow,,


**Note on Performance:**

According to ([1]), `apply()` is twice as fast as looping through a DataFrame with `iterrows()` and 8 times as fast as looping through Python lists.

However, keep in mind that while `apply()` is much faster for iterating over the rows of your DataFrame/Series (thanks to internal optimizations such as the use of iterators in Cython), it is still fundamentally looping over rows. Whatever you apply, it is still executed once per row.  

Therefore, whenever possible, use vectorized ufuncs instead—they are far more optimized and parallelized. For example, in ([1]) replacing the Haversine distance formula with its vectorized counterpart led to a **50× speed improvement**!

[1]: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6


## Group-By

### Split-Apply-Combine

While simple operations are already predefined in Pandas, custom aggregations and operations can be carried out using **group-by**.  
The group-by operation can be described in the following steps:

* **Split:** Divide the data into groups based on some criteria (splitting and grouping depending on the value of a key)  
* **Apply:** Apply a function independently to each group (aggregation, transformation, filtering, …)  
* **Combine:** Merge the results into a data structure  

A typical example, where the *apply* step is a summation aggregation, is shown here:


In [249]:
tmp = np.array([list("ABCABC"), np.arange(1, 7)]).T
tmp

array([['A', '1'],
       ['B', '2'],
       ['C', '3'],
       ['A', '4'],
       ['B', '5'],
       ['C', '6']], dtype='<U21')

In [250]:
df = pd.DataFrame(tmp, columns=["key", "data"])
df["data"] = pd.to_numeric(df["data"])
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [251]:
df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7d3423e6f2c0>

Note that what is returned is not a set of “DataFrames,” but rather a `DataFrameGroupBy` object.  
This object is where the magic happens: you can think of it as a special view of the `DataFrame`, ready to dig into the groups, but no actual computation is performed until an aggregation is applied.  

This “lazy evaluation” approach means that common aggregates can be implemented very efficiently and almost transparently to the user.  

To produce a result, we can apply an aggregate to this `DataFrameGroupBy` object, which performs the necessary apply/combine steps to generate the desired outcome:


In [252]:
df.groupby("key").sum().reset_index()

Unnamed: 0,key,data
0,A,5
1,B,7
2,C,9


We can perform column indexing just as we would with a regular DataFrame:


In [253]:
df.groupby("key")["data"].sum()

key
A    5
B    7
C    9
Name: data, dtype: int64

### Iterating Over Groups

The `GroupBy` object supports direct iteration over groups, returning each group as a `Series` or `DataFrame`:


In [254]:
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [255]:
for (key, _) in df.groupby("key"):
    print(key)
    
print()
for (_, group) in df.groupby("key"):
    print(group, "\n")

A
B
C

  key  data
0   A     1
3   A     4 

  key  data
1   B     2
4   B     5 

  key  data
2   C     3
5   C     6 



In [256]:
pkm = pd.read_csv('data/pandas/Pokemon.csv')
pkm.groupby('Generation')['Total'].mean()

Generation
1    426.813253
2    418.283019
3    436.225000
4    459.016529
5    434.987879
6    436.378049
Name: Total, dtype: float64

### Dispatch-Methoden

Jede Methode, die nicht explizit durch das ``GroupBy``-Objekt implementiert ist, wird durchgereicht und auf den Gruppen aufgerufen, egal ob es sich um ``DataFrame``- oder ``Series``-Objekte handelt.
Zum Beispiel können Sie die Methode ``describe()`` von ``DataFrame`` verwenden, um eine Reihe von Aggregationen durchzuführen, die jede Gruppe in den Daten beschreiben:

In [257]:
df.describe()

Unnamed: 0,data
count,6.0
mean,3.5
std,1.870829
min,1.0
25%,2.25
50%,3.5
75%,4.75
max,6.0


In [258]:
df.groupby("key").describe()

Unnamed: 0_level_0,data,data,data,data,data,data,data,data
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,2.0,2.5,2.12132,1.0,1.75,2.5,3.25,4.0
B,2.0,3.5,2.12132,2.0,2.75,3.5,4.25,5.0
C,2.0,4.5,2.12132,3.0,3.75,4.5,5.25,6.0


In [259]:
df = pd.read_csv("data/pandas/Pokemon_no_duplicates.csv", index_col=0)
df.groupby('Generation')["Name"].nunique()

Generation
1    151
2    100
3    135
4    107
5    156
6     71
Name: Name, dtype: int64

## Aggregate, Filter, Transform, Apply

So far, we have focused on aggregation for the combine operation, but there are more possibilities.  
In particular, `GroupBy` objects have the methods `aggregate()`, `filter()`, `transform()`, and `apply()`, which efficiently implement a wide variety of useful operations before the grouped data is combined.

For the purposes of the following subsections, we will use this `DataFrame`:


In [260]:
def create_df():
    rng = np.random.RandomState(0)
    df = pd.DataFrame(
        {
            'key': ['A', 'B', 'C', 'A', 'B', 'C'],
            'data1': range(6),
            'data2': rng.randint(0, 10, 6)
        },
        columns = ['key', 'data1', 'data2']
    )
    return df
    
df = create_df()
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


### Aggregation

While we have already used some *aggregation functions*, the `aggregate` function is the explicit version of this.  
It can take a string, a function, or a list of them, and compute all the aggregations at once.


In [261]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [262]:
df.groupby('key').agg(['min', np.median, max])

  df.groupby('key').agg(['min', np.median, max])
  df.groupby('key').agg(['min', np.median, max])


Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful approach is to pass a dictionary that maps (existing) column names to the operations that should be applied to that column:


In [263]:
df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


In [264]:
df.groupby('key').aggregate({'data1': np.sum, 'data2': lambda x: np.std(x, ddof=1)})

  df.groupby('key').aggregate({'data1': np.sum, 'data2': lambda x: np.std(x, ddof=1)})


Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,1.414214
B,5,4.949747
C,7,4.242641


### Named Aggregation

To support column-specific aggregation with control over the *output column names*, Pandas accepts so-called named aggregation in `GroupBy.agg()`, where:

* The keywords are the names of the output columns  
* The values are tuples, where the first element is the column to select and the second element is the aggregation to apply to that column  
    * Alternatively, you can use the `NamedAgg` tuple `pandas.NamedAgg` with the fields `['column', 'aggfunc']` to make the arguments more explicit.


In [265]:
animals = pd.DataFrame({
    'kind': ['cat', 'dog', 'cat', 'dog'],
    'height': [9.1, 6.0, 9.5, 34.0],
    'weight': [7.9, 7.5, 9.9, 198.0]
})
animals

Unnamed: 0,kind,height,weight
0,cat,9.1,7.9
1,dog,6.0,7.5
2,cat,9.5,9.9
3,dog,34.0,198.0


In [266]:
assert int(pd.__version__[0]) >= 1, 'Your version of pandas is too old for this!'

In [267]:
animals.groupby("kind").agg(
    min_height = pd.NamedAgg(column='height', aggfunc='min'),
    max_height = pd.NamedAgg(column='height', aggfunc='max'),
    average_weight = pd.NamedAgg(column='weight', aggfunc=np.mean),
)

  animals.groupby("kind").agg(


Unnamed: 0_level_0,min_height,max_height,average_weight
kind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,9.1,9.5,8.9
dog,6.0,34.0,102.75


### Filtering

Filtering allows you to exclude data based on group properties.  
For example, we might want to keep only those groups where the standard deviation is greater than a critical value:


In [268]:
def filter_func(x):
    return x['data2'].std() > 4

df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [269]:
df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641


Note that this is not an aggregation—the result has the same shape as the original DataFrame, except that certain rows have been omitted!


In [270]:
df.groupby('key').filter(filter_func)

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


### The `apply()` Method

With the `apply()` method, you can apply any function to the group results.  
The function should take a `DataFrame` as input and return either a Pandas object (e.g., `DataFrame`, `Series`) or a scalar; the combine operation will then be tailored to the type of the returned output.

First, let’s recall our earlier use of `apply()`:


In [271]:
create_df()

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [272]:
df = create_df()
df["data1"] = df["data1"].apply(lambda x: x / df["data1"].max())
df

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.2,0
2,C,0.4,3
3,A,0.6,3
4,B,0.8,7
5,C,1.0,9


Note that `groupby` only returns a *view of the original DataFrame*.  
Here’s an `apply()` that normalizes the (grouped) first column by the sum of the (grouped) second column:


In [None]:
import warnings
warnings.filterwarnings('ignore')

try:
    del newdf  # noqa: F821
except:  # noqa: E722
    pass

In [None]:
df = create_df()
sums = df.groupby('key')["data2"].sum()
print(sums, '\n\n\n')
for key, group in df.groupby('key'):
    group["data1"] /= sums[key]
    try:
        #appending to dataframes is bad style!
        newdf = newdf.append(group) # noqa: F821
    except:  # noqa: E722
        newdf = group.copy()
    print(newdf, '\n')

newdf

key
A     8
B     7
C    12
Name: data2, dtype: int64 



  key  data1  data2
0   A  0.000      5
3   A  0.375      3 

  key     data1  data2
1   B  0.142857      0
4   B  0.571429      7 

  key     data1  data2
2   C  0.166667      3
5   C  0.416667      9 



Unnamed: 0,key,data1,data2
2,C,0.166667,3
5,C,0.416667,9


In [275]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [276]:
df.groupby('key').apply(norm_by_data2)

Unnamed: 0_level_0,Unnamed: 1_level_0,key,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,A,0.0,5
A,3,A,0.375,3
B,1,B,0.142857,0
B,4,B,0.571429,7
C,2,C,0.166667,3
C,5,C,0.416667,9


### Specifying the Split Key

In the simple examples shown earlier, we split the `DataFrame` by a single column name.  
This is just one of many options for defining groups, and here we will go through some other possibilities for specifying the grouping key.


### A List, an Array, a Series, or an Index Providing the Grouping Keys

The key can be any Series or list whose length matches that of the `DataFrame`.  
For example:


In [277]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [278]:
L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()

Unnamed: 0,key,data1,data2
0,ACC,7,17
1,BA,4,3
2,B,4,7


### A Dictionary or Series Mapping the Index to the Group

Another method is to provide a dictionary that maps index values to group keys:


In [279]:
df2 = df.set_index('key')
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [280]:
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
consonant,12,19
vowel,3,8


Grouping by multiple columns creates a hierarchical index.


In [281]:
df2.groupby([mapping, "key"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
consonant,B,5,7
consonant,C,7,12
vowel,A,3,8


---

Lecture: AI I - Basics 

Excersie: [**Excersie 3.2: Pandas**](../03_data/exercises/02_pandas.ipynb)

Next: [**Chapter 3.3: Visualisation with Matplotlib**](../03_data/03_matplotlib.ipynb)