Lecture: AI I - Basics 

Previous:
[**Chapter 3.1: Numpy**](../03_data/01_numpy.ipynb)

---

# Chapter 3.2: Pandas

- [The Pandas Series Object](#the-pandas-series-object)
- [The Pandas DataFrame Object](#the-pandas-dataframe-object)
- [The Pandas Index Object](#the-pandas-index-object)
- [Indexing, Selection, and Assignment of Data](#indexing-selection-and-assignment-of-data)
- [Reading Series and DataFrames](#reading-series-and-dataframes)
- [Ufuncs and Aggregation](#ufuncs-and-aggregation)
- [Group By](#group-by)
- [Aggregate, Filter, Transform, Apply](#aggregate-filter-transform-apply)


Pandas is a library for fast and efficient computation on large datasets.  
As in NumPy, many operations in Pandas are vectorized, making them efficient and fast.

Pandas is a newer package built on top of NumPy that provides an efficient implementation of a DataFrame.  
DataFrames are essentially multidimensional arrays with attached row and column labels, often containing heterogeneous types and/or missing data.  
In addition to providing a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks (→ relational algebra) and spreadsheet programs.

As we have seen, NumPy’s ndarray data structure provides essential features for the kind of clean, well-organized data typically encountered in numerical computing tasks.  
While it serves this purpose very well, its limitations become apparent when we need more flexibility (e.g., attaching labels to data, handling missing values, etc.) or when we attempt operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.).  
These are each key components of analyzing the less structured data that exists in many forms in the world around us.  

Pandas, and in particular its Series and DataFrame objects, build on the NumPy array structure and provide efficient access to these kinds of “data manipulation” tasks, which take up a large portion of a data scientist’s time.

Just as we usually import NumPy as `np`, we import Pandas under the alias `pd`.  
We also import NumPy, since we will often need it when working with Pandas:


In [1]:
import numpy as np
import pandas as pd

## The Pandas `Series` Object

A Pandas `Series` is a one-dimensional array with indexed data.  
It can be created from a list or an array as follows:


In [2]:
# Series with missing values 
data = pd.Series([0.25, 0.5, np.nan, 1.0])
data

0    0.25
1    0.50
2     NaN
3    1.00
dtype: float64

In [3]:
type(data)

pandas.core.series.Series

In [4]:
data.values, type(data.values)

(array([0.25, 0.5 ,  nan, 1.  ]), numpy.ndarray)

The index is an array-like object of type `pd.Index`:


In [5]:
data.index, type(data.index), list(data.index)

(RangeIndex(start=0, stop=4, step=1),
 pandas.core.indexes.range.RangeIndex,
 [0, 1, 2, 3])

As with a list or a NumPy array, the data can be accessed via the associated index using the familiar Python square-bracket notation:


In [6]:
data[1:3]

1    0.5
2    NaN
dtype: float64

In [7]:
type(data[1])

numpy.float64

In [8]:
print(dir(data))

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfl

### Series as a Generalized NumPy Array

From what we’ve seen so far, the Series object might appear to be essentially interchangeable with a one-dimensional NumPy array.  
The key difference, however, is the presence of the index: while a NumPy array has an implicitly defined integer index used to access its values, a Pandas Series has an explicitly defined index that is linked to its values.

This explicit index definition gives the Series object additional capabilities.  
For example, the index does not have to be an integer; it can consist of values of any type.  
If we want, we can use strings as the index:


In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'd', 'c'])
data

a    0.25
b    0.50
d    0.75
c    1.00
dtype: float64

In [10]:
data.index = list("AbCD")
data

A    0.25
b    0.50
C    0.75
D    1.00
dtype: float64

In [11]:
data["b"] == data[1] == data.values[1]

  data["b"] == data[1] == data.values[1]


np.True_

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[3, 7, 2, 4])
data

3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [13]:
data.index

Index([3, 7, 2, 4], dtype='int64')

When an explicit index is present, it takes precedence! (*As long as we are not slicing!*)


In [14]:
data[3]

np.float64(0.25)

In [15]:
data[1:3]

7    0.50
2    0.75
dtype: float64

In [16]:
type(data[3])

numpy.float64

Note that explicit indices do not have to be unique!


In [17]:
d2 = pd.concat([data, data])
d2

3    0.25
7    0.50
2    0.75
4    1.00
3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [18]:
d2[7]

7    0.5
7    0.5
dtype: float64

In [19]:
d2[7] = 2

In [20]:
d2

3    0.25
7    2.00
2    0.75
4    1.00
3    0.25
7    2.00
2    0.75
4    1.00
dtype: float64

### Series as a Specialized Dictionary

In this sense, you can think of a Pandas Series as a specialized version of a Python dictionary.  
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, while a Series is a structure that maps typed keys to a set of typed values.  

This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient for certain operations than a Python list, the type information of a Pandas Series makes it much more efficient than a Python dictionary for specific operations.

The Series-as-dictionary analogy becomes even clearer when a Series object is constructed directly from a Python dictionary:


In [21]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [22]:
population['Texas']

np.int64(26448193)

In contrast to a dictionary, however, a Series also supports array-like operations such as slicing:


In [23]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Note that Illinois is included!


### Creating Series Objects

Data can be a scalar:


In [24]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

Data can be a dictionary:


In [25]:
ser = pd.Series({2:'a', 1:'b', 3:'c'})
ser

2    a
1    b
3    c
dtype: object

In [26]:
ser.to_dict()

{2: 'a', 1: 'b', 3: 'c'}

In [27]:
list(ser.values)

['a', 'b', 'c']

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame.  
Like the Series object, it can be thought of either as a generalization of a NumPy array or as a specialization of a Python dictionary.  
We’ll now take a look at each of these perspectives.


### DataFrame as a Generalized NumPy Array

If a Series is analogous to a one-dimensional array with flexible indices, then a DataFrame is analogous to a two-dimensional array with flexible row indices and flexible column names.  
Just as you can think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.  
By “aligned” we mean that they share the same index.

To demonstrate this, let’s first construct a new Series that lists the area of each of the five states discussed in the previous section:


In [28]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from earlier, we can use a dictionary to construct a single two-dimensional object containing this information:


In [29]:
states = pd.DataFrame({'population': population,
                       'area': area,
                       'country': 'USA'})
states

Unnamed: 0,population,area,country
California,38332521,423967,USA
Texas,26448193,695662,USA
New York,19651127,141297,USA
Florida,19552860,170312,USA
Illinois,12882135,149995,USA


In [30]:
print(states.dtypes)

population     int64
area           int64
country       object
dtype: object


That looks like a generalized dictionary!  
The keys are the state names, and the values are like a list `[area, country, population]`.


In [31]:
states.sort_values(by="population", ascending=False)

Unnamed: 0,population,area,country
California,38332521,423967,USA
Texas,26448193,695662,USA
New York,19651127,141297,USA
Florida,19552860,170312,USA
Illinois,12882135,149995,USA


In [32]:
states['population'], type(states['population'])

(California    38332521
 Texas         26448193
 New York      19651127
 Florida       19552860
 Illinois      12882135
 Name: population, dtype: int64,
 pandas.core.series.Series)

Retrieve the "keys" (indices) of the DataFrame where "population" has its maximum value:


In [33]:
states["population"].idxmax()

'California'

Gibt die Serie am angegebenen Index zurück:

In [34]:
states.loc[states["population"].idxmax()]

population    38332521
area            423967
country            USA
Name: California, dtype: object

Return the Series at the specified index:


In [35]:
states.max()

population    38332521
area            695662
country            USA
dtype: object

In [36]:
try:
    states['California']
except KeyError as e:
    print("KeyError:", e)

KeyError: 'California'


In [37]:
states.loc['California']

population    38332521
area            423967
country            USA
Name: California, dtype: object

In [38]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [39]:
states.columns

Index(['population', 'area', 'country'], dtype='object')

In [40]:
states.values

array([[38332521, 423967, 'USA'],
       [26448193, 695662, 'USA'],
       [19651127, 141297, 'USA'],
       [19552860, 170312, 'USA'],
       [12882135, 149995, 'USA']], dtype=object)

In [41]:
type(states.values)

numpy.ndarray

You can think of a DataFrame as a generalization of a two-dimensional NumPy array, where both rows and columns have a generalized index for accessing the data.


### DataFrame as a Specialized Dictionary

Similarly, we can think of a DataFrame as a specialization of a dictionary.  
Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.  
For example, if you request the attribute `"area"`, you get the `Series` object containing the areas we saw earlier.

Note that indexing a DataFrame with square brackets returns the *column*!


In [42]:
states["area"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [43]:
type(states["area"])

pandas.core.series.Series

### Creating DataFrame Objects

A Pandas DataFrame can be constructed in several ways:


#### From a Single Series Object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from just one Series:


In [44]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [45]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From Multiple Series Objects


In [46]:
s1 = pd.Series(['100', '200', 'python', '300.12', '400'])
s2 = pd.Series(['10', '20', 'php', '30.12', '40'])
df = pd.concat([s1, s2], axis='columns')
df

Unnamed: 0,0,1
0,100,10
1,200,20
2,python,php
3,300.12,30.12
4,400,40


Note that many functions in Pandas take the `axis` argument—in this case, you can choose between 0/`index` and 1/`columns`.  
If you want to be explicit, I recommend using the string version!


#### From a List of Dictionaries

Any list of dictionaries can be turned into a DataFrame.  
We’ll use a simple list comprehension to generate some data.  
Even if some keys are missing from the dictionaries, Pandas will fill those with NaN (i.e., “not a number”) values:


In [47]:
df = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}], index=["first_dict", "second_dict"])
df

Unnamed: 0,a,b,c
first_dict,1.0,2,
second_dict,,3,4.0


Since each column must have a consistent data type and `np.NaN` is a float, some of the numbers will be converted to floats:


In [48]:
df['a']

first_dict     1.0
second_dict    NaN
Name: a, dtype: float64

In [49]:
df['b']

first_dict     2
second_dict    3
Name: b, dtype: int64

In [50]:
type(np.nan)

float

In [51]:
df.dtypes

a    float64
b      int64
c    float64
dtype: object

If we wanted to retrieve the rows, Pandas would have to explicitly enforce the numbers:


In [52]:
df

Unnamed: 0,a,b,c
first_dict,1.0,2,
second_dict,,3,4.0


In [53]:
df.loc['first_dict']

a    1.0
b    2.0
c    NaN
Name: first_dict, dtype: float64

#### From a Two-Dimensional NumPy Array

From a two-dimensional array of data, we can create a DataFrame with specified column and index names.  
If none are provided, an integer index will be used for each column:


In [54]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.852177,0.987077,-0.877352,0.004228
2013-01-02,-0.572042,1.202998,-0.314829,-1.395134
2013-01-03,-1.612946,2.251727,0.507596,-0.933851
2013-01-04,-0.639619,0.648964,-1.885012,0.883314
2013-01-05,0.567475,-1.159135,-1.139363,0.151764
2013-01-06,-0.680184,-0.858035,0.401916,1.408603


## The Pandas Index Object

We’ve seen that both the Series and the DataFrame objects contain an explicit index that allows you to reference and modify data.  
This Index object is an interesting structure in its own right, and can be thought of as an immutable array:


In [55]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Index([2, 3, 5, 7, 11], dtype='int64')

In [56]:
try:
    ind[0] = 1
except TypeError as e:
    print("TypeError:", e)

TypeError: Index does not support mutable operations


In [57]:
sr = pd.Series(0, index=ind)
sr

2     0
3     0
5     0
7     0
11    0
dtype: int64

Index objects have a name:


In [58]:
ind.names = ['indexx']
ind

Index([2, 3, 5, 7, 11], dtype='int64', name='indexx')

In [59]:
sr = pd.Series(np.zeros_like(ind), index=ind)
sr

indexx
2     0
3     0
5     0
7     0
11    0
dtype: int64

In [60]:
df = pd.DataFrame(np.zeros_like(ind), index=ind, columns=['first'])
df

Unnamed: 0_level_0,first
indexx,Unnamed: 1_level_1
2,0
3,0
5,0
7,0
11,0


In [61]:
df.index.names = [None]
df

Unnamed: 0,first
2,0
3,0
5,0
7,0
11,0


Index objects also have many of the attributes familiar from NumPy arrays:


In [62]:
ind.size, ind.shape, ind.ndim, ind.dtype

(5, (5,), 1, dtype('int64'))

While thinking of indices as immutable lists is natural, indices also support set operations:


In [63]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [64]:
indA & indB

Index([0, 3, 5, 7, 9], dtype='int64')

In [65]:
indA | indB

Index([3, 3, 5, 7, 11], dtype='int64')

In [66]:
indA ^ indB

Index([3, 0, 0, 0, 2], dtype='int64')

## Indexing, Selection, and Assignment of Data

From the NumPy notebook, we are already familiar with indexing, slicing, masking, and fancy indexing:


In [67]:
a = np.arange(16).reshape(4,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Select the values from the second and fourth columns that are divisible by 3:

In [68]:
a[:, [1, 3]][a[:, [1, 3]] % 3 == 0]

array([ 3,  9, 15])

Here we will look at similar ways of accessing and modifying values in Pandas Series and DataFrame objects.  
The corresponding patterns in Pandas are very similar to those in NumPy, although there are a few special considerations to keep in mind.

We’ll start with the simple case of the one-dimensional Series object, and then move on to the slightly more complex two-dimensional DataFrame object.


### Data Selection in Series

As we saw in the previous section, a Series object behaves in many ways like a one-dimensional NumPy array and in many ways like a standard Python dictionary.  
Keeping these two overlapping analogies in mind will help us understand the patterns of data indexing and selection in these arrays.


#### Series as a Dictionary

Like a dictionary, the Series object maps a collection of keys to a collection of values, which means that most of the corresponding functions work just as well for it:


In [69]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [70]:
data.__contains__('b')

True

In [71]:
'b' in data

True

In [72]:
np.array_equal(data.keys(), data.index)

True

In [73]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [74]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [75]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series as a One-Dimensional Array

Series builds on this dictionary-like interface and provides array-style element selection using the same basic mechanisms as NumPy arrays—namely slicing, masking, and fancy indexing.  
Examples include the following:

* Slicing by explicit index


In [76]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

* Slicing by implicit integer index

(Note that when slicing with an explicit index, e.g. `data['a':'c']`, the final index is included in the slice, whereas when slicing with an implicit index, e.g. `data[0:2]`, the final index is excluded from the slice.)


In [77]:
data[0:2]

a    0.25
b    0.50
dtype: float64

In [78]:
(data > 0.3) & (data < 0.8)

a    False
b     True
c     True
d    False
e    False
dtype: bool

* Masking

In [79]:
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

* Fancy indexing

In [80]:
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [81]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])
data

1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [82]:
data[1:3]

2    0.50
3    0.75
dtype: float64

**If your Series has an explicit integer index, an indexing operation like `data[1]` uses the explicit index, whereas a slicing operation like `data[1:3]` uses the implicit Python-style index.**


In [83]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

Explicit index when indexing:


In [84]:
data[1]

'a'

Implicit index when slicing:


In [85]:
data[1:3]

5    b
3    c
dtype: object

The `loc` attribute enables indexing and slicing that *always* refer to the explicit index:


In [86]:
data.loc[1]

'a'

In [87]:
data.loc[1:3]

1    a
5    b
3    c
dtype: object

Note that `loc` may, but does not always, raise an index error when slicing:


In [88]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

In [89]:
try:
    data.loc[3:10]
except KeyError as e:
    print("KeyError:", e)

KeyError: 10


In [90]:
try:
    data.loc['a':'z']
except KeyError as e:
    print("KeyError:", e)

KeyError: 'a'


The `iloc` attribute enables indexing and slicing that always refer to the implicit Python-style index:


In [91]:
data.iloc[1]

'b'

In [92]:
data.iloc[1:3]

5    b
3    c
dtype: object

Please, save yourself the pain and always be explicit about what you are doing—always use `.loc` and `.iloc`.

**Explicit is better than implicit.**

The statement does **not** mean that explicit indices are better than implicit ones,  
but rather that you should be explicit about which one you are using.


### Addendum: Indexing

From [the docs][1]: In earlier versions, using `.loc[list-of-labels]` worked as long as at least one of the keys was found (otherwise a `KeyError` was raised).  
This behavior is now deprecated!

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike


In [93]:
s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

In [94]:
try:
    s.loc[[1, 2, 3]]
except KeyError as e:
    print("KeyError:", e)

KeyError: '[3] not in index'


Instead, you should use [`reindex`][1], which aligns the Series/DataFrame to a new index with optional filling logic.

[1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html


In [95]:
s.reindex([1, 2, 3])

1    2.0
2    3.0
3    NaN
dtype: float64

---

Lecture: AI I - Basics 

Excersie: [**Excersie 3.2: Pandas**](../03_data/exercises/02_pandas.ipynb)

Next: [**Chapter 3.3: Visualisation with Matplotlib**](../03_data/03_matplotlib.ipynb)