# Pandas

Pandas is build on top of NumPy and provides a higher level view on data. It therefore provides Series and DataFrame (as well as Index) objects that allow attaching row and column labels to multidimensional arrays, work with missing data, and powerful data operations familiar from databases and spreadsheets.

If you use the Anaconda Python stack, pandas is already installed.

In [1]:
import numpy as np
print("NumPy version:", np.__version__)

import pandas as pd
print("Pandas version:", pd.__version__)

NumPy version: 1.16.2
Pandas version: 0.24.2


## Series

A Series object is a one-dimensional array of indexed data.

It provides both a sequence of values (NumPy array) as well as a sequence of indices.

In [2]:
s = pd.Series([64.2, 274.3, 93.21, 52.87])
print (s)

0     64.20
1    274.30
2     93.21
3     52.87
dtype: float64


In [3]:
print(s.values)
print(type(s.values))

[ 64.2  274.3   93.21  52.87]
<class 'numpy.ndarray'>


In [None]:
print(s.index)
print(type(s.index))

In [None]:
print(s[2])
print(type(s[2]))

In [None]:
print(s[0:2])
print(type(s[0:2]))

One way to think of a Series object is as a one-dimensionally NumPy array with an explicitely defined index. This gives further capabilities on how to use this index.

In [None]:
s = pd.Series([9.13, 5.89, 7.37, 1.93], index=['a', 'b', 'c', 'd'])
print(s)

In [None]:
print(s['a'])
print(s['c'])

In [None]:
s['b'] = 8.98
print(s)

In [None]:
print(s['a':'c'])
print('\n', s[0:2])

**Why is the first slicing ['a':'c'] including c and the second one [0:2] not including c?** 

Slicing with the explicit index (Panda style) is inclusive, slicing with the implicit index (Python style) is exclusive. This can be confusing if the explicit index uses numbers 0, 1, 2, ..., n-1. Then the implicit index (Python style) is still used. See loc, iloc how to handle this.

In [None]:
s2 = pd.Series([9.13, 5.89, 7.37, 1.93], index=[0, 1, 2, 3])
print(s2, '\n')
print(s2[0:3])

Another way to think of a Series object is as a specialization of a dictionary.

In [None]:
pop_dict = {'Berlin': 3613495, 'Munich': 1456039, 'Cologne': 1080394, 'Hamburg': 1834823, 'Frankfurt a.M.': 746878 }
pop = pd.Series(pop_dict)
print(pop)

In [None]:
print(pop['Hamburg'])

In [None]:
print(pop['Munich':'Hamburg'])

## DataFrame

A DataFrame is a two-dimensional array with both flexible row indices and flexible column names. It can be though of as a sequence of aligned Series objects that share the same index.

In [None]:
area_dict = {'Berlin': 891.68, 'Munich': 310.70, 'Cologne': 405.02, 'Hamburg': 755.22, 'Frankfurt a.M.': 248.31 }
area = pd.Series(area_dict)
print(area)

In [None]:
cities = pd.DataFrame({'population':pop, 'area in km²':area})
cities

In [None]:
print(cities.index)

In [None]:
print(cities.columns)

A DataFrame can also be regarded as a specialized dictionary that  maps a column name (key) to a Series (value). Columns can be accessed by the index operator using the column name. THe return type is a Series object.

In [None]:
print(cities['area in km²'])
print('\n', type(cities['area in km²']))

**Careful when using the index, for a two dimensional NumPy array called arr, arr[0] gives back the first row. For a DataFrame object called df, df[col0] returns the column with the label col0 (as a Series object).**

In [None]:
a = np.random.randint(0, 10, (3,3))
print(a, '\n')
print(a[0])

Further Serior objects can be added to an existing DataFrame using the index operator that provides a new key that denotes the column label of the Series object to be inserted.

In [None]:
vehicle_dict = {'Berlin':'B', 'Munich':'M', 'Frankfurt a.M.': 'F', 'Bremen':'HB'}
veh = pd.Series(vehicle_dict)
cities['vehicle number'] = veh
cities

### Data Indexing and Selection

If an explicit integer index is provided, that explicit index will always be used. But if such an explicit integer index is not provided, the implicit integer index can still be used.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
print(data)

In [None]:
print(data[1])

In [None]:
print(s, '\n')
print(s[1])

Using a integer index that is not in the explicit integer index object causes an error:

In [None]:
#data(data[0])

But slicing works with the integer index, but maybe not as expected.

In [None]:
print(data, '\n')
print(data[1:3])

In [None]:
cities

If (string) labels are provided for columns, we can also use these (string) labels like attributes of the DataFrame to access the Series object with these labels. But only if there is no naming conflict, e.g. with method names.

In [None]:
print(cities['population'])

In [None]:
cities_pop = cities.population
print(cities_pop)

Insert a new column as the result of a computation with a universal function.

In [None]:
cities['density'] = cities['population'] / cities['area in km²']
cities

In [None]:
cities.values

In [None]:
cities.T

Accessing a specific row:

In [None]:
cities.values[1]

Accessing a specific column:

In [None]:
cities['population']

### loc and iloc indexers for Series objects

The **loc** attribute allows indexing and slicing that always references the explicit index.

In [None]:
print(data)

In [None]:
print(data.loc[1])

In [None]:
data.loc[1:3]

The **iloc** attribute allows indexing and slicing that always references the implicit Python-style index.

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

**The explicit indexers loc and iloc make cleaner code and are suggested to be used, especially with explicit integer indexes!**

### loc and iloc for DataFrame objects

In [None]:
cities

The iloc indexer allows to index the underlying array as if it is a simple NumPy array.

In [None]:
cities.iloc[1:4, :2]

Using the explicit index of the DataFrame with loc.

In [None]:
cities.loc[:'Cologne', 'vehicle number':'density']

### What else is there?

Combine masking and fancy indexing:

In [None]:
cities.loc[cities.density > 3000, ['population', 'density']]

In [None]:
cities.density > 3000

In [None]:
cities.loc[[True, True, False, False, True], ['population', 'density']]