# Pandas

In [1]:
import pandas as pd

## DataFrame

In [2]:
users = {
    'firstname': ['Joerg', 'Johanna', 'Caro', 'Philipp'],
    'lastname': ['Faschingbauer', 'Faschingbauer', 'Faschingbauer', 'Lichtenberger'],
    'svnr': ['1037190666', '1234110695', '2345250497', '345606041986'],
    'age': [56, 27, 25, 37]
}

In [3]:
users['firstname']

['Joerg', 'Johanna', 'Caro', 'Philipp']

In [4]:
df = pd.DataFrame(users)

In [5]:
df

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
1,Johanna,Faschingbauer,1234110695,27
2,Caro,Faschingbauer,2345250497,25
3,Philipp,Lichtenberger,345606041986,37


In [6]:
df['firstname']

0      Joerg
1    Johanna
2       Caro
3    Philipp
Name: firstname, dtype: object

## Filters

### Simple Equality

In [7]:
df['lastname'] == 'Faschingbauer'

0     True
1     True
2     True
3    False
Name: lastname, dtype: bool

In [8]:
flt = df['lastname'] == 'Faschingbauer'

In [9]:
flt

0     True
1     True
2     True
3    False
Name: lastname, dtype: bool

In [10]:
type(flt)

pandas.core.series.Series

In [11]:
df[flt]

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
1,Johanna,Faschingbauer,1234110695,27
2,Caro,Faschingbauer,2345250497,25


**Better:** use ``loc[]`` to avoid confusion with column addressing (we want *rows*)

In [12]:
df.loc[flt]

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
1,Johanna,Faschingbauer,1234110695,27
2,Caro,Faschingbauer,2345250497,25


**Better again:** ``loc`` allows us to select the columns that we want

In [13]:
df.loc[flt, 'firstname']

0      Joerg
1    Johanna
2       Caro
Name: firstname, dtype: object

### Boolean Expressions

In [14]:
flt = (df['firstname'] == 'Joerg') & (df['lastname'] == 'Faschingbauer') | (df['firstname'] == 'Philipp')

**Attention**: braces are important because '&' binds stronger than '=='. This is bad.

In [15]:
df[flt]

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
3,Philipp,Lichtenberger,345606041986,37


**Negation**: ``~``

In [16]:
df.loc[~flt]

Unnamed: 0,firstname,lastname,svnr,age
1,Johanna,Faschingbauer,1234110695,27
2,Caro,Faschingbauer,2345250497,25


### Neat Helpers

In [17]:
flt = df['firstname'].isin(['Caro', 'Philipp'])
df[flt]

Unnamed: 0,firstname,lastname,svnr,age
2,Caro,Faschingbauer,2345250497,25
3,Philipp,Lichtenberger,345606041986,37


In [18]:
flt = df['firstname'].str.startswith('J')
df[flt]

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
1,Johanna,Faschingbauer,1234110695,27



### Updating

#### Straightforward: assign modified copy of Series back into DataFrame

In [19]:
df2 = df.copy()

In [20]:
df2['firstname'] = df2['firstname'].str.upper()

In [21]:
df2

Unnamed: 0,firstname,lastname,svnr,age
0,JOERG,Faschingbauer,1037190666,56
1,JOHANNA,Faschingbauer,1234110695,27
2,CARO,Faschingbauer,2345250497,25
3,PHILIPP,Lichtenberger,345606041986,37


#### Apply On Series

In [22]:
df['firstname'].apply(len)

0    5
1    7
2    4
3    7
Name: firstname, dtype: int64

In [23]:
def upper(s):
    return s.upper()

In [24]:
df['firstname'].apply(upper)

0      JOERG
1    JOHANNA
2       CARO
3    PHILIPP
Name: firstname, dtype: object

#### Apply On Entire DataFrame

Default direction: 'rows' -> length of each column

In [25]:
df.apply(len)

firstname    4
lastname     4
svnr         4
age          4
dtype: int64

In [26]:
df.apply(len, axis='columns')

0    4
1    4
2    4
3    4
dtype: int64

#### ``applymap``: Each Element of DataFrama

### Groups, Aggregation

#### Naive Approach

In [41]:
flt = df['lastname'] == 'Faschingbauer'
df.loc[flt, 'age'].mean()

36.0

In [42]:
flt = df['lastname'] == 'Lichtenberger'
df.loc[flt, 'age'].mean()

37.0

#### ``groupby``

In [29]:
lastnames = df.groupby('lastname')

In [43]:
lastnames

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7f5fd6b130>

**To give an idea what it is ...**

Converted to a list ...

In [31]:
list(lastnames)

[('Faschingbauer',
    firstname       lastname        svnr  age
  0     Joerg  Faschingbauer  1037190666   56
  1   Johanna  Faschingbauer  1234110695   27
  2      Caro  Faschingbauer  2345250497   25),
 ('Lichtenberger',
    firstname       lastname          svnr  age
  3   Philipp  Lichtenberger  345606041986   37)]

That into a ``dict`` ... looks like the elements are data frames

In [32]:
d = dict(list(lastnames))

In [33]:
d

{'Faschingbauer':   firstname       lastname        svnr  age
 0     Joerg  Faschingbauer  1037190666   56
 1   Johanna  Faschingbauer  1234110695   27
 2      Caro  Faschingbauer  2345250497   25,
 'Lichtenberger':   firstname       lastname          svnr  age
 3   Philipp  Lichtenberger  345606041986   37}

In [44]:
type(d['Faschingbauer'])

pandas.core.frame.DataFrame

#### Working with ``groupby``

``get_group()`` gives a DataFrame

In [45]:
f = lastnames.get_group('Faschingbauer')

In [46]:
f

Unnamed: 0,firstname,lastname,svnr,age
0,Joerg,Faschingbauer,1037190666,56
1,Johanna,Faschingbauer,1234110695,27
2,Caro,Faschingbauer,2345250497,25


In [52]:
f['age'].mean()

36.0