# Short introduction to pandas

pandas is the main package in python to manipulate tabular data. It's usually abbreviated to <code>pd</code>:

In [1]:
import pandas as pd

The main class in pandas is a <code>DataFrame</code> that represents a table of data. A <code>DataFrame</code> can be created in a number of ways. The most common are reading in from a text file or turning an array (list) or a dictionary into a <code>DataFrame</code>. Let's do the latter.

In [2]:
data=[[1.76,70,'blue'],[1.46,32,'brown'],[1.65,47,'black'],[1.32,26,'brown']]
df=pd.DataFrame(data)
df

Unnamed: 0,0,1,2
0,1.76,70,blue
1,1.46,32,brown
2,1.65,47,black
3,1.32,26,brown


Data frames have columns and rows, and each may have names. To get column names we simply pass column names to the data frame when it is created. The row names are actually called the <code>index</code> of the data frame. Let's introduce both.

In [3]:
columns=['height (m)','weight (kg)','eye colour']
index=['Arnaud','Beatrix','Camille','Derk']
df=pd.DataFrame(data,columns=columns,index=index)
df

Unnamed: 0,height (m),weight (kg),eye colour
Arnaud,1.76,70,blue
Beatrix,1.46,32,brown
Camille,1.65,47,black
Derk,1.32,26,brown


Column names and the index can be directly accessed:

In [4]:
df.columns

Index(['height (m)', 'weight (kg)', 'eye colour'], dtype='object')

Don't get confused by the complicated output. The columns basically behave like a list of strings:

In [5]:
df.columns[2]

'eye colour'

In [6]:
df.index

Index(['Arnaud', 'Beatrix', 'Camille', 'Derk'], dtype='object')

Individual columns can be selected like this:

In [7]:
df['weight (kg)']

Arnaud     70
Beatrix    32
Camille    47
Derk       26
Name: weight (kg), dtype: int64

We can also select more columns at once. For these we need to pass the list of selected columns to the data frame:

In [8]:
selected_columns=['height (m)','eye colour']
df[selected_columns]

Unnamed: 0,height (m),eye colour
Arnaud,1.76,blue
Beatrix,1.46,brown
Camille,1.65,black
Derk,1.32,brown


This can be done in one go. Note the double brackets:

In [9]:
df[['height (m)','eye colour']]

Unnamed: 0,height (m),eye colour
Arnaud,1.76,blue
Beatrix,1.46,brown
Camille,1.65,black
Derk,1.32,brown


Columns can be combined with standard operations. For instance, to calculate the body mass index:

In [10]:
df['weight (kg)']/df['height (m)']**2

Arnaud     22.598140
Beatrix    15.012197
Camille    17.263545
Derk       14.921947
dtype: float64

(Note that this is made up data -- no need to check for unhealthy bmi's.)

We can then add this to the table as a new column:

In [11]:
df['bmi']=df['weight (kg)']/df['height (m)']**2
df

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Arnaud,1.76,70,blue,22.59814
Beatrix,1.46,32,brown,15.012197
Camille,1.65,47,black,17.263545
Derk,1.32,26,brown,14.921947


Next, we can also select rows with <code>df.loc</code>.

In [12]:
df.loc['Derk']

height (m)        1.32
weight (kg)         26
eye colour       brown
bmi            14.9219
Name: Derk, dtype: object

Actually <code>df.loc</code> can also be used to search a value in a certain row and column:

In [13]:
df.loc['Derk','eye colour']

'brown'

We can also change values:

In [14]:
df.loc['Arnaud','weight (kg)']=71
df

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Arnaud,1.76,71,blue,22.59814
Beatrix,1.46,32,brown,15.012197
Camille,1.65,47,black,17.263545
Derk,1.32,26,brown,14.921947


(Note that the bmi doesn't automatically update.)

Next, we can select rows by column values. To do so, note that we can obtain a boolean vector by checking some condition for a column:

In [15]:
df['eye colour']=='brown'

Arnaud     False
Beatrix     True
Camille    False
Derk        True
Name: eye colour, dtype: bool

...or 

In [16]:
df['height (m)']>1.5

Arnaud      True
Beatrix    False
Camille     True
Derk       False
Name: height (m), dtype: bool

This boolean vector can be used to select rows (namely those, where the value is <code>True</code>):

In [17]:
selection_vector=df['eye colour']=='brown'
df[selection_vector]

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Beatrix,1.46,32,brown,15.012197
Derk,1.32,26,brown,14.921947


...or slightly shorter:

In [18]:
df[df['height (m)']>1.5]

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Arnaud,1.76,71,blue,22.59814
Camille,1.65,47,black,17.263545


Different criteria may be combined. For this we operate on the boolean selection vectors. These can be combined by logical operaters such as & (and) and | (or).

In [19]:
eye_colour= df['eye colour']!='brown' ## "!=" is "not equal"
weight= df['weight (kg)']<60

df[eye_colour & weight]

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Camille,1.65,47,black,17.263545


In [20]:
df[eye_colour | weight]

Unnamed: 0,height (m),weight (kg),eye colour,bmi
Arnaud,1.76,71,blue,22.59814
Beatrix,1.46,32,brown,15.012197
Camille,1.65,47,black,17.263545
Derk,1.32,26,brown,14.921947


Row selection by value and column selection can be combined. For this, we need to use <code>df.loc</code> again.

In [21]:
df.loc[eye_colour & weight,"bmi"]

Camille    17.263545
Name: bmi, dtype: float64

If you just want the values, use <code>values</code>.

In [22]:
df.loc[eye_colour & weight,"bmi"].values

array([17.26354454])

Note this returns a list, as there could be more than one row that satisfies the selection criteria.

Pandas obviously has a lot more functions. You can compute statistics on the tables, you can do complicated data manipulation, you can fill in missing values and so on. For a lot more, see the documentation of [pandas](https://pandas.pydata.org/docs/).