# First steps with pandas

Klaus-Dieter Warzecha

![alt great panda](img/panda640.jpg "Great Panda")


## pandas: Python Data Analysis Library

http://pandas.pydata.org/

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


# How to import it?

In [None]:
import numpy as np
import pandas as pd

# Major objects
  - **Series**
  - **DataFrame**

# Series

## A first example

In [None]:
length_in_d = (865, 750, 727, 647, 569, 560, 413, 384, 382, 371)

print(length_in_d)
print("The data structure is of type {}.".format(type(length_in_d)))
print("The first element is of type {}.".format(type(length_in_d[0])))

In [None]:
ser = pd.Series(length_in_d) # read tuple into a pandas Series object
print(ser)

### Lesson learned

A pandas Series 

  - is a Numpy array (ndarray) 
  - has an additional index

## How to change the index?

In [None]:
names = ['Rhein', 'Weser', 'Elbe', 'Donau', 'Main', 
         'Havel', 'Saale', 'Neckar', 'Spree', 'Ems']

ser.index = names
print(ser)

### Lesson learned

A pandas Series 

  - can have an alphanumerical index

## Indexing and slicing

In [None]:
rivers = ser.copy() # just to be sure ;-)
rivers

In [None]:
rivers[['Saale', 'Weser']] # we can pass a list of indices

In [None]:
rivers['Elbe': 'Spree': 2] # we can use slicing with alphanumerical indices

In [None]:
rivers[-1: 3: -1] # the numeric positional index is still accessible

In [None]:
other = pd.Series([2210, 518, 368], index=['Donau', 'Elbe', 'Rhein']) # some rivers just pass through Germany
other

## Series can be added

In [None]:
total = rivers.add(other, fill_value=0)

In [None]:
total

### Lesson learned
Series are aligned on index

## A Game of <strike>Thrones</strike> Indices

In [None]:
import string

# just add some characters
chars = list(string.ascii_lowercase + string.digits + string.ascii_uppercase)
ascii_codes = [ord(char) for char in chars]

### Build and inspect a simple Series of characters

In [None]:
stuff = pd.Series(chars)

print(stuff.head())
print(stuff.tail())
print(stuff.size)

### Use the ASCII code of the characters as index

In [None]:
stuff.index = ascii_codes

In [None]:
stuff.head()

In [None]:
stuff.tail()

In [None]:
stuff[86]

In [None]:
stuff[0]

### Are we doomed?

## Can I still use the zero-based positional index?

In [None]:
stuff.iloc[0]

In [None]:
stuff.iloc[-1]

In [None]:
stuff.iloc[[2,12,3]]

In [None]:
stuff.iloc[3:30:7]

### Lesson learned

  - We can use **.iloc** for 0-based positional indexing!
  - Check out the other methods, e.g. **.ix**, **.iat**, **.loc** in the docs.

# DataFrames

In [None]:
!head data/alice.csv

## Read data from a CSV file into a data frame

In [None]:
stuff = pd.read_csv('data/alice.csv')

In [None]:
stuff.dtypes

In [None]:
jogging = pd.read_csv('data/alice.csv', parse_dates=True, index_col=0)

## Get some info on the DataFrame

In [None]:
jogging.info()

In [None]:
jogging.columns

In [None]:
jogging.index

In [None]:
jogging.head()

## Advice: use column titles **without** spaces

In [None]:
jogging['minutes'][:10:2]

In [None]:
jogging.tail()

## Get some statistics

In [None]:
jogging.describe()

In [None]:
jogging['pace'] = jogging.distance/jogging.minutes

In [None]:
jogging.head()

In [None]:
jogging.describe()

## Let's make a copy of the data frame

In [None]:
alice = jogging.copy()

In [None]:
alice.head()

In [None]:
alice.distance.sum()

In [None]:
%matplotlib inline
alice.distance.plot(kind='hist', bins=10, figsize=(10,6))

In [None]:
weekly = alice.distance.resample('W-MON').sum()
weekly.plot(figsize=(10,6))

# Let's plot some UV data

In [None]:
absorb = pd.read_csv('data/022-abs.txt', skiprows=23, header=None, delimiter='\t', 
                     index_col=0, names=['wavelength', 'absorbance'])

In [None]:
absorb.head()

In [None]:
%matplotlib inline
absorb.plot()

  - wavelength range too large
  - legend sucks
  - no y-label
  - no title

The `df.plot()` function returns a `matplotlib.axes.AxesSubplot` object :-)

In [None]:
import matplotlib.pyplot as plt

matplotlib.style.use('ggplot')


relevant = (absorb.index > 280) & (absorb.index < 420)

plt.figure()

ax = absorb[relevant].plot(legend=False, 
                            figsize=(10,8), 
                            title ='UV absorption of anthracene')

ax.set_ylabel('absorbance / A.U.')