# Numpy + Pandas I

## Numpy

pandas is built atop `NumPy`, historically and in the actual library, so it's helpful to have a basic understanding of it.

### ndarray

The core of numpy is the `ndarray`, N-dimensional array. These are singly-typed, fixed-length data containers.
NumPy also provides many convenient and fast methods implemented on the `ndarray`.

In [1]:
import numpy as np
# per convention

In [2]:
x = np.array([1, 2, 3])
x

array([1, 2, 3])

In [3]:
x.dtype

dtype('int64')

In [4]:
y = np.array([[True, False], [False, True]])
y

array([[ True, False],
       [False,  True]], dtype=bool)

In [5]:
z = np.array([1., 2, 3])
z

array([ 1.,  2.,  3.])

In [6]:
s = np.array(['hello', 1])
s

array(['hello', '1'], 
      dtype='<U5')

In [7]:
x = np.array([1, 2, 3])
x.shape

(3,)

In [8]:
y = np.array([[1., 2., 3.], [4., 5., 6.]])
print(y)
print(y.shape)

[[ 1.  2.  3.]
 [ 4.  5.  6.]]
(2, 3)


### dtypes

Unlike python lists, NumPy arrays care about the type of data stored within.
The full list of NumPy dtypes can be found in the [NumPy documentation](http://docs.scipy.org/doc/numpy/user/basics.types.html).

![dtypes](http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png)

We sacrifice the convinience of mixing bools and ints and floats within an array for much better performance.
However, an unexpected `dtype` change will probably bite you at some point in the future.

The two biggest things to remember are

- Missing values (NaN) cast integer or boolean arrays to floats
- NumPy arrays only have a single dtype for every element
- the object dtype is the fallback

You'll want to avoid object dtypes. It's typically slow.

In [9]:
np.nan

nan

In [10]:
np.nan * 1.

nan

In [11]:
type(np.nan)

float

### Vectorization

_dtypes_ and _vectorization_ are part of what make NumPy fast

In [12]:
x = np.random.randint(0, 10, 10)
y = np.random.randint(0, 10, 10)
x, y

(array([9, 0, 8, 8, 2, 5, 4, 9, 8, 1]), array([7, 7, 2, 8, 4, 4, 0, 1, 3, 0]))

In [13]:
[x[i] + y[i] for i in range(10)]

[16, 7, 10, 16, 6, 9, 4, 10, 11, 1]

In [14]:
x + y

array([16,  7, 10, 16,  6,  9,  4, 10, 11,  1])

In [15]:
a = np.random.rand(100000)
b = np.random.rand(100000)

In [16]:
%timeit [a[i] + b[i] for i in range(a.shape[0])]

10 loops, best of 3: 33.4 ms per loop


In [17]:
%timeit a + b

The slowest run took 19.52 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 107 Âµs per loop


### 1ms = 1000$\mu$s

### Broadcasting

It's super cool and super useful. 

The one-line explanation is that when doing elementwise operations, things expand to the "correct" shape.

In [18]:
x = np.arange(5)
x

array([0, 1, 2, 3, 4])

Adding scalars

In [19]:
x + 1

array([1, 2, 3, 4, 5])

In [20]:
y = np.random.uniform(size=(2, 5))
y

array([[ 0.85499094,  0.20632132,  0.06903616,  0.14402684,  0.81976805],
       [ 0.70942004,  0.47069417,  0.31006931,  0.42700851,  0.77755918]])

In [21]:
y + 1

array([[ 1.85499094,  1.20632132,  1.06903616,  1.14402684,  1.81976805],
       [ 1.70942004,  1.47069417,  1.31006931,  1.42700851,  1.77755918]])

Adding other arrays

In [22]:
print(x)
print(y)

[0 1 2 3 4]
[[ 0.85499094  0.20632132  0.06903616  0.14402684  0.81976805]
 [ 0.70942004  0.47069417  0.31006931  0.42700851  0.77755918]]


In [23]:
x + y

array([[ 0.85499094,  1.20632132,  2.06903616,  3.14402684,  4.81976805],
       [ 0.70942004,  1.47069417,  2.31006931,  3.42700851,  4.77755918]])

In [24]:
z = x + y
print(z.dtype, z.shape)

float64 (2, 5)


In [25]:
a = np.ones((5,2))
print(a)
b = np.ones((2,5))
print(b)

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]
[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


In [26]:
a + b

ValueError: operands could not be broadcast together with shapes (5,2) (2,5) 

In [27]:
print(b)
print(b.T)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]
[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]


In [28]:
a + b.T

array([[ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  2.]])

### Element-wise operations

- Arithmetic: `+`, `-`, `*`, `/`, `**`
- Comparisons: `==`, `!=`, `<`, `>`, `<=`, `>=`

In [29]:
x * y

array([[ 0.        ,  0.20632132,  0.13807232,  0.43208051,  3.2790722 ],
       [ 0.        ,  0.47069417,  0.62013861,  1.28102554,  3.11023673]])

In [30]:
x**2

array([ 0,  1,  4,  9, 16])

In [31]:
x < y

array([[ True, False, False, False, False],
       [ True, False, False, False, False]], dtype=bool)

In [32]:
x < 2

array([ True,  True, False, False, False], dtype=bool)

In [33]:
x < 2

array([ True,  True, False, False, False], dtype=bool)

### Array methods

- `.sum()`, `.mean()`, `.min()`, `.max()`, `.prod()`, `.cumsum()`, etc
- `np.exp`, `np.log`, `np.sin`, `np.sqrt` etc

In [34]:
y

array([[ 0.85499094,  0.20632132,  0.06903616,  0.14402684,  0.81976805],
       [ 0.70942004,  0.47069417,  0.31006931,  0.42700851,  0.77755918]])

In [35]:
y.sum()

4.7888945249725277

In [36]:
y.sum(axis=0)

array([ 1.56441098,  0.67701549,  0.37910547,  0.57103535,  1.59732723])

In [37]:
y.sum(axis=1)

array([ 2.0941433 ,  2.69475122])

In [38]:
y.prod()

4.9429668903378281e-05

In [39]:
y.mean()

0.47888945249725279

In [40]:
print(y.min(), y.max())

0.0690361610206 0.854990938989


In [41]:
print(x)
print(x.cumsum())

[0 1 2 3 4]
[ 0  1  3  6 10]


In [42]:
print(y)
print(y.cumsum(axis=0))

[[ 0.85499094  0.20632132  0.06903616  0.14402684  0.81976805]
 [ 0.70942004  0.47069417  0.31006931  0.42700851  0.77755918]]
[[ 0.85499094  0.20632132  0.06903616  0.14402684  0.81976805]
 [ 1.56441098  0.67701549  0.37910547  0.57103535  1.59732723]]


In [43]:
z = np.arange(1,6)
print(z)
print(np.log(z))

[1 2 3 4 5]
[ 0.          0.69314718  1.09861229  1.38629436  1.60943791]


In [44]:
np.exp(z)

array([   2.71828183,    7.3890561 ,   20.08553692,   54.59815003,
        148.4131591 ])

### Numpy indexing and slicing

Works similarly to list indexing, but with multiple dimensions

In [45]:
x = np.arange(50)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

In [46]:
y = x.reshape((5,10))
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])

In [47]:
y[0]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [48]:
y[1]

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [49]:
y[10]

IndexError: index 10 is out of bounds for axis 0 with size 5

In [50]:
y[1][0]

10

In [51]:
y[1,0]

10

In [52]:
y[0:,0]

array([ 0, 10, 20, 30, 40])

In [53]:
y[0,0:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [54]:
y[2:4,4:8]

array([[24, 25, 26, 27],
       [34, 35, 36, 37]])

In [55]:
y[2:4,4:8:2]

array([[24, 26],
       [34, 36]])

In [56]:
y[-1,]

array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

In [57]:
z = np.arange(60).reshape((3,4,5))
z

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]],

       [[40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59]]])

In [58]:
z[1,1,1]

26

In [59]:
z[1,:,1]

array([21, 26, 31, 36])

## Boolean indexing

If you pass in an array of boolean values of a compatible shape, you can index that way, too.

In [60]:
x = np.arange(5)
x

array([0, 1, 2, 3, 4])

In [61]:
x[np.array([False,True,False,True,False])]

array([1, 3])

In [62]:
even = (x % 2 == 0)
even

array([ True, False,  True, False,  True], dtype=bool)

In [63]:
x[even]

array([0, 2, 4])

In [64]:
lt3 = x < 3
lt3

array([ True,  True,  True, False, False], dtype=bool)

In [65]:
x[lt3]

array([0, 1, 2])

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with both *relational* and *labeled* data. 

It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

## Why pandas?

Numpy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of Numpy, pandas provides:

- labeled arrays

- heterogenous data types within a table

- "better" missing data handling

- convenient methods (`groupby`, `rolling`, `resample`)

- more data types (Categorical, Datetime)

- indexing handles data alignment

- built in functionality for reading/writing many kinds of files

In [66]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=1024 height=500></iframe>")

## Pandas Data Structures

### Series

A **Series** is a single vector of data with an *index* that labels each element in the vector

In [67]:
import pandas as pd

In [68]:
counts = pd.Series([632, 1638, 569, 115])
counts

0     632
1    1638
2     569
3     115
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A numpy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [69]:
counts.values

array([ 632, 1638,  569,  115])

In [70]:
counts.index

RangeIndex(start=0, stop=4, step=1)

We can assign meaningful labels to the index, if they are available:

In [71]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

In [72]:
bacteria

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

These labels can be used to refer to the values in the `Series`.

In [73]:
bacteria['Actinobacteria']

569

In [74]:
bacteria['Firmicutes':'Actinobacteria']

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
dtype: int64

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [75]:
bacteria[0]

632

Or boolean arrays/lists

In [76]:
bac = [name.endswith('bacteria') for name in bacteria.index]
bac

[False, True, True, False]

In [77]:
bacteria[bac]

Proteobacteria    1638
Actinobacteria     569
dtype: int64

We can give both the array of values and the index meaningful labels themselves:

In [78]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria

phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

We can also filter according to the values in the `Series`:

In [79]:
bacteria[bacteria > 1000]

phylum
Proteobacteria    1638
Name: counts, dtype: int64

In [80]:
bacteria > 1000

phylum
Firmicutes        False
Proteobacteria     True
Actinobacteria    False
Bacteroidetes     False
Name: counts, dtype: bool

A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [81]:
bacteria_dict = {'Firmicutes': 632, 
                 'Proteobacteria': 1638, 
                 'Actinobacteria': 569, 
                 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

Actinobacteria     569
Bacteroidetes      115
Firmicutes         632
Proteobacteria    1638
dtype: int64

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.

In [82]:
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2

Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

In [83]:
bacteria2.isnull()

Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

Critically, the labels are used to **align data** when used in operations with other Series objects:

In [84]:
bacteria + bacteria2

Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

In [85]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                               'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 
                               'Actinobacteria', 'Bacteroidetes']})

In [86]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [87]:
data[['phylum','value','patient']]

Unnamed: 0,phylum,value,patient
0,Firmicutes,632,1
1,Proteobacteria,1638,1
2,Actinobacteria,569,1
3,Bacteroidetes,115,1
4,Firmicutes,433,2
5,Proteobacteria,1130,2
6,Actinobacteria,754,2
7,Bacteroidetes,555,2


A `DataFrame` has a second index, representing the columns:

In [88]:
data.columns

Index(['patient', 'phylum', 'value'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [89]:
data['value']

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [90]:
data.value

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [91]:
type(data.value)

pandas.core.series.Series

In [92]:
type(data[['value']])

pandas.core.frame.DataFrame

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). 

### Indexing `DataFrame`s

If we want access to a row in a `DataFrame`, we index its `loc` or `iloc` attribute.

In [93]:
np.random.seed(0)
data = pd.DataFrame({'v1': np.random.rand(4), 
                     'v2': np.random.rand(4)}, 
                    index=['w','x','y','z'])
data.head()

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894
y,0.602763,0.437587
z,0.544883,0.891773


### `.loc`

To access something by its _label_, use `.loc[]`:

In [94]:
data.loc['y']

v1    0.602763
v2    0.437587
Name: y, dtype: float64

In [95]:
data.loc[['x','y']]

Unnamed: 0,v1,v2
x,0.715189,0.645894
y,0.602763,0.437587


In [96]:
data.loc[['x','y'],'v1']

x    0.715189
y    0.602763
Name: v1, dtype: float64

In [97]:
data.loc[['x','y'],'v2']

x    0.645894
y    0.437587
Name: v2, dtype: float64

In [98]:
data.loc['w':'y',['v1','v2']]

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894
y,0.602763,0.437587


### `.iloc`

Whereas if we want to access something by its row/column number, use `.iloc[]`:

In [99]:
data.iloc[2]

v1    0.602763
v2    0.437587
Name: y, dtype: float64

In [100]:
data.iloc[:2]

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894


In [101]:
data.iloc[:2,0]

w    0.548814
x    0.715189
Name: v1, dtype: float64

In [102]:
data.iloc[0,:]

v1    0.548814
v2    0.423655
Name: w, dtype: float64

### Boolean indexing

`.loc` _also_ supports boolean indices

In [103]:
data.v1 > .6

w    False
x     True
y     True
z    False
Name: v1, dtype: bool

In [104]:
data.loc[data.v1 > .6]

Unnamed: 0,v1,v2
x,0.715189,0.645894
y,0.602763,0.437587


In [105]:
data.loc[(data.v1 > .6) & (data.v2 < .5)]

Unnamed: 0,v1,v2
y,0.602763,0.437587


In [106]:
data.loc[(data.v1 > .6) & (data.v2 < .5), 'v2']

y    0.437587
Name: v2, dtype: float64

### `.ix`

If you want to live dangerously, you can use `.ix[]`, which lets you mix integers and labels but does some guesswork...

Why am I even telling you this? 

# ðŸ‘´

In [107]:
data.ix['w']

v1    0.548814
v2    0.423655
Name: w, dtype: float64

In [108]:
data.ix[0]

v1    0.548814
v2    0.423655
Name: w, dtype: float64

In [109]:
data.ix['w', 0]

0.54881350392732475

In [110]:
data.ix[2:, 'v2']

y    0.437587
z    0.891773
Name: v2, dtype: float64

In [111]:
data.ix[data.index == 'w', 'v2']

w    0.423655
Name: v2, dtype: float64

### .query()

- A convenient way to use SQL-like syntax for selecting

- But unfortunately does not allow assignment...

- ðŸ˜Ž

- More info in [the docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html)

In [112]:
data.query('v1 < .6')

Unnamed: 0,v1,v2
w,0.548814,0.423655
z,0.544883,0.891773


In [113]:
data.query('v1 < .6 & v2 > .7')

Unnamed: 0,v1,v2
z,0.544883,0.891773


## Operating on `DataFrame`s

Operations on `DataFrame`s are very similar to operations on `Series`, including the ability to automatically align axes.

In [114]:
df1 = pd.DataFrame.from_dict({'USA': {'lat': 37.1, 'lon': 95.7},
                              'CAN': {'lat': 56.1, 'lon': 106.3},
                              'MEX': {'lat': 23.6, 'lon': 102.6}}, orient='index')
df1

Unnamed: 0,lat,lon
CAN,56.1,106.3
MEX,23.6,102.6
USA,37.1,95.7


In [115]:
df2 = pd.DataFrame.from_dict({'USA': {'temp': 70},
                              'CAN': {'temp': 60},
                              'MEX': {'temp': 50}}, orient='index')
df2

Unnamed: 0,temp
CAN,60
MEX,50
USA,70


In [116]:
df1['lat'] / df2['temp']

CAN    0.935
MEX    0.472
USA    0.530
dtype: float64

Do we have to sort values?

In [117]:
df1 = pd.DataFrame.from_dict({'USA': {'lat': 37.1, 'lon': 95.7},
                              'CAN': {'lat': 56.1, 'lon': 106.3},
                              'MEX': {'lat': 23.6, 'lon': 102.6}}, orient='index')
df2 = pd.DataFrame.from_dict({'CAN': {'temp': 60},
                              'USA': {'temp': 70},
                              'MEX': {'temp': 50}}, orient='index')

In [118]:
df1['lat'] / df2['temp']

CAN    0.935
MEX    0.472
USA    0.530
dtype: float64

What about missing values?

In [119]:
df3 = pd.DataFrame.from_dict({'USA': {'temp': 70},
                              'CAN': {'temp': 60},
                              'MEX': {'temp': 50},
                              'GRL': {'temp': 40}}, orient='index')
df3

Unnamed: 0,temp
CAN,60
GRL,40
MEX,50
USA,70


In [120]:
df1['lat'] / df3['temp']

CAN    0.935
GRL      NaN
MEX    0.472
USA    0.530
dtype: float64

## Adding to and changing `DataFrame`s

In [121]:
data = pd.DataFrame.from_dict({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                               1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                               2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                               3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                               4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                               5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                               6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                               7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}}, orient='index')
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,115
4,Firmicutes,2,433
5,Proteobacteria,2,1130
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


Its important to note that the Series returned when a DataFrame is indexed is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [122]:
vals = data.value
vals

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [123]:
vals[5] = 0
vals

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: int64

In [124]:
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,115
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [125]:
vals = data.value.copy()
vals[5] = 1000
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,115
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


We can modify existing values by assignment:

In [126]:
data.loc[3, 'value'] = 14
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,14
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [127]:
data.iloc[0, 1] = 2
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,2,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,14
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [128]:
data['patient'] = data['patient'] + 1
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,3,632
1,Proteobacteria,2,1638
2,Actinobacteria,2,569
3,Bacteroidetes,2,14
4,Firmicutes,3,433
5,Proteobacteria,3,0
6,Actinobacteria,3,754
7,Bacteroidetes,3,555


And we can also create new variables by assignment:

In [129]:
data['year'] = 2013
data

Unnamed: 0,phylum,patient,value,year
0,Firmicutes,3,632,2013
1,Proteobacteria,2,1638,2013
2,Actinobacteria,2,569,2013
3,Bacteroidetes,2,14,2013
4,Firmicutes,3,433,2013
5,Proteobacteria,3,0,2013
6,Actinobacteria,3,754,2013
7,Bacteroidetes,3,555,2013


In [130]:
data['age'] = np.repeat([5,10],4)
data

Unnamed: 0,phylum,patient,value,year,age
0,Firmicutes,3,632,2013,5
1,Proteobacteria,2,1638,2013,5
2,Actinobacteria,2,569,2013,5
3,Bacteroidetes,2,14,2013,5
4,Firmicutes,3,433,2013,10
5,Proteobacteria,3,0,2013,10
6,Actinobacteria,3,754,2013,10
7,Bacteroidetes,3,555,2013,10


In [131]:
data['iq'] = np.arange(5)

ValueError: Length of values does not match length of index

We can use attribute indexing of columns to assign values:

In [132]:
data.year = 2014
data

Unnamed: 0,phylum,patient,value,year,age
0,Firmicutes,3,632,2014,5
1,Proteobacteria,2,1638,2014,5
2,Actinobacteria,2,569,2014,5
3,Bacteroidetes,2,14,2014,5
4,Firmicutes,3,433,2014,10
5,Proteobacteria,3,0,2014,10
6,Actinobacteria,3,754,2014,10
7,Bacteroidetes,3,555,2014,10


But not to add new columns:

In [133]:
data.treatment = 1
data

Unnamed: 0,phylum,patient,value,year,age
0,Firmicutes,3,632,2014,5
1,Proteobacteria,2,1638,2014,5
2,Actinobacteria,2,569,2014,5
3,Bacteroidetes,2,14,2014,5
4,Firmicutes,3,433,2014,10
5,Proteobacteria,3,0,2014,10
6,Actinobacteria,3,754,2014,10
7,Bacteroidetes,3,555,2014,10


In [134]:
data.treatment

1

Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:

In [135]:
treatment = pd.Series([0]*4 + [1]*2)
treatment

0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64

In [136]:
data['treatment'] = treatment
data

Unnamed: 0,phylum,patient,value,year,age,treatment
0,Firmicutes,3,632,2014,5,0.0
1,Proteobacteria,2,1638,2014,5,0.0
2,Actinobacteria,2,569,2014,5,0.0
3,Bacteroidetes,2,14,2014,5,0.0
4,Firmicutes,3,433,2014,10,1.0
5,Proteobacteria,3,0,2014,10,1.0
6,Actinobacteria,3,754,2014,10,
7,Bacteroidetes,3,555,2014,10,


### `.eval()`

- Another new(ish) method similar to `.query()`
- Takes advantage of some tricks to speed things up
- See [the docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.eval.html) for more info

In [137]:
data = pd.DataFrame.from_dict({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                               1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                               2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                               3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                               4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                               5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                               6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                               7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}}, orient='index')
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,115
4,Firmicutes,2,433
5,Proteobacteria,2,1130
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [138]:
data.eval('patient + value')

0     633
1    1639
2     570
3     116
4     435
5    1132
6     756
7     557
dtype: int64

In [139]:
data.eval('newvar = patient + value', inplace=True)
data

Unnamed: 0,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [140]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
data['month'] = month

ValueError: Length of values does not match length of index

In [141]:
data['month'] = ['Jan']*len(data)
data

Unnamed: 0,phylum,patient,value,newvar,month
0,Firmicutes,1,632,633,Jan
1,Proteobacteria,1,1638,1639,Jan
2,Actinobacteria,1,569,570,Jan
3,Bacteroidetes,1,115,116,Jan
4,Firmicutes,2,433,435,Jan
5,Proteobacteria,2,1130,1132,Jan
6,Actinobacteria,2,754,756,Jan
7,Bacteroidetes,2,555,557,Jan


We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [142]:
del data['month']
data

Unnamed: 0,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


Or `.drop()` can be used:

In [None]:
data.drop?

In [143]:
data['month'] = ['Jan']*len(data)
data.drop('month', axis=1, inplace=True)
data

Unnamed: 0,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [144]:
data.values

array([['Firmicutes', 1, 632, 633],
       ['Proteobacteria', 1, 1638, 1639],
       ['Actinobacteria', 1, 569, 570],
       ['Bacteroidetes', 1, 115, 116],
       ['Firmicutes', 2, 433, 435],
       ['Proteobacteria', 2, 1130, 1132],
       ['Actinobacteria', 2, 754, 756],
       ['Bacteroidetes', 2, 555, 557]], dtype=object)

Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

In [145]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})
df.values

array([[ 0.4,  1. ],
       [-1. ,  2. ],
       [ 4.5,  3. ]])

In [146]:
df.values.dtype

dtype('float64')

Pandas uses a custom data structure to represent the indices of `Series` and `DataFrame`s.

In [147]:
data.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

Index objects are immutable:

In [148]:
data.index[0] = 15

TypeError: Index does not support mutable operations

This is so that Index objects can be shared between data structures without fear that they will be changed.

In [149]:
bacteria2.index = bacteria.index

In [150]:
bacteria2

phylum
Firmicutes           NaN
Proteobacteria     632.0
Actinobacteria    1638.0
Bacteroidetes      569.0
dtype: float64

## Importing data

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Let's start with some more bacteria data, stored in csv format.

In [151]:
!head -n10 data/microbiome.csv

Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80


This table can be read into a DataFrame using `read_csv`:

In [152]:
mb = pd.read_csv("data/microbiome.csv")
mb.shape

(75, 4)

In [153]:
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


In [154]:
mb.tail()

Unnamed: 0,Taxon,Patient,Tissue,Stool
70,Other,11,203,6
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [155]:
pd.read_csv("data/microbiome.csv", header=None).head()

Unnamed: 0,0,1,2,3
0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [156]:
mb = pd.read_table("data/microbiome.csv", sep=',')
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


The `sep` argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

In [157]:
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Tissue,Stool
Taxon,Patient,Unnamed: 2_level_1,Unnamed: 3_level_1
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605


This is called a *hierarchical* index, which we will revisit later.

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument:

In [158]:
pd.read_csv("data/microbiome.csv", skiprows=[3,4,6]).head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,5,831,8605
3,Firmicutes,7,718,717
4,Firmicutes,8,173,33


Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [159]:
pd.read_csv("data/microbiome.csv", nrows=4)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946


Alternately, if we want to process our data in reasonable chunks, the `chunksize` argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:

In [160]:
data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)

mean_tissue = {chunk.Taxon[0]: chunk.Tissue.mean() for chunk in data_chunks}
    
mean_tissue

{'Actinobacteria': 449.06666666666666,
 'Bacteroidetes': 599.6666666666666,
 'Firmicutes': 684.4,
 'Other': 198.8,
 'Proteobacteria': 2943.0666666666666}

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

In [161]:
!head -n12 data/microbiome_missing.csv

Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA
Firmicutes,10,162,3196
Firmicutes,11,372,-99999


In [162]:
pd.read_csv("data/microbiome_missing.csv").head(12)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632.0,305.0
1,Firmicutes,2,136.0,4182.0
2,Firmicutes,3,,703.0
3,Firmicutes,4,408.0,3946.0
4,Firmicutes,5,831.0,8605.0
5,Firmicutes,6,693.0,50.0
6,Firmicutes,7,718.0,717.0
7,Firmicutes,8,173.0,33.0
8,Firmicutes,9,228.0,
9,Firmicutes,10,162.0,3196.0


Above, Pandas recognized `NA` and an empty field as missing data.

In [163]:
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(12)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,False,False,False,False
1,False,False,False,False
2,False,False,True,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,True
9,False,False,False,False


Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [164]:
pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999]).head(12)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632.0,305.0
1,Firmicutes,2,136.0,4182.0
2,Firmicutes,3,,703.0
3,Firmicutes,4,408.0,3946.0
4,Firmicutes,5,831.0,8605.0
5,Firmicutes,6,693.0,50.0
6,Firmicutes,7,718.0,717.0
7,Firmicutes,8,173.0,33.0
8,Firmicutes,9,228.0,
9,Firmicutes,10,162.0,3196.0


These can be specified on a column-wise basis using an appropriate dict as the argument for `na_values`.

### Microsoft Excel

Since so much scientific data ends up in Excel spreadsheets, Pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: `xlrd` and `openpyxl` (these may be installed with either `pip` or `easy_install`).

Importing Excel data to Pandas is a two-step process. First, we create an `ExcelFile` object using the path of the file:                                             

In [165]:
mb_file = pd.ExcelFile('data/microbiome/MID1.xls')
mb_file

<pandas.io.excel.ExcelFile at 0x112563d30>

Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:

In [166]:
mb1 = mb_file.parse("Sheet 1", header=None)
mb1.columns = ["Taxon", "Count"]
mb1.head()

Unnamed: 0,Taxon,Count
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7


There is also a `read_excel` convenience function in Pandas that combines these steps into a single call:

In [167]:
mb2 = pd.read_excel('data/microbiome/MID2.xls', sheetname='Sheet 1', header=None)
mb2.columns = ['Taxon', 'Count']
mb2.head()

Unnamed: 0,Taxon,Count
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2


# Saving data

In [168]:
data

Unnamed: 0,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


In [169]:
data.to_csv?

In [170]:
data.to_csv('data/test.csv')
!cat data/test.csv

,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


In [171]:
data.to_csv('data/test.csv', index=False)
!cat data/test.csv

phylum,patient,value,newvar
Firmicutes,1,632,633
Proteobacteria,1,1638,1639
Actinobacteria,1,569,570
Bacteroidetes,1,115,116
Firmicutes,2,433,435
Proteobacteria,2,1130,1132
Actinobacteria,2,754,756
Bacteroidetes,2,555,557


In [172]:
data.to_csv('data/test.csv', index=False, header=False)
!cat data/test.csv

Firmicutes,1,632,633
Proteobacteria,1,1638,1639
Actinobacteria,1,569,570
Bacteroidetes,1,115,116
Firmicutes,2,433,435
Proteobacteria,2,1130,1132
Actinobacteria,2,754,756
Bacteroidetes,2,555,557


There are many other types of files that Pandas can open and save, which we'll cover further in the future.

## Exercise 3

Open up [Lecture 3/Exercise 3.ipynb](./Exercise 3.ipynb) in your Jupyter notebook server.

Solutions are at [Lecture 3/Exercise 3 - Solutions.ipynb](./Exercise 3 - Solutions.ipynb)

## References

Slide materials inspired by and adapted from [Chris Fonnesbeck](https://github.com/fonnesbeck/statistical-analysis-python-tutorial) and [Tom Augspurger](https://github.com/TomAugspurger/pydata-chi-h2t)