# Numpy + Pandas I

## Numpy

pandas is built atop `NumPy`, historically and in the actual library, so it's helpful to have a basic understanding of it.

### ndarray

The core of numpy is the `ndarray`, N-dimensional array. These are singly-typed, fixed-length data containers.
NumPy also provides many convenient and fast methods implemented on the `ndarray`.

In [2]:
import numpy as np
# per convention

In [2]:
x = np.array([1, 2, 3])
x

array([1, 2, 3])

In [3]:
x.dtype

dtype('int64')

In [4]:
y = np.array([[True, False], [False, True]])
y

array([[ True, False],
       [False,  True]], dtype=bool)

In [5]:
z = np.array([1., 2, 3])
z

array([ 1.,  2.,  3.])

In [3]:
s = np.array(['hello', 'tt'])
s

array(['hello', 'tt'], dtype='<U5')

In [5]:
x = np.array([[1], [2], [3]])
x.shape

(3, 1)

In [8]:
y = np.array([[1., 2., 3.], [4., 5., 6.]])
print(y)
print(y.shape)

[[ 1.  2.  3.]
 [ 4.  5.  6.]]
(2, 3)


### dtypes

Unlike python lists, NumPy arrays care about the type of data stored within.
The full list of NumPy dtypes can be found in the [NumPy documentation](http://docs.scipy.org/doc/numpy/user/basics.types.html).

![dtypes](http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png)

We sacrifice the convinience of mixing bools and ints and floats within an array for much better performance.
However, an unexpected `dtype` change will probably bite you at some point in the future.

The two biggest things to remember are

- Missing values (NaN) cast integer or boolean arrays to floats
- NumPy arrays only have a single dtype for every element
- the object dtype is the fallback

You'll want to avoid object dtypes. It's typically slow.

In [9]:
np.nan

nan

In [10]:
np.nan * 1.

nan

In [11]:
type(np.nan)

float

In [6]:
a = np.ones((5,2))
print(a)


[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]


In [9]:
b = np.zeros((5,2))
b

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [10]:
print(b)
print('*******')
print(b.T)

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]
*******
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


## Matrix multiplication

In [15]:
np.matmul(b , b.T)

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [16]:
b*b

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

### Element-wise operations

- Arithmetic: `+`, `-`, `*`, `/`, `**`
- Comparisons: `==`, `!=`, `<`, `>`, `<=`, `>=`

In [29]:
x * y

array([[ 0.        ,  0.20632132,  0.13807232,  0.43208051,  3.2790722 ],
       [ 0.        ,  0.47069417,  0.62013861,  1.28102554,  3.11023673]])

In [30]:
x**2

array([ 0,  1,  4,  9, 16])

In [31]:
x < y

array([[ True, False, False, False, False],
       [ True, False, False, False, False]], dtype=bool)

In [32]:
x < 2

array([ True,  True, False, False, False], dtype=bool)

### Array methods

- `.sum()`, `.mean()`, `.min()`, `.max()`, `.prod()`, `.cumsum()`, etc
- `np.exp`, `np.log`, `np.sin`, `np.sqrt` etc

In [34]:
y

array([[ 0.85499094,  0.20632132,  0.06903616,  0.14402684,  0.81976805],
       [ 0.70942004,  0.47069417,  0.31006931,  0.42700851,  0.77755918]])

In [35]:
y.sum()

4.7888945249725277

In [36]:
y.sum(axis=0)

array([ 1.56441098,  0.67701549,  0.37910547,  0.57103535,  1.59732723])

In [37]:
y.sum(axis=1)

array([ 2.0941433 ,  2.69475122])

In [38]:
y.prod()

4.9429668903378281e-05

In [39]:
y.mean()

0.47888945249725279

In [40]:
print(y.min(), y.max())

0.0690361610206 0.854990938989


### Numpy indexing and slicing

Works similarly to list indexing, but with multiple dimensions

In [19]:
np.linspace(0,10, 10)


array([ 0.        ,  1.11111111,  2.22222222,  3.33333333,  4.44444444,
        5.55555556,  6.66666667,  7.77777778,  8.88888889, 10.        ])

In [20]:
x = np.linspace(0,10, 50)
x

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

In [26]:
y = x.reshape((5,10))
y

array([[ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
         1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469],
       [ 2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
         3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102],
       [ 4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
         5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735],
       [ 6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
         7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367],
       [ 8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
         9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ]])

In [47]:
y[0]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
y[1,0]

2.0408163265306123

In [30]:
y[3:,0]

array([6.12244898, 8.16326531])

In [32]:
y[:,2:]

array([[ 0.40816327,  0.6122449 ,  0.81632653,  1.02040816,  1.2244898 ,
         1.42857143,  1.63265306,  1.83673469],
       [ 2.44897959,  2.65306122,  2.85714286,  3.06122449,  3.26530612,
         3.46938776,  3.67346939,  3.87755102],
       [ 4.48979592,  4.69387755,  4.89795918,  5.10204082,  5.30612245,
         5.51020408,  5.71428571,  5.91836735],
       [ 6.53061224,  6.73469388,  6.93877551,  7.14285714,  7.34693878,
         7.55102041,  7.75510204,  7.95918367],
       [ 8.57142857,  8.7755102 ,  8.97959184,  9.18367347,  9.3877551 ,
         9.59183673,  9.79591837, 10.        ]])

In [33]:
y[2:4,4:8]

array([[4.89795918, 5.10204082, 5.30612245, 5.51020408],
       [6.93877551, 7.14285714, 7.34693878, 7.55102041]])

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with both *relational* and *labeled* data. 

It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

## Why pandas?

Numpy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of Numpy, pandas provides:

- labeled arrays

- heterogenous data types within a table

- "better" missing data handling

- convenient methods (`groupby`, `rolling`, `resample`)

- more data types (Categorical, Datetime)

- indexing handles data alignment

- built in functionality for reading/writing many kinds of files

## Pandas Data Structures

### Series

A **Series** is a single vector of data with an *index* that labels each element in the vector

In [37]:
import pandas as pd

In [38]:
counts = pd.Series([632, 1638, 569, 115])
counts

0     632
1    1638
2     569
3     115
dtype: int64

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

In [41]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                               'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 
                               'Actinobacteria', 'Bacteroidetes']})

In [42]:
data

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [43]:
data[['phylum','value','patient']]

Unnamed: 0,phylum,value,patient
0,Firmicutes,632,1
1,Proteobacteria,1638,1
2,Actinobacteria,569,1
3,Bacteroidetes,115,1
4,Firmicutes,433,2
5,Proteobacteria,1130,2
6,Actinobacteria,754,2
7,Bacteroidetes,555,2


A `DataFrame` has a second index, representing the columns:

In [88]:
data.columns

Index(['patient', 'phylum', 'value'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [89]:
data['value']

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [90]:
data.value

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

In [91]:
type(data.value)

pandas.core.series.Series

In [92]:
type(data[['value']])

pandas.core.frame.DataFrame

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). 

### Indexing `DataFrame`s

If we want access to a row in a `DataFrame`, we index its `loc` or `iloc` attribute.

In [93]:
np.random.seed(0)
data = pd.DataFrame({'v1': np.random.rand(4), 
                     'v2': np.random.rand(4)}, 
                    index=['w','x','y','z'])
data.head()

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894
y,0.602763,0.437587
z,0.544883,0.891773


### `.loc`

To access something by its _label_, use `.loc[]`:

In [94]:
data.loc['y']

v1    0.602763
v2    0.437587
Name: y, dtype: float64

In [95]:
data.loc[['x','y']]

Unnamed: 0,v1,v2
x,0.715189,0.645894
y,0.602763,0.437587


In [96]:
data.loc[['x','y'],'v1']

x    0.715189
y    0.602763
Name: v1, dtype: float64

In [97]:
data.loc[['x','y'],'v2']

x    0.645894
y    0.437587
Name: v2, dtype: float64

In [98]:
data.loc['w':'y',['v1','v2']]

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894
y,0.602763,0.437587


### `.iloc`

Whereas if we want to access something by its row/column number, use `.iloc[]`:

In [99]:
data.iloc[2]

v1    0.602763
v2    0.437587
Name: y, dtype: float64

In [100]:
data.iloc[:2]

Unnamed: 0,v1,v2
w,0.548814,0.423655
x,0.715189,0.645894


In [101]:
data.iloc[:2,0]

w    0.548814
x    0.715189
Name: v1, dtype: float64

In [102]:
data.iloc[0,:]

v1    0.548814
v2    0.423655
Name: w, dtype: float64

### Boolean indexing

`.loc` _also_ supports boolean indices

In [103]:
data.v1 > .6

w    False
x     True
y     True
z    False
Name: v1, dtype: bool

In [104]:
data.loc[data.v1 > .6]

Unnamed: 0,v1,v2
x,0.715189,0.645894
y,0.602763,0.437587


In [105]:
data.loc[(data.v1 > .6) & (data.v2 < .5)]

Unnamed: 0,v1,v2
y,0.602763,0.437587


In [106]:
data.loc[(data.v1 > .6) & (data.v2 < .5), 'v2']

y    0.437587
Name: v2, dtype: float64

## Operating on `DataFrame`s

Operations on `DataFrame`s are very similar to operations on `Series`, including the ability to automatically align axes.

In [46]:
df1 = pd.DataFrame.from_dict({'USA': {'lat': 37.1, 'lon': 95.7},
                              'CAN': {'lat': 56.1, 'lon': 106.3},
                              'MEX': {'lat': 23.6, 'lon': 102.6}}, orient='index')
df1

Unnamed: 0,lat,lon
USA,37.1,95.7
CAN,56.1,106.3
MEX,23.6,102.6


In [115]:
df2 = pd.DataFrame.from_dict({'USA': {'temp': 70},
                              'CAN': {'temp': 60},
                              'MEX': {'temp': 50}}, orient='index')
df2

Unnamed: 0,temp
CAN,60
MEX,50
USA,70


In [116]:
df1['lat'] / df2['temp']

CAN    0.935
MEX    0.472
USA    0.530
dtype: float64

Do we have to sort values?

In [117]:
df1 = pd.DataFrame.from_dict({'USA': {'lat': 37.1, 'lon': 95.7},
                              'CAN': {'lat': 56.1, 'lon': 106.3},
                              'MEX': {'lat': 23.6, 'lon': 102.6}}, orient='index')
df2 = pd.DataFrame.from_dict({'CAN': {'temp': 60},
                              'USA': {'temp': 70},
                              'MEX': {'temp': 50}}, orient='index')

In [118]:
df1['lat'] / df2['temp']

CAN    0.935
MEX    0.472
USA    0.530
dtype: float64

What about missing values?

In [119]:
df3 = pd.DataFrame.from_dict({'USA': {'temp': 70},
                              'CAN': {'temp': 60},
                              'MEX': {'temp': 50},
                              'GRL': {'temp': 40}}, orient='index')
df3

Unnamed: 0,temp
CAN,60
GRL,40
MEX,50
USA,70


In [120]:
df1['lat'] / df3['temp']

CAN    0.935
GRL      NaN
MEX    0.472
USA    0.530
dtype: float64

## Adding to and changing `DataFrame`s

In [121]:
data = pd.DataFrame.from_dict({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                               1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                               2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                               3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                               4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                               5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                               6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                               7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}}, orient='index')
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,115
4,Firmicutes,2,433
5,Proteobacteria,2,1130
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


We can modify existing values by assignment:

In [126]:
data.loc[3, 'value'] = 14
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,1,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,14
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [127]:
data.iloc[0, 1] = 2
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,2,632
1,Proteobacteria,1,1638
2,Actinobacteria,1,569
3,Bacteroidetes,1,14
4,Firmicutes,2,433
5,Proteobacteria,2,0
6,Actinobacteria,2,754
7,Bacteroidetes,2,555


In [128]:
data['patient'] = data['patient'] + 1
data

Unnamed: 0,phylum,patient,value
0,Firmicutes,3,632
1,Proteobacteria,2,1638
2,Actinobacteria,2,569
3,Bacteroidetes,2,14
4,Firmicutes,3,433
5,Proteobacteria,3,0
6,Actinobacteria,3,754
7,Bacteroidetes,3,555


## Importing data

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

This table can be read into a DataFrame using `read_csv`:

In [152]:
mb = pd.read_csv("sample_csv_file.csv")
mb.shape

(75, 4)

In [153]:
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


In [154]:
mb.tail()

Unnamed: 0,Taxon,Patient,Tissue,Stool
70,Other,11,203,6
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [155]:
pd.read_csv("data/microbiome.csv", header=None).head()

Unnamed: 0,0,1,2,3
0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [156]:
mb = pd.read_table("data/microbiome.csv", sep=',')
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


The `sep` argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

In [169]:
data.to_csv?

In [170]:
data.to_csv('data/test.csv')


,phylum,patient,value,newvar
0,Firmicutes,1,632,633
1,Proteobacteria,1,1638,1639
2,Actinobacteria,1,569,570
3,Bacteroidetes,1,115,116
4,Firmicutes,2,433,435
5,Proteobacteria,2,1130,1132
6,Actinobacteria,2,754,756
7,Bacteroidetes,2,555,557


## References

Slide materials inspired by and adapted from [Chris Fonnesbeck](https://github.com/fonnesbeck/statistical-analysis-python-tutorial) and [Tom Augspurger](https://github.com/TomAugspurger/pydata-chi-h2t)