# A short introduction to pandas

https://pandas.pydata.org/

Import pandas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Series
Series is a **one-dimensional labeled array** capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

Create a Series of random numbers. The passed index is a list of axis labels.

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -2.174334
b    1.110806
c   -0.240837
d    0.159714
e    1.370581
dtype: float64

### Series is ndarray
Series acts very similarly to a `ndarray`, and is a valid argument to most NumPy functions. However, things like slicing also slice the index. **Logical indexing** is supported

In [3]:
s[0]

-2.1743339401635815

In [4]:
s[s>s.median()]

b    1.110806
e    1.370581
dtype: float64

In [5]:
np.exp(s)

a    0.113684
b    3.036804
c    0.785970
d    1.173176
e    3.937638
dtype: float64

In [6]:
 s + s*2

a   -6.523002
b    3.332417
c   -0.722510
d    0.479143
e    4.111743
dtype: float64

In [7]:
s[[4, 3, 1]]

e    1.370581
d    0.159714
b    1.110806
dtype: float64

In [8]:
### Series is dict-like

In [9]:
s['a']

-2.1743339401635815

In [10]:
s.a

-2.1743339401635815

In [11]:
s['e'] = 12

In [12]:
'e' in s

True

In [13]:
s.b

1.1108055571491553

### Vectorized operations and label alignment with Series
A key difference between Series and ndarray is that operations between Series **automatically align the data based on label**. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [14]:
s[1:] + s[:-1]

a         NaN
b    2.221611
c   -0.481674
d    0.319429
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. **Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research**. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

## DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.


In [15]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

In [16]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [17]:
# access to index
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [18]:
df.columns

Index(['one', 'two'], dtype='object')

### Column selection, addition, deletion

In [19]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [20]:
df['three'] = df.one * df.two
df['flag'] = df.one > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


In [21]:
del df['two']

### More complex examples

In [22]:
# More complex example
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [24]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing data

In [25]:
dates = pd.DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.022337,-0.651306,-0.342759,1.00347
2013-01-02,-0.588523,-0.143568,0.807798,-1.227523
2013-01-03,0.172962,-1.298321,0.939256,-1.135721
2013-01-04,0.255404,-0.349999,0.436032,-0.612105
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637
2013-01-06,0.023173,-0.757701,1.207235,0.798595


In [26]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,1.022337,-0.651306,-0.342759,1.00347
2013-01-02,-0.588523,-0.143568,0.807798,-1.227523
2013-01-03,0.172962,-1.298321,0.939256,-1.135721
2013-01-04,0.255404,-0.349999,0.436032,-0.612105
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637


In [27]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.255404,-0.349999,0.436032,-0.612105
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637
2013-01-06,0.023173,-0.757701,1.207235,0.798595


In [28]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.003416,-0.478626,0.361887,-0.203276
std,0.67937,0.55873,0.810159,0.956063
min,-0.905853,-1.298321,-0.876243,-1.227523
25%,-0.435599,-0.731102,-0.148061,-1.004817
50%,0.098068,-0.500652,0.621915,-0.329238
75%,0.234794,-0.195176,0.906391,0.587354
max,1.022337,0.329141,1.207235,1.00347


In [29]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-03,0.172962,-1.298321,0.939256,-1.135721
2013-01-06,0.023173,-0.757701,1.207235,0.798595
2013-01-01,1.022337,-0.651306,-0.342759,1.00347
2013-01-04,0.255404,-0.349999,0.436032,-0.612105
2013-01-02,-0.588523,-0.143568,0.807798,-1.227523
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637


In [30]:
# columns
df.A

2013-01-01    1.022337
2013-01-02   -0.588523
2013-01-03    0.172962
2013-01-04    0.255404
2013-01-05   -0.905853
2013-01-06    0.023173
Freq: D, Name: A, dtype: float64

In [31]:
# rows
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,1.022337,-0.651306,-0.342759,1.00347
2013-01-02,-0.588523,-0.143568,0.807798,-1.227523
2013-01-03,0.172962,-1.298321,0.939256,-1.135721


In [32]:
# boolean indexing
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,1.022337,-0.651306,-0.342759,1.00347,one
2013-01-02,-0.588523,-0.143568,0.807798,-1.227523,one
2013-01-03,0.172962,-1.298321,0.939256,-1.135721,two
2013-01-04,0.255404,-0.349999,0.436032,-0.612105,three
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637,four
2013-01-06,0.023173,-0.757701,1.207235,0.798595,three


In [33]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.172962,-1.298321,0.939256,-1.135721,two
2013-01-05,-0.905853,0.329141,-0.876243,-0.04637,four


In [34]:
df2[df2.A>0]

Unnamed: 0,A,B,C,D,E
2013-01-01,1.022337,-0.651306,-0.342759,1.00347,one
2013-01-03,0.172962,-1.298321,0.939256,-1.135721,two
2013-01-04,0.255404,-0.349999,0.436032,-0.612105,three
2013-01-06,0.023173,-0.757701,1.207235,0.798595,three


### Operations

In [35]:
df.mean()

A   -0.003416
B   -0.478626
C    0.361887
D   -0.203276
dtype: float64

In [36]:
df.mean(axis=1)

2013-01-01    0.257936
2013-01-02   -0.287954
2013-01-03   -0.330456
2013-01-04   -0.067667
2013-01-05   -0.374831
2013-01-06    0.317825
Freq: D, dtype: float64

In [None]:
# applying functions to data
df.apply(np.cumsum)

### Grouping
By **group by** we are referring to a process involving one or more of the following steps
* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

In [37]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                   'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                   'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.593893,-1.816897
1,bar,one,-0.298509,0.638461
2,foo,two,1.272251,-0.487022
3,bar,three,1.149135,1.311743
4,foo,two,-0.352958,-0.246426
5,bar,two,-0.49451,-1.295404
6,foo,one,0.436829,-1.013869
7,foo,three,0.993858,2.207088


In [38]:
df.A.unique()

array(['foo', 'bar'], dtype=object)

Grouping and then applying a function sum to the resulting groups.

In [40]:
df.groupby('A').mean()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.118705,0.218267
foo,0.351217,-0.271425


Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [None]:
df.groupby(['A','B']).mean()

### Pivot Tables
A pivot table is a table that summarizes data in another table, and is made by applying an operation such as sorting, averaging, or summing to data in the first table, typically including grouping of the data.

We can produce pivot tables very easily.

In [41]:
dfp = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
dfp

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.318145,0.342288
1,one,B,foo,0.782149,0.540207
2,two,C,foo,-0.152158,-0.240358
3,three,A,bar,-0.0789,0.078295
4,one,B,bar,-0.336452,-1.771042
5,one,C,bar,-0.397852,0.230086
6,two,A,foo,0.883048,-0.396742
7,three,B,foo,-0.789048,0.116781
8,one,C,foo,-0.278658,-0.556337
9,one,A,bar,-0.572947,-0.480353


In [42]:
pd.pivot_table(dfp, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.572947,0.318145
one,B,-0.336452,0.782149
one,C,-0.397852,-0.278658
three,A,-0.0789,
three,B,,-0.789048
three,C,0.557545,
two,A,,0.883048
two,B,0.531895,
two,C,,-0.152158
