# Pandas 

This is a short introduction to pandas, geared mainly for new users. You can see mor complex recipes in the cookbook.

Customarily, we import as follows:

In [1]:
pip install pyppeteer-install


Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement pyppeteer-install (from versions: none)
ERROR: No matching distribution found for pyppeteer-install


In [15]:
import pandas as pd

In [16]:
import numpy as np

In [17]:
import matplotlib.pyplot as plt

# Object Creation

Creating a series by passing a list of values, letting pandas create a default integer index:

In [18]:
s = pd.Series([1,3,5,np.nan,6,8])

In [19]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [20]:
dates = pd.date_range('20210505', periods=6)

In [21]:
dates

DatetimeIndex(['2021-05-05', '2021-05-06', '2021-05-07', '2021-05-08',
               '2021-05-09', '2021-05-10'],
              dtype='datetime64[ns]', freq='D')

In [22]:
df = pd.DataFrame(np.random.randn(6,4), index = dates, columns = list('ABCD'))

In [23]:
df

Unnamed: 0,A,B,C,D
2021-05-05,-0.554748,-0.516329,-1.221866,-0.889811
2021-05-06,-2.152828,0.885595,1.361189,0.331374
2021-05-07,0.100295,0.390659,1.195424,0.573039
2021-05-08,0.447972,0.89591,1.083364,2.296458
2021-05-09,-0.40309,-0.133006,-0.103473,0.175212
2021-05-10,0.415369,-0.85244,0.389363,-0.490938


Creating a DataFrame by passing a dict of objects that can be converted to series like

In [24]:
df2 = pd.DataFrame({'A': 1., 
                   'B' : pd.Timestamp('20210506'),
                   'C': pd.Series(1, index=list(range(4)),dtype='float'),
                    'D' : np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),
                   'F': 'foo'})

In [25]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-05-06,1.0,3,test,foo
1,1.0,2021-05-06,1.0,3,train,foo
2,1.0,2021-05-06,1.0,3,test,foo
3,1.0,2021-05-06,1.0,3,train,foo


Having sepecfic dtypes

In [26]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float64
D             int32
E          category
F            object
dtype: object

If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here's a subset of the attributes that will be completed.

In [37]:
df2.<TAB>

SyntaxError: invalid syntax (<ipython-input-37-915637deb483>, line 1)

In [38]:
df2.abs

<bound method NDFrame.abs of      A          B    C  D      E    F
0  1.0 2021-05-06  1.0  3   test  foo
1  1.0 2021-05-06  1.0  3  train  foo
2  1.0 2021-05-06  1.0  3   test  foo
3  1.0 2021-05-06  1.0  3  train  foo>

As you can see, the columns A,B,C and D are automatically tab completed E is there as well, the rest of the attributes have been truncated for brevity.

## Viewing Data

In [39]:
df.head()

Unnamed: 0,A,B,C,D
2021-05-05,-0.554748,-0.516329,-1.221866,-0.889811
2021-05-06,-2.152828,0.885595,1.361189,0.331374
2021-05-07,0.100295,0.390659,1.195424,0.573039
2021-05-08,0.447972,0.89591,1.083364,2.296458
2021-05-09,-0.40309,-0.133006,-0.103473,0.175212


In [40]:
df.tail(3)

Unnamed: 0,A,B,C,D
2021-05-08,0.447972,0.89591,1.083364,2.296458
2021-05-09,-0.40309,-0.133006,-0.103473,0.175212
2021-05-10,0.415369,-0.85244,0.389363,-0.490938


Display the index, columns, and the underlying numpy data.

In [41]:
df.index

DatetimeIndex(['2021-05-05', '2021-05-06', '2021-05-07', '2021-05-08',
               '2021-05-09', '2021-05-10'],
              dtype='datetime64[ns]', freq='D')

In [42]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [43]:
df.values

array([[-0.55474789, -0.51632915, -1.22186606, -0.8898114 ],
       [-2.15282772,  0.88559464,  1.36118852,  0.33137448],
       [ 0.10029484,  0.39065939,  1.1954242 ,  0.57303908],
       [ 0.44797162,  0.8959099 ,  1.08336417,  2.29645799],
       [-0.40309029, -0.13300588, -0.10347346,  0.17521169],
       [ 0.41536921, -0.85244019,  0.38936303, -0.49093808]])

Describe shows a quick statistic summary of your data.

In [44]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.357838,0.111731,0.450667,0.332556
std,0.971584,0.731517,0.989163,1.105606
min,-2.152828,-0.85244,-1.221866,-0.889811
25%,-0.516833,-0.420498,0.019736,-0.324401
50%,-0.151398,0.128827,0.736364,0.253293
75%,0.336601,0.761861,1.167409,0.512623
max,0.447972,0.89591,1.361189,2.296458


Transposing your data.

In [45]:
df.T

Unnamed: 0,2021-05-05,2021-05-06,2021-05-07,2021-05-08,2021-05-09,2021-05-10
A,-0.554748,-2.152828,0.100295,0.447972,-0.40309,0.415369
B,-0.516329,0.885595,0.390659,0.89591,-0.133006,-0.85244
C,-1.221866,1.361189,1.195424,1.083364,-0.103473,0.389363
D,-0.889811,0.331374,0.573039,2.296458,0.175212,-0.490938


Sorting by an axis

In [50]:
df.sort_index(axis = 1, ascending = False)

Unnamed: 0,D,C,B,A
2021-05-05,-0.889811,-1.221866,-0.516329,-0.554748
2021-05-06,0.331374,1.361189,0.885595,-2.152828
2021-05-07,0.573039,1.195424,0.390659,0.100295
2021-05-08,2.296458,1.083364,0.89591,0.447972
2021-05-09,0.175212,-0.103473,-0.133006,-0.40309
2021-05-10,-0.490938,0.389363,-0.85244,0.415369


Sorting by values

In [51]:
df.sort(columns='B')

AttributeError: 'DataFrame' object has no attribute 'sort'

# Selection


# Getting

Selecting a single column, which yields a series, equivalent to df.A

In [54]:
df['A']

2021-05-05   -0.554748
2021-05-06   -2.152828
2021-05-07    0.100295
2021-05-08    0.447972
2021-05-09   -0.403090
2021-05-10    0.415369
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [55]:
df[0:3]

Unnamed: 0,A,B,C,D
2021-05-05,-0.554748,-0.516329,-1.221866,-0.889811
2021-05-06,-2.152828,0.885595,1.361189,0.331374
2021-05-07,0.100295,0.390659,1.195424,0.573039


In [57]:
df['2021-05-05':'2021-05-07']

Unnamed: 0,A,B,C,D
2021-05-05,-0.554748,-0.516329,-1.221866,-0.889811
2021-05-06,-2.152828,0.885595,1.361189,0.331374
2021-05-07,0.100295,0.390659,1.195424,0.573039


# Selecting by Label

For getting a cross sections using a label.

In [58]:
df.loc[dates[0]]

A   -0.554748
B   -0.516329
C   -1.221866
D   -0.889811
Name: 2021-05-05 00:00:00, dtype: float64