# Pandas 10 mins tutorial note
By Michael Wu

[Reference](https://pandas.pydata.org/pandas-docs/stable/10min.html)

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Import

In [16]:
import numpy as np

In [17]:
import pandas as pd

In [18]:
import matplotlib.pyplot as plt

## Series

In [19]:
s = pd.Series([1,3,5,np.nan,6,8])

In [20]:
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [23]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

**Note that a series is like a n x 1 array (one column)**

## DataFrame
An array with index & labeled columns

In [26]:
dates = pd.date_range('20180101', periods=6)

In [27]:
dates

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

#### 1. By passing a NumPy array, an index and labeled columns:

In [28]:
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))

In [29]:
df

Unnamed: 0,A,B,C,D
2018-01-01,-0.359872,-0.97346,-0.69747,-0.047439
2018-01-02,-1.157497,0.716557,0.078953,0.697488
2018-01-03,-0.237439,-0.866224,0.506165,-0.141225
2018-01-04,1.262685,0.600326,1.557483,0.177108
2018-01-05,-0.952798,-1.793948,-0.075188,-1.285164
2018-01-06,0.290557,-1.614248,0.067387,1.663238


#### 2. By passing a dict of objects that can be converted to series-like:

In [31]:
df2 = pd.DataFrame({
    'A' : 1.,
    'B' : pd.Timestamp('20180101'),
    'C' : pd.Series(1, index = list(range(4)), dtype = 'float32'),
    'D' : np.array([3] * 4, dtype = 'int32'),
    'E' : pd.Categorical(["test", "train", "happy", "chill"]),
    'F' : 'foo'
})

In [32]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-01-01,1.0,3,test,foo
1,1.0,2018-01-01,1.0,3,train,foo
2,1.0,2018-01-01,1.0,3,happy,foo
3,1.0,2018-01-01,1.0,3,chill,foo


**Check the data types of each column:**

In [34]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

**TAB completion for column names:**

df2.<TAB>

## Viewing Data

**View the top and bottom rows of df:**

In [36]:
df.head(2)

Unnamed: 0,A,B,C,D
2018-01-01,-0.359872,-0.97346,-0.69747,-0.047439
2018-01-02,-1.157497,0.716557,0.078953,0.697488


In [37]:
df.tail(3)

Unnamed: 0,A,B,C,D
2018-01-04,1.262685,0.600326,1.557483,0.177108
2018-01-05,-0.952798,-1.793948,-0.075188,-1.285164
2018-01-06,0.290557,-1.614248,0.067387,1.663238


**View the index, columns and underlying NumPy data values:**

In [38]:
df.index

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [39]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [40]:
df.values

array([[-0.35987201, -0.97345952, -0.69747016, -0.0474386 ],
       [-1.15749712,  0.71655681,  0.0789534 ,  0.69748783],
       [-0.23743894, -0.86622372,  0.50616537, -0.14122489],
       [ 1.26268459,  0.60032644,  1.55748349,  0.17710786],
       [-0.95279835, -1.79394821, -0.07518802, -1.28516407],
       [ 0.290557  , -1.61424837,  0.06738673,  1.66323767]])

**Quick view of statistics:**

In [41]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.192394,-0.655166,0.239555,0.177334
std,0.882067,1.078926,0.753728,0.976665
min,-1.157497,-1.793948,-0.69747,-1.285164
25%,-0.804567,-1.454051,-0.039544,-0.117778
50%,-0.298655,-0.919842,0.07317,0.064835
75%,0.158558,0.233689,0.399362,0.567393
max,1.262685,0.716557,1.557483,1.663238


**Transpose:**

In [42]:
df.T

Unnamed: 0,2018-01-01 00:00:00,2018-01-02 00:00:00,2018-01-03 00:00:00,2018-01-04 00:00:00,2018-01-05 00:00:00,2018-01-06 00:00:00
A,-0.359872,-1.157497,-0.237439,1.262685,-0.952798,0.290557
B,-0.97346,0.716557,-0.866224,0.600326,-1.793948,-1.614248
C,-0.69747,0.078953,0.506165,1.557483,-0.075188,0.067387
D,-0.047439,0.697488,-0.141225,0.177108,-1.285164,1.663238


**Sorting by an axis (row):**

In [43]:
df.sort_index(axis = 1, ascending = False)

Unnamed: 0,D,C,B,A
2018-01-01,-0.047439,-0.69747,-0.97346,-0.359872
2018-01-02,0.697488,0.078953,0.716557,-1.157497
2018-01-03,-0.141225,0.506165,-0.866224,-0.237439
2018-01-04,0.177108,1.557483,0.600326,1.262685
2018-01-05,-1.285164,-0.075188,-1.793948,-0.952798
2018-01-06,1.663238,0.067387,-1.614248,0.290557


**Sorting by values (of one particular column):**

In [44]:
df.sort_values(by = 'B')

Unnamed: 0,A,B,C,D
2018-01-05,-0.952798,-1.793948,-0.075188,-1.285164
2018-01-06,0.290557,-1.614248,0.067387,1.663238
2018-01-01,-0.359872,-0.97346,-0.69747,-0.047439
2018-01-03,-0.237439,-0.866224,0.506165,-0.141225
2018-01-04,1.262685,0.600326,1.557483,0.177108
2018-01-02,-1.157497,0.716557,0.078953,0.697488


**Copy:**

In [49]:
df3 = df.copy()

**Add a new column:**

In [51]:
df3['E'] = ['one', 'two', 'three', 'four', 'five', 'six']

In [52]:
df3

Unnamed: 0,A,B,C,D,E
2018-01-01,-0.359872,-0.97346,-0.69747,-0.047439,one
2018-01-02,-1.157497,0.716557,0.078953,0.697488,two
2018-01-03,-0.237439,-0.866224,0.506165,-0.141225,three
2018-01-04,1.262685,0.600326,1.557483,0.177108,four
2018-01-05,-0.952798,-1.793948,-0.075188,-1.285164,five
2018-01-06,0.290557,-1.614248,0.067387,1.663238,six


## Selection

### NOTICE: the index
`3:5` actually means the index '3 ~ 4'

In [45]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2018-01-04,1.262685,0.600326
2018-01-05,-0.952798,-1.793948


### Boolean Indexing

In [47]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2018-01-04,1.262685,0.600326,1.557483,0.177108
2018-01-06,0.290557,-1.614248,0.067387,1.663238


In [48]:
df[df > 0]

Unnamed: 0,A,B,C,D
2018-01-01,,,,
2018-01-02,,0.716557,0.078953,0.697488
2018-01-03,,,0.506165,
2018-01-04,1.262685,0.600326,1.557483,0.177108
2018-01-05,,,,
2018-01-06,0.290557,,0.067387,1.663238


## Some contents of "Selection" are skipped.
## For more please check out the official tutorial:

https://pandas.pydata.org/pandas-docs/stable/10min.html