# Introduction to Pandas 🐼

"Borrowed" from https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

In [2]:
import numpy as np # Never wrong to have NumPy on your side as well
import pandas as pd

**Pandas** is the Python package usually used for dataframes. It consists of two major data structures: `pd.Series` for single columns of data and `pd.DataFrame` for tables consisting of one or multiple columns.

## Object Creation
Creating a Series by passing a list of values, letting pandas create a default integer index:

In [4]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

(See the row index on the left side of the output)

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]:
df1 = pd.DataFrame(
    {
        'A': 1.,
        'B': pd.Series([1, 2, 3, 4]),
        'C': np.array([3] * 4),
        'D': pd.Categorical(["test", "train", "test", "train"]),
        'E': 'foo'
    }
)
df1

Unnamed: 0,A,B,C,D,E
0,1.0,1,3,test,foo
1,1.0,2,3,train,foo
2,1.0,3,3,test,foo
3,1.0,4,3,train,foo


## Viewing Data

Here is how to view the top and bottom rows of the frame:

In [22]:
df2 = pd.DataFrame(np.arange(100).reshape((25, -1)))
df2.head() # Display first n columns, default: 5

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [23]:
df2.tail()

Unnamed: 0,0,1,2,3
20,80,81,82,83
21,84,85,86,87
22,88,89,90,91
23,92,93,94,95
24,96,97,98,99


Display the index ("row names"), columns:

In [24]:
df1.index

RangeIndex(start=0, stop=4, step=1)

In [25]:
df1.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

Get a numpy representation of the underlying data:

In [29]:
df2.head().values # This returns a numpy ndarray!

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

`describe` shows a quick statistic summary of the data:

In [33]:
df1.describe()

Unnamed: 0,A,B,C
count,4.0,4.0,4.0
mean,1.0,2.5,3.0
std,0.0,1.290994,0.0
min,1.0,1.0,3.0
25%,1.0,1.75,3.0
50%,1.0,2.5,3.0
75%,1.0,3.25,3.0
max,1.0,4.0,3.0


Sort DataFrame by the values of one or more columns:

In [37]:
df1.sort_values(["D", "B"])

Unnamed: 0,A,B,C,D,E
0,1.0,1,3,test,foo
2,1.0,3,3,test,foo
1,1.0,2,3,train,foo
3,1.0,4,3,train,foo


## Selection + Slicing

Selecting a single column, which yields a Series

In [38]:
df1['A']

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

Note: this returns a Series!

Selecting multiple columns using a list

In [47]:
df1[['A', 'D']]

Unnamed: 0,A,D
0,1.0,test
1,1.0,train
2,1.0,test
3,1.0,train


Selecting via [], which slices the rows.

In [43]:
df1[1:3]

Unnamed: 0,A,B,C,D,E
1,1.0,2,3,train,foo
2,1.0,3,3,test,foo


By lists of indices and column names, similar to the numpy style:

In [60]:
df1.loc[[1, 2], ["B", "D"]]

Unnamed: 0,B,D
1,2,train
2,3,test


#### Selection by Label

Common syntax: `df.loc[row_index, columns]`

In [50]:
df1.loc[:, ['A', 'B']]

Unnamed: 0,A,B
0,1.0,1
1,1.0,2
2,1.0,3
3,1.0,4


In [52]:
df1.loc[2:4, ['A', 'B']]

Unnamed: 0,A,B
2,1.0,3
3,1.0,4


#### Selection by Position

In [53]:
df1.iloc[3]

A        1
B        4
C        3
D    train
E      foo
Name: 3, dtype: object

In [54]:
df1.iloc[2:4, 0:2]

Unnamed: 0,A,B
2,1.0,3
3,1.0,4


### Boolean indexing / filtering

Using a single column’s values to select data.

In [64]:
df1[df1.B > 2]

Unnamed: 0,A,B,C,D,E
2,1.0,3,3,test,foo
3,1.0,4,3,train,foo


Or the values from multiple columns:

In [67]:
df1[
    (df1.B > 2) &
    (df1.D == "test")
]

Unnamed: 0,A,B,C,D,E
2,1.0,3,3,test,foo


Note the brackets around the conditions!

Using the isin() method for filtering:

In [68]:
df1[df1['B'].isin([1, 3])]

Unnamed: 0,A,B,C,D,E
0,1.0,1,3,test,foo
2,1.0,3,3,test,foo


### Changing Values in DataFrames
Can easily be done using the indexing methods shown above.

In [69]:
df1.loc[df1.B > 2, "C"] = 5
df1

Unnamed: 0,A,B,C,D,E
0,1.0,1,3,test,foo
1,1.0,2,3,train,foo
2,1.0,3,5,test,foo
3,1.0,4,5,train,foo


Columns can be added (or overwritten) just like values in a dictionary:

In [70]:
df1["F"] = [3, 7, 3, 1]
df1

Unnamed: 0,A,B,C,D,E,F
0,1.0,1,3,test,foo,3
1,1.0,2,3,train,foo,7
2,1.0,3,5,test,foo,3
3,1.0,4,5,train,foo,1


Appending a row to a DataFrame is a little bit more tricky: 

In [80]:
df1.append(
    {
        "A": 1,
        "B": 2,
        "D": "test"
    },
    ignore_index=True
) 

Unnamed: 0,A,B,C,D,E,F
0,1.0,1,3.0,test,foo,3.0
1,1.0,2,3.0,train,foo,7.0
2,1.0,3,5.0,test,foo,3.0
3,1.0,4,5.0,train,foo,1.0
4,1.0,2,,test,,


Keep in mind this operation does not change the DF in place but returns a new DF!

## Basic operations

In [74]:
df1.mean() # Along columns

A    1.0
B    2.5
C    4.0
F    3.5
dtype: float64

In [75]:
df1.mean(axis=1) # Along rows

0    2.00
1    3.25
2    3.00
3    2.75
dtype: float64

In [81]:
df1.sum()

A                     4
B                    10
C                    16
D    testtraintesttrain
E          foofoofoofoo
F                    14
dtype: object

In [84]:
df1.count()

A    4
B    4
C    4
D    4
E    4
F    4
dtype: int64

## Apply functions

In [78]:
df1.apply(np.cumsum)

Unnamed: 0,A,B,C,D,E,F
0,1.0,1,3,test,foo,3
1,2.0,3,6,testtrain,foofoo,10
2,3.0,6,11,testtraintest,foofoofoo,13
3,4.0,10,16,testtraintesttrain,foofoofoofoo,14
