# Pandas Tutorial
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

**Agenda**
- What is a Data Frame?
- What is a Data Series?
- Different operations in Pandas.

In [1]:
# First import pandas and numpy

import pandas as pd
import numpy as np

We need `index` for row names and `columns` for variable names.

In [2]:
# Play with DataFrame

df = pd. DataFrame(np.arange(0,20). reshape(5,4), 
                   index= ["R1", "R2", "R3", "R4", "R5"], 
                  columns = ["C1", "C2", "C3", "C4"])

df

Unnamed: 0,C1,C2,C3,C4
R1,0,1,2,3
R2,4,5,6,7
R3,8,9,10,11
R4,12,13,14,15
R5,16,17,18,19


Or without index names, just work with column numbers:

In [3]:
df1 = pd. DataFrame(np.arange(0,20). reshape(5,4))
df1

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [4]:
df.head()

Unnamed: 0,C1,C2,C3,C4
R1,0,1,2,3
R2,4,5,6,7
R3,8,9,10,11
R4,12,13,14,15
R5,16,17,18,19


### Save to a file
e.g. csv

In [6]:
df.to_csv("04. df1.csv")

Check if this file has been created in the environment.

## Indexing Data Frames
**To access the elements:**
1. `.loc[]`
2. `.iloc[]`

In [7]:
df

Unnamed: 0,C1,C2,C3,C4
R1,0,1,2,3
R2,4,5,6,7
R3,8,9,10,11
R4,12,13,14,15
R5,16,17,18,19


In [10]:
df.loc["R1"]

C1    0
C2    1
C3    2
C4    3
Name: R1, dtype: int32

In [11]:
type(df.loc["R1"])

pandas.core.series.Series

In [14]:
df.iloc[:,1:]

Unnamed: 0,C2,C3,C4
R1,1,2,3
R2,5,6,7
R3,9,10,11
R4,13,14,15
R5,17,18,19


In [15]:
type(df.iloc[:, 1:])

pandas.core.frame.DataFrame

In [20]:
df.iloc[1, :]

C1    4
C2    5
C3    6
C4    7
Name: R2, dtype: int32

In [17]:
type(df.iloc[1, :])

pandas.core.series.Series

### Important Note:
There is something counterintuitive about Data Frames; while indexing `1:2` should give the same thing as `1`, it does not. The former outputs a data frame with one column where the latter returns series.

In [21]:
df.iloc[:,1:2]

Unnamed: 0,C2
R1,1
R2,5
R3,9
R4,13
R5,17


In [23]:
type(df.iloc[:,1:2])

pandas.core.frame.DataFrame

In [22]:
df.iloc[:,1]

R1     1
R2     5
R3     9
R4    13
R5    17
Name: C2, dtype: int32

In [24]:
type(df.iloc[:,1])

pandas.core.series.Series

**Important**: Data Frames can also be converted to an array:

In [26]:
df.iloc[:,1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

In [27]:
df.iloc[:,1:].values.shape

(5, 3)

In [28]:
df["C1"].value_counts()

0     1
8     1
4     1
16    1
12    1
Name: C1, dtype: int64

## How to check for null or NA values

In [30]:
df

Unnamed: 0,C1,C2,C3,C4
R1,0,1,2,3
R2,4,5,6,7
R3,8,9,10,11
R4,12,13,14,15
R5,16,17,18,19


In [29]:
df.isnull().sum()

C1    0
C2    0
C3    0
C4    0
dtype: int64

In [31]:
df["C1"].value_counts()

0     1
8     1
4     1
16    1
12    1
Name: C1, dtype: int64

In [32]:
df["C1"].unique()

array([ 0,  4,  8, 12, 16])

### Subset multiple columns with column names

In [34]:
df[["C1", "C2"]]

Unnamed: 0,C1,C2
R1,0,1
R2,4,5
R3,8,9
R4,12,13
R5,16,17
