# Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrames and Series. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data, while Series is one dimensional array.

In [70]:
import pandas as pd
import numpy as np

## 1. Series Objects

A Series is analog to one dimensional numpy array of indexed data

In [71]:
data = pd.Series([2,5,6,5,8])
data

0    2
1    5
2    6
3    5
4    8
dtype: int64

In [72]:
data.values     #returns all the data values in the Series

array([2, 5, 6, 5, 8], dtype=int64)

In [73]:
data.index      #returns indices

RangeIndex(start=0, stop=5, step=1)

In [74]:
data2 = pd.Series([1,5,6,9],index = ['a','b','c','d'])
data2

a    1
b    5
c    6
d    9
dtype: int64

In [75]:
data2[2]

6

In [76]:
#Series a bit like a specialization of a Python dictionary.
dic = {"a":0,
      "b":1,
      "d":5,
      "c":2}
dic

{'a': 0, 'b': 1, 'd': 5, 'c': 2}

In [77]:
dic_series = pd.Series(dic)
dic_series

a    0
b    1
d    5
c    2
dtype: int64

### Constructing Series objects

- pd.Series(data, index=index)

In [78]:
# 1. data is the numpy list
s = pd.Series([1,2,3,4,5,6])
s

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [79]:
# 2. data is dictionary
s = pd.Series(dic)
s

a    0
b    1
d    5
c    2
dtype: int64

In [80]:
# 3. data is scalar value
s = pd.Series(10, index = [1,5,6])
s

1    10
5    10
6    10
dtype: int64

In [81]:
# Obtaining particular values from series
s = pd.Series(dic, index =['a','c'])
s

a    0
c    2
dtype: int64

## 2. Dataframe Objects

Dataframe is analog to 2D numpy array with both flexible row indices and flexible column names

In [82]:
age = pd.Series({'john':22, 'mary':20, 'zing':21, 'amar':19})
mark = pd.Series({'john':520, 'mary':550, 'zing':521, 'amar':589})

#Dataframe is a collection of 1D Series
df = pd.DataFrame({'age': age, 'mark': mark})
df

Unnamed: 0,age,mark
john,22,520
mary,20,550
zing,21,521
amar,19,589


In [83]:
df.columns

Index(['age', 'mark'], dtype='object')

### Constructing DataFrame objects

- pd.DataFrame(data, index)

In [84]:
# 1.data is single series object
df = pd.DataFrame(age,columns=['age'])
df

Unnamed: 0,age
john,22
mary,20
zing,21
amar,19


In [85]:
# 2.data is a dictionary
dic = {'a':1, 'c':5, 'h':6}
df = pd.DataFrame(dic, index=[0])
df

Unnamed: 0,a,c,h
0,1,5,6


In [86]:
# 3.data is 2D numpy array
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.923901,0.972205
b,0.24253,0.207385
c,0.581826,0.82245


## 3. Index Objects

Pandas Series and DataFrame both have implicit Index Object, used to access and modify the data values. The Index Object of Pandas is like an immutable array or ordered set. 

In [87]:
i = pd.Index([1,2,8,3,8,9])
i

Int64Index([1, 2, 8, 3, 8, 9], dtype='int64')

#### Index as immutable array
The Index object in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices.

In [88]:
i[5]

9

In [89]:
i[2:]

Int64Index([8, 3, 8, 9], dtype='int64')

In [90]:
i[::]

Int64Index([1, 2, 8, 3, 8, 9], dtype='int64')

The only difference between numpy array and pandas index is that index is immutable

In [91]:
index = pd.Index([1,2,3])
arr = np.array([1,2,3])

index, arr

(Int64Index([1, 2, 3], dtype='int64'), array([1, 2, 3]))

In [92]:
arr[2] = 5

In [93]:
index[2] = 5          #error

TypeError: Index does not support mutable operations

#### Index as ordered set

All the set operations can be performed on Index

In [94]:
index1 = pd.Index([1,2,3,6,5,8])
index2 = pd.Index([5,9,8,1,2,8])

In [95]:
# UNION
index1 | index2

Int64Index([1, 2, 3, 5, 6, 8, 9], dtype='int64')

In [96]:
# INTERSECTION
index1 & index2

Int64Index([1, 2, 5, 8, 8], dtype='int64')

In [97]:
# SET DIFFERENCE
index1 ^ index2

Int64Index([3, 6, 9], dtype='int64')

In [98]:
# ELEMENT WISE DIFERENCE
index1 - index2

Int64Index([-4, -7, -5, 5, 3, 0], dtype='int64')