# Pandas 

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. There are three basic data structure in pandas.
* Pandas series
* Dataframe
* Panels

In [44]:
# importing pandas and numpy
import pandas as pd
import numpy as np

## Data Structure

##  1. Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

In [45]:
# generating random data and index
data = np.random.randn(5)
index = ['a', 'b', 'c', 'd', 'e']

# pandas series 
series = pd.Series(data, index=index, name = "Example Series")
type(series)

pandas.core.series.Series

In [46]:
# working with pandas serise object 
print(series.index)
print(series.name)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Example Series


In [47]:
# Vectorized operations on series
series + series 

a   -2.803834
b    0.111187
c   -0.043370
d    6.386727
e    3.875181
Name: Example Series, dtype: float64

In [48]:
series * series 

a     1.965372
b     0.003091
c     0.000470
d    10.197569
e     3.754257
Name: Example Series, dtype: float64

In [49]:
np.exp(series)

a     0.246125
b     1.057168
c     0.978549
d    24.370255
e     6.942004
Name: Example Series, dtype: float64

## 2. Dataframe
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input.

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* A pandas series
* Another DataFrame

In [50]:
# defining the data for dataframe (index is an optional argument)
data = {
'col1' : pd.Series([1.0, 2.0, 3.0, 4.0], index=['a','b','c','d']),
'col2' : pd.Series([1.0, 2.0, 3.0, 4.0], index=['a','b','c','d']),
'col3' : pd.Series([1.0, 2.0, 3.0, 4.0], index=['a','b','c','d']),
}

In [51]:
# creating dataframe
df = pd.DataFrame(data)
print(df)
type(df)

   col1  col2  col3
a   1.0   1.0   1.0
b   2.0   2.0   2.0
c   3.0   3.0   3.0
d   4.0   4.0   4.0


pandas.core.frame.DataFrame

In [52]:
# another way of definging dataframe (index is an optional argument)
data = np.random.random(4)
index = ['a','b','c','d']
col_name = ['col1']

In [53]:
# creating dataframe
df1 = pd.DataFrame(data,index=index,columns=col_name)
print(df1)
type(df1)

       col1
a  0.283012
b  0.049261
c  0.885213
d  0.162878


pandas.core.frame.DataFrame

In [54]:
# Column selection, addition, deletion
df['col1']

a    1.0
b    2.0
c    3.0
d    4.0
Name: col1, dtype: float64

In [55]:
df.col1

a    1.0
b    2.0
c    3.0
d    4.0
Name: col1, dtype: float64

In [56]:
# performing mathematical operation on dataframe
df3 = df * df
print(df3)

   col1  col2  col3
a   1.0   1.0   1.0
b   4.0   4.0   4.0
c   9.0   9.0   9.0
d  16.0  16.0  16.0


In [57]:
# performing logical operation on dataframe
df['flag'] = df['col1'] > 1
print(df)

   col1  col2  col3   flag
a   1.0   1.0   1.0  False
b   2.0   2.0   2.0   True
c   3.0   3.0   3.0   True
d   4.0   4.0   4.0   True


In [58]:
# delete a column
del df['flag']
print(df)

   col1  col2  col3
a   1.0   1.0   1.0
b   2.0   2.0   2.0
c   3.0   3.0   3.0
d   4.0   4.0   4.0


In [59]:
# select row by label name
df.loc['b']

col1    2.0
col2    2.0
col3    2.0
Name: b, dtype: float64

In [60]:
# select row by label location
df.iloc[0]

col1    1.0
col2    1.0
col3    1.0
Name: a, dtype: float64

In [61]:
# Transpose the dataframe
df.T

Unnamed: 0,a,b,c,d
col1,1.0,2.0,3.0,4.0
col2,1.0,2.0,3.0,4.0
col3,1.0,2.0,3.0,4.0


In [62]:
# display the basic information about dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 3 columns):
col1    4 non-null float64
col2    4 non-null float64
col3    4 non-null float64
dtypes: float64(3)
memory usage: 288.0+ bytes


## 3. Panel

Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s.

>  In 0.20.0, Panel is deprecated and will be removed in a future version.

## Essential Basic Functionality

In [63]:
# to view first n values of pandas data structure (default = 5)
df.head(3)

Unnamed: 0,col1,col2,col3
a,1.0,1.0,1.0
b,2.0,2.0,2.0
c,3.0,3.0,3.0


In [64]:
# to view last n values of pandas data structure (default = 5)
df.tail(2)

Unnamed: 0,col1,col2,col3
c,3.0,3.0,3.0
d,4.0,4.0,4.0


In [65]:
# return the shape of the dataframe
df.shape

(4, 3)

In [66]:
# return the list of values 
df['col1'].values

array([ 1.,  2.,  3.,  4.])

In [67]:
# normal logical expression 
df > 1

Unnamed: 0,col1,col2,col3
a,False,False,False
b,True,True,True
c,True,True,True
d,True,True,True


In [68]:
# Boolean Reductions .all() can be changed with .any() or .bool
(df > 1).any()

col1    True
col2    True
col3    True
dtype: bool

In [69]:
(df > 1).any().any()

True