<a href="https://colab.research.google.com/github/lblogan14/python_data_analysis/blob/master/Chapter4_Pandas_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The pandas package is designed on the basis of the NumPy library. Two new data structures are developed: series and dataframe.

**Install pandas from Anaconda**

    conda list pandas
    conda install pandas
**intall pandas from PyPI**

    pip install pandas

**intall on Linux**

    sudo apt-get intall python-pandas

In [0]:
'''Import necessary packages'''
import pandas as pd
import numpy as np

##Introduction to pandas Data Structures##

###The Series###

designed to represent one-dimensional data structures and is composed of two arrays associated with each other.

The main array holds the data, to which each element is  associated with a label, contained within the other array, called the index.

####Series() constructor####

In [3]:
s = pd.Series([12, -4.7, 7, 9])
s #['index', 'value']

0    12.0
1    -4.7
2     7.0
3     9.0
dtype: float64

create a series with meaningful labels

In [4]:
s = pd.Series([12, -4.7, 7, 9], index=['a','b','c','d'])
s

a    12.0
b    -4.7
c     7.0
d     9.0
dtype: float64

You can individually see the two arrays by using the index and the values attributes.

In [5]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
s.values

array([12. , -4.7,  7. ,  9. ])

####Selecting the Internal Elements####

using the key,

In [7]:
s[1]

-4.7

or the label corresponding to the same position

In [8]:
s['b']

-4.7

Select multiple items:

In [9]:
s[0:2]

a    12.0
b    -4.7
dtype: float64

In [10]:
s[['b','c']] #specify a list of labels in an array

b   -4.7
c    7.0
dtype: float64

####Assigning Values####

by selecting the value via index or label

In [11]:
s[1] = 0
s

a    12.0
b     0.0
c     7.0
d     9.0
dtype: float64

In [12]:
s['b'] = 1
s

a    12.0
b     1.0
c     7.0
d     9.0
dtype: float64

####Defining a Series from others####

In [13]:
s4 = pd.Series(s)
s4

a    12.0
b     1.0
c     7.0
d     9.0
dtype: float64

In [14]:
arr = np.array([1,2,3,4])
s3 = pd.Series(arr)
s3

0    1
1    2
2    3
3    4
dtype: int64

Noted that this Series is not copied... Change in arr will change elements in s3:

In [15]:
arr[2] = -2
s3

0    1
1    2
2   -2
3    4
dtype: int64

####Filtering####

In [16]:
s[s > 8]

a    12.0
d     9.0
dtype: float64

####Opearations####

In [17]:
s / 2

a    6.0
b    0.5
c    3.5
d    4.5
dtype: float64

In [18]:
np.log(s)

a    2.484907
b    0.000000
c    1.945910
d    2.197225
dtype: float64

####Evaluating####

**unique()** function can exclude the duplicates

In [19]:
serd = pd.Series([1,0,2,1,2,3], index=['white', 'white', 'blue', 'green', 'green', 'yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [20]:
serd.unique()

array([1, 0, 2, 3])

**value_counts()** function returns unique values and calculates the occurrences within a series

In [21]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

**isin()** function returns the Boolean values if the values are contained in the data structure

In [22]:
serd.isin([0,3]) #if 0 and 3 are in this series

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

You can use this to filter data

In [23]:
serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64

####NaN Values####

To define a missing value, use **np.NaN**

In [24]:
s2 = pd.Series([5, -3, np.NaN, 14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

use **isnull()** and **notnull()** to identify the indices without a value

In [25]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [26]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

This can be used to filter data

In [27]:
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

In [28]:
s2[s2.isnull()]

2   NaN
dtype: float64

####Series as Dictionary####

In [29]:
mydict = {'red':2000, 'blue':1000, 'yellow':500, 'orange':1000}
myseries = pd.Series(mydict)
myseries

blue      1000
orange    1000
red       2000
yellow     500
dtype: int64

To define the array indices separately

In [30]:
colors = ['red','yellow','orange','blue','green']
myseries = pd.Series(mydict, index=colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

####Operations between Series####

Same labels are added together

In [31]:
mydict2 = {'red':400, 'yellow':1000, 'black':700}
myseries2 = pd.Series(mydict2)
myseries + myseries2

black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

###The DataFrame###

a tabular data structure silimar to a spreadsheet.

The dataframe has two index arrays. The first index array associated with the lines. Each label is associated with all the values in the row. The second array contains a series of labels, each associated with a particular column

The dataframe can be understood as a dict of series, where the keys are the column names and the values are the series that will form the columns of the dataframe.

####Defining a DataFrame####

passa dict object to the **DataFrame()** constructor

In [32]:
data = {'color': ['blue','green','yellow','red','white'],
       'object': ['ball','pen','pencil','paper','mug'],
       'price': [1.2, 1.0, 0.6, 0.9, 1.7]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


You can select a sequence of columns by using the **columns** option

In [33]:
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


You can assign the labels in the index array using the **index** option

In [34]:
frame2 = pd.DataFrame(data, index=['one','two','three','four','five'])
frame2

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6
four,red,paper,0.9
five,white,mug,1.7


To create a dataframe without using a dict object, you can specify the labels using the **columns** and **index** options when assigning the data matrix

In [35]:
frame3 = pd.DataFrame(np.arange(16).reshape(4,4),
                     index = ['red','blue','yellow','white'],
                     columns = ['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


####Selecting Elements####

In [36]:
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


Name of all the columns

In [37]:
frame.columns

Index(['color', 'object', 'price'], dtype='object')

List of indices

In [38]:
frame.index

RangeIndex(start=0, stop=5, step=1)

Entire dataset

In [39]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)

Selecting the column of interest

In [40]:
frame['price']

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

In [41]:
frame.price

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

Selecting the rows of interest by using the **loc** attribute

In [42]:
frame.loc[2] #the thrid row

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

In [43]:
frame.loc[2:4] #from the second to the fourth row

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [44]:
frame[2:4] #from the second to the third row

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
3,red,paper,0.9


Return a single value, need to specify the column name and then the index

In [45]:
frame['object'][3]

'paper'

####Assgining Values####
using the **name** attribute

In [46]:
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [47]:
frame.index.name = 'id'
frame.columns.name = 'item'
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


Add a new column

In [48]:
frame['new'] = 12
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,12
1,green,pen,1.0,12
2,yellow,pencil,0.6,12
3,red,paper,0.9,12
4,white,mug,1.7,12


In [49]:
frame['new'] = [3.0, 1.3, 2.2, 0.8, 1.1]
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2
3,red,paper,0.9,0.8
4,white,mug,1.7,1.1


In [50]:
ser = pd.Series(np.arange(5))
ser

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [51]:
frame['new'] = ser
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,0.6,2
3,red,paper,0.9,3
4,white,mug,1.7,4


Change a single value

In [52]:
frame['price'][2] = 3.3
frame

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,3.3,2
3,red,paper,0.9,3
4,white,mug,1.7,4


####Membership of a Value####
**isin()** function still applies

In [53]:
frame.isin([1.0, 'pen'])

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [54]:
frame[frame.isin([1.0, 'pen'])]

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,
3,,,,
4,,,,


####Deleting a Column####

In [55]:
del frame['new']
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,3.3
3,red,paper,0.9
4,white,mug,1.7


####Filtering####

In [56]:
frame[frame < 1.2]

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,
1,green,pen,1.0
2,yellow,pencil,
3,red,paper,0.9
4,white,mug,


####DataFrame from Nested dict####

In [57]:
nestdict = {'red': {2012:22, 2013:33},
           'white': {2011:12, 2012:22, 2013:16},
           'blue': {2011:17, 2012:27, 2013:18}}
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,blue,red,white
2011,17,,12
2012,27,22.0,22
2013,18,33.0,16


####Transposition of a DataFrame####

In [58]:
frame.T

id,0,1,2,3,4
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
color,blue,green,yellow,red,white
object,ball,pen,pencil,paper,mug
price,1.2,1,3.3,0.9,1.7


###The Index Objects###


In [59]:
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

**idxmin()** returns the index with the lowest value

In [60]:
ser.idxmin()

'blue'

**idxmax()** returns the index with the highest value

In [61]:
ser.idxmax()

'white'

**is_unique** attribute tells you if there are indices with duplicate labels

In [62]:
serd.index.is_unique

False

In [63]:
frame.index.is_unique

True

###Other Functionalities on Indexes###

####Reindexing####

In [64]:
ser = pd.Series([2,5,7,4], index= ['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

The **reindex()** function will reindex in the order provided by the index array

In [65]:
ser.reindex(['three','one','four','two'])

three    7
one      2
four     4
two      5
dtype: int64

You can use **reindex()** to modify the series

In [66]:
ser.reindex(['three','four','five','one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

The automatic reindexing will also help you fill the missing values.

In [67]:
ser3 = pd.Series([1,5,6,3], index= [0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

As you can see, the indexes 1, 2, and 4 are missing.

You can use **reindex()** with the **method** option set to **ffill**
The **ffill** method will assign values of *nearest* lowest index in the original series to the missing values of the new indexes

In [68]:
ser3.reindex(range(6), method= 'ffill')

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

The **bfill** method will assign values of *nearest* highest index in the original series to the missing values of the new indexes

In [69]:
ser3.reindex(range(6), method= 'bfill')

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

Extend this idea to the DataFrame structure

In [70]:
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,3.3
3,red,paper,0.9
4,white,mug,1.7


In [71]:
frame.reindex(range(5), method= 'ffill', columns= ['color','price','new','object'])

item,color,price,new,object
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,1.2,blue,ball
1,green,1.0,green,pen
2,yellow,3.3,yellow,pencil
3,red,0.9,red,paper
4,white,1.7,white,mug


####Dropping####

Using **drop()** function to delete a row or a column

In [72]:
ser = pd.Series(np.arange(4.), index=['red','blue','yellow','white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [73]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

In [74]:
ser.drop(['blue','white'])

red       0.0
yellow    2.0
dtype: float64

Extend the idea to DataFrame

In [75]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)), 
                     index=['red','blue','yellow','white'], 
                     columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


To delete rows,

In [76]:
frame.drop(['blue','yellow'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
white,12,13,14,15


To delete columns, you need to set the **axis**=1

In [77]:
frame.drop(['pen','pencil'], axis = 1)

Unnamed: 0,ball,paper
red,0,3
blue,4,7
yellow,8,11
white,12,15


####Alignment####

In [78]:
s1 = pd.Series([3,2,5,1], ['white','yellow','greeen','blue'])
s2 = pd.Series([1,4,7,2,1], ['white', 'yellow','black','blue','brown'])
s1 + s2

black     NaN
blue      3.0
brown     NaN
greeen    NaN
white     4.0
yellow    6.0
dtype: float64

The members which are shown up in both series will be added, others will has the value **NaN**

In [79]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)), 
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])
frame1

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [80]:
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                      index=['blue','green','white','yellow'],
                      columns=['mug','pen','ball'])
frame2

Unnamed: 0,mug,pen,ball
blue,0,1,2
green,3,4,5
white,6,7,8
yellow,9,10,11


In [81]:
frame1 + frame2

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


###Operations between Data Structures###

####Flexible Arithmetic Methods####

    add()
    sub()
    div()
    mul()

In [82]:
frame1.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


####Operations between DataFrame and Series####

In [83]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [84]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
ser

ball      0
pen       1
pencil    2
paper     3
dtype: int64

In [85]:
frame - ser

Unnamed: 0,ball,pen,pencil,paper
red,0,0,0,0
blue,4,4,4,4
yellow,8,8,8,8
white,12,12,12,12


As you can see, the result will be subtracted for all values of the column, regardless of their index

If an index is not present in either one data structure, **NaN** will show up for that index

In [87]:
ser['mug'] = 9
ser

ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

In [88]:
frame - ser

Unnamed: 0,ball,mug,paper,pen,pencil
red,0,,0,0,0
blue,4,,4,4,4
yellow,8,,8,8,8
white,12,,12,12,12


###Function Application and Mapping###

####Functions by Element####

In [89]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [90]:
np.sqrt(frame)

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.414214,1.732051
blue,2.0,2.236068,2.44949,2.645751
yellow,2.828427,3.0,3.162278,3.316625
white,3.464102,3.605551,3.741657,3.872983


####Functions by Row or Column####

You can define a user-defined function and use **apply()** function on the dataframe

There are two ways to define such user-defined functions: the **lambda** function or the traditional **def** function

In [0]:
f = lambda x: x.max() - x.min()

In [0]:
def f(x):
  return x.max() - x.min()

In [96]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


Use **apply()** for the column

In [95]:
frame.apply(f)

ball      12
pen       12
pencil    12
paper     12
dtype: int64

Use **apply()** for the row, you need to add the **axis=1**

In [97]:
frame.apply(f, axis=1)

red       3
blue      3
yellow    3
white     3
dtype: int64

You can also use the user-defined function to return a series:

In [98]:
def f(x):
  return pd.Series([x.min(), x.max()], index=['min','max'])

frame.apply(f)

Unnamed: 0,ball,pen,pencil,paper
min,0,1,2,3
max,12,13,14,15


####Statistics Functions####

**sum()**

In [103]:
frame.sum(axis=0)

ball      24
pen       28
pencil    32
paper     36
dtype: int64

**mean()**

In [102]:
frame.mean(axis=1)

red        1.5
blue       5.5
yellow     9.5
white     13.5
dtype: float64

**describe()** allows you to obtain summary statistics at once.

In [107]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


###Sorting and Ranking###

**sort_index()** function sort the values according to the indexes

In [108]:
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [109]:
ser.sort_index()

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

In [111]:
ser.sort_index(ascending= False)

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64

For DataFrame, **sort_index()** will order by row, **sort_index(axis=1)** will order by column

In [112]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [113]:
frame.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [114]:
frame.sort_index(axis=1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


**sort_values()** will sort the values according to the values themselves

In [115]:
ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

To order the values in a dataframe, you also need to provide the name of the column on which to sort

In [116]:
frame.sort_values(by= 'pen') #sorted by the values in pen column

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [117]:
frame.sort_values(by= ['pen','pencil'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


**rank()** will sort the data and assign a rank to each element of the series

In [118]:
ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [119]:
ser.rank(method= 'first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [120]:
ser.rank(ascending=False)

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

###Correlation and Covariance###

Correlation - **corr()**

Covariance - **cov()**

In [0]:
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008',
                                    '2009','2010','2011','2012','2013'])
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008',
                                   '2009','2010','2011','2012','2013'])

In [122]:
seq.corr(seq2)

0.7745966692414835

In [123]:
seq.cov(seq2)

0.8571428571428571

Can be also aplied to a single dataframe

In [124]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [127]:
frame2.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,-0.276026,0.57735,-0.763763
pen,-0.276026,1.0,-0.079682,-0.361403
pencil,0.57735,-0.079682,1.0,-0.692935
paper,-0.763763,-0.361403,-0.692935,1.0


In [128]:
frame2.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,2.0,-0.666667,2.0,-2.333333
pen,-0.666667,2.916667,-0.333333,-1.333333
pencil,2.0,-0.333333,6.0,-3.666667
paper,-2.333333,-1.333333,-3.666667,4.666667


**corrwith()** method can used to calculate the pairwise correlations between the columns or rows of a dataframe with a series or another DataFrame()

In [129]:
ser = pd.Series([0,1,2,3,9],index=['red','blue','yellow','white','green'])
ser

red       0
blue      1
yellow    2
white     3
green     9
dtype: int64

In [131]:
frame2.corrwith(ser)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

In [132]:
frame2.corrwith(frame)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

###"Not a Number" Data###

####Assigning a NaN Value####

use **np.NaN** (or **np.nan**)

In [133]:
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [134]:
ser['white'] = None
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

####Filtering out NaN Values####

use **dropna()** function

In [135]:
ser.dropna()

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

or apply conditional selection with **notnull()**

In [136]:
ser[ser.notnull()]

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

Be CAREFUL using **dropna()** on DataFrame...

One **NaN** can eliminate one row and one column...

In [137]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                      index = ['blue','green','red'],
                      columns = ['ball','mug','pen'])
frame3

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
green,,,
red,2.0,,5.0


In [138]:
frame3.dropna()

Unnamed: 0,ball,mug,pen


To avoid this issue, you should specify the **how** option, assigning a value of **all** to it. This tells the **dropna()** function to delete *only the rows or columns in which all elements are NaN*.

In [140]:
frame3.dropna(how= 'all')

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
red,2.0,,5.0


####Filling in NaN Occureences####

use **fillna()** function

In [141]:
frame3.fillna(0)

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,0.0,0.0,0.0
red,2.0,0.0,5.0


You can also replace **NaN** with different values depending on the column, specifying one by one the indexes and the associated values

In [142]:
frame3.fillna({'ball':1, 'mug':0, 'pen':99}) # in the dict

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,1.0,0.0,99.0
red,2.0,0.0,5.0


###Hierarchical Indexing and Leveling###

allows you to have multiple levels of indexes on a single axis

Example of a series containing two arrays of indexes ( which means *creating a structure with two levels*)

In [143]:
mser = pd.Series(np.random.rand(8),
                index= [['white','white','white','blue','blue','red','red','red'], #first level
                       ['up','down','right','up','down','up','down','left']])      #second level
mser

white  up       0.466076
       down     0.175193
       right    0.554998
blue   up       0.845832
       down     0.646697
red    up       0.673806
       down     0.466067
       left     0.615963
dtype: float64

The indexes now are multiple-level indexes

In [144]:
mser.index

MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
           labels=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

To select the first index,

In [145]:
mser['white']

up       0.466076
down     0.175193
right    0.554998
dtype: float64

To select the second index,

In [146]:
mser[:,'up'] # use ':' to select all index in the first level

white    0.466076
blue     0.845832
red      0.673806
dtype: float64

To select a specific value,

In [147]:
mser['white','up']

0.46607579222204687

To reshape data and represent as a pivot-table, you may use **unstack()** function.

**unstack()** converts the series with a hierarchical index to a simple dataframe, wher the second set  of indexes is converted into a new set of columns

In [148]:
mser.unstack()

Unnamed: 0,down,left,right,up
blue,0.646697,,,0.845832
red,0.466067,0.615963,,0.673806
white,0.175193,,0.554998,0.466076


To perform a reverse operation, use **stack()** function

In [149]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [150]:
frame.stack()

red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int64

For DataFrame, you can define a hierarchical index using the **index** and **columns** options

In [151]:
mframe = pd.DataFrame(np.random.randn(16).reshape(4,4),
                      index=[['white','white','red','red'],     #row - first level
                             ['up','down','up','down']],        #row - second level
                      columns=[['pen','pen','paper','paper'],   #column - first level
                               [1,2,1,2]])                      #column - second level
mframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pen,pen,paper,paper
Unnamed: 0_level_1,Unnamed: 1_level_1,1,2,1,2
white,up,1.24932,0.433371,1.014253,-1.419111
white,down,-1.102312,-0.210819,-0.040815,-1.830725
red,up,0.390486,1.032475,-1.199623,0.39689
red,down,-1.53651,1.592459,1.379996,0.063854


###Reordering and Sorting Levels###

**swaplevel()** function interchanges two levels without changing the data

In [152]:
mframe.columns.names = ['object','id']
mframe.index.names = ['colors','status']
mframe

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
white,up,1.24932,0.433371,1.014253,-1.419111
white,down,-1.102312,-0.210819,-0.040815,-1.830725
red,up,0.390486,1.032475,-1.199623,0.39689
red,down,-1.53651,1.592459,1.379996,0.063854


In [154]:
mframe.swaplevel('colors','status')

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
status,colors,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
up,white,1.24932,0.433371,1.014253,-1.419111
down,white,-1.102312,-0.210819,-0.040815,-1.830725
up,red,0.390486,1.032475,-1.199623,0.39689
down,red,-1.53651,1.592459,1.379996,0.063854


**sort_index()** orders the data considering only those of a certain level by *specifying it as parameter*

In [155]:
mframe.sort_index(level= 'colors') # sort the color index only

Unnamed: 0_level_0,object,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
red,down,-1.53651,1.592459,1.379996,0.063854
red,up,0.390486,1.032475,-1.199623,0.39689
white,down,-1.102312,-0.210819,-0.040815,-1.830725
white,up,1.24932,0.433371,1.014253,-1.419111


###Summary Statistic by Level###

Many descriptive statistics and summary statistics have a **level** option

If you want to create a statistic at a *row* level, you have to specify the **level** option with the level name

In [156]:
mframe.sum(level='colors')

object,pen,pen,paper,paper
id,1,2,1,2
colors,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
white,0.147009,0.222552,0.973438,-3.249836
red,-1.146024,2.624934,0.180373,0.460744


To create a statistic at a *column* level, must specify the **axis**=1 as well

In [157]:
mframe.sum(level='id', axis= 1)

Unnamed: 0_level_0,id,1,2
colors,status,Unnamed: 2_level_1,Unnamed: 3_level_1
white,up,2.263573,-0.985739
white,down,-1.143127,-2.041544
red,up,-0.809138,1.429364
red,down,-0.156514,1.656313
