Book Reference: page 123-of **Python for Data Analysis Book by Wes McKinney**

# pandas

- contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python
- Difference between pandas and numpy:
    - pandas is designed for working with tabular or heterogeneous data
    - Numpy is best suited for working with homogeneous numerical array data
- open source project in 2010

In [61]:
import pandas as pd

# or 
# from pandas import Series, DataFrame

### 2 Workhorse data structures
1. Series
    - is a one-dimensional array-like object containing a sequence of values and an associated array of data labels (index)
    - fixed-length, ordered dict
2. DataFrame
    - represents a rectangular table of data
    - contains an ordered collection of columns (which may be of different value types)
    - has both a row and column index
    - dict of Series sharing the same index

In [109]:
obj = pd.Series([1,2,3,'ha'])
obj

0     1
1     2
2     3
3    ha
dtype: object

In [110]:
obj2 = pd.Series([4, 7, -5, 3])
obj2

0    4
1    7
2   -5
3    3
dtype: int64

In [111]:
obj2.values

array([ 4,  7, -5,  3], dtype=int64)

In [112]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [113]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [114]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [115]:
# use labels in the index when selecting single values or a set of values
obj2['a']

-5

In [116]:
obj2['d'] = 6

In [117]:
obj2[['c', 'a', 'd']] #c,a,d interpreted as list of indices

c    3
a   -5
d    6
dtype: int64

Numpy-like operations
- filtering with a boolean arra
- scalar multiplication
- applying math functions

In [118]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [119]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [120]:
import numpy as np
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [121]:
# dict-like operations
'b' in obj2

True

In [122]:
# convert Python dict to Series

sdata = {'Ohio': 35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [123]:
# override ordered index through...
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [124]:
# Check for missing or NA 
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [125]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [126]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [127]:
# Data alignment feature
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series' object itself AND Series' index both have names


In [128]:
obj4.name = 'population'

In [129]:
obj4.index.name = 'state'

In [130]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [131]:
obj

0     1
1     2
2     3
3    ha
dtype: object

In [132]:
# Series's index can be altered in-place by assignment
obj.index = ['Jennie', 'Lisa', 'Rose', 'Jisoo']

In [133]:
obj

Jennie     1
Lisa       2
Rose       3
Jisoo     ha
dtype: object

### DataFrame

In [134]:
data = {
    'singer': ['Blackpink', 'NCT', 'Seventeen', 'SF9'],
    'year': [2018, 2017, 2019, 2017],
    'title': ['As If It\'s Your Last', 'Cherry Bomb', 'Fear', 'O Sole Mio']
}

In [135]:
frame = pd.DataFrame(data)

In [136]:
frame

Unnamed: 0,singer,title,year
0,Blackpink,As If It's Your Last,2018
1,NCT,Cherry Bomb,2017
2,Seventeen,Fear,2019
3,SF9,O Sole Mio,2017


In [137]:
frame.head() # displays first five rows

Unnamed: 0,singer,title,year
0,Blackpink,As If It's Your Last,2018
1,NCT,Cherry Bomb,2017
2,Seventeen,Fear,2019
3,SF9,O Sole Mio,2017


In [138]:
pd.DataFrame(data, columns=['year', 'title', 'singer'])

Unnamed: 0,year,title,singer
0,2018,As If It's Your Last,Blackpink
1,2017,Cherry Bomb,NCT
2,2019,Fear,Seventeen
3,2017,O Sole Mio,SF9


In [139]:
frame2 = pd.DataFrame(data, columns=['year', 'title', 'singer', 'rating'], 
            index=['one', 'two', 'three', 'four'])
frame2

Unnamed: 0,year,title,singer,rating
one,2018,As If It's Your Last,Blackpink,
two,2017,Cherry Bomb,NCT,
three,2019,Fear,Seventeen,
four,2017,O Sole Mio,SF9,


In [140]:
frame2.columns

Index(['year', 'title', 'singer', 'rating'], dtype='object')

In [141]:
frame2['singer']

one      Blackpink
two            NCT
three    Seventeen
four           SF9
Name: singer, dtype: object

In [142]:
frame2.title # this only works when the column name is a valid Python variable name

one      As If It's Your Last
two               Cherry Bomb
three                    Fear
four               O Sole Mio
Name: title, dtype: object

In [143]:
frame2.loc['three'] #rows

year           2019
title          Fear
singer    Seventeen
rating          NaN
Name: three, dtype: object

Columns can be modified by assignment

In [144]:
frame2['rating'] = 3.5

In [145]:
frame2

Unnamed: 0,year,title,singer,rating
one,2018,As If It's Your Last,Blackpink,3.5
two,2017,Cherry Bomb,NCT,3.5
three,2019,Fear,Seventeen,3.5
four,2017,O Sole Mio,SF9,3.5


In [146]:
frame2['rating'] = np.arange(4.)

In [147]:
frame2

Unnamed: 0,year,title,singer,rating
one,2018,As If It's Your Last,Blackpink,0.0
two,2017,Cherry Bomb,NCT,1.0
three,2019,Fear,Seventeen,2.0
four,2017,O Sole Mio,SF9,3.0


In [148]:
# Series(rating_value's) length must match the length of  the DataFrame
rating_val = pd.Series([4,5], index=['two', 'four'])

In [149]:
frame2['rating'] = rating_val

In [150]:
frame2

Unnamed: 0,year,title,singer,rating
one,2018,As If It's Your Last,Blackpink,
two,2017,Cherry Bomb,NCT,4.0
three,2019,Fear,Seventeen,
four,2017,O Sole Mio,SF9,5.0


In [151]:
# add new column
frame2['girl_group'] = frame2.singer == 'Blackpink'

In [152]:
frame2

Unnamed: 0,year,title,singer,rating,girl_group
one,2018,As If It's Your Last,Blackpink,,True
two,2017,Cherry Bomb,NCT,4.0,False
three,2019,Fear,Seventeen,,False
four,2017,O Sole Mio,SF9,5.0,False


In [153]:
# delete column
del frame2['girl_group']

In [154]:
frame2

Unnamed: 0,year,title,singer,rating
one,2018,As If It's Your Last,Blackpink,
two,2017,Cherry Bomb,NCT,4.0
three,2019,Fear,Seventeen,
four,2017,O Sole Mio,SF9,5.0


In [155]:
frame2.columns

Index(['year', 'title', 'singer', 'rating'], dtype='object')

The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s **copy method**.

In [156]:
# nested dict of dicts
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [157]:
frame3 = pd.DataFrame(pop)

In [158]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [159]:
frame3.T # transpose

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [160]:
frame3['Ohio'][:-1]

2000    1.5
2001    1.7
Name: Ohio, dtype: float64

In [162]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [163]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [164]:
frame2.values # see how dtype is object

array([[2018, "As If It's Your Last", 'Blackpink', nan],
       [2017, 'Cherry Bomb', 'NCT', 4.0],
       [2019, 'Fear', 'Seventeen', nan],
       [2017, 'O Sole Mio', 'SF9', 5.0]], dtype=object)

### Index Objects

In [165]:
obj = pd.Series(range(3), index=['a','b','c'])

In [166]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')