Sarah Kerman shk83@pitt.edu 9/12/17
Pandas Notes

In [7]:
#import using alias like numpy
import numpy as np
import pandas as pd

- pandas objects are like numpy arrays but can be referenced with labels instead of indices
- Series object: 1dim array. can be created from python list or numpy array. has index and value attributes
    - series.value is a numpy array of values
    - elements can be called/sliced like normal
    - series.index is the range of indices
    - indices will be integers if unspecified (basically a 1d numpy array) but can specify index labels
    - labels can be anything, like python dictionary
    - series can be constructed directly from python dictionary, but all keys must be same type, values same type
    - can add/select items using key(index) like dictionary, use .keys()/.items(), but can also slice like array


In [8]:
#a pandas Series with unspecified indices
data=pd.Series([4,4434,675,24,892])
print(data)
print(data.values)
print(data.index)

0       4
1    4434
2     675
3      24
4     892
dtype: int64
[   4 4434  675   24  892]
RangeIndex(start=0, stop=5, step=1)


In [11]:
# a pandas Series with specified indices (labels)
data2=pd.Series([12,435,574,678],index=['a','b','c','d'])
data2

a     12
b    435
c    574
d    678
dtype: int64

- DataFrame object:
    - like a 2d array with flexible row/column labels, or multiple Series objects that share same labels on one axis
    - can construct from:
        - dictionary containing new column labels as keys, and Series objs as vals
        - list of dictionaries
        - a 2d numpy array
    - has .index attr which lists index labels
    - has .columns attr which is an index obj holding column labels

In [9]:
#DataFrame constructed from dict of Series showing 4 of my characters, their class, and their raid dps
dps=pd.Series({'Phavena':6000,'Aezat':3000,"V'thia":5000,'Teishaa':3500})
player_class=pd.Series({'Phavena':'Sorcerer','Aezat':'Juggernaut',"V'thia":'Sage','Teishaa':'Sniper'})
team=pd.DataFrame({'dps':dps,'class':player_class})
team

Unnamed: 0,class,dps
Aezat,Juggernaut,3000
Phavena,Sorcerer,6000
Teishaa,Sniper,3500
V'thia,Sage,5000


Methods for when you have numerical indices that might not be sequential (as in implicit python index)
- .loc[x] always references **explicit** index (if your indices are 1, 3, 5, .loc[1] will call first element, not second)
- .iloc[x] always references **implicit** (python) index (.iloc[1] will call second element, regardless of what is actually labeled 1
- "explicit is better than implicit" keep your code clean


Data Selection in DataFrame
- Can use dictionary key format to call a series from a DF
- can use .iloc (implicit python indicies) to slice DF like a 2d array (labels are maintained in output)
- can use .loc to slice DF using explicit row/column labels
- .ix is deprecated don't use it

In [13]:
#using implicit indicies to slice DF like 2D array
print(team.iloc[:3,:2])

#using explicit indices to slice DF
print(team.loc[:'Phavena',:'class'])

              class   dps
Aezat    Juggernaut  3000
Phavena    Sorcerer  6000
Teishaa      Sniper  3500
              class
Aezat    Juggernaut
Phavena    Sorcerer


Missing Data
- pandas uses sentinels (special indicators) for missing data
- float NaN, or python None object (only usable in arrays of type "object")
- any math with np.nan will result in nan
    - any aggregate functions on an array with nan will be defined, but result in nan
    - use nan-safe aggregate functions (np.nansum(),.nanmin(),.nanmax()...)
- pandas will convert between None and nan where appropriate
- integer arrays will be upcast to float arrays to fit nan if present
- functions for null data:
    - isnull() returns boolean mask over data to indicate null value
    - notnull() opposite of isnull()
    - dropna() return version with nulls filtered out
        - for a DF, will drop any entire row containing a null value (or column if specified .dropna(axis='columns')
        - can specify how='all' to only drop row/column that is ALL null
        - can specify thresh value of how many nulls must be in row/column in order to be dropped (thresh=x)
    - fillna() return copy of data with missing values filled
        - .fillna(x) to fill nulls with specific value (like 0, -1, or 9999)
        - .fillna(method='ffill') forward-fill takes previous valid value and to fill null
            - if previous value not available (like 1st element in a row) null remains
        - .fillna(method='bfill') back-fill takes next value to fill back
        - can specify axis for DF
