___

___

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
#np.random.seed(89)

In [3]:
df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])

In [4]:
randn(5,4)

array([[-0.91233679, -0.23352934,  0.7965508 , -1.75510931],
       [ 2.34386695,  0.61371325, -0.55532915,  0.75503369],
       [-0.76473965, -0.33124533,  0.56518436, -1.11211525],
       [-0.56176619,  1.60317492, -0.33201428,  0.84232979],
       [ 1.11120591, -0.57021894, -0.42833331,  0.17594559]])

In [5]:
df

Unnamed: 0,W,X,Y,Z
A,0.926472,0.967269,-0.6131,-0.924886
B,1.611991,0.223651,2.814554,-0.232918
C,-0.267676,0.626875,0.791482,-0.090819
D,-0.671311,1.210892,-0.471212,1.89174
E,-0.102502,1.167327,0.100831,-1.276812


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [6]:
df['W']

A    0.926472
B    1.611991
C   -0.267676
D   -0.671311
E   -0.102502
Name: W, dtype: float64

In [7]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,0.926472,-0.924886
B,1.611991,-0.232918
C,-0.267676,-0.090819
D,-0.671311,1.89174
E,-0.102502,-1.276812


In [8]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

A   -1.693730
B   -0.305779
C   -0.182533
D    0.344946
E   -0.870106
Name: W, dtype: float64

DataFrame Columns are just Series

In [9]:
type(df['W'])

pandas.core.series.Series

In [10]:
type(df['W'].iloc(0)[0])

numpy.float64

In [11]:
df['p']=df['X']+df['W']

In [12]:
df['p']

A   -1.695851
B   -0.265229
C   -1.062109
D   -0.280339
E   -0.471958
Name: p, dtype: float64

In [13]:
df

Unnamed: 0,W,X,Y,Z,p
A,-1.69373,-0.002121,-0.425892,-0.25559,-1.695851
B,-0.305779,0.04055,-0.068354,-1.202224,-0.265229
C,-0.182533,-0.879576,-0.243248,0.674031,-1.062109
D,0.344946,-0.625285,-0.206452,-0.871597,-0.280339
E,-0.870106,0.398148,-0.665255,0.133677,-0.471958


**Creating a new column:**

In [14]:
df['p'] = df['W'] + df['Y']

In [15]:
df

Unnamed: 0,W,X,Y,Z,p
A,-1.69373,-0.002121,-0.425892,-0.25559,-2.119622
B,-0.305779,0.04055,-0.068354,-1.202224,-0.374133
C,-0.182533,-0.879576,-0.243248,0.674031,-0.425781
D,0.344946,-0.625285,-0.206452,-0.871597,0.138495
E,-0.870106,0.398148,-0.665255,0.133677,-1.535361


** Removing Columns**

In [16]:
df.drop('p',axis=1,inplace=True)

In [17]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z
A,-1.69373,-0.002121,-0.425892,-0.25559
B,-0.305779,0.04055,-0.068354,-1.202224
C,-0.182533,-0.879576,-0.243248,0.674031
D,0.344946,-0.625285,-0.206452,-0.871597
E,-0.870106,0.398148,-0.665255,0.133677


KeyError: "['new'] not found in axis"

In [19]:
df

Unnamed: 0,W,X,Y,Z
A,-1.69373,-0.002121,-0.425892,-0.25559
B,-0.305779,0.04055,-0.068354,-1.202224
C,-0.182533,-0.879576,-0.243248,0.674031
D,0.344946,-0.625285,-0.206452,-0.871597
E,-0.870106,0.398148,-0.665255,0.133677


Can also drop rows this way:

In [None]:
df.drop('E',axis=0)

** Selecting Rows**

In [None]:
df.loc['A']

Or select based off of position instead of label 

In [None]:
df.iloc[2]

** Selecting subset of rows and columns **

In [None]:
df.loc['B','Y']

In [None]:
df.loc[['A','B'],['W','Y']]

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [None]:
df

In [None]:
df>0

In [24]:
df=df[df>0]

In [26]:
df

Unnamed: 0,W,X,Y,Z
A,,,,
B,,0.04055,,
C,,,,0.674031
D,0.344946,,,
E,,0.398148,,0.133677


In [30]:
df.fillna(df['W'].mean())

Unnamed: 0,W,X,Y,Z
A,0.344946,0.344946,0.344946,0.344946
B,0.344946,0.04055,0.344946,0.344946
C,0.344946,0.344946,0.344946,0.674031
D,0.344946,0.344946,0.344946,0.344946
E,0.344946,0.398148,0.344946,0.133677


In [29]:
df.dropna()

Unnamed: 0,W,X,Y,Z


In [None]:
df[df['W']>0]

In [None]:
df[df['W']>0]['Y']

In [None]:
df[df['W']>0][['Y','X']]

For two conditions you can use | and & with parenthesis:

In [33]:
df[(df['W']>0) | (df['Y']>5)]

Unnamed: 0,W,X,Y,Z
D,0.344946,,,


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [None]:
df

In [None]:
# Reset to default 0,1...n index
df.reset_index()

In [None]:
newind = ['CA','NY','WY','OR','CO']

In [None]:
df['States'] = newind

In [None]:
df

In [None]:
df.set_index('States')

In [None]:
df

In [None]:
df.set_index('States',inplace=True)

In [None]:
df

## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [None]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [None]:
hier_index

In [None]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [None]:
df.loc['G1']

In [None]:
df.loc['G1'].loc[1]

In [None]:
df.index.names

In [None]:
df.index.names = ['Group','Num']

In [None]:
df

In [None]:
df.xs('G1')

In [None]:
df.xs(['G1',1])

In [None]:
df.xs(1,level='Num')

# Great Job!