# DataFrames


In [1]:
import pandas as pd
import numpy as np

In [8]:
from numpy.random import randn


In [6]:
df = pd.DataFrame(randn(3,2),index='Ram Sham Geeta'.split(),columns='X Y'.split())

In [7]:
df

Unnamed: 0,X,Y
Ram,0.495227,-0.334108
Sham,-0.057518,-0.59271
Geeta,-0.810741,1.017984


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [9]:
df['X']

Ram      0.495227
Sham    -0.057518
Geeta   -0.810741
Name: X, dtype: float64

In [10]:
# Pass a list of column names
df[['X','Y']]

Unnamed: 0,X,Y
Ram,0.495227,-0.334108
Sham,-0.057518,-0.59271
Geeta,-0.810741,1.017984


DataFrame Columns are just Series

In [11]:
type(df['X'])

pandas.core.series.Series

**Creating a new column:**

In [14]:
df['W'] = df['X'] + df['Y']

In [15]:
df

Unnamed: 0,X,Y,new,W
Ram,0.495227,-0.334108,0.161119,0.161119
Sham,-0.057518,-0.59271,-0.650228,-0.650228
Geeta,-0.810741,1.017984,0.207243,0.207243


** Removing Columns**

In [16]:
df.drop('new',axis=1)

Unnamed: 0,X,Y,W
Ram,0.495227,-0.334108,0.161119
Sham,-0.057518,-0.59271,-0.650228
Geeta,-0.810741,1.017984,0.207243


In [17]:
# Not inplace unless specified!
df

Unnamed: 0,X,Y,new,W
Ram,0.495227,-0.334108,0.161119,0.161119
Sham,-0.057518,-0.59271,-0.650228,-0.650228
Geeta,-0.810741,1.017984,0.207243,0.207243


In [18]:
df.drop('new',axis=1,inplace=True)

In [19]:
df

Unnamed: 0,X,Y,W
Ram,0.495227,-0.334108,0.161119
Sham,-0.057518,-0.59271,-0.650228
Geeta,-0.810741,1.017984,0.207243


Can also drop rows this way:

In [20]:
df.drop('Ram',axis=0)

Unnamed: 0,X,Y,W
Sham,-0.057518,-0.59271,-0.650228
Geeta,-0.810741,1.017984,0.207243


** Selecting Rows**

In [21]:
df.loc['Sham']

X   -0.057518
Y   -0.592710
W   -0.650228
Name: Sham, dtype: float64

Or select based off of position instead of label 

In [22]:
df.iloc[2]

X   -0.810741
Y    1.017984
W    0.207243
Name: Geeta, dtype: float64

** Selecting subset of rows and columns **

In [23]:
df.loc['Ram','X']

0.49522719891177125

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [25]:
df

Unnamed: 0,X,Y,W
Ram,0.495227,-0.334108,0.161119
Sham,-0.057518,-0.59271,-0.650228
Geeta,-0.810741,1.017984,0.207243


In [26]:
df>0

Unnamed: 0,X,Y,W
Ram,True,False,True
Sham,False,False,False
Geeta,False,True,True


In [27]:
df[df>0]

Unnamed: 0,X,Y,W
Ram,0.495227,,0.161119
Sham,,,
Geeta,,1.017984,0.207243


In [28]:
df[df['W']>0]

Unnamed: 0,X,Y,W
Ram,0.495227,-0.334108,0.161119
Geeta,-0.810741,1.017984,0.207243


In [29]:
df[df['W']>0]['Y']

Ram     -0.334108
Geeta    1.017984
Name: Y, dtype: float64

In [30]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
Ram,-0.334108,0.495227
Geeta,1.017984,-0.810741


For two conditions you can use | and & with parenthesis:

In [31]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,X,Y,W
Geeta,-0.810741,1.017984,0.207243
