# Pandas Introduction - DataFrame part 2

In [72]:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd

In [73]:
pop_data = {'Nevada': {2004: 0.0, 2003: 0.0, 2002: 2.9, 2001: 2.4},
            'Ohio': {2004: 0.0, 2003: 0.0, 2002: 3.6, 2001: 1.7, 2000: 1.5},
            'California': {2004: 0.0, 2003: 0.0, 2002: 0.0, 2001: 0.0, 2000: 0.0},
            'Texas': {2004: 0.0, 2003: 0.0, 2002: 0.0, 2001: 0.0, 2000: 0.0},
           }
           
df3 = DataFrame(pop_data)

# set the index and column names to something meaningful
df3.index.name = 'Year'
df3.columns.name = 'State'

## Accessing data in a Dataframe

Using "dictionary-style" references allows access to columns.  

The *loc* method allows selection of both rows and columns.

Unlike slicing in normal Python, slicing for Series and DataFrames is *inclusive* of the endpoint.

In [74]:
df3

State,Nevada,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004,0.0,0.0,0.0,0.0
2003,0.0,0.0,0.0,0.0
2002,2.9,3.6,0.0,0.0
2001,2.4,1.7,0.0,0.0
2000,,1.5,0.0,0.0


The first positional argument to *loc* is an index (row) specifier.

In [75]:
df3.loc[2002]

State
Nevada        2.9
Ohio          3.6
California    0.0
Texas         0.0
Name: 2002, dtype: float64

Slicing returns a contiguous set of rows...

In [76]:
df3.loc[2003:2001]

State,Nevada,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003,0.0,0.0,0.0,0.0
2002,2.9,3.6,0.0,0.0
2001,2.4,1.7,0.0,0.0


Specific individual indexes (rows) can be returning by passing a list of index values.

In [78]:
df3.loc[[2003, 2001]]

State,Nevada,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003,0.0,0.0,0.0,0.0
2001,2.4,1.7,0.0,0.0


The second parameter to *loc* references columns.  Specifying both index and column information returns subsets of the original dataframe.

In [79]:
df3

State,Nevada,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004,0.0,0.0,0.0,0.0
2003,0.0,0.0,0.0,0.0
2002,2.9,3.6,0.0,0.0
2001,2.4,1.7,0.0,0.0
2000,,1.5,0.0,0.0


In [80]:
df3.loc[2004:2002, 'Ohio']

Year
2004    0.0
2003    0.0
2002    3.6
Name: Ohio, dtype: float64

In [81]:
df3.loc[2004:2002, 'Ohio':'Texas']

State,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004,0.0,0.0,0.0
2003,0.0,0.0,0.0
2002,3.6,0.0,0.0


In [82]:
df3.loc[2004:2002, ['Ohio', 'Texas']]

State,Ohio,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2004,0.0,0.0
2003,0.0,0.0
2002,3.6,0.0


## Selecting Values from a Dataframe

Selection of data can be more sophisticated than simple slices and index/column references.

In [83]:
df3

State,Nevada,Ohio,California,Texas
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004,0.0,0.0,0.0,0.0
2003,0.0,0.0,0.0,0.0
2002,2.9,3.6,0.0,0.0
2001,2.4,1.7,0.0,0.0
2000,,1.5,0.0,0.0


In [84]:
df4 = df3.T

In [85]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,0.0,0.0,0.0,0.0


As we have seen, we can reference a specific cell with a specific row and column key value using the *loc* method.

In [86]:
df4.loc['Nevada', 2000]

nan

You can also reference rows satisfying a boolean experession.  

Note the reference to the specific column 2002 from dataframe df4.  Each row value from this column will be sequentially evaluated within the boolean expression and the rows that return True for the boolean are returned.  With Series, the series name was sufficient in boolean expressions, but with DataFrames, you must specify the column to be used for comparisons.

In [87]:
df4[df4[2002] > 0]

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5


Compound boolean conditions may be specified. 

Note that Pandas does ***not*** use Python boolean operators (e.g., **and** and **or**), but instead uses operators like **&** (and), **|** (or), and **~** (not).  

Be careful with evaluation precedence on these expressions.  Use parantheses to ensure evaluation order.

In [88]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,0.0,0.0,0.0,0.0


In [89]:
df4[(df4[2002] > 0) & (df4[2000] > 0)]

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ohio,0.0,0.0,3.6,1.7,1.5


In [90]:
df4[(df4[2002] > 0) | (df4[2000] > 0)]

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5


The *isnull* and *notnull* methods can be used to test for missing values.

In [91]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,0.0,0.0,0.0,0.0


In [92]:
df4[(df4[2002] > 0) & (df4[2000].isnull())]

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,


In [93]:
df4[(df4[2002] > 0) & (df4[2000].notnull())]

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ohio,0.0,0.0,3.6,1.7,1.5


You can check the entire dataframe to see if any values are missing.

In [94]:
df4.isnull().values.any()

True

Methods may return a "truth" dataframe for the condition.  For example, the following expression returns a dataframe where the cell value is True if the cell has a missing value.

In [95]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,0.0,0.0,0.0,0.0


In [96]:
df4.isnull()

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,False,False,False,False,True
Ohio,False,False,False,False,False
California,False,False,False,False,False
Texas,False,False,False,False,False


Check to see if any row has any column that has a missing value.

In [98]:
df4.isnull().any(axis='columns')

State
Nevada         True
Ohio          False
California    False
Texas         False
dtype: bool

The rows have at least one missing value return True.  Let's take a look at those rows.  In this case, the expression inside the bracket represents a boolean and all rows satisfying the boolean are selected.  We can use that to reference part of the dataframe.

In [None]:
df4[df4.isnull().any(axis=1)]

Explicitly set a cell to NaN.

In [99]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,0.0,0.0,0.0,0.0


In [100]:
df4.loc['Texas', 2003] = np.nan

In [101]:
df4

Year,2004,2003,2002,2001,2000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nevada,0.0,0.0,2.9,2.4,
Ohio,0.0,0.0,3.6,1.7,1.5
California,0.0,0.0,0.0,0.0,0.0
Texas,0.0,,0.0,0.0,0.0


Now find the rows where any column has a missing value.

In [None]:
df4[df4.isnull().any(axis=1)]

In [None]:
df4.isnull()

Replace all missing values with a (-1).  Note that since this is a direct assignment, it modifies the contents of the dataframe.

In [None]:
df4[df4.isnull()] = -1.0

In [None]:
df4

Put the NaNs back!

In [None]:
df4[df4 < 0] = np.nan

In [None]:
df4

Use *loc* to combine a condition on indexes with column selection.

In [None]:
df4.loc[df4.isnull().any(axis='columns'), [2004, 2002]]

In [None]:
df4.loc[df4[2002] > 0, [2004, 2002]]

Check to see if a cell value is within a particular set of values.

In [None]:
df4[df4.isin([0.0, np.nan]).any(axis='columns')]

Without a condition, *any* returns a truth table where 0 and NaN are considered False and everything else considered True.

In [None]:
df4

In [None]:
df4.any()

In [None]:
df4.any(axis='columns')

That boolean result can be used to select from the dataframe.

In [None]:
df4[df4.any(axis='columns')]

In [None]:
df4[df4[2002] > 2.0]

In [None]:
df4.columns

In [None]:
df4[df4[df4.columns] > 2.0]

Find values satisfying a condition in any column...

In [None]:
df4[(df4 > 2.0).any(axis='columns')]

In [None]:
df4[(df4 > 3.0).any(axis='columns')]