# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L6 Pandas DataFrame: Part 1

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* The Pandas library
* Working with Pandas `DataFrame` objects

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* Python for Data Analysis by Wes McKinney

## `DataFrame`
`DataFrame` is a two dimensional labeled data structure. `DataFrame` has index (just like `Series`). Each `DataFrame` index value *maps* to a labeled `Series`. You can think of a `DataFrame` like an Excel spreadsheet, SQL table, or a dict of `Series` objects. The index represents the rows and the `Series` represents the columns. 

Like Series, DataFrame accepts many different kinds of input:
* Dictionary of 1D array-like objects (`ndarrays`, lists, dictionaries, or `Series`)
* 2-D `ndarray`
* Structured or record `ndarray`
* A `Series`
* Another `DataFrame`

### `DataFrame` from Lists
Let's expand our Washington city population `Series` example. Suppose we want to store the four most populated cities in Washington, Idaho, and Oregon. Let's declare dictionaries to store this new information. Then we will create a `DataFrame` to represent all three states' populations:

In [7]:
import pandas as pd

washington = ["Seattle", "Spokane", "Tacoma", "Vancouver"]
idaho = ["Boise", "Nampa", "Meridian", "Idaho Falls"]
oregon = ["Portland", "Eugene", "Salem", "Gresham"]
pops = [washington, idaho, oregon]
df = pd.DataFrame(pops)
print(df)

          0        1         2            3
0   Seattle  Spokane    Tacoma    Vancouver
1     Boise    Nampa  Meridian  Idaho Falls
2  Portland   Eugene     Salem      Gresham


Pandas stacks the nested list into a 2-dimensional `DataFrame`. By default, the index and columns are labeled as 0-based indices. Instead, we want to provide labels to help with indexing later:

In [8]:
import numpy as np
df = pd.DataFrame(pops, index=["WA", "ID", "OR"], columns=np.arange(1, len(washington) + 1))
print("Population DataFrame #1")
print(df)

Population DataFrame #1
           1        2         3            4
WA   Seattle  Spokane    Tacoma    Vancouver
ID     Boise    Nampa  Meridian  Idaho Falls
OR  Portland   Eugene     Salem      Gresham


### `DataFrame` from Dictionaries
Let's re-work the above example to build the `DataFrame` from dictionaries. This can be useful because the dictionary keys will be used for the `DataFrame` columns:

In [9]:
pops_dict = {"WA": washington, "ID": idaho, "OR": oregon}
df2 = pd.DataFrame(pops_dict)
print(df2)

            ID        OR         WA
0        Boise  Portland    Seattle
1        Nampa    Eugene    Spokane
2     Meridian     Salem     Tacoma
3  Idaho Falls   Gresham  Vancouver


We can then update the index to start at 1:

In [10]:
df2.index += 1
print("Population DataFrame #2")
print(df2)

Population DataFrame #2
            ID        OR         WA
1        Boise  Portland    Seattle
2        Nampa    Eugene    Spokane
3     Meridian     Salem     Tacoma
4  Idaho Falls   Gresham  Vancouver


Now, `df` (Population `DataFrame` #1) and `df2` (Population `DataFrame` #2) are the transpose of each other:

In [11]:
df2T = df2.T # transpose
# re-order
df = df.sort_index()
df2T = df2T.sort_index()
print(df)
print(df2T)
print(df == df2T)

           1        2         3            4
ID     Boise    Nampa  Meridian  Idaho Falls
OR  Portland   Eugene     Salem      Gresham
WA   Seattle  Spokane    Tacoma    Vancouver
           1        2         3            4
ID     Boise    Nampa  Meridian  Idaho Falls
OR  Portland   Eugene     Salem      Gresham
WA   Seattle  Spokane    Tacoma    Vancouver
       1     2     3     4
ID  True  True  True  True
OR  True  True  True  True
WA  True  True  True  True


What happens if the dictionaries used to create a `DataFrame` do not have the same keys? Just like with `Series`, the `DataFrame` index of unaligned columns will be the union of the keys.

In [12]:
washington = {"Seattle": 652405, "Spokane": 210721, "Bellevue": 133992, "Leavenworth": 1992}
idaho = {"Boise": 205671, "Nampa": 81557, "Coeur d'Alene": 44137, "Moscow": 23800}
oregon = {"Portland": 583776, "Eugene": 156185, "Hillsboro": 91611, "Corvallis": 54462}
pops = {"WA": washington, "ID": idaho, "OR": oregon}
df = pd.DataFrame(pops)
print(df)

                     ID        OR        WA
Bellevue            NaN       NaN  133992.0
Boise          205671.0       NaN       NaN
Coeur d'Alene   44137.0       NaN       NaN
Corvallis           NaN   54462.0       NaN
Eugene              NaN  156185.0       NaN
Hillsboro           NaN   91611.0       NaN
Leavenworth         NaN       NaN    1992.0
Moscow          23800.0       NaN       NaN
Nampa           81557.0       NaN       NaN
Portland            NaN  583776.0       NaN
Seattle             NaN       NaN  652405.0
Spokane             NaN       NaN  210721.0


### `DataFrame` from `ndarray`
As another example, let's create a `DataFrame` from random data stored in an `ndarray`:

In [13]:
from numpy.random import randn
rand_data = randn(3, 4)
rand_df = pd.DataFrame(rand_data, index=["a", "b", "c"], columns=["col1", "col2", "col3", "col4"])
print(rand_df)

       col1      col2      col3      col4
a  1.234266  2.095078 -0.096673  0.383125
b  1.406528 -0.556563  0.620373 -0.026869
c  0.375379  0.930456 -0.251887  0.328747


### Working with Columns
You can treat a `DataFrame` semantically like a dictionary of like-indexed `Series` objects. Getting, setting, and deleting columns works with the same syntax as the analogous dictionary operations:

In [14]:
rand_data = randn(3, 4)
rand_df = pd.DataFrame(rand_data, index=["a", "b", "c"], columns=["col1", "col2", "col3", "col4"])
print(rand_df)

# index column
print(rand_df["col2"])
# update column
rand_df["col4"] = 100 # 100 is propogated to fill the column
print(rand_df)
# add columns (inserted at end)
rand_df["col5"] = rand_df["col1"] > rand_df["col2"]
print(rand_df)
rand_df["sum"] = rand_df.sum(axis="columns")
print(rand_df)
# add columns at location
rand_df.insert(2, "ones", 1)
print(rand_df)
# delete columns
del rand_df["col5"]
print(rand_df)
sum_ser = rand_df.pop("sum")
print(rand_df)
print("Popped column is a Series:")
print(sum_ser)

       col1      col2      col3      col4
a  0.035687  2.499510  0.345747  0.397402
b  0.439195  0.602978 -0.546295 -0.656553
c  2.730834 -1.801815 -0.802332 -0.083538
a    2.499510
b    0.602978
c   -1.801815
Name: col2, dtype: float64
       col1      col2      col3  col4
a  0.035687  2.499510  0.345747   100
b  0.439195  0.602978 -0.546295   100
c  2.730834 -1.801815 -0.802332   100
       col1      col2      col3  col4   col5
a  0.035687  2.499510  0.345747   100  False
b  0.439195  0.602978 -0.546295   100  False
c  2.730834 -1.801815 -0.802332   100   True
       col1      col2      col3  col4   col5         sum
a  0.035687  2.499510  0.345747   100  False  102.880943
b  0.439195  0.602978 -0.546295   100  False  100.495879
c  2.730834 -1.801815 -0.802332   100   True  101.126687
       col1      col2  ones      col3  col4   col5         sum
a  0.035687  2.499510     1  0.345747   100  False  102.880943
b  0.439195  0.602978     1 -0.546295   100  False  100.495879
c  2.730834 -1