# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L7 Pandas DataFrame: Part 2

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* The Pandas library
* Working with Pandas `DataFrame` objects

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* Python for Data Analysis by Wes McKinney

## Pandas DataFrame Continued

In [1]:
import pandas as pd

### Indexing
From the [Pandas website](http://pandas.pydata.org/), the basics of indexing are as follows:

|Operation|Syntax|Result|
|-|-|-|
|Select column	|`df[col]`	|`Series`|
|Select row by label	|`df.loc[label]`|	`Series`|
|Select row by integer location	|`df.iloc[loc]`	|`Series`|
|Slice rows	|`df[5:10]`	|`DataFrame`|
|Select rows by boolean vector	|`df[bool_vec]`|	`DataFrame`|

In [15]:
rand_data = randn(3, 4)
rand_df = pd.DataFrame(rand_data, index=["a", "b", "c"], columns=["col1", "col2", "col3", "col4"])
print(rand_df)

# row indexing by label
print(rand_df.loc["b"])
# row indexing by location
print(rand_df.iloc[1])
# row slicing by location
print(rand_df[0:2])

       col1      col2      col3      col4
a -1.294607 -1.335656 -0.818923  2.005748
b -0.952310  0.221592 -0.609447  2.572125
c -0.357690 -0.057465  0.177202  1.420419
col1   -0.952310
col2    0.221592
col3   -0.609447
col4    2.572125
Name: b, dtype: float64
col1   -0.952310
col2    0.221592
col3   -0.609447
col4    2.572125
Name: b, dtype: float64
       col1      col2      col3      col4
a -1.294607 -1.335656 -0.818923  2.005748
b -0.952310  0.221592 -0.609447  2.572125


### Combining `DataFrame`s
Pandas supports many ways to combine `DataFrame`s together, including merging, joining, and concatenating. For simplicity, we will focus on concatenation with the `concat` function in the main Pandas namespace. 

Suppose we have three `DataFrame`s with the same column labels that we want to combine into a single `DataFrame`. We can use `pd.concat(<list of DataFrames>)` to combine them. The following example is from the Pandas documentation on [merging](http://pandas.pydata.org/pandas-docs/stable/merging.html):

In [16]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])

frames = [df1, df2, df3]
result = pd.concat(frames)
print(result.tail(2))
print(help(result.tail))

      A    B    C    D
10  A10  B10  C10  D10
11  A11  B11  C11  D11
Help on method tail in module pandas.core.generic:

tail(n=5) method of pandas.core.frame.DataFrame instance
    Returns last n rows

None


The resulting `DataFrame` is a combination is `df3` concatenated to the end of `df2`, which is concatenated to the end of `df1`:
![](http://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png)
(image from [http://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png](http://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png))

### Viewing Large `DataFrame`s
In this class we will be working with some big `DataFrame`s. Pandas will output condensed `DataFrame`s using .... There are also object methods to view shortened or summarized `DataFrame` information:
* `describe()`: Generate various summary statistics, excluding NaN values
* `head(n=5)`: Returns first `n` rows 
* `tail(n=5)`: Returns the last `n` rows

In [17]:
print(result.describe())
print("\n")
print(result.head(n=2))
print("\n")
print(result.tail(n=2))

         A   B    C   D
count   12  12   12  12
unique  12  12   12  12
top     A4  B5  C10  D4
freq     1   1    1   1


    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


      A    B    C    D
10  A10  B10  C10  D10
11  A11  B11  C11  D11


### File I/O
With Pandas, we can easily write our data frames out to a csv (comma separated value) file to save for later use after our program terminates. The `DataFrame` method [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) write a data frame to a csv file. The rows and columns of the data frame will be the rows and columns of the csv file. 

For example, suppose we want to write to a file our example data frame we used to learn how to concatenate data frames together. We can do this in a one-liner:

In [22]:
fname = r"files\to_csv_example_df.csv"
result.to_csv(fname)

If we open [to_csv_example_df.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/to_csv_example_df.csv) with Microsoft Excel, we see the following table:
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/to_csv_example_df.png" width="400">

We can also load data from a csv file into a data frame. To do this, we use the [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) Pandas function:

In [24]:
df = pd.read_csv(fname)
print(df)
print(df.columns)

    Unnamed: 0    A    B    C    D
0            0   A0   B0   C0   D0
1            1   A1   B1   C1   D1
2            2   A2   B2   C2   D2
3            3   A3   B3   C3   D3
4            4   A4   B4   C4   D4
5            5   A5   B5   C5   D5
6            6   A6   B6   C6   D6
7            7   A7   B7   C7   D7
8            8   A8   B8   C8   D8
9            9   A9   B9   C9   D9
10          10  A10  B10  C10  D10
11          11  A11  B11  C11  D11
Index(['Unnamed: 0', 'A', 'B', 'C', 'D'], dtype='object')


However, we seem some less than desirable output. For example, the first column in the csv file is our index, but our data frame is creating and assigning a new index. We also have the extra column "Unnamed: 0". We can explicitly tell Pandas the first column is the index with the keyword `index_col`. It is also good to explicitly tell Pandas the first row is our header row and contains the column labels. We can do this with the keyword `header`.

In [25]:
# another attempt at reading in the csv data
df = pd.read_csv(fname, index_col=0, header=0)
print(df)
print(df.columns)

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11
Index(['A', 'B', 'C', 'D'], dtype='object')


## `Panel`
From the [Pandas website](http://pandas.pydata.org/):
>Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may find the axis names slightly arbitrary:
* items: axis 0, each item corresponds to a DataFrame contained inside
* major_axis: axis 1, it is the index (rows) of each of the DataFrames
* minor_axis: axis 2, it is the columns of each of the DataFrames

We will not officially cover `Panel`s at this point in the course. You are welcome to read up on them if you would like.

## Summary
We have covered quite a bit of information on Pandas, but we have only scratched the surface! I highly encourage you to read *Python for Data Analysis* by Wes McKinney and practice working with `Series` and `DataFrame` objects. For the rest of the class, we will learn new Pandas functionality as we go. 