# The Pandas Library

Pandas is a wonderful data organization and processing library.  It works off of a single major underlying structure, the `DataFrame`.  The `DataFrame` is an object that can be described as a "dictionary of dictionaries".  Similar to how previous examples have shown that you can have nested dictionary datatypes, Pandas takes it a step further and provides tools to organize, manipulate, combine, compare, and store these multi-layer dictionaries.  Pandas also includes built-in Excel Spreadsheet creation functionality, meaning that you can transform your data into more complex structures in Excel, making it easier to share with collaborators.

Let's look at an example of a simple dataframe with just a few rows and columns.

In [44]:
import pandas as pd

df = pd.DataFrame([{"First Name":"Zee","Age":40,"Team":"Blue"},
                   {"First Name":"Charlotte","Age":45,"Team":"Red"},
                   {"First Name":"Wilbur","Age":50,"Team":"Green"}])

In [45]:
display(df)

Unnamed: 0,First Name,Age,Team
0,Zee,40,Blue
1,Charlotte,45,Red
2,Wilbur,50,Green


In [46]:
print(df)

  First Name  Age   Team
0        Zee   40   Blue
1  Charlotte   45    Red
2     Wilbur   50  Green


Notice the difference between `display` and `print` in the example above.  It should be pointed out that `display` is functional in notebook environments only, not in terminals.  Trying to use `display` in a terminal will result in an error.

You can add to existing dataframes by appending new rows.  Each row will be made as a dictionary beforehand.


In [47]:
newrow = {"First Name":"Iroh","Age":99,"Team":"Red"}
df = df.append(newrow,ignore_index=True)

In [48]:
display(df)

Unnamed: 0,First Name,Age,Team
0,Zee,40,Blue
1,Charlotte,45,Red
2,Wilbur,50,Green
3,Iroh,99,Red


Notice that we had to reassign the dataframe when we added the new row.  This is because all of the internal dataframe functions return a *new* dataframe object.  This method ensures that dataframes are not overwritten on accident, and that data is not lost without intent.  Also, we now have two rows that have a shared value in a column.  We can sort the rows by specific columns, which can be used to group things together.  We can also group by multiple columns to have improved organization.

In [49]:
df2 = df.sort_values("Team",ascending=True)
display(df2)

Unnamed: 0,First Name,Age,Team
0,Zee,40,Blue
2,Wilbur,50,Green
1,Charlotte,45,Red
3,Iroh,99,Red


In [50]:
df3 = df.sort_values(["Team","Age"],ascending=[True,False])
display(df3)

Unnamed: 0,First Name,Age,Team
0,Zee,40,Blue
2,Wilbur,50,Green
3,Iroh,99,Red
1,Charlotte,45,Red


In the first cell, we sort only by the column "Team", and put the results in ascending order.  Pandas uses alphabetical sorting unless all cells in a column are *entirely* numerical.

In the second cell, we sorted by a list, which means the whole dataframe is sorted by the first element, and any rows that have the same value in that first element are then sorted by the second element, and so on.  Additionally, we use a second list for the ascending values to set the ascending/descending state for each element being sorted.

What if we wanted just the values in one column?  We can call the dataframe like we would a dictionary, using the square brackets and a column name.


In [51]:
print(df["Age"])

0    40
1    45
2    50
3    99
Name: Age, dtype: int64


In [52]:
print(df["Age"].values)

[40 45 50 99]


We can also call individual rows with the `.iloc` function.  In the example below, `.iloc(0)` indicates we're iterating on the 0th axis, which is down the rows.  The `[2]` indicates we want the data at index 2 in that list of rows.  In the original `df` dataframe object, that is the third row.

In [53]:
df.iloc(0)[2]

First Name    Wilbur
Age               50
Team           Green
Name: 2, dtype: object