# Pandas 🐼 🐼 🐼

[Pandas Website](https://pandas.pydata.org/)

Pandas is a data analysis, manipulation, and visualization library. It's become a go-to tool for data scientists. Pandas not only provides many useful methods for working with data, but it's also quite fast because it vectorizes many operations, performing operations in parallel rather than one by one.

There have been whole books written about Pandas and it's feature set is vast. Here, my goal is to demonstrate at a basic level _what Pandas can do_ in terms of loading, filtering, and visualizing data so you can determine if it's a tool you want to learn more about.

The [Pandas API documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html#api) is excellent and each method has a set of concrete examples below its list of parameters. There's a lot of nuance to some of these concepts, and many functions can do what we want but only if we use a few of their many parameters.


## Loading Data

Let's import panda and load data from a CSV file. I've included some anonymized reference interactions in this repo as an example.

In [57]:
import pandas as pd # common convention to reference pandas as "pd"
from pathlib import Path

df = pd.read_csv(Path('assets', 'reference_interactions.csv')) # read the data from the csv file
df.head() # head(N) displays the first N rows of the dataframe (N=5 by default)

Unnamed: 0,datetime,email,type,mode,patron category,topics,location,duration
0,9/1/2023 10:54:50,calypso@cca.edu,Directional,Chat,Faculty,,Aeaea,11
1,9/1/2023 11:09:25,circe@cca.edu,Reference,Email,Faculty,,Aeaea,94
2,9/1/2023 13:18:22,calypso@cca.edu,Directional,Chat,Undergrad,,Aeaea,35
3,9/5/2023 8:53:28,circe@cca.edu,Service,Email,Other,,Aeaea,60
4,9/5/2023 11:27:44,circe@cca.edu,Service,Email,Staff,,Aeaea,75


Pandas can load data from many different sources, including CSV, Excel, SQL, and more. The functions are all named `read_{format}` like `read_excel()`, `read_sql()`, `read_json()`, `read_pickle()` (Python object serialization).

The data we load must be _tabular_ in nature, as in we can interpret is as having rows and columns like a spreadsheet. A deeply nested JSON file would not work, for example. Data is read into a `DataFrame` object, the primary data structure in Pandas, which we will discuss further below.

There are a few other DataFrame methods which give us a peek at our data, such as `tail()` (same as head except it's the final N rows), `info()`, and `describe()`.

`DataFrame.info()` summarize our data, showing our columns, how many non-null values they have, and their data types. We will discuss data types more below (what looks wrong from this output?).

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   datetime         301 non-null    object
 1   email            301 non-null    object
 2   type             301 non-null    object
 3   mode             301 non-null    object
 4   patron category  286 non-null    object
 5   topics           134 non-null    object
 6   location         301 non-null    object
 7   duration         301 non-null    int64 
dtypes: int64(1), object(7)
memory usage: 18.9+ KB


`DataFrame.describe()` shows summary statistics for our columns, such as the number of unique values and the value that appears most frequently.

In [65]:
df.describe(exclude=[int])
# exclude=[int] excludes the integer (duration) column from the describe() output
# otherwise describe() really only looks at the one "duration" column

Unnamed: 0,datetime,email,type,mode,patron category,topics,location
count,301,301,301,301,286,134,301
unique,301,8,4,6,6,12,2
top,9/1/2023 10:54:50,calypso@cca.edu,Technical/Computing,Email,Faculty,Moodle,Aeaea
freq,1,114,122,140,166,70,238


`DataFrame.shape` shows how many rows and columns we have as a `(rows, columns)` tuple.

In [66]:
df.shape

(301, 8)

## Dataframes

When data is loaded into Pandas, it's turned into a `dataframe`. What is a `dataframe`? It's a two-dimensional data structure _with an index_ column. Look at the result from `df.head()` above and notice the leftmost, unlabelled column. That wasn't in our original CSV files (open it to see), Pandas added it to give each row a unique identifier.

The [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) documentation page lists its methods.

## Series

A `Series` is a one-dimensional data structure in Pandas. It's a single column of a DataFrame. When we select a single column from a DataFrame, we get a Series. The [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) documentation page lists its methods.

Series can be accessed with bracket notation, like `df['column_name']` or dot notation like `df.column_name`. The former is preferred because it works with all column names, even those that conflict with DataFrame methods.

In [67]:
emails = df['email']
print(type(emails))
emails

<class 'pandas.core.series.Series'>


0         calypso@cca.edu
1           circe@cca.edu
2         calypso@cca.edu
3           circe@cca.edu
4           circe@cca.edu
              ...        
296       calypso@cca.edu
297       calypso@cca.edu
298    telemachus@cca.edu
299       calypso@cca.edu
300    telemachus@cca.edu
Name: email, Length: 301, dtype: object

Series share some of the investigative methods of DataFrames, like `head()`, `tail()`, and `describe()`.

In [68]:
emails.describe()

count                 301
unique                  8
top       calypso@cca.edu
freq                  114
Name: email, dtype: object

## Modifying Columns

TODO remove @cca.edu from emails and rename to username

TODO convert datetime string to actual datetime objects

## Filtering

TODO only interactions in a certain mode or a certain tyoe

## Aggregating

TODO count interactions by user, mode, and/or type

## Visualization

TODO time-series plot of interactions by month?

## Custom Indices


If our data has an identifier column, we can set it to function as the index.

In [69]:
# set the datetime column to be the index
df = df.set_index('datetime')
df.head()

Unnamed: 0_level_0,email,type,mode,patron category,topics,location,duration
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9/1/2023 10:54:50,calypso@cca.edu,Directional,Chat,Faculty,,Aeaea,11
9/1/2023 11:09:25,circe@cca.edu,Reference,Email,Faculty,,Aeaea,94
9/1/2023 13:18:22,calypso@cca.edu,Directional,Chat,Undergrad,,Aeaea,35
9/5/2023 8:53:28,circe@cca.edu,Service,Email,Other,,Aeaea,60
9/5/2023 11:27:44,circe@cca.edu,Service,Email,Staff,,Aeaea,75


Most methods return a new DataFrame rather than modifying the original one, thus why we write `df = df.set_index('id')` to re-assign the new dataframe to the same variable, but many methods also have an `inplace` parameter which will modify the original DataFrame. Below, we reset the index back to the numeric row numbers Pandas created for us originally.

In [70]:
# reset the index
df.reset_index(inplace=True)
df.head()

Unnamed: 0,datetime,email,type,mode,patron category,topics,location,duration
0,9/1/2023 10:54:50,calypso@cca.edu,Directional,Chat,Faculty,,Aeaea,11
1,9/1/2023 11:09:25,circe@cca.edu,Reference,Email,Faculty,,Aeaea,94
2,9/1/2023 13:18:22,calypso@cca.edu,Directional,Chat,Undergrad,,Aeaea,35
3,9/5/2023 8:53:28,circe@cca.edu,Service,Email,Other,,Aeaea,60
4,9/5/2023 11:27:44,circe@cca.edu,Service,Email,Staff,,Aeaea,75


We can even specify multiple columns to function as the index. The `set_index()` documentation shows [an example of using month and year](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html#:~:text=as%20other%20DataFrame.-,Examples,-%3E%3E%3E%20df%20%3D) columns to create a datetime index.

## Concrete Example: Video Content

A couple years ago, our video platform moved from an unlimited storage model to a tiered model with quotas for different types of hours of content. I had to analyze our current content to find a way to get under the quotas. That work is [on GitHub](https://github.com/cca/panopto-session-data) in [a notebook](https://github.com/cca/panopto-session-data/blob/main/notes.ipynb) that uses Pandas. Some of the facets of Pandas used include:

- Joining dataframes from two CSVs with a common key
- Creating derivative columns
- Converting a string column to dates
- Looking at different slices of the data in tables
- Creating filtered data frames based on conditions and time periods

I didn't do any visualization because the raw figures were more useful, and it's hard to rerun the notebook with access to the specific reports I had, but it gives a sense of how to approach a problem with Pandas.