# 1. Introduction

### a) What is Pandas?

* [Software Library for Python](https://pandas.pydata.org/) (not a [bear](https://en.wikipedia.org/wiki/Giant_panda), [extra-cute bear](https://en.wikipedia.org/wiki/Red_panda), or [Kung Fu bear](https://m.media-amazon.com/images/M/MV5BODJkZTZhMWItMDI3Yy00ZWZlLTk4NjQtOTI1ZjU5NjBjZTVjXkEyXkFqcGdeQXVyODE5NzE3OTE@._V1_.jpg), contrary to popular belief).
* Data structures and functions to manipulate data in tables and time series
* Built on top of NumPy (Recall...Large, multi-dimensional arrays & matrices, with mathematical functions to operate on these arrays).


### b) (Very) Brief history

* Development started in 2008 by Wes McKinney at AQR Capital (a global investment management firm).
* They needed a tool to perform quantitative analyses on financial data.
* Name is from "Panel Data" (Econometrics term: multidimensional data involving measurements for the same subject over time)
* Went Open Source in 2009.
* Official book [(HTML link)](https://wesmckinney.com/book/) was published in 2012, and more developers started joining.
* 2013 and onwards: started gathering serious momentum in the Python userbase, got sponsored, etc...

### c) Why/When Pandas?

* Python: Huge growth since 2012 & uptake in academic settings.
* Popular tool with an active community of contributors - keep it in good shape.
* Well-documented:
  * [Documentation](https://pandas.pydata.org/docs/), with tutorials and API reference
  * Official book [(HTML link)](https://wesmckinney.com/book/) 
  * Plenty of StackOverflow threads, books [(A good HTML one)](https://jakevdp.github.io/PythonDataScienceHandbook/index.html), online guides, tutorials, etc..
* Rich, user-friendly API.
* Effective tool for data exploration.
* Computationally faster and less of a headache than straightforward approaches (eg: your 'for' loops and Python Lists) due to NumPy arrays being implemented in C.
  * BUT...Not very fast compared to many database-like tools popular in open-source data science, which you'll want when your table is huge.
  * For tables larger than your machine's memory you might want to check out other libraries or software, a few are benchmarked [here](https://h2oai.github.io/db-benchmark/).
  * Pandas is meant for 2D tables, for labelled multi-dimensional arrays check out [xarray](https://docs.xarray.dev/en/stable/).
  
### d) Reference & Further Reading

* This module is partially based on the free [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html). For a more in-depth coverage of Pandas (and a few of the other course modules) I highly recommend checking it out!
* We also like the [Neurohackademy lecture](https://neurohackademy.org/course/complex-data-structures/) on Pandas.

In [1]:
import pandas as pd  # Importing the pandas module, bind to the variable 'pd' in the local scope
import numpy as np  # Importing numpy to explain some of the underlying structures

# 2. Pandas Objects

* Enhanced versions of [NumPy structured arrays](https://numpy.org/doc/stable/user/basics.rec.html).
* Rows and columns are identified with labels instead of integer indices.

### a) The Series Object

* One-dimensional array of indexed data.
* Consists of a sequence of values and a sequence of indices.
* Values are a NumPy array.
* Index is an array-like object of type pd.Index (discussed in detail soon)
* Can access Values/Index through attributes

In [2]:
# Creating a Series from a traditional Python List:
my_list = [3.0, 5.0, 7.2, 9.5]
data = pd.Series(my_list)
print(data)

# Accessing the values and index of the Series
print("\nSeries values:")
print(data.values)

print("\nSeries Index:")
print(data.index)

0    3.0
1    5.0
2    7.2
3    9.5
dtype: float64

Series values:
[3.  5.  7.2 9.5]

Series Index:
RangeIndex(start=0, stop=4, step=1)


* At this point you may be thinking: "So...What's the big deal? It just seems like a NumPy array".
* Numpy Array has an implicitly defined integer index.
* The Series has an explicitly defined index...It can be of any type you wish (not just integers).
* If you don't provide an index, it defaults to an integer sequence, as seen above.

In [3]:
# Explicitly setting string indices
data = pd.Series(my_list, index=["a", "lol", "g", "bbbb"])
print(data)

a       3.0
lol     5.0
g       7.2
bbbb    9.5
dtype: float64


* This should remind you of the key-value structure of a Dictionary in Python.
* In fact, you can also create a Series from a dictionary!

In [4]:
my_dict = {"a": 0.5, "lol": 2.2, "g": 3.14, "gg": 42.0}
data = pd.Series(my_dict)
print(data)

a       0.50
lol     2.20
g       3.14
gg     42.00
dtype: float64


### b) The Dataframe Object

* If we think of a Series as a one-dimensional array with generalizable indices, then a Dataframe is an two-dimensional array with generalizable rows indices and column names.
* It's a sequence of aligned Series (Series share the same Index).
* Usually used to represent tabular data (tables).

In [5]:
# Constructing a dataframe from multiple Series with a common Index

# Number of faces, edges, and corners of some common 3D shapes, in Dictionaries
faces_dict = {"cuboid": 6, "cylinder": 4, "pyramid": 5, "cone": 2, "sphere": 1}
edges_dict = {"cuboid": 12, "cylinder": 2, "pyramid": 8, "cone": 1, "sphere": 0}
corners_dict = {"cuboid": 8, "cylinder": 0, "pyramid": 5, "cone": 1, "sphere": 0}

# Converting the Dictionaries into Series
faces_series = pd.Series(faces_dict)
edges_series = pd.Series(edges_dict)
corners_series = pd.Series(corners_dict)

# Make a dataframe out of the Series - A dictionary where the keys are the column name, and the values are the Series
shapes = pd.DataFrame({"faces": faces_series, "edges": edges_series, "corners": corners_series})
display(shapes)
# display() prints the dataframe in a pretty way with HTML - note this won't work outside of a Notebook.

# Attributes
print("\nIndex, Column names, Column Series:\n")
print(shapes.index)  # Getting the Index labels
print(shapes.columns)  # Getting the column names
print(
    shapes["faces"]
)  # Note that querying for the 'faces' column returns a Series! We could use this to construct a new dataframe...

Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5
cone,2,1,1
sphere,1,0,0



Index, Column names, Column Series:

Index(['cuboid', 'cylinder', 'pyramid', 'cone', 'sphere'], dtype='object')
Index(['faces', 'edges', 'corners'], dtype='object')
cuboid      6
cylinder    4
pyramid     5
cone        2
sphere      1
Name: faces, dtype: int64


* Continuing the Dictionary analogy:
  * Dictionary maps a key to a value
  * Dataframe maps a column name to a Series of column data

* While all these dataframe creation options are similarly instantiated, due to time constraints, the examples can be found in the reference/further reading material:
  * A Dataframe can be constructed from a dictionary of Series
  * A Dataframe can be constructed from any list of dictionaries. Missing keys will be automatically replaced with 'NaN' (Not a Number).
  * A Dataframe can also be constructed from a two-dimensional NumPy array, and NumPy structured arrays.

### c) The Index Object

* Last object we'll see today.
* We now know that the Series and DataFrame objects contain an explicit Index object to reference entries.
* It can be useful to think of it an an immutable array - one that can't be modified after creation. You can perform similar referencing and slicing as an array.

In [6]:
# Creating an Index
ind = pd.Index([2, 3, 5, 7])

print(ind)
print(ind[2])
print(ind[1:3])
print(ind.size, ind.shape, ind.ndim, ind.dtype)  # NumPy array attributes

Index([2, 3, 5, 7], dtype='int64')
5
Index([3, 5], dtype='int64')
4 (4,) 1 int64


In [7]:
# Note it can't be modified...TypeError
ind[2] = 2

TypeError: Index does not support mutable operations

* The Index can also be thought of as a type of set data structure - but we won't go into this in detail. For the more advanced students, it is useful to be aware that [Python Set Methods](https://docs.python.org/2/library/sets.html) can be used on Index objects.

# 3. Pandas Wrangler Essentials

### a) Data I/O (reading/writing)

In [8]:
# Read in a CSV file with Pandas...Specify a web address or a local path
# There are methods to read in other file formats to...See the documentation!
csv_data = pd.read_csv(
    "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"
)
display(
    csv_data.head(10)
)  # The 'head' method returns the first few rows of the Dataframe (default 5 rows)

Unnamed: 0,id,location_id,address_1,address_2,city,state_province,postal_code,country
0,1,1,2600 Middlefield Road,,Redwood City,CA,94063,US
1,2,2,24 Second Avenue,,San Mateo,CA,94401,US
2,3,3,24 Second Avenue,,San Mateo,CA,94403,US
3,4,4,24 Second Avenue,,San Mateo,CA,94401,US
4,5,5,24 Second Avenue,,San Mateo,CA,94401,US
5,6,6,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
6,7,7,500 Arbor Road,,Menlo Park,CA,94025,US
7,8,8,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
8,9,9,2510 Middlefield Road,,Redwood City,CA,94063,US
9,10,10,1044 Middlefield Road,,Redwood City,CA,94063,US


In [9]:
# What if I only want to read in a few columns I need?
csv_data_mini = pd.read_csv(
    "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv",
    usecols=["id", "postal_code"],
)
display(csv_data_mini.head())

Unnamed: 0,id,postal_code
0,1,94063
1,2,94401
2,3,94403
3,4,94401
4,5,94401


In [10]:
# Export using the to_csv method on your Dataframe
csv_str = csv_data_mini.to_csv("./out.csv", index=False)

### b) Selection & Indexers, and some other nuts and bolts
* So you've read in your tabular data, great! The next step now will be to select the data you want to access.

#### Selecting with Indexers

* While selecting entries in Series and DataFrames is possible with List-style Indexing and Slicing, it is bad practice, and should be avoided. Why is this?
* Slicing and indexing can be a source of confusion.
* In a Series with an existing explicit integer index, data[1] will use the explicit indices, while data[1:3] will use the implicit index.
* Using indexer attributes makes this clearer
* 'loc': always references the explicit index
* 'iloc': always references the implicit index

In [11]:
# Series
data = pd.Series(["hi", "hello", "hey"], index=[1, 3, 5])

print(data)
print("\n")
print(data[1])  # explicit index when indexing
print("\n")
print(data[1:3])  # implicit index when slicing
# ...Confusion ensues

1       hi
3    hello
5      hey
dtype: object


hi


3    hello
5      hey
dtype: object


In [12]:
# Series examples - loc vs iloc

print(data.loc[1])
print(data.iloc[1])

print("\n")

print(data.loc[1:3])
print(data.iloc[1:3])
# Less confusing

hi
hello


1       hi
3    hello
dtype: object
3    hello
5      hey
dtype: object


In [13]:
# Can set values with loc and iloc

data.loc[1] = "sup"
print(data)

1      sup
3    hello
5      hey
dtype: object


In [14]:
# Dataframe loc vs iloc

display(shapes.iloc[:3, :2])  # as for a simple NumPy array
display(shapes.loc[:"pyramid", :"edges"])  # Equivalent using explicit index and column names

Unnamed: 0,faces,edges
cuboid,6,12
cylinder,4,2
pyramid,5,8


Unnamed: 0,faces,edges
cuboid,6,12
cylinder,4,2
pyramid,5,8


#### Deep and shallow copies...Beware!

* Let's go back to that Addresses CSV we read into Pandas.

In [15]:
display(csv_data_mini.head())

Unnamed: 0,id,postal_code
0,1,94063
1,2,94401
2,3,94403
3,4,94401
4,5,94401


In [16]:
# Retrieving a series from a dataframe
code_series = csv_data_mini.loc[:, "postal_code"]  # [row, column]

print(code_series.head())
print(code_series.keys())  # Keys/Indices
print(list(code_series.items()))  # Values (index, postal code)

0    94063
1    94401
2    94403
3    94401
4    94401
Name: postal_code, dtype: object
RangeIndex(start=0, stop=21, step=1)
[(0, '94063'), (1, '94401'), (2, '94403'), (3, '94401'), (4, '94401'), (5, '94025-9881'), (6, '94025'), (7, '94025-9881'), (8, '94063'), (9, '94063'), (10, '94061'), (11, '94063'), (12, '94065'), (13, '94063'), (14, '94110'), (15, '94087'), (16, '94080'), (17, '94063'), (18, '94403'), (19, '94002'), (20, '94103')]


In [17]:
code_series.loc[0] = "666"
print(code_series.head())

0      666
1    94401
2    94403
3    94401
4    94401
Name: postal_code, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  code_series.loc[0] = '666'


* However...Modifying an entry gives us a warning...what is this?
* Returning a view versus a copy
* View (or shallow copy) doesn’t have its own data. It “views” the data contained in the original array. Modifying the view modifies the original.
* Copy (or deep copy): the original and the copy are two separate instances.
* In Pandas (and Python in general!) make sure to check if the method returns a deep or shallow copy of an object...Or you might end up modifying things you didn't intend to.

In [18]:
# See how the value has changed in our original dataframe! Not just the series. We were manipulating a view.
display(csv_data_mini.head())

Unnamed: 0,id,postal_code
0,1,666
1,2,94401
2,3,94403
3,4,94401
4,5,94401


In [19]:
# If we wanted to leave our original Dataframes intact while modifying the extracted Series...
# Make a copy of the Series

csv_data_mini = pd.read_csv(
    "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv",
    usecols=["id", "postal_code"],
)

code_series = csv_data_mini.loc[
    :, "postal_code"
].copy()  # Making a DEEP COPY series from our Dataframe
code_series.loc[0] = "666"

# Still intact!
display(display(csv_data_mini.head()))

Unnamed: 0,id,postal_code
0,1,94063
1,2,94401
2,3,94403
3,4,94401
4,5,94401


None

#### Additional useful stuff...

In [20]:
display(shapes)

Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5
cone,2,1,1
sphere,1,0,0


In [21]:
# Can add a column
shapes["ratio"] = (
    shapes["faces"] / shapes["edges"]
)  # Add a column with our ratio of faces to edges, np array division
display(shapes)

# And delete/drop a column
shapes = shapes.drop(
    ["ratio"], axis=1
)  # axis = drop labels from index (0 or ‘index’) or columns (1 or ‘columns’).
display(shapes)

Unnamed: 0,faces,edges,corners,ratio
cuboid,6,12,8,0.5
cylinder,4,2,0,2.0
pyramid,5,8,5,0.625
cone,2,1,1,2.0
sphere,1,0,0,inf


Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5
cone,2,1,1
sphere,1,0,0


In [22]:
# Iterating through the rows of an dataframe
# Usually not recommended (not efficient), but sometimes necessary for complex data processing

for index, row in shapes.iterrows():
    print(index, row)

cuboid faces       6
edges      12
corners     8
Name: cuboid, dtype: int64
cylinder faces      4
edges      2
corners    0
Name: cylinder, dtype: int64
pyramid faces      5
edges      8
corners    5
Name: pyramid, dtype: int64
cone faces      2
edges      1
corners    1
Name: cone, dtype: int64
sphere faces      1
edges      0
corners    0
Name: sphere, dtype: int64


### c) Filtering
* Filtering a Dataframe to find rows that satisfy a specific condition

In [23]:
# Recall the shapes Dataframe...
display(shapes)

Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5
cone,2,1,1
sphere,1,0,0


In [24]:
# Simple query examples (many ways to do the same thing)
display(shapes.loc[shapes.faces > 3])  # Using attribute access of columns
display(shapes.loc[shapes["faces"] > 3])  # Using Equivalent Dictionary-style access of columns
display(shapes.query("faces > 3"))  # Using the query method

Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5


Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5


Unnamed: 0,faces,edges,corners
cuboid,6,12,8
cylinder,4,2,0
pyramid,5,8,5


In [25]:
# Composite query satisfying multiple conditions (you can chain many of them) - using dictionary style indexing.
query = shapes.loc[(shapes["faces"] > 3) & (shapes["edges"] != 12)]
display(query)


# NOTE: We use the & (bitwise AND) and | (bitwise OR) to chain conditions instead of Python's logical 'and'/'or'
# Using 'and' or 'or' would try to convert the Series to booleans and raise an error.

# Aside - but if you are briefly interested in why:
# 'and' returns True if both the operands are true - not what we want here
# '&' compute the (bitwise AND) of two arrays element-wise

Unnamed: 0,faces,edges,corners
cylinder,4,2,0
pyramid,5,8,5


### d) Combining Dataframes
* Combining different data sources. For example, what would you do if you want to filter across two dataframes?
* Some different types of combination:
  * Concat: concatenation (splice together) along an axis.
  * Merge: combine dataframes, aligning the rows from each based on common attributes/columns.
* There are more types of combinations...Like append (shorthand for concat) and join (shorthand for a merge() that defaults to joining on common indices).
* Merge is powerful: implements a subset of 'relational algebra' (a procedural query language)
  * Formal rules for manipulating relational data
  * Primitive operations are building blocks of more complex operations
  * Conceptual foundation of operations available in most database software.
* Examples nabbed from the [Pandas Docs](https://pandas.pydata.org/docs/user_guide/merging.html)
* We'll see some basic examples but won't go into depth due to time restrictions.

In [26]:
# Concat

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

frames = [df1, df2]
result = pd.concat(frames, axis=0)

display(df1)
display(df2)
display(result)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [27]:
# Merge

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

# Help(pd.merge)
# Play around with Merge!
result = pd.merge(left, right, on=["key1", "key2"], how="inner")

# On: "Column or index level names to join on. These must be found in both DataFrames"
# How: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’},

display(left)
display(right)
display(result)

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


* inner: returns only rows from left and right which share a common key


* outer: uses the keys from both frames, and NaNs are inserted for missing rows in both.


* left: only keys from left are used, and missing data from right is replaced by NaN.


* right: keys from right are used, and missing data from left is replaced by NaN.


* cross: cartesian product of the two data frames, creating combinations of all rows from left and right.

### e) Inbuilt Aggregations
* Efficient summarization of data is important for analysis.
* Computing aggregations  the mean, sum, median, etc..
* We will cover some basic inbuilt ones today, but we won't have time to cover how to make your own. For that I suggest you read about the concept of a 'groupby' and 'split, apply, combine' in Pandas.
  * Don't worry too much about the formalities! They make life more efficient for developers, but you can achieve the same results using only your intuition.
  * For example, you already know how to copy a series from a dataframe, filter it to find the parts you are interested, compute something on them, and add a column to a dataframe with your computation results :)

In [28]:
# The describe() method shows many common aggregations
a_series = pd.Series([1, 2, 3, 3])
print(a_series.describe())
print(a_series.value_counts())  # Unique value count

# Also works on a dataframe
print(shapes.describe())

count    4.000000
mean     2.250000
std      0.957427
min      1.000000
25%      1.750000
50%      2.500000
75%      3.000000
max      3.000000
dtype: float64
3    2
1    1
2    1
Name: count, dtype: int64
          faces      edges   corners
count  5.000000   5.000000  5.000000
mean   3.600000   4.600000  2.800000
std    2.073644   5.176872  3.563706
min    1.000000   0.000000  0.000000
25%    2.000000   1.000000  0.000000
50%    4.000000   2.000000  1.000000
75%    5.000000   8.000000  5.000000
max    6.000000  12.000000  8.000000


In [29]:
print(shapes.sum())
print(shapes.median())
print(shapes.count())  # Count non-NA cells for each column or row.

# What if I want to aggregate across a row in a dataframe?
print(shapes.mean(axis="columns"))

faces      18
edges      23
corners    14
dtype: int64
faces      4.0
edges      2.0
corners    1.0
dtype: float64
faces      5
edges      5
corners    5
dtype: int64
cuboid      8.666667
cylinder    2.000000
pyramid     6.000000
cone        1.333333
sphere      0.333333
dtype: float64


# 4. Bonus: Actual Pandas

![panda](https://i.redd.it/m1pr821ubod01.jpg)
![panda2](https://www.rd.com/wp-content/uploads/2018/04/shutterstock_688280269.jpg)
![panda2](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Red_Panda_%2824986761703%29.jpg/1200px-Red_Panda_%2824986761703%29.jpg)