# Comparison with pandas

[pandas](https://pandas.pydata.org) is a popular Python library for data analysis and manipulation. 

GreenplumPython strives to provide a pandas-like interface so that people can get started with it quickly.

In [1]:
import pandas as pd
import greenplumpython as gp

gp


<module 'greenplumpython' from '/home/gpadmin/.local/lib/python3.9/site-packages/greenplumpython/__init__.py'>

However, due to the discrepency in the purposes they serve, that is,

- GreenplumPython is an interface to a remote database system, while
- pandas is a library for manipulating local in-memory data

there are still some important differences in their interface.

This document covers the similarities and the differences between GreenplumPython and pandas, as well as the rationales behind.

## Data Structure

The core data structure of GreenplumPython, `DataFrame`, is fundamentally similar to `Dataframe` in pandas in that

- Data are organized into rows and columns;
- Columns can be of different types, and can be accessed by name;
- Rows are of the same type, and are iterable.

Next, we will see similarities and differences between them in detail with examples.

### Getting Access to the Structure

For example, suppose we have some information of students, including students' names and ages:

In [2]:
students = [("alice", 18), ("bob", 19), ("carol", 19)]
students


[('alice', 18), ('bob', 19), ('carol', 19)]

For analyzing them, we might want to create a pandas `DataFrame` as follows:

In [3]:
pd_df = pd.DataFrame.from_records(students, columns=["name", "age"])
pd_df


Unnamed: 0,name,age
0,alice,18
1,bob,19
2,carol,19


We can also create a `DataFrame` in GreenplumPython from the same data in a very similar way:

In [4]:
db = gp.database("postgresql://localhost/gpadmin")
gp_df = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
gp_df


name,age
alice,18
bob,19
carol,19


But here is an important **difference**:

- a `DataFrame` in GreenplumPython must be created in a database, while
- pandas does not have the concept of "database".

`Database` in GreenplumPython is like "directory" in file systems, which helps to avoid name conflict on persistence.

A GreenplumPython `DataFrame` can be saved persistently in the database system as a table with

In [5]:
gp_df.save_as("student", column_names=["name", "age"], temp=True)


name,age
alice,18
bob,19
carol,19


This is similar to how a pandas `DataFrame` is persisted as file

In [6]:
pd_df.to_csv("/tmp/student.csv")


Saving a `DataFrame` with `temp=True` in GreenplumPython is similar to saving a `DataFrame` into the `/tmp` directory in pandas:

- GreenplumPython `DataFrame`s saved with `temp=True` will be dropped automatically by the database system when the database session is terminated, while
- pandas `DataFrame`s saved in `/tmp` will be clean automatically by the operating system.

In order to access the data of a table in database, we can create a `DataFrame` from the table:

In [7]:
student = db.create_dataframe(table_name="student")
student


name,age
carol,19
alice,18
bob,19


This is similar to loading a `DataFrame` from file in pandas, that is,

In [8]:
pd.read_csv("/tmp/student.csv")


Unnamed: 0.1,Unnamed: 0,name,age
0,0,alice,18
1,1,bob,19
2,2,carol,19


### Accessing Data in Rows and Columns

In both GreenplumPython `DataFrame`s and pandas `DataFrame`s, rows can be accessed by iterating over the dataframe.

For example, in GreenplumPython:

In [9]:
for row in gp_df:
    print(row["name"], row["age"])


alice 18
bob 19
carol 19


This is similar to how rows in a pandas `DataFrame` can be accessed:

In [10]:
for row in pd_df.iterrows():
    print(row[1]["name"], row[1]["age"])


alice 18
bob 19
carol 19


Similar to pandas, `Row` in GreenplumPython is `dict`-like. The value of each column can be accessed by name.

In conclusion, from the user's perspective, `DataFrame` in GreenplumPython is very similar to `DataFrame` in pandas. We expect this would make it easier for whoever interested to get started using GreenplumPython.

## Data Selection

Data selection is probably the most fundamental set of operations on data.

In both pandas and GreenplumPython, data selection is done primarily with the `[]` operator.

### Selecting Columns

In both pandas and GreenplumPython, columns are accessed by name.

For example, to select a subset of columns, such as `name` and `age`, from the dataframe containing student info, in pandas we can do

In [11]:
pd_df[["name", "age"]]


Unnamed: 0,name,age
0,alice,18
1,bob,19
2,carol,19


The result of the `[]` operator is a new pandas `DataFrame`. In GreenplumPython, this is exactly the same:

In [12]:
student[["name", "age"]]


name,age
alice,18
bob,19
carol,19


The result is a new GreenplumPython `DataFrame` containing the selected columns.

### Accessing a Single Column

To refer to a single column, we can use the `[]` operator with the column name in both pandas and GreenplumPython.

For example, to access the names of the students in pandas, we can do:

In [13]:
pd_df["name"]


0    alice
1      bob
2    carol
Name: name, dtype: object

While in GreenplumPython, the column can be refered to in the same way:

In [14]:
gp_df["name"]


<greenplumpython.col.Column at 0x7f104ed5d100>

But you might notice the **difference** here:

- for pandas, using the `[]` operator gives us the data immediately if we refer to a column, while
- for GreenplumPython, it only gives a **symbolic** `Column` object. `Column` is supposed to be used for computation rather than for observing data.

The reasons behind this difference are:

- Database systems behind GreenplumPython does not provide native one-dimensional data structure like `Series` in pandas.
- It is much more efficient to retrieve all columns needed in a GreenplumPython `DataFrame` at once than one at a time. 

We will see later how to add new columns to a GreenplumPython `DataFrame` so that they can be retrieved all at once.

### Selecting Rows by Predicates

The `[]` operator can also be used to select a subset of rows, a.k.a filtering.

Say we want the information of student named "alice", with pandas we can do

In [15]:
pd_df[lambda df: df["name"] == "alice"]


Unnamed: 0,name,age
0,alice,18


With GreenplumPython, we can do it in exactly the same way:

In [16]:
student[lambda t: t["name"] == "alice"]


name,age
alice,18


Here we see how a column in GreenplumPython, `t["name"]` in this case, is used for computation to form a more complex expression.

In this example, When the expression `t["name"] == "alice"` is evaluated, `t` will be bound to the **current** dataframe, i.e. `student`.

GreenplumPython provides such a functional interface so that the user does not have to refer to the possibly long intermediate variable name like `student` again and again when the expression becomes complicated.

### Selecting Rows by Slices

We can get a quick glance of the data by selecting the first several rows. This can be achieved with `slice` in Python.

Like many built-in data structures in Python, such as `list` and `tuple`, `DataFrame` in GreenplumPython supports slicing.

For example, if we want only the first two rows of the `DataFrame` of students in GreenplumPython, we can do

In [17]:
student[:2]


name,age
carol,19
alice,18


In pandas, we can do exactly the same thing on a `DataFrame`:

In [18]:
pd_df[:2]


Unnamed: 0,name,age
0,alice,18
1,bob,19


But you might notice the **difference**: When selecting rows,

- for pandas, rows in the output `DataFrame` preserves the same order as the input, while
- for GreenplumPython, the order of rows in `DataFrame` might not be preserved.

The difference is due to the fact that database systems behind will not guarantee the order of rows unless otherwise specified.


## Data Transformation

Data transformation is about changing the data to a desired form.

Like pandas, GreenplumPython provides powerful building blocks to make transformation easier.

### Data Ordering

Having the data sorted in a desired order makes it convenient for many analytical tasks, such as statistics.

pandas supports sorting the data (a.k.a values) of a `DataFrame` by columns.

For example, we can sort in pandas the dataframe of student info by "age" and then "name", both in descending order with

In [19]:
pd_df.sort_values(["age", "name"], ascending=[False, False])


Unnamed: 0,name,age
2,carol,19
1,bob,19
0,alice,18


In GreenplumPython, order of data can be defined with the `order_by()` method:

In [20]:
student.order_by("age", ascending=False).order_by("name", ascending=False)[:]


name,age
carol,19
bob,19
alice,18


There are some important **difference** compared with pandas:

- GreenplumPython does not provide something like `DataFrame.sort_index()` in pandas, because `DataFrame` in GreenplumPython does not have an "index column".
- In GreenplumPython, slicing is requied after `order_by()` to get an ordered `DataFrame` due to the limitations of relational database systems.

### Column Transformation

Column transformation is to transform one or more existing columns into a new one of the same length.

A new column may contains data resulting from whatever computation we want. 

Both GreeplumPython and pandas support transforming columns by adding a new column. Specifically, we need to:

- define the transfomation as an expression, and then
- bind the expression to a new column of the source `DataFrame` or `DataFrame` to form a new one.

We can use `assign()` method to add new columns in both packages. For example, suppose we would like to know the year of birth for each student in the previous example.

In pandas, we can add a new column named `year_of_birth` like this

In [21]:
import datetime

this_year = datetime.date.today().year
pd_df.assign(year_of_birth=lambda df: -df["age"] + this_year)


Unnamed: 0,name,age,year_of_birth
0,alice,18,2005
1,bob,19,2004
2,carol,19,2004


In GreenplymPython, we can do exactly the same:

In [22]:
student.assign(year_of_birth=lambda t: -t["age"] + this_year)


name,age,year_of_birth
alice,18,2005
bob,19,2004
carol,19,2004


The column data can result from any expression, which can contain complex computations.

For example, in order to hide the names of students to protect privacy, we can write a function transforming names to something not human-readable.

In [23]:
from hashlib import sha256

@gp.create_function
def hash_name(name: str) -> str:
    return sha256(name.encode("utf-8")).hexdigest()


The `gp.create_function` decorator converts a Python function into a User-Defined Function (UDF) in database so that it can be applied to `Column`s.

With the function defined, we can then apply it to generate a new `Column`:

In [24]:
student.assign(name_=lambda t: hash_name(t["name"]))


name,age,name_
alice,18,2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90
bob,19,81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9
carol,19,4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5


After adding the new column, we can select the columns we care about into a new `DataFrame` with the `[]` operator.

To be more concise, GreenplumPython and pandas support transforming columns directly into a new `DataFrame` or `DataFrame` by `apply()`-ing the function.

In the previous example, using `apply()`, we can obtain the GreenplumPython `DataFrame` with the original names hidden:

In [25]:
from dataclasses import dataclass, asdict


@dataclass
class Student:
    name: str
    age: int


def hide_name(name: str, age: int) -> Student:
    return Student(name=sha256(name.encode("utf-8")).hexdigest(), age=age)


student.apply(lambda t: gp.create_function(hide_name)(t["name"], t["age"]), expand=True)


ImportError: cannot import name 'datacalss' from 'dataclasses' (/usr/local/lib/python3.9/dataclasses.py)

We can directly apply the same Python function without any modification to the `DataFrame` in pandas:

In [None]:
pd_df.apply(
    lambda df: asdict(hide_name(df["name"], df["age"])),
    axis=1, 
    result_type="expand"
)


But there are still some important **differences** between the two cases:

- In pandas, what we apply to a `DataFrame` is a Python function, while in GreenplumPython, a Python function must be converted to a database function before being applied to a `DataFrame`.
- pandas supports applying a function along different axes, while GreenplumPython only supports applying a function to each row due to the limitation of the database system behind.


### Data Grouping

Like ordering, Grouping data based on distinct set of values of columns can also facilitate analytical tasks.

Data grouping is often associated with aggregate functions to obtain data summaries. 

For example, suppose we want to the number of students of different age. In pandas, we can do

In [None]:
import numpy as np

pd_df.groupby("age").apply(lambda df: np.count_nonzero(df["name"]))


In GreenplumPython, what we need to do is

In [None]:
count = gp.aggregate_function("count")

student.group_by("age").apply(lambda t: count(t["name"]))


### Data Deduplication

Data deduplication is to return a new data structure containing only the distinct set of values in the selected columns.

This operation is well supported in GreenplumPython, also in a very similar way to pandas.

For example, suppose we want to draw a representative sample containing students for each distinct age,

With pandas, we can do

In [None]:
pd_df.drop_duplicates("age")


With GreenplumPython, what we need to do is

In [None]:
student.distinct_on("age")


Moreover, GreenplumPython also supports aggregation on only the distinct values.

Suppose we want to know the number of different ages of the students, we can do

In [None]:
student.apply(lambda t: count.distinct(t["age"]))


### Joins

Joins are operations that combines two data structures horizontally in a sensible way.

This makes it easier and more efficient to query one data structure based on the other.

For example, suppose we want to retrieve all pairs of sturents of the same age.

In pandas, we can join the `DataFrame` with itself on the "age" column:

In [None]:
pd_df.merge(pd_df, on="age", suffixes=("", "_2"))


Similarily, in GreenplumPython, we can do

In [None]:
student.join(student, on="age", other_columns={"name": "name_2"})


In terms of associative query, there is an important **difference** between GreenplumPython and pandas.

pandas allows querying one `DataFrame` based on another without previously joining them. 

For example, suppose we have two pandas `DataFrame` of numbers, the operation below is *legal* even though not *sensible*:

In [None]:
num_1 = pd.DataFrame({"val": [1, 3, 5, 7, 9]})
num_2 = pd.DataFrame({"val": [2, 4, 6, 8, 10]})

num_1[num_2["val"] % 2 == 0]  # Even numbers?


To avoid such kind of misuse, in GreenplumPython, it is impossible to refer to other `DataFrame`s except for the "current" one in an expression except when using the `in_()` expression.

This is because GreenplumPython only accepts a `Callable` as argument for expression and will bind it to the current `DataFrame` automatically. 