### Selecting the Database of Your Data

To begin with, we need to select the database that contains the data we want:

In [1]:
import greenplumpython as gp

db = gp.database(host="localhost", dbname="gpadmin")

### Accessing a Table in the Database

After selecting the database, we can access a table in the database by specifying its name:

In [2]:
t = gp.table("demo", db=db)
t

i,j,k
6,6,6
9,9,9
1,1,1
2,2,2
3,3,3
7,7,7
4,4,4
5,5,5
8,8,8
10,10,10


And of course, we can `SELECT` the first ordered N rows of a table, like this:

In [3]:
t.order_by(t["i"]).head(10)

i,j,k
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6
7,7,7
8,8,8
9,9,9
10,10,10


### Basic Data Manipulation

Now we have a table. We can do basic data manipulation on it, just like in SQL.

For example, we can `SELECT` a subset of its columns:

In [4]:
t_ij = t[["i", "j"]]
t_ij

i,j
6,6
9,9
1,1
2,2
3,3
7,7
4,4
5,5
8,8
10,10


And we can also `SELECT` a subset of its rows. Say we want all the even numbers:

In [5]:
t_even = t_ij[t_ij["i"] % 2 == 0]
t_even

i,j
6,6
2,2
4,4
8,8
10,10


For a quick glance, we can `SELECT` the first unordered N rows of a table, like this:

In [6]:
t_n = t_even[:3]
t_n

i,j
6,6
4,4
8,8


Finally when we are done, we can save the resulting table to the database, either temporarily or persistently:

In [7]:
t_n.save_as(table_name="t_n", temp=True)

i,j
4,4
8,8
10,10


### `JOIN`-ing Two Tables

We can also `JOIN` two tables with GreenplumPython. For example, suppose we have two tables like this:

In [8]:
rows = [(1, "'a'",), (2, "'b'",), (3, "'c'",), (4, "'d'")]
t1 = gp.values(rows, db=db, column_names=["id, val"])
t1

id,val
1,'a'
2,'b'
3,'c'
4,'d'


In [9]:
rows = [(1, "'a'",), (2, "'b'",), (3, "'a'",), (4, "'b'")]
t2 = gp.values(rows, db=db, column_names=["id, val"])
t2

id,val
1,'a'
2,'b'
3,'a'
4,'b'


We can `JOIN` the two table like this:

In [10]:
t_join = t1.inner_join(
    t2,
    cond=t1["val"] == t2["val"],
    targets=[
        t1["id"].rename("t1_id"),
        t1["val"].rename("t1_val"),
        t2["id"].rename("t2_id"),
        t2["val"].rename("t2_val"),
    ],
)
t_join

t1_id,t1_val,t2_id,t2_val
1,'a',3,'a'
1,'a',1,'a'
2,'b',4,'b'
2,'b',2,'b'


### Creating and Calling Functions

Calling functions is essential for data analytics. GreenplumPython supports creating Greenplum UDFs and UDAs from Python functions and calling them in Python.

Suppose we have a table of numbers:

In [11]:
rows = [(i,) for i in range(10)]
numbers = gp.values(rows, db=db, column_names=["val"])
numbers

val
0
1
2
3
4
5
6
7
8
9


If we want to get the square of each number, we can write a function to do that:

In [12]:
@gp.create_function
def square(a: int) -> int:
    return a ** 2

square(numbers["val"], as_name="result", db=db).to_table()

result
0
1
4
9
16
25
36
49
64
81


Note that this function is called in exactly the same way as ordinary Python functions.

If we also want to get the sum of these numbers, what we need is to write an aggregate function like this:

In [13]:
@gp.create_aggregate
def my_sum(result: int, val: int) -> int:
    if result is None:
        return val
    return result + val

my_sum(numbers["val"], as_name="result", db=db).to_table()

result
45
