# Basic Usage

### Prerequisites

To run this tutorial, we need

- A database that is authorized to access, and
- A table named `demo` as created with the following SQL command:

    ```sql
    CREATE TABLE demo AS
    SELECT n AS i, n AS j, n AS k
    FROM generate_series(0,9) AS n;
    ```

To create this table, if in a shell environment, [psql](https://www.postgresql.org/docs/current/app-psql.html) can be used.

Or, inside a Jupyter Notebook, the SQL command can be executed directly in cells with [ipython-sql](https://pypi.org/project/ipython-sql/), as shown below.

First we need to connect to the database (`gpadmin` in our example) specified with the URI using the `%sql` magic:

In [16]:
%load_ext sql
%sql postgresql://localhost/gpadmin

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Authentication methods and credentials can be specified in the URI. Please refer to the [libpq document](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) for detailed usage.

In [17]:
%%sql

DROP TABLE IF EXISTS demo;

CREATE TABLE demo AS
SELECT n AS i, n AS j, n AS k
FROM generate_series(0,9) AS n;

 * postgresql://localhost/gpadmin
Done.
10 rows affected.


[]

With the table created successfully, we are now good to go! 

### Getting Access to Database

To get access to the database we want:

In [18]:
import greenplumpython as gp

db = gp.database(uri="postgresql://localhost/gpadmin")

Here, the `uri` follows the same specification in the [libpq document](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) as above.

As another example, if password is required, the `uri` might look like `postgresql://user:password@hostname/dbname`.

### Accessing a DataFrame in the Database

After selecting the database, we can access a dataframe in the database by specifying its name:

In [19]:
t = db.create_dataframe(table_name="demo")
t

i,j,k
1,1,1
3,3,3
4,4,4
5,5,5
6,6,6
9,9,9
0,0,0
2,2,2
7,7,7
8,8,8


And of course, we can `SELECT` the first ordered N rows of a dataframe, like this:

In [20]:
t.order_by("i")[:10]

i,j,k
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6
7,7,7
8,8,8
9,9,9


### Basic Data Manipulation

Now we have a dataframe. We can do basic data manipulation on it, just like in SQL.

For example, we can `SELECT` a subset of its columns:

In [21]:
t_ij = t[["i", "j"]]
t_ij

i,j
5,5
6,6
9,9
1,1
3,3
4,4
0,0
2,2
7,7
8,8


And we can also `SELECT` a subset of its rows. Say we want all the even numbers:

In [22]:
t_even = t_ij[lambda t: t["i"] % 2 == 0]
t_even

i,j
6,6
4,4
0,0
2,2
8,8


For a quick glance, we can `SELECT` the first unordered N rows of a dataframe, like this:

In [23]:
t_n = t_even[:3]
t_n

i,j
4,4
6,6
0,0


Finally when we are done, we can save the resulting dataframe to the database, either temporarily or persistently:

In [24]:
t_n.save_as(table_name="t_n", column_names=["i", "j"], temp=True)

i,j
6,6
2,2
0,0


### Joining Two DataFrames

We can also `JOIN` two dataframes with GreenplumPython. For example, suppose we have two dataframes like this:

In [25]:
rows = [
    (1, "'a'"),
    (2, "'b'"),
    (3, "'c'"),
    (4, "'d'"),
]
t1 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t1

id,val
1,'a'
2,'b'
3,'c'
4,'d'


In [26]:
rows = [
    (1, "'a'"),
    (2, "'b'"),
    (3, "'a'"),
    (4, "'b'"),
]
t2 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t2

id,val
1,'a'
2,'b'
3,'a'
4,'b'


We can `JOIN` the two dataframe like this:

In [27]:
t_join = t1.join(
    t2,
    on="val",
    self_columns = {
        "id": "t1_id",
        "val": "t1_val"
    },
    other_columns = {
        "id": "t2_id",
        "val": "t2_val"
    }
)
t_join

t1_id,t1_val,t2_id,t2_val
1,'a',3,'a'
1,'a',1,'a'
2,'b',4,'b'
2,'b',2,'b'


### Creating and Calling Functions

Calling functions is essential for data analytics. GreenplumPython supports creating Greenplum UDFs and UDAs from Python functions and calling them in Python.

Suppose we have a dataframe of numbers:

In [28]:
rows = [(i,) for i in range(10)]
numbers = db.create_dataframe(rows=rows, column_names=["val"])
numbers

val
0
1
2
3
4
5
6
7
8
9


If we want to get the square of each number, we can write a function to do that:

In [29]:
@gp.create_function
def square(a: int) -> int:
    return a**2


numbers.apply(lambda t: square(t["val"]), column_name="square")

square
0
1
4
9
16
25
36
49
64
81


Note that this function is called in exactly the same way as ordinary Python functions.

If we also want to get the sum of these numbers, what we need is to write an aggregate function like this:

In [30]:
@gp.create_aggregate
def my_sum(result: int, val: int) -> int:
    if result is None:
        return val
    return result + val


numbers.apply(lambda t: my_sum(t["val"]), column_name="sum")

sum
45
