# Tutorial - QFrame

## What is a QFrame?
QFrame is a class which generates an SQL statement. It stores fields info in `QFrame.data` parameter which is a dictionary.

`QFrame.data` has `select` key in which it stores `fields` which we want to have in our SQL statement. Each key have to have specified `type` which can be 'dim' if the varibale is a dimension variable or 'num' if the variable is a numeric variable. Let's take a look at all options that we can have under `select` and `fields` keys.

```json
{
  "select": {
    "table": "table",
    "schema": "schema",
    "fields": {
      "column": {
        "type": "dim",
        "as": "",
        "group_by": "",
        "order_by": "",
        "expression": "",
        "select": "",
        "custom_type": ""
      }
    },
    "where": "",
    "distinct": "",
    "having": "",
    "limit": ""
  }
}
```

- `table` - Name of the table.
- `schema` - Name of the schema.
- `fields`, in each field:
    - `type` - Type of the column. Options:

        - 'dim' - VARCHAR(500)  
        - 'num' - FLOAT
     
     Every column has to have specified type. If you want to sepcify another type check `custom_type`.
    - `as` - Column alias (name).

    - `group_by` - Aggregation type. Possibilities:

        - 'group' - This field will go to GROUP BY statement.
        - {'sum', 'count', 'min', 'max', 'avg'} - This field will by aggregated in specified way.
  
     If you don't want to aggregate fields leave `group_by` empty in each field.
    - `order_by` - Put the field in order by statement. Options:
    
        - 'ASC'
        - 'DESC'
        
    - `expression` - Expression, eg. CASE statement, column operation, CONCAT statement, ... .
    - `select` - Set 0 if you don't want to put this field in SELECT statement.
    - `custom_type` - Specify custom SQL data type, eg. DATE.
- `where` - Add where statement, eg. 'sales>100'
- `distinct` - Set 1 to add distinct to select
- `having` - Add having statement, eg. 'sum(sales)>100'
- `limit` - Add limit, eg. 100

## How to create a QFrame?
You can create a QFrame manually - passing the data directly to QFrame or automatically - using `initiate` function.

In [None]:
from grizly import (
    get_path, 
    QFrame
)

### Manually - using dictionary

This method is the most direct method of creating a QFrame - to use it you need to know the structure of `QFrame.data`. From following dictionary

In [None]:
data = {
  "select": {
    "table": "table",
    "schema": "schema",
    "fields": {
      "col": {
        "type": "dim"
      }
    }
  }
}

QFrame will generate a simple sql

In [None]:
qf = QFrame().read_dict(data)
qf.get_sql()

Here we also used simple method `.get_sql()` which prints sql saved in QFrame.

### Manually - using JSON file

We use a `.json` file to conviniently manipulate information about columns, renames and other things that might be very verbose to manipulate in python code. We can edit the json file into a json editor like http://jsoneditoronline.org/ more conviniently than in Python code.

After editing the `store.json` we can read it back inside a QFrame using `read_json()`.

This means we can use our json as our main `store` of verbose information and python as our main way to manipulate said information.

In [None]:
json_path = get_path("dev", "grizly", "notebooks","store.json")
qf.save_json(json_path=json_path, subquery="my_query_1")

qf = QFrame().read_json(json_path=json_path, subquery="my_query_1")
qf.get_sql()

### Automatically - using initiate funtion

The other way to generate a QFrame is to use `initiate` function. You can use it in two ways. First is to pass the column names directly. 

In [None]:
from grizly import initiate

initiate(columns=["col1", "col2"], 
         schema="schema", 
         table="table", 
         json_path=json_path,
         subquery="my_query_2")

qf = QFrame().read_json(json_path=json_path, subquery="my_query_2")
qf.get_sql()

The second way is to use `get_columns` function which will import all names of the columns in given table, also with the types.

In [None]:
from grizly import get_columns

columns, col_types = get_columns(table='table_tutorial',
                                 schema='administration',
                                 column_types=True,
                                 db='redshift')
initiate(columns=columns,
         col_types=col_types,
         schema="administration", 
         table="table_tutorial", 
         json_path=json_path,
         subquery="my_query_3")

qf = QFrame(engine="mssql+pyodbc://redshift_acoe").read_json(json_path=json_path, subquery="my_query_3")
qf.get_sql()

## Working with the QFrame
There is a lot of methods which you can use to edit the QFrame. You can check them in QFrame docs. In this tutorial we will only show some of them.

### Doing some basic SQL stuff
Let's now add a `where` statement, rename some fields, add calculated field and remove some fields`.

In [None]:
qf.query("col2 > 1") #<- where
qf.rename({"col1": "items", "col2": "price"})
qf.assign(calculated_field = "col4*2", type='num', custom_type='double precision')
qf.remove(["col3", "col4"])
qf.get_sql()

:Be aware that `rename()` method doesn't change the name of the field but only the alias (final name) of the column.:

Now you can check how the data changed calling `data` attribute.

In [None]:
qf.data

You can see that now we also have `sql_blocks` key. You can ignore it. This key is used to build SQL statement and is generated any time `get_sql()` method is called.

### Forking

Forking qframes can be important if your data workflow needs to take the same sql table and apply different transformations to it.

Sometimes we want to fork, do some transforms, then union the QFrames back together which results into an append operation on the data side.

Let's create two copies of one QFrame.

In [None]:
qf1 = qf.copy()
qf2 = qf.copy()

## Unioning data

There are two ways of unioning two QFrames - we can union by the position of the field or by the final name of the columns (that means the alias). 

In [None]:
from grizly import union

qf1.rename({"col2": "price_1", "calculated_field": "price_2"})
qf2.rename({"col2": "price_2", "calculated_field": "price_1"})

#### Union by the positon

In [None]:
uqf_pos = union(qframes=[qf1, qf2], union_type="UNION ALL", union_by='position')
uqf_pos.get_sql()

#### Union by the column names

In [None]:
uqf_name = union(qframes=[qf1, qf2], union_type="UNION ALL", union_by='name')
uqf_name.get_sql()

You can see that in this case union changes the order of the columns. 

## Joining data

In [None]:
from grizly import join

We will be using `Chinook.sqlite` to visualize data.

In [None]:
engine_string = "sqlite:///" + get_path("dev", "grizly", "tests", "Chinook.sqlite")

### Simple join

First table is `Track` table.

In [None]:
tracks = {  'select': {
                'fields': {
                    'TrackId': { 'type': 'dim'},
                    'Name': {'type': 'dim'},
                    'AlbumId': {'type': 'dim'},
                    'Composer': {'type': 'dim'},
                    'UnitPrice': {'type': 'num'}
                },
                'table': 'Track'
            }
}
tracks_qf = QFrame(engine=engine_string).read_dict(tracks)
tracks_qf.get_sql()

In [None]:
tracks_qf.to_df().sample(5)

The second table is `PlaylistTrack` table. 

In [None]:
playlist_track = { "select": {
                        "fields":{
                            "PlaylistId": {"type" : "dim"},
                            "TrackId": {"type" : "dim"}
                        },
                        "table" : "PlaylistTrack"
                    }
                }

playlist_track_qf = QFrame(engine=engine_string).read_dict(playlist_track)
playlist_track_qf.get_sql()

In [None]:
playlist_track_qf.to_df().sample(5)

Now let's join them on `TrackId`.

In [None]:
joined_qf = join([tracks_qf,playlist_track_qf], join_type="left join", on="sq1.TrackId=sq2.TrackId")

joined_qf.get_sql()

In [None]:
joined_qf.to_df().sample(5)

As you can see in this example `UnitPrice` is taken from the first table. By default join function is taking all fields from the first QFrame, then all the fields from the second QFrame which are not in the first and so on. If you still want to keep all fields from each QFrame we have to set `unique_col=False`. We will see in the next example how does it work.

### Multiple join

Now let's use one more table to check how does multiple join look like.

In [None]:
playlists = { "select": {
                    "fields": {
                        "PlaylistId": {"type" : "dim"},
                        "Name": {"type" : "dim"}
                    },
                    "table" : "Playlist"
                }
            }

playlists_qf = QFrame(engine=engine_string).read_dict(playlists)
playlists_qf.get_sql()

In [None]:
playlists_qf.to_df().sample(5)

Now if we want to join `Tracks`, `PlaylistTrack` and `Playlist` tables we can use `TrackId` and `PlaylistId`. We can see that in `Tracks` and `Playlist` tables we have the same column `Name`. Let's check the option `unique_col=False` and analyse duplicated columns.

In [None]:
joined_qf = join(qframes=[tracks_qf, playlist_track_qf, playlists_qf], join_type=
                ['left join', 'left join'], on=[
                 'sq1.TrackId=sq2.TrackId', 'sq2.PlaylistId=sq3.PlaylistId'], unique_col=False)

In [None]:
joined_qf.show_duplicated_columns()

We can see that three columns occure in two different tables. We will remove `sq2.TrackId` and  `sq2.PlaylistId` fields and rename `Name` column.

In [None]:
joined_qf.remove(['sq2.TrackId', 'sq2.PlaylistId']).rename({'sq1.Name': 'TrackName', 'sq3.Name': 'PlaylistType'})
joined_qf.get_sql()

In [None]:
joined_qf.to_df().sample(5)