In [1]:
import matplotlib.cbook

import warnings
import plotnine
warnings.filterwarnings(module='plotnine*', action='ignore')
warnings.filterwarnings(module='matplotlib*', action='ignore')

%matplotlib inline

# Querying SQL (intro)

## Reading in data

In this tutorial, we'll use the mtcars data ([source](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html)) that comes packaged with siuba. This data contains information about 32 cars, like their miles per gallon (`mpg`), and number of cylinders (`cyl`). This data in siuba is a pandas DataFrame.

In [2]:
from siuba.data import mtcars

mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


First, we'll use sqlalchemy, and the pandas method `create_engine` to copy the data into a sqlite table.
Once we have that, `siuba` can use a class called `LazyTbl` to connect to the table.

In [3]:
from sqlalchemy import create_engine
from siuba.sql import LazyTbl

# copy in to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")

# connect with siuba
tbl_mtcars = LazyTbl(engine, "mtcars")

tbl_mtcars

Unnamed: 0,index,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


Notice that `siuba` by default prints a glimpse into the current data, along with some extra information about the database we're connected to. However, in this case, there are more than 5 rows of data. In order to get all of it back as a pandas DataFrame we need to `collect()` it.

## Connecting to existing database

While we use `sqlalchemy.create_engine` to connect to a database in the previous section, `LazyTbl` also accepts a string as its first argument, followed by a table name.

This is shown below, with placeholder variables, like "username" and "password". See this [SqlAlchemy doc](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for more.

```python
tbl = LazyTbl(
    "postgresql://username:password@localhost:5432/dbname",
    "tablename"
    )
```

## Collecting data and previewing queries

In [4]:
from siuba import head, collect, show_query

tbl_mtcars >> head(2) >> collect()

Unnamed: 0,index,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


In [5]:
tbl_mtcars >> head(2) >> show_query()

SELECT mtcars."index", mtcars.mpg, mtcars.cyl, mtcars.disp, mtcars.hp, mtcars.drat, mtcars.wt, mtcars.qsec, mtcars.vs, mtcars.am, mtcars.gear, mtcars.carb 
FROM mtcars
 LIMIT 2 OFFSET 0


Unnamed: 0,index,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


## Basic queries

A core goal of `siuba` is to make sure most column operations and methods that work on a pandas DataFrame, also work with a SQL table. As a result, the examples in these docs also work when applied to SQL.

This is shown below for `filter`, `summarize`, and `mutate`.

In [6]:
from siuba import _, filter, select, group_by, summarize, mutate

tbl_mtcars >> filter(_.cyl == 6)

Unnamed: 0,index,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
3,5,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
4,9,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [7]:
(tbl_mtcars
  >> group_by(_.cyl)
  >> summarize(avg_mpg = _.mpg.mean())
  )

Unnamed: 0,cyl,avg_mpg
0,4,26.663636
1,6,19.742857
2,8,15.1


In [8]:
tbl_mtcars >> select(_.mpg, _.cyl, _.endswith('t'))

Unnamed: 0,mpg,cyl,drat,wt
0,21.0,6,3.9,2.62
1,21.0,6,3.9,2.875
2,22.8,4,3.85,2.32
3,21.4,6,3.08,3.215
4,18.7,8,3.15,3.44


In [9]:
tbl_mtcars >> \
  mutate(feetpg = _.mpg * 5290, inchpg = _.feetpg * 12)

Unnamed: 0,index,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,feetpg,inchpg
0,0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,111090.0,1333080.0
1,1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,111090.0,1333080.0
2,2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,120612.0,1447344.0
3,3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,113206.0,1358472.0
4,4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,98923.0,1187076.0


Note that the two SQL implementations supported are postgresql, and sqlite. Support for window and aggregate functions is currently limited.

## Diving deeper

**TODO**

* using raw SQL
* how methods and function calls are translated
* uses sqlalchemy column objects