# Data Retrieval III (SQL)

In this notebook, we will work with the following:

1. SELECT statements.
2. Aggregation.
3. Window functions.
4. Joins.

In [None]:
from os import environ
import urllib.parse

from wrds.sql import (
    WRDS_POSTGRES_HOST,
    WRDS_POSTGRES_DB,
    WRDS_POSTGRES_PORT,
)

In [None]:
environ["DATABASE_URL"] = (
    f"postgresql://{environ['WRDS_USER']}:"
    f"{urllib.parse.quote_plus(environ['WRDS_PASS'])}@"
    f"{WRDS_POSTGRES_HOST}:{WRDS_POSTGRES_PORT}/"
    f"{WRDS_POSTGRES_DB}"
)

In [None]:
%load_ext sql
%config SqlMagic.autopandas=True
%config SqlMagic.displaycon=False
%config SqlMagic.feedback=False

In [None]:
# Uncomment and run the line below to see options.
# %config SqlMagic

In [None]:
%sql

# SQL

SQL is **structured query language**, and it is a way that we can specify to a database management system ("DBMS") the form of data that we would like it to return to us.
This is another deep topic, but like the others, we can accomplish a lot for research with some well-chosen basics.

A DBMS generally stores data in tables, which are 2D datasets like the pandas dataframes or stats software datasets that we are accustomed to using.
These tables are related to each other using keys in one-to-one, one-to-many, and many-to-one relationships, hence the name "relational database."

SQL in research is most helpful in two particular cases:

1. Retrieving data from a data service that runs a DBMS for us (e.g., WRDS).
2. Creating and using a local database to help deal with big data that is more granular than we ultimately need.

We will focus below on the first case.

# SELECT statements

A `SELECT` statement tells the DBMS that we would like to select certain data from a table.
Its basic anatomy is quite simple:

```sql
SELECT *
  FROM comp.funda;
```

Above, `SELECT *` means that we want to select every column.
This is generally bad form, because, in practice, we rarely need all of the columns.
`FROM comp.funda` tells the DMBS that we want the `comp.funda` table, which is the Compustat Daily Updates - Fundamentals Annual.
When using WRDS, the database names are available at the top of the variable descriptions for a given table/query form.

The semicolon at the end signifies the end of the query.
Unlike Python, SQL does not use whitespace as syntax, though there are style [conventions](https://www.sqlstyle.guide).

In [None]:
%%sql
SELECT *
  FROM comp.funda
 LIMIT 10


If we want to keep a result, we can use the syntax below.
After the `%%sql` cell magic, we give a name to the results we want to assign to a name (e.g., `df01`), and then a space and `<<` before the query.

In combination with the `autopandas` setting above, our result will be a pandas dataframe.

In [None]:
%%sql df01 <<
SELECT *
  FROM comp.funda
 LIMIT 10


In [None]:
df01.head()  # noqa: F821

## *Aside: code testing*

(Feel free to skip)

You may have noticed the comment, `# noqa: F821`, in the method call above.

This is a specially-formatted comment that tells my testing infrastructure to ignore (i.e. no quality assurance, or `noqa`, error for the type F821, which corresponds with a `NameError`).
The reason we need to capture it is that the code testing tool doesn't understand the `%%sql` magic commands, and it can't find where `df01` was previously defined.

For a project of this scope and update frequency, it helps to have some automated testing that helps me catch when things stop working.
One part of that is suppressing errors that happen for some technical or intended reason, to isolate real problems.

In [None]:
# We can look at all of these column names if we like.
# df01.columns.to_list()

Note two things in particular in the query and results above.

First, I used the `LIMIT` keyword with a value of `10`.
Compustat is a huge dataset, and retrieving everything would be a big download.
When we are experimenting or iterating on a query, using `LIMIT` asks the server to provide only a number of results up to the parameter to limit.
This is a strong norm when using this kind of data, as it dramatically reduces the load on the server.
`LIMIT` becomes more important as we ask the server to do transformation work for us, which increases the computational demand.

Second, there are 948 columns in this dataset.
Chances are, this is many more than we want, so we should narrow down to the variables of interest.

In [None]:
%%sql
SELECT gvkey, fyear, conm, tic, cusip
       , at, lt
  FROM comp.funda
 WHERE (datafmt = 'STD')
 LIMIT 10


There are two changes above.
First, we picked explicit column names.

Second, we added a `WHERE` clause to impose a condition on the data that we want back.
In this case, we asked for rows where the column `datafmt` has a value of `STD`.
The default query form for Compustat returns only these standard data formats, so we recreate that here.

In [None]:
%%sql
SELECT gvkey, fyear, conm, tic
       , cusip AS cusip9
       , SUBSTRING(cusip, 1, 8) AS cusip8
       , at, lt
  FROM comp.funda
 WHERE (datafmt = 'STD') AND
       (fyear BETWEEN 2000 AND 2020)
 LIMIT 10


Here, we made three more changes.
First, we asked for the `cusip` column to be called `cusip9` in our results using `AS`.
Second, we used a function to transform the `cusip` column (using the `SUBSTRING()` function) to give us only eight characters and to name it `cusip8`.
This is a simple example of having the server do prep work for us.
Finally, we added a second condition to `WHERE`: a year restriction.

# Aggregation

Sometimes, the data in a table is more granular than the data that we returned to us.
So, we can ask the server to aggregate it for us, returning an aggregated dataset.

There are a few important things to know:

1. We use `GROUP BY` to tell the DBMS how to group rows before aggregating.
2. Every column must either be in the `GROUP BY` or have an aggregation function applied. A notable example here is that we ask for the `MAX` of the company name. If the name changes in the rows of the search, the DBMS would need to know how to choose. However, this is enforced as a general rule, not only when there is an actual conflict to resolve.
3. Order of the statements matter. For example, `WHERE` needs to be after `FROM` and before `GROUP BY`. I've done them here, so it will work, but this is a topic better explored in an introductory book on SQL.

In [None]:
%%sql
  SELECT gvkey
         , MAX(conm) AS co_name
         , AVG(at) AS assets_avg
         , SUM(ni) AS netincome_total
    FROM comp.funda
   WHERE (datafmt = 'STD') AND
         (fyear BETWEEN 2000 AND 2020)
GROUP BY gvkey
   LIMIT 10


# Window functions

Sometimes, we want data at the level of the table, but we would also like aggregated measures.
SQL has something called **window functions** which aggregate data like we did before, but then they **broadcast** it up to the level of the original table.

In [None]:
%%sql
SELECT gvkey, fyear, conm, tic
       , cusip AS cusip9
       , SUBSTRING(cusip, 1, 8) AS cusip8
       , at, lt
       , AVG(at) OVER(PARTITION BY gvkey) AS assets_avg
       , SUM(ni) OVER(PARTITION BY gvkey) AS netincome_total
  FROM comp.funda
 WHERE (datafmt = 'STD') AND
       (fyear BETWEEN 2000 AND 2020)
 LIMIT 10


Notice a few things about using window functions:

1. We're broadcasting back to the original row level, so there's no need to provide aggregation on the name.
2. We removed `GROUP BY`.
3. Instead, each aggregation function uses the `OVER()` function (which tells the DBMS that we want a window function), and, inside, it has `PARTITION BY` which serves the purpose of defining how the aggregation is done.

Window functions are very useful for a lot of the work we do, and they can easily push work to the server that we might otherwise have to do after retrieving the data.

# Joining data

A `JOIN` is combining one table with another (or multiple others) in order to query combined data.
This is a fairly deep topic, though we are going to work through a simple example.

In [None]:
%%sql
SELECT f.gvkey, f.fyear, f.conm, f.tic
        , f.cusip AS cusip9
        , SUBSTRING(f.cusip, 1, 8) AS cusip8
        , f.at, f.lt
        , AVG(f.at) OVER(PARTITION BY f.gvkey) AS assets_avg
        , SUM(f.ni) OVER(PARTITION BY f.gvkey) AS netincome_total
        , c.city
        , c.state
  FROM comp.funda AS f
  JOIN comp.company AS c
    ON f.gvkey = c.gvkey
 WHERE (f.datafmt = 'STD') AND
        (f.fyear BETWEEN 2000 AND 2020)
 LIMIT 10


There are a number of changes here to make the `JOIN` work.

1. Notice that we added prefixes to all of the variables in original tables. Without these qualifiers, those column names are ambiguous.
1. Otherwise, most things look similar until the `JOIN`.
1. The `JOIN` itself has two parts: the `JOIN` specifying the other table we want to join, and the `ON` specifying how to join (or merge) the two. In this case, we are using `f.gvkey` and `c.gvkey`.
1. Like we have before, we're using `AS` again, this time to give short names to the tables (to make those prefixes easier to type).

Joins are powerful, and they can allow us to push a lot of our prep work onto the server. In addition, with copyrighted data like this, sharing a query with someone else is a way of transmitting exactly (or close to) what you pulled, while letting them rely on their own licensed access to the data.

# Breakout Exercises (time permitting)

If time permits, do the following exercise.

## EX1: customize a query

Choose one of the queries above, and edit to make two changes:

1. Restrict the results to Apple and Microsoft, two firms we've used as examples before. (Hint: the ticker symbols may be helpful)
2. Add an additional item of your choice to retrieve an additional column or aggregated variable.

In [None]:
%%sql
SELECT 'Replace with your answer to 1-1.' as exercise