# Lecture 7.1: SQL and Importing Data
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Know to execute basic [SQL commands](#SQL-Commands)
* Understand how to import data from various sources

    
</div>

This lecture corresponds to Chapter 13 of your textbook.

In [1]:
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# SQL Queries

SQL stands for "Structured Query Language". Many large databases are stored in SQL format, and you will probably encounter one if you work on big data and/or at a large company. 

To introduce SQL we're going to use the `sqldf` package, which lets us run SQL queries on R tibbles/data frames. Also, to make things go faster, we'll operate on a subsetted version of flights which takes 1% of randomly sampled rows.

### Selecting data from a table
The SQL syntax for selecting column(s) from a table is
```{sql}
SELECT <col1>, <col2>, ..., <coln> FROM <table>
```
Note the similarity to the corresponding `tidyverse` command:
```{r}
select(<table>, <col1>, <col2>, ..., <coln>)
```

The special keyword `*` means "select everything" and is equivalent to `dplyr`'s `everything()`:

If you have a really big table, SQL allows you to `LIMIT` the number of rows it returns.

### Filtering

The SQL syntax for filtering rows in a table uses the `WHERE` clause:
```{sql}
SELECT * FROM <table> WHERE dest = "IAH"
```
This is the same as:
```{r}
filter(<table>, dest == "IAH")
```
Note that SQL uses a single `=` to check equality!

### Missing data
In SQL, missing data is coded as `NULL`. This is a special value which is analogous to `NA` in R. 

### Summarizing

The SQL syntax for summarizing is using the `GROUP BY` clause:
```{sql}
SELECT AVG(<col>) AS avg_col FROM <table> GROUP BY(<group cols>)
```
This is the same as:
```{r}
<table> %>% group_by(<group cols>) %>% summarize(avg_col = mean(<col>))
```

### Joins

The SQL syntax for joins:
```{sql}
SELECT * FROM <table> LEFT JOIN <other_table> ON <left_key_col> = <right_key_col>
```
This is the same as:
```{r}
left_join(<table>, <other_table>, by = c("<left_key_col>" = "<right_key_col>"))
```

Note here that SQL requires us to be explicit about which columns we are `SELECT`ing when joining multiple tables. Each column name must be prefixed with the name of the table in which it resides.

## Advanced joins in SQL
SQL is more general in specifying the join condition. Whereas in tidyverse it must be a key, in
SQL it can be a general logical condition.

## Example
For every airport in `airports`, what is its nearest neighbor?

What is the nearest neighbor to `DTW`?

## Types of data
You will encounter data in many different formats. Here are a few of the most common ones:

### Comma-separated value data
Comma-separated value (CSV) is one of the most common formats for sharing data. It has the advantage of being human-readable. The disadvantage is that there is no actual standard for reading or writing these files!

Here's an example of CSV data on heights:
    
    "earn","height","sex","ed","age","race"
    50000,74.4244387818035,"male",16,45,"white"
    60000,65.5375428255647,"female",16,58,"white"
    30000,63.6291977374349,"female",16,29,"white"
    50000,63.1085616752971,"female",16,91,"other"
    51000,63.4024835710879,"female",17,39,"white"
    9000,64.3995075440034,"female",15,26,"white"
    
The first row (usually) has a *header* giving the column names. Subsequent rows give the actual data. Strings are (usually) quoted.

You might also see these data come in the format:
    
    earn,height,sex,ed,age,race
    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
No quotes!

Or even:

    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
No column names!

The `read_csv` command is designed to read this type of file. Note that this command is part of `tidyverse` and is different from `read.csv` in R! You generally want to use `read_csv` over `read.csv` since:
- It is much faster.
- It outputs nicely formatted `tibble`s which you can pass into other tidyverse functions.

Here `read_csv` has told us what columns it found, and also what the data types it found for them are. Generally these will be correct but we will see examples later where it guesses wrongly and we have to manually override them.

Here is another version of `heights`, where we are not lucky enough to have a header telling us which columns came from where:

Now `read_csv()` has erroneously assumed that the first row of data are the header names. To override this behavior we need to specify the column names by hand:

To create short examples illustrating `read_csv`'s behavior, we can specify the contents of a csv file inline.

In [6]:
read_csv(
    "a, b, c
     1, 2, 3
     4, 5, 6
")

[1m[1mRows: [1m[22m[34m[34m2[34m[39m [1m[1mColumns: [1m[22m[34m[34m3[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (3): a, b, c


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


You might want to skip a few rows in the beginning that have metadata.

In [7]:
read_csv(
"# First row to skip
// Second row to skip
% Third row to skip
a, b, c
1, 2, 3
4, 5, 6
", skip = 3)

[1m[1mRows: [1m[22m[34m[34m2[34m[39m [1m[1mColumns: [1m[22m[34m[34m3[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (3): a, b, c


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Some CSVs will come with comments, typically in the form of lines prefaced by `#`. You can also skip comments line by specifying a comment character.

Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

You can specify your own column names.

You can specify how missing values are represented in the file.

You can write a tibble to a csv file using `write_csv()`.

## Reading data from the Internet
These days, it's increasingly common to pull data from online sources. For example, say I wanted to know the population of European countries. This is [easily found](https://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country) on Wikipedia. How can I get these data into R and analyze them?

We will use the package `htmltab` for this purpose.

In [19]:
#install.packages("htmltab")
library(htmltab)

also installing the dependency ‘XML’





The downloaded binary packages are in
	/var/folders/0l/dj01tr0x49xbx9gr9y98rpj00000gn/T//Rtmpi112PJ/downloaded_packages


The syntax of this command is:

```
htmltab(<url>, <table identifier>)
```

Let's try it with the Wikipedia page above:

In [20]:
url <- "http://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country"



Argument 'which' was left unspecified. Choosing first table.



[90m# A tibble: 8 × 2[39m
  Year  `Population(% of world total)`
  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m                         
[90m1[39m AD 1  34 (15%)                      
[90m2[39m 1000  40 (15%)                      
[90m3[39m 1500  78 (18%)                      
[90m4[39m 1600  112 (20%)                     
[90m5[39m 1700  127 (21%)                     
[90m6[39m 1820  224 (21%)                     
[90m7[39m 1913  498 (28%)                     
[90m8[39m 2000  742 (13%)                     


This did not produce what we want. The reason is that there are many tables on this page, and by default `htmltab()` just takes the first one it finds. We can pass a number as the second argument in order to take the second, third, etc.:

To get `europe.pop` into a usable format we need to do a bit more work: