## Importing data from databases (Part 2)

Importing an entire table from a database while you might only need a tiny bit of information seems like a lot of unncessary work. In this chapter, you'll learn about SQL queries, which will help you make things more efficient by performing some computations on the database side.

### Query tweater (1)
In your life as a data scientist, you'll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it's possible that you only need a fraction of this data. In this case, it's a good idea to send SQL queries to your database, and only import the data you actually need into R.

dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals "male":

dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")

In [1]:
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import tweat_id column of comments where user_id is 1: elisabeth
 elisabeth = dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = '1'")

# Print elisabeth
print(elisabeth)

"package 'DBI' was built under R version 3.6.3"

  tweat_id
1       87
2       49
3       77
4       77


### Query tweater (2)
Apart from checking equality, you can also check for less than and greater than relationships, with < and >, just like in R.

con, a connection to the tweater database, is again available.

In [2]:
# Import post column of tweats where date is higher than '2015-09-21': latest
latest = dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'")

# Print latest
print(latest)

                                                                 post
1               open and crush avocado. add shrimps. perfect starter.
2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
3                              just eat an apple. simply and healthy.


### Query tweater (3)
Suppose that you have a people table, with a bunch of information. This time, you want to find out the age and country of married males. Provided that there is a married column that's 1 when the person in question is married, the following query would work.

SELECT age, country FROM people WHERE gender = "male" AND married = 1

Can you use a similar approach for a more specialized query on the tweater database?

In [3]:
# Create data frame specific
specific = dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = '77' AND USER_ID > '4'")

# Print specific
print(specific)

  message
1  great!


### Query tweater (4)
There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.

In [4]:
# Create data frame short

short = dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < '5'")
# Print short
print(short)

  id name
1  2 mike
2  3 thea
3  6 kate


### Join the query madness!
Of course, SQL does not stop with the the three keywords SELECT, FROM and WHERE. Another very often used keyword is JOIN, and more specifically INNER JOIN. Take this call for example:

query1 = dbGetQuery(con,"SELECT name, post FROM users INNER JOIN tweats on users.id = user_id WHERE date > '2015-09-19'")


query2 = dbGetQuery(con, "SELECT post, message FROM tweats INNER JOIN comments on tweats.id = tweat_id WHERE tweat_id = '77'")

In [8]:
# print query1

query1 = dbGetQuery(con,"SELECT name, post FROM users INNER JOIN tweats on users.id = user_id WHERE date > '2015-09-19'")
print(query1)

       name                                                                post
1 elisabeth nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
2    oliver               open and crush avocado. add shrimps. perfect starter.
3      kate                       2 slices of bread. add cheese. grill. heaven.
4    anjali                              just eat an apple. simply and healthy.


In [9]:
# print query2

query2 = dbGetQuery(con, "SELECT post, message FROM tweats INNER JOIN comments on tweats.id = tweat_id WHERE tweat_id = '77'")

print(query2)

                                           post            message
1 2 slices of bread. add cheese. grill. heaven.             great!
2 2 slices of bread. add cheese. grill. heaven.      not my thing!
3 2 slices of bread. add cheese. grill. heaven. couldn't be better
4 2 slices of bread. add cheese. grill. heaven.       saved my day


### Send - Fetch - Clear
You've used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

1. Sending the specified query with dbSendQuery();
2. Fetching the result of executing the query on the database with dbFetch();
3. Clearing the result with dbClearResult().

Let's not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query's result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().

In [10]:
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")

# Use dbFetch() twice
print(dbFetch(res, n = 2))

print(dbFetch(res))

# Clear res
dbClearResult(res)

    id tweat_id user_id message
1 1022       87       7   nice!
2 1000       77       7  great!
    id tweat_id user_id  message
1 1011       49       5  love it
2 1010       88       6    yuck!
3 1030       75       6 so easy!


### Be polite and ...
Every time you connect to a database using dbConnect(), you're creating a new connection to the database you're referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it's always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.

The code that connects you to the database is already available, can you finish the script?

In [11]:
# Create the data frame  long_tweats
long_tweats = dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > '40'")

# Print long_tweats
print(long_tweats)

# Disconnect from the database
dbDisconnect(con)

                                                                 post
1                           wash strawberries. add ice. blend. enjoy.
2                       2 slices of bread. add cheese. grill. heaven.
3               open and crush avocado. add shrimps. perfect starter.
4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
        date
1 2015-09-14
2 2015-09-21
3 2015-09-22
4 2015-09-22
