# Practice SQL

Okay, our `SQL` feet are wet. 
We have seen how to load databases and how to query databases so let's step a little further into the `SQL` pool and query a larger database. 
In the previous notebooks, we were using a `SQLite` engine but `SQLite` is...well...lite. Instead, in this practice we are going to work with a `PostgreSQL` database, which is one of the most advanced open source `SQL` databases available. 
Before we begin, let's talk about the data.



### Twitter Data

We'll be working with "some" Twitter data in this practice. 
"Some" is in quotes only because we are working with over 250 million rows (250 with 6 ZEROS) and that is only "some" of what is being collected. 
This database has been collecting from 50 different US cities since from Jan 2017 until mid May 2017 for this particular copy of the database. 

--- 

Given the volume of data, this database server is a good introduction to some other database concepts we have yet to touch on. 
Again, `pandas` has some pretty neat tools for interacting with databases of all flavors. All we have to do is establish the connection and we can start making requests of the database.

In [None]:
import psycopg2
import pandas as pd

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='pgsql.dsa.lan' password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
except:
    print("Something went wrong...probably the wrong permissions")

`psycopg2` is a package that allows `Python` to connect to a `Postgres` database.

We didn't state this explicitly, but databases allow us to query other aspects of it other than tables. 
We can do something like write a query to get the table names in the database. 
Let's do that now so we can get an idea of what types of tables we will be working with.

In [None]:
statement = """SELECT table_name 
    FROM information_schema.tables
    WHERE table_schema = 'twitter'"""

pd.read_sql_query(statement,conn)

So we are working with a `tweet`, `hashtag`, `mention`, `url`, and `job` table, but that doesn't tell us a whole lot about the contents of any given table. 
At most, we can infer from the table names what type of data might be in them. 
But it would be better to know the attributes and perhaps what kind of data type they are. 

Again, we will be referencing the database's `information_schema` in order to pull this information out. 
Study the following query and compare it with the last.

In [None]:
statement = """                              
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'hashtag';
"""

pd.read_sql_query(statement,conn)

Now we know a bit about the contents of the of the `hashtag` table, which is good because we will be using it quite a bit in this practice. 

**Take a moment and contemplate the `CREATE TABLE` command that was used to create this table.**

Moving on, we can explore what type of attributes are in the `tweet` data?

**Activity 1:** Write a query to look at the column_names, data_types and character_maximum_lengths of the tweet table.

In [None]:
# Your code for Activity 1 goes here
# --------------








**Activity 2:** How might you get **all** of the column characteristics (ie. `column_name`, `data_type`, etc.) for a single table?

In [None]:
# Your code for Activity 2 goes here
# --------------







That's enough surface level exploring for now. 
Let's get our hands dirty a bit. 
To do that, we have to dig ;) into the data. 

Oh! but before we dig in we need to cover the idea of `LIMIT`. 
There are several reasons to put a `limit` on the number of rows returned in a query. 
For example, if we just want to get an idea of what our data looks like, we might specify a `LIMIT` of `5`. 
This is similar to running `head(dataframe_name)` in `R` or `dataframe_name.head()` in `Python`. 
But the reason we are using it is more for performance reasons. 
Remember, this is a fairly large database, and not specifying a limit and trying to pull millions of rows of data into Pandas could slow the operation down to a pace that is beyond the limit of your patience. 

So let's introduce `LIMIT` right now.

```SQL
SELECT <column_names>
FROM <table_name>
LIMIT <numeber_of_rows>
```

In [None]:
statement = """
SELECT * 
FROM twitter.hashtag
LIMIT 1000
"""

pd.read_sql_query(statement,conn)

#### Nested Queries

So **LIMIT**s employed for performance sake make for the appropriate time to cover nested queries. 
Nested queries are a good way to extract further information after performing some operation. 
Take the example below. 
We are trying to return a table that displays the unique hashtags in one column and how many times they occur in another. 
But remember, this database is large and we don't want to perform this over the entire table, 
so instead, we are going to perform this query over another query nested inside of it where we limit the number of rows to 1000. 

We can break this query apart from the inner part first to the outer part.

```SQL
SELECT text
FROM twitter.hashtag
LIMIT 1000
```

We know what this query is going to return. 
It's going to return a table of only the `text` column form the `hashtag` table, 
but only the first 1000 rows. 
We will nest this table inside of another query so that it is only performing an aggregation off of a subset (1000 rows) of the data:

```SQL
SELECT DISTINCT text, COUNT(*)
FROM( ...
...) AS t1
GROUP BY text
```

In this outer statement, we are wanting to know the unique terms as well as their counts for that 1000 rows we pull. 
The `AS t1` just names the nested query. 
Now we can take a look at what this actually returns.

In [None]:
statement = """
SELECT DISTINCT text, COUNT(*)
FROM(
    SELECT text
    FROM twitter.hashtag
    LIMIT 1000) AS t1
GROUP BY text
"""

pd.read_sql_query(statement,conn)

Take a look at the column names of the returned data frame: `text` and `count`.  
Right now there is no order and that can get a little annoying...


**Activity 3:** Order the query above by `count`. 


In [None]:
# Your code for Activity 3 goes here
# --------------








And of course we want to know the most popular hashtags by `count`.

**Activity 4:** Order the same query above from greatest to fewest count.

In [None]:
# Your code for Activity 4 goes here
# ----------------







Many databases also give the user the ability to perform operations on the columns within the query. 
These operations are very helpful in removing data carpentry steps from later on in our analysis. 

You may have noticed that some hashtags are actually repeated twice (or thrice or more) despite using `DISTINCT` on the `text` column. 
Can you identify why this is? 
Well, it is because `DISTINCT` is case sensitive and some hashtags are capitalized while others aren't. 
Fortunately, there is an operation that we can perform on the `text` column to transform these values all to one case on the fly.

Take a look at the `LOWER()` operation...

In [None]:
statement = """
SELECT LOWER(text) 
FROM twitter.hashtag
LIMIT 1000
"""

pd.read_sql_query(statement,conn)

**Activity 5:** Now, similar to the queries above, count the number of distinct hashtag `text`s but this time transform `text` to lower. Remember to put a `LIMIT` of 1000 in the nested query. Be sure to put the count in descending order.

*HINT*: `LOWER()` should go in the nested portion.

In [None]:
# Your code for Activity 5 goes here
# --------------












And there is so much more than just the `COUNT` of rows. 
You can also find the `AVERAGE` (arithmetic mean) of a numeric column. 
Keep in mind that the query below is `GROUP`ed `BY` `text`.

In [None]:
statement = """
SELECT DISTINCT text, AVG(index_start)
FROM(
    SELECT lower(text) AS text, index_start
    FROM twitter.hashtag
    LIMIT 1000) AS t1
GROUP BY text
ORDER BY avg DESC
"""

pd.read_sql_query(statement,conn)

We're going to return to the world of `Joins` now. 
There is one thing we should add about `JOIN`s that we didn't have the opportunity to discuss in the lab. 
It is possible to perform a full outer join without using the `JOIN` statement. 
All you have to do is specify is what columns to match from the two (or more) tables in the `WHERE` clause. 

Take the example below. 
We are wanting to return all rows from both tables where the tweet_id matches in both the `hashtag` and `tweet` tables. 
Keep in mind, the `tweet` table has the column named `tweet_id_str`.

In [None]:
statement = """
SELECT t.tweet_id_str, t.text, h.text
FROM twitter.tweet t, twitter.hashtag h
WHERE t.tweet_id_str = h.tweet_id
LIMIT 100;
"""

pd.read_sql_query(statement,conn)

# Save your notebook, then `File > Close and Halt`