<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/TeachingMaterials/2021-10-NIHLibrarySession/ISB_Cancer_Gateway_in_the_Cloud_(ISB_CGC)_SQL_Reference_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ISB Cancer Gateway in the Cloud (ISB-CGC) SQL Reference Guide 


This guide is intended as a quick refresher to SQL to help users understand and follow the hands-on portion of our Cloud Computing training session.

In this notebook, we will use SQLite3 a small self-contained package to create a database and practice querying. To learn more about SQLite3 please visit [Here ](https://www.sqlite.org/index.html )





### What is SQL?

SQL or Structured Query Language, is a language to talk to databases. SQL provides a method to inquire about the contents of a database in a predictable and consistent syntax to receive useful results. It allows you to select specific data and to build complex reports. Today, SQL is a universal language of data. We use SQL to retrieve data from ISB-CGC BigQuery tables directly in Google BigQuery console.



## Notebook Setup

### Import Dependencies

In [1]:
import sqlite3

### Initialize Database and Connect

In [2]:
conn = sqlite3.connect('proteins.db')
cur = conn.cursor()
conn.row_factory = sqlite3.Row

### How do I create a table in SQL?

Databases contain tables, which can be visualized as a spreadsheet. There's a series of rows (called records in a database) and columns. The intersection of a row and a column is called a field.

You can create a table with the CREATE query. It's useful to combine this with the IF NOT EXISTS statement, which prevents SQLite from clobbering an existing table.

In this example, We will create a table called proteins containing two columns ID and Name. 

There are five data types (also referred to as storage classes) in SQL:

TEXT: a text string

INTEGER: a whole number

REAL: a floating point (unlimited decimal places) number

BLOB: binary data (for instance, a .jpeg or .webp image)

NULL: a null value

In our example, We will use the data type TEXT for ID and Name. Similarly, we will also create another table called properties which will contain the ID and location and their data types. 

To prevent a record from being created without data in a specified field, you can add the NOT NULL directive.

The SQL to create this field is: name TEXT NOT NULL


In [3]:
# Create proteins table
query = """
  DROP TABLE IF EXISTS proteins
"""
cur.execute(query)

query = """
  CREATE TABLE proteins (
    id TEXT,
    name TEXT
  )
"""
cur.execute(query)

# Create properties table
query = """
  DROP TABLE IF EXISTS properties
"""
query = """
  CREATE TABLE properties (
    id TEXT,
    location TEXT
  )
"""
cur.execute(query)

<sqlite3.Cursor at 0x7f4211bbbf10>

### How do I add data to a table?

You can populate your new tables with some sample data by using the INSERT SQL keyword.

You can add as many columns as you want to add to the table. If you want to add values in all columns of a table, then there is no need to specify column names, you can use the VALUES keyword to only insert values as shown below.

The VALUES keyword expects a list in parentheses but can take multiple lists separated by commas. 
 



In [4]:
# Insert into proteins table
cur.execute("INSERT INTO proteins VALUES ('P26939', 'TLNI')")
cur.execute("INSERT INTO proteins VALUES ('Q99PL5', 'PRBP')")
#cur.execute("INSERT INTO proteins VALUES ('Q6PB66', 'PGBM')")
#cur.execute("INSERT INTO proteins VALUES ('P11275', 'PPPE')")
#cur.execute("INSERT INTO proteins VALUES ('P26935', 'TLNP')")
cur.execute("INSERT INTO proteins VALUES ('PPR8PP', 'BRCC')")
cur.execute("INSERT INTO proteins VALUES ('QQ9PL6', 'TN11')")
cur.execute("INSERT INTO proteins VALUES ('QQPB66', 'TNNP')")
cur.execute("INSERT INTO proteins VALUES ('P11275', 'TINH')")
cur.execute("INSERT INTO proteins VALUES ('P36969', 'PPQL')")

# Insert into properties table
cur.execute("INSERT INTO properties VALUES ('P26939', 'cytoskeleton')")
cur.execute("INSERT INTO properties VALUES ('Q99PL5', 'cytosol')")
cur.execute("INSERT INTO properties VALUES ('Q6PB66', 'nucleus')")
cur.execute("INSERT INTO properties VALUES ('P11275', 'plasma')")
cur.execute("INSERT INTO properties VALUES ('P26935', 'plasma')")
cur.execute("INSERT INTO properties VALUES ('PPR8PP', 'cytoskeleton')")
cur.execute("INSERT INTO properties VALUES ('QQ9PL6', 'cytosol')")
#cur.execute("INSERT INTO properties VALUES ('QQPB66', 'nucleus')")
#cur.execute("INSERT INTO properties VALUES ('P11275', 'plasma')")
cur.execute("INSERT INTO properties VALUES ('P36969', 'plasma')")

<sqlite3.Cursor at 0x7f4211bbbf10>

### How do I select/fetch data in SQL? 

SELECT queries are used to fetch data from a database. We will use SELECT keyword for fetching table data in SQL. 

The syntax for retrieving/fetching data using SELECT keyword is:

>SELECT column1, column2, column3, ... FROM table_name;

To select all data:

>SELECT * FROM table_name;

To select only distinct values (no duplicate records):
>SELECT DISTINCT column1, column2, column3, ... FROM table_name;

A simple SELECT example would be this query below, which will return all the columns and rows from the table.

Example:

SELECT * FROM `isb-cgc-bq.TCGA_versioned.clinical_gdc_r24`LIMIT 1000

The asterisk (*) means that we want to grab all the columns, without excluding anything. Since SQL databases usually consist of more than one table, the FROM keyword is required to specify which table we want to look in. 

LIMIT keyword can be used to specify how many lines we would like to fetch. 

Remember: The order of clauses matters in SQL. SQL uses the following order of precedence: FROM, SELECT, LIMIT.





SELECT Example:

In [5]:
cur.execute("SELECT * FROM proteins")
print(cur.fetchall())

[('P26939', 'TLNI'), ('Q99PL5', 'PRBP'), ('PPR8PP', 'BRCC'), ('QQ9PL6', 'TN11'), ('QQPB66', 'TNNP'), ('P11275', 'TINH'), ('P36969', 'PPQL')]


In [6]:
cur.execute("SELECT id FROM proteins LIMIT 2")
print(cur.fetchall())

[('P26939',), ('Q99PL5',)]


###How do I filter/retrieve data in SQL based on a condition? 

WHERE keyword allows us to filter data in SQL depending on a condition. 

The syntax for WHERE clause is:

>SELECT column1, column2, column3, ... FROM table_name WHERE condition;

In the example below, we can select how many mutations have been observed in KRAS. 

WHERE Example:

SELECT COUNT(DISTINCT(sample_barcode_tumor)) AS numSamples
FROM `isb-cgc-bq.TCGA_versioned.somatic_mutation_hg38_gdc_r10`
WHERE Hugo_Symbol="KRAS"


SELECT Example combined with WHERE Clause 

In [7]:
cur.execute("SELECT * FROM properties WHERE location='cytoskeleton'")
print(cur.fetchall())

[('P26939', 'cytoskeleton'), ('PPR8PP', 'cytoskeleton')]


In [8]:
cur.execute("SELECT id FROM properties WHERE location='plasma'")
print(cur.fetchall())

[('P11275',), ('P26935',), ('P36969',)]


In the above example, we have used the ‘=’ operator. You can use other operators such as ‘>’, ‘<‘, ‘>=’, ‘<=’, IN, LIKE, and ‘BETWEEN’ depending upon your condition.

###How do I group data in SQL based on a condition? 

GROUP BY and ORDER BY clause allows us to filter data depending on a condition.

GROUP BY in SQL is used to arrange similar data into a group. The GROUP BY clause follows the WHERE clause and comes before the ORDER BY clause.

ORDER BY clause is used to sort the data in the ascending or descending order.

Syntax:

>SELECT column1, column 2…

>FROM table_name

>WHERE [condition]

>GROUP BY column1, column2

>ORDER BY column1, column2;

In the example below, you can use the query to group the data in the proteins table, group it based on column name and order it in ASC order. 




GROUP BY combined with ORDER BY Example:

In [9]:
cur.execute("SELECT * FROM proteins WHERE name='BRCC' GROUP BY id ORDER BY name ASC")
print(cur.fetchall())

[('PPR8PP', 'BRCC')]


###How do I query from more than one table?

In more complex databases, most of the time there are several tables connected to each other in some way and linking the tables logically allows us to use the information stored in both tables at the same time.

JOINS are used for fetching information from more than one table. A JOIN is a means for combining fields from two tables by using values common to each. These common variables are referred to as keys. 

The commonly used types of joins are:

INNER JOIN 

The most important and frequently used of the joins is the INNER JOIN. They are also referred to as an EQUIJOIN.

The INNER JOIN matches each row in one table with every row in other tables and allows you to query rows that contain columns from both tables.INNER JOIN returns rows that have matching values in both tables.

The INNER JOIN is an optional clause of the SELECT statement. It appears immediately after the FROM clause. 

The basic syntax of the INNER JOIN is as follows:

>SELECT table1.column1, table2.column2...
FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field;

INNER JOIN Example:

In [10]:
cur.execute("SELECT proteins.name,properties.location FROM proteins INNER JOIN properties ON proteins.id = properties.id")
print(cur.fetchall())

[('TLNI', 'cytoskeleton'), ('PRBP', 'cytosol'), ('BRCC', 'cytoskeleton'), ('TN11', 'cytosol'), ('TINH', 'plasma'), ('PPQL', 'plasma')]


INNER JOIN combined with GROUP BY Clause Example:

In [11]:
cur.execute("SELECT proteins.name,properties.location FROM proteins INNER JOIN properties ON proteins.id = properties.id GROUP BY proteins.id;")
print(cur.fetchall())

[('TINH', 'plasma'), ('TLNI', 'cytoskeleton'), ('PPQL', 'plasma'), ('BRCC', 'cytoskeleton'), ('PRBP', 'cytosol'), ('TN11', 'cytosol')]


LEFT JOIN 

The LEFT JOIN returns all rows from the left table, even if there are no matches in the right table. This means that if the ON clause matches 0 (zero) records in the right table; the join will still return a row in the result, but with NULL in each column from the right table.

LEFT JOIN keeps observations that are present in the left (first) table, dropping those that are only present in the other.

The basic syntax of a LEFT JOIN is as follows:

>SELECT table1.column1, table2.column2...
FROM table1
LEFT JOIN table2
ON table1.common_field = table2.common_field;

Here, the given condition could be any given expression based on your requirement.


LEFT JOIN Example:

In [12]:
cur.execute("SELECT proteins.name,properties.location FROM proteins LEFT JOIN properties ON proteins.id = properties.id")
print(cur.fetchall())

[('TLNI', 'cytoskeleton'), ('PRBP', 'cytosol'), ('BRCC', 'cytoskeleton'), ('TN11', 'cytosol'), ('TNNP', None), ('TINH', 'plasma'), ('PPQL', 'plasma')]


RIGHT JOINS

The RIGHT JOIN returns all rows from the right table, even if there are no matches in the left table. This means that if the ON clause matches 0 (zero) records in the left table; the join will still return a row in the result, but with NULL in each column from the left table.

RIGHT JOIN keeps observations that are present in the right (second) table, dropping those that are only present in the other.

The basic syntax of a RIGHT JOIN is as follow:

>SELECT table1.column1, table2.column2...
FROM table1
RIGHT JOIN table2
ON table1.common_field = table2.common_field;

RIGHT JOIN Example:

>cur.execute("SELECT proteins.name,properties.location FROM proteins RIGHT JOIN properties ON proteins.id = properties.id")
print(cur.fetchall())

Output: 

>[('TLN1', 'cytoskeleton'), ('PRBP', 'cytosol'), ('PGBM', 'nucleus')('TINH', 'plasma') ('none', 'plasma')('none', 'cytoskeleton'), ('TN11', 'cytosol'), ('PPQL', 'plasma')]


FULL JOINS 

Full joins are used to join two tables into a single one containing all variables using KEYS (id is the key in this example) to match variables across both tables. In short a FUll JOIN keeps all observations.

The SQL FULL JOIN combines the results of both left and right outer joins.

The joined table will contain all records from both the tables and fill in NULLs for missing matches on either side.

The basic syntax of a FULL JOIN is as follows:

>SELECT table1.column1, table2.column2...
FROM table1
FULL JOIN table2
ON table1.common_field = table2.common_field;

FULL JOIN Example:

>cur.execute("SELECT proteins.name,properties.location FROM proteins FULL JOIN properties ON proteins.id = properties.id")
print(cur.fetchall())

Output:

>[('TLNI', 'cytoskeleton'), ('PRBP', 'cytosol'), ('BRCC', 'cytoskeleton'), ('TN11', 'cytosol'), ('TNNP', None), ('TINH', 'plasma'), ('PPQL', 'plasma')[('TLN1', 'cytoskeleton'), ('PRBP', 'cytosol'), ('PGBM', 'nucleus')('TINH', 'plasma') ('none', 'plasma')('none', 'cytoskeleton'), ('TN11', 'cytosol'), ('PPQL', 'plasma')]]


SUMMARY

The concepts and syntax learnt in this notebook are considered standard SQL and will be implemented in the hands on practice session using ISB-CGC BQ table data in Google Cloud.