## SQL - part 2

#### What is a database? 
*  A collection of data

        * Dictionary
            {"EGFR":6.8, "MYC": 4.5, "WNT1":11.7}

        * Tab-separated text file, or pd.DataFrame


| GeneID  | GeneSymbol  | ExpressionValue  |
|---------|-------------|------------------|
| 7471    | WNT1        |             11.7 |
| 4609    | MYC         |              4.5 |
| 1956    | EGFR        |              6.8 |


#### Relational Database Management Systems
* Software programs such as Oracle, MySQL, SQLServer, DB2, postgreSQL, SQLite 
* They handle the data storage, indexing, logging, tracking and security (access)  

#### Why use databases and Relational Database Management Systems?
* Easy, efficient, secure, collaborative management of data that maintains data integrity

#### What is the Structured Query Language (SQL) ?
* SQL is the standard language for relational database management systems
* SQL is used to communicate with a database

#### Why SQLite?
SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. 
* SQLite is often the technology of choice for small applications, particularly those of embedded systems and devices like phones and tablets, smart appliances, and instruments.
* It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle.

#### sqlite3
The sqlite3 module in the Python standard library provides a SQL interface to communicate with databases.<br>
https://docs.python.org/3/library/sqlite3.html

Once you have a `Connection`, you can create a `Cursor` object and call its execute() method to perform SQL commands.

`Cursor` objects represent a database cursor, which is used to manage the context of a fetch/retrieval operation.
A call to the `Cursor`'s execute() method is used to perform SQL commands.

In [1]:
from sqlite3 import connect

'''    
    Establish a connection to the database.
    This statement creates the file iat the given path if it does not exist.
    The file was provided so the statement should just establish the connection.
'''
connection = connect('../datasets/org.Hs.eg.sqlite')
cursor = connection.cursor()


In [2]:
for elem in dir(connection):
    if not elem.startswith("__"):
        print(elem)

DataError
DatabaseError
Error
IntegrityError
InterfaceError
InternalError
NotSupportedError
OperationalError
ProgrammingError
backup
close
commit
create_aggregate
create_collation
create_function
cursor
enable_load_extension
execute
executemany
executescript
in_transaction
interrupt
isolation_level
iterdump
load_extension
rollback
row_factory
set_authorizer
set_progress_handler
set_trace_callback
text_factory
total_changes


In [3]:
for elem in dir(cursor):
    if not elem.startswith("__"):
        print(elem)

arraysize
close
connection
description
execute
executemany
executescript
fetchall
fetchmany
fetchone
lastrowid
row_factory
rowcount
setinputsizes
setoutputsize


#### Major SQL commands: SELECT, INSERT, DELETE, UPDATE
#### SELECT - Retrieves data from one or more tables and doesn’t change the data at all 

* SELECT  * (means all columns), or the comma separated names of the columns of data you wish to return
    * Returns columns (left to right) in the order received. 
    * '*' selects ALL rows and ALL columns and returns them by column order and row_id
* FROM is the table source or sources (comma separated)
* WHERE (optional) is the predicate clause: conditions for the query
    * Evaluates to True or False for each row
    * This clause almost always includes Column-Value pairs.
    * Omitting the Where clause returns ALL the records in that table.
    * Note: the match is case sensitive
* ORDER BY (optional) indicates a sort order for the output data 
    * default is row_id, which can be very non-intuitive  
    * ASCending or DESCending can be appended to change the sort order.  (ASC is default)
* GROUP BY (optional) groups by a column and creates summary data for a different column
* HAVING (optional) allows restrictions on the rows selected
    * a GROUP BY clause is required before HAVING
* LIMIT (optional) reduces the number of rows retrieved to the number provided after this clause
* In most SQL clients, the ";" indicates the end of a statement and requests execution


In [8]:
sql = '''SELECT *
FROM sqlite_master 
LIMIT 5;'''
cursor.execute(sql)
cursor.description

(('type', None, None, None, None, None, None),
 ('name', None, None, None, None, None, None),
 ('tbl_name', None, None, None, None, None, None),
 ('rootpage', None, None, None, None, None, None),
 ('sql', None, None, None, None, None, None))

In [9]:
# In every SQLite database, there is a special table: sqlite_master
# sqlite_master -  describes the contents of the database

sql = '''SELECT type, name 
FROM sqlite_master 
LIMIT 5;'''
cursor.execute(sql)

<sqlite3.Cursor at 0x12236bb90>

In [10]:
# See the result header

cursor.description

(('type', None, None, None, None, None, None),
 ('name', None, None, None, None, None, None))

In [11]:
def get_header(cursor):
    '''
    Makes a tab delimited header row from the cursor description.
    Arguments:
        cursor: a cursor after a select query
    Returns:
        string: A string consisting of the column names separated by tabs, no new line
    '''
    return '\t'.join([row[0] for row in cursor.description])


#### Important for homework 8
#### Different ways to retrieve results - observe the different data structures displayed

In [12]:
# See the result

cursor.execute(sql)
print("Iterate through the cursor:")
for row in cursor: 
    print(row)
    
print()

cursor.execute(sql)
print("Use the Cursor fetchall() method:")
cursor.fetchall()

Iterate through the cursor:
('table', 'metadata')
('index', 'sqlite_autoindex_metadata_1')
('table', 'map_metadata')
('table', 'map_counts')
('index', 'sqlite_autoindex_map_counts_1')

Use the Cursor fetchall() method:


[('table', 'metadata'),
 ('index', 'sqlite_autoindex_metadata_1'),
 ('table', 'map_metadata'),
 ('table', 'map_counts'),
 ('index', 'sqlite_autoindex_map_counts_1')]

In [13]:
def get_results(cursor):
    '''
    Makes a tab delimited table from the cursor results.
    Arguments:
        cursor: a cursor after a select query
    Returns:
        string: A string consisting of the column names separated by tabs, no new line
    ''' 
    res = list()
    for row in cursor.fetchall():        
        res.append('\t'.join(list(map(str,row))))
    return "\n".join(res)

In [14]:
cursor.execute(sql)

print(get_header(cursor))
print(get_results(cursor))

type	name
table	metadata
index	sqlite_autoindex_metadata_1
table	map_metadata
table	map_counts
index	sqlite_autoindex_map_counts_1


In [15]:
# WHERE clause example (-- denotes comment)
# more examples later

sql = '''
SELECT name
FROM sqlite_master 
WHERE type= "table"; -- condition that allows the selection of specific rows
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

name
metadata
map_metadata
map_counts
genes
gene_info
chromosomes
accessions
cytogenetic_locations
omim
refseq
pubmed
unigene
chrlengths
go_bp
go_mf
go_cc
go_bp_all
go_mf_all
go_cc_all
kegg
ec
chromosome_locations
pfam
prosite
alias
ensembl
ensembl2ncbi
ncbi2ensembl
ensembl_prot
ensembl_trans
uniprot
ucsc
sqlite_stat1
sqlite_stat4
sqlite_sequence


In [16]:
sql = '''
SELECT *
FROM go_bp LIMIT 10;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
1	GO:0002576	TAS
1	GO:0008150	ND
1	GO:0043312	TAS
2	GO:0001869	IDA
2	GO:0002576	TAS
2	GO:0007597	TAS
2	GO:0010951	IEA
2	GO:0022617	TAS
2	GO:0048863	IEA
2	GO:0051056	TAS


Aliasing column names to make them easier to understand 

In [17]:
sql = '''
SELECT _id "Gene ID", symbol Symbol, gene_name Name 
FROM gene_info LIMIT 5;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

Gene ID	Symbol	Name
1	A1BG	alpha-1-B glycoprotein
2	A2M	alpha-2-macroglobulin
3	A2MP1	alpha-2-macroglobulin pseudogene 1
4	NAT1	N-acetyltransferase 1
5	NAT2	N-acetyltransferase 2


In [20]:
#select all from go_bp
sql = '''
SELECT *
FROM go_bp
LIMIT 5;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))
 

_id	go_id	evidence
1	GO:0002576	TAS
1	GO:0008150	ND
1	GO:0043312	TAS
2	GO:0001869	IDA
2	GO:0002576	TAS


http://geneontology.org/docs/guide-go-evidence-codes/
* Inferred from Experiment (EXP)
* Inferred from Direct Assay (IDA)
* Inferred from Physical Interaction (IPI)
* Inferred from Mutant Phenotype (IMP)
* Inferred from Genetic Interaction (IGI)
* Inferred from Expression Pattern (IEP)
* Inferred from High Throughput Experiment (HTP)
* Inferred from High Throughput Direct Assay (HDA)
* Inferred from High Throughput Mutant Phenotype (HMP)
* Inferred from High Throughput Genetic Interaction (HGI)
* Inferred from High Throughput Expression Pattern (HEP)
* Inferred from Biological aspect of Ancestor (IBA)
* Inferred from Biological aspect of Descendant (IBD)
* Inferred from Key Residues (IKR)
* Inferred from Rapid Divergence (IRD)
* Inferred from Sequence or structural Similarity (ISS)
* Inferred from Sequence Orthology (ISO)
* Inferred from Sequence Alignment (ISA)
* Inferred from Sequence Model (ISM)
* Inferred from Genomic Context (IGC)
* Inferred from Reviewed Computational Analysis (RCA)
* Traceable Author Statement (TAS)
* Non-traceable Author Statement (NAS)
* Inferred by Curator (IC)
* No biological Data available (ND)
* Inferred from Electronic Annotation (IEA)


# COUNT returns a single number, which is the count of all rows in the table

sql = '''
SELECT count(*) FROM genes;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

In [22]:
sql = '''
SELECT count(_id) AS 'Number of genes' 
FROM genes;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

Number of genes
61521


In [23]:
sql = '''
SELECT count(*) AS 'Number of genes' 
FROM genes;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

Number of genes
61521


In [24]:
# DISTINCT selects  non-duplicated elements (rows)

sql = '''
SELECT _id FROM go_bp LIMIT 5;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

print()

sql = '''
SELECT DISTINCT _id FROM go_bp LIMIT 5;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id
1
1
1
2
2

_id
1
2
4
5
7


In [25]:
# count the number of rows on go_bp
sql = '''
SELECT count(_id) AS 'Number of associations' 
FROM go_bp;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))


Number of associations
130884


In [26]:
# count the number of distinct genes in go_bp
sql = '''
SELECT DISTINCT count(_id) AS 'Number of associations' 
FROM go_bp;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))


Number of associations
130884


In [27]:
# count the number of distinct genes in go_bp
sql = '''
SELECT DISTINCT count(DISTINCT _id) AS 'Number of associations' 
FROM go_bp;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

Number of associations
17913


#### WHERE clause operators
https://www.sqlite.org/lang_expr.html

<> ,  != 	inequality <br>
<			less than <br>
<= 			less than or equal <br>
=			equal <br>
'>			greater than <br>
'>= 		greater than or equal <br>
BETWEEN v1 AND v2	tests that a value to lies in a given range <br>
EXISTS		test for existence of rows matching query <br>
IN			tests if a value falls within a given set or query <br>
IS [ NOT ] NULL	is or is not null <br>
[ NOT ] LIKE		tests value to see if like or not like another <br>

% is the wildcard in SQL, used in conjunction with LIKE


In [28]:
sql = '''
SELECT * FROM go_bp 
WHERE _id = '1';
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
1	GO:0002576	TAS
1	GO:0008150	ND
1	GO:0043312	TAS


In [30]:
gene_ids = (1,5,7)
sql = f'''
SELECT * FROM go_bp 
WHERE _id IN {gene_ids};
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
1	GO:0002576	TAS
1	GO:0008150	ND
1	GO:0043312	TAS
5	GO:0006805	TAS
7	GO:0002576	TAS
7	GO:0006953	IEA
7	GO:0006954	NAS
7	GO:0019216	NAS
7	GO:0030277	NAS
7	GO:0043312	TAS


In [31]:
sql = '''
SELECT * FROM go_bp 
WHERE _id IN (1,5,7);
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
1	GO:0002576	TAS
1	GO:0008150	ND
1	GO:0043312	TAS
5	GO:0006805	TAS
7	GO:0002576	TAS
7	GO:0006953	IEA
7	GO:0006954	NAS
7	GO:0019216	NAS
7	GO:0030277	NAS
7	GO:0043312	TAS


In [33]:
sql = '''
SELECT * FROM go_bp 
WHERE evidence = 'ND' AND _id BETWEEN 20 AND 2000 
--LIMIT 10
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
170	GO:0008150	ND
514	GO:0008150	ND
516	GO:0008150	ND
618	GO:0008150	ND
619	GO:0008150	ND
667	GO:0008150	ND
702	GO:0008150	ND
815	GO:0008150	ND
1153	GO:0008150	ND
1181	GO:0008150	ND
1253	GO:0008150	ND
1407	GO:0008150	ND
1575	GO:0008150	ND
1577	GO:0008150	ND
1688	GO:0008150	ND
1693	GO:0008150	ND
1708	GO:0008150	ND
1765	GO:0008150	ND
1773	GO:0008150	ND
1891	GO:0008150	ND


In [34]:
sql = '''
SELECT * 
FROM go_bp
WHERE go_id LIKE '%0081%' 
LIMIT 10;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
1	GO:0008150	ND
163	GO:0008104	IDA
170	GO:0008150	ND
399	GO:0060081	IEA
451	GO:0008156	IMP
479	GO:0008104	ISS
487	GO:2000811	IMP
495	GO:0008104	IEA
514	GO:0008150	ND
516	GO:0008150	ND


In [None]:
# Retrieve rows from go_bp where the go_id is GO:0008104 and evidence is IEA or IDA

In [35]:
sql = '''
SELECT *
FROM go_bp 
WHERE go_id = "GO:0008104" AND evidence IN ("IEA","IDA") ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
163	GO:0008104	IDA
495	GO:0008104	IEA
807	GO:0008104	IDA
1394	GO:0008104	IEA
2279	GO:0008104	IEA
3422	GO:0008104	IEA
3802	GO:0008104	IEA
3900	GO:0008104	IDA
3922	GO:0008104	IEA
5279	GO:0008104	IEA
5819	GO:0008104	IDA
7071	GO:0008104	IEA
7722	GO:0008104	IDA
8990	GO:0008104	IDA
10012	GO:0008104	IDA
10366	GO:0008104	IEA
10395	GO:0008104	IEA
11969	GO:0008104	IDA
13897	GO:0008104	IEA
13976	GO:0008104	IEA
14535	GO:0008104	IEA
16381	GO:0008104	IDA
17218	GO:0008104	IDA
19318	GO:0008104	IEA
19647	GO:0008104	IDA
26292	GO:0008104	IDA


In [36]:
sql = '''
SELECT *
FROM go_bp 
WHERE go_id = "GO:0008104" AND (evidence = "IEA" OR evidence = "IDA") ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id	evidence
163	GO:0008104	IDA
495	GO:0008104	IEA
807	GO:0008104	IDA
1394	GO:0008104	IEA
2279	GO:0008104	IEA
3422	GO:0008104	IEA
3802	GO:0008104	IEA
3900	GO:0008104	IDA
3922	GO:0008104	IEA
5279	GO:0008104	IEA
5819	GO:0008104	IDA
7071	GO:0008104	IEA
7722	GO:0008104	IDA
8990	GO:0008104	IDA
10012	GO:0008104	IDA
10366	GO:0008104	IEA
10395	GO:0008104	IEA
11969	GO:0008104	IDA
13897	GO:0008104	IEA
13976	GO:0008104	IEA
14535	GO:0008104	IEA
16381	GO:0008104	IDA
17218	GO:0008104	IDA
19318	GO:0008104	IEA
19647	GO:0008104	IDA
26292	GO:0008104	IDA


In [37]:
# ORDER BY (optional) - indicates a sort order for the output data: ASC or DESC

sql = '''
SELECT _id, go_id
FROM go_bp 
WHERE evidence="ND"
ORDER BY _id  DESC
LIMIT 20;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	go_id
37596	GO:0008150
37166	GO:0008150
36360	GO:0008150
33578	GO:0008150
30423	GO:0008150
29930	GO:0008150
29554	GO:0008150
29476	GO:0008150
28708	GO:0008150
28584	GO:0008150
28574	GO:0008150
28534	GO:0008150
28193	GO:0008150
28180	GO:0008150
28072	GO:0008150
27996	GO:0008150
27829	GO:0008150
27646	GO:0008150
27583	GO:0008150
27576	GO:0008150


In [40]:
sql = '''
SELECT _id, go_id
FROM go_bp 
WHERE evidence="ND"
ORDER BY _id  DESC
LIMIT 20;
'''
cursor.execute(sql)
cursor.fetchall()[0][0]


37596

Sqlite3 also has some PRAGMA methods <br>
This is an SQL extension specific to SQLite that is used to modify the operation of the SQLite library or to query the SQLite library for internal (non-table) data <br>
https://www.sqlite.org/pragma.html <br>
The code below shows how to get the schema (columns and columns information)

In [41]:
sql = 'PRAGMA table_info("go_bp")'
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

cid	name	type	notnull	dflt_value	pk
0	_id	INTEGER	1	None	0
1	go_id	CHAR(10)	1	None	0
2	evidence	CHAR(3)	1	None	0


In [42]:
sql = '''SELECT * FROM pragma_table_info("go_bp")  '''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

cid	name	type	notnull	dflt_value	pk
0	_id	INTEGER	1	None	0
1	go_id	CHAR(10)	1	None	0
2	evidence	CHAR(3)	1	None	0


In [43]:
sql = '''
    SELECT DISTINCT _id 
    FROM go_bp
    WHERE go_id = 'GO:0008104'
    '''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id
163
479
495
703
705
807
1394
1471
1679
2279
3422
3456
3802
3900
3922
3973
5279
5819
5900
6619
6647
7006
7071
7084
7218
7631
7722
7724
8218
8502
8759
8868
8903
8990
9114
9117
9354
9515
9786
10012
10366
10395
11715
11969
13232
13897
13976
14535
15305
16381
16608
16674
17218
18628
18658
19318
19647
24489
26292


In [44]:
# SUB-QUERY - we can have a query in a query

sql = '''
SELECT _id, symbol, gene_name 
FROM gene_info
WHERE _id IN
    (SELECT DISTINCT _id 
    FROM go_bp
    WHERE go_id = 'GO:0008104'); 
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	symbol	gene_name
163	NR0B1	nuclear receptor subfamily 0 group B member 1
479	BBS2	Bardet-Biedl syndrome 2
495	BCL6	BCL6 transcription repressor
703	CAV1	caveolin 1
705	CAV3	caveolin 3
807	CD81	CD81 molecule
1394	DHCR24	24-dehydrocholesterol reductase
1471	DRD2	dopamine receptor D2
1679	ERCC3	ERCC excision repair 3, TFIIH core complex helicase subunit
2279	GNGT1	G protein subunit gamma transducin 1
3422	MECP2	methyl-CpG binding protein 2
3456	MGAT3	mannosyl (beta-1,4-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase
3802	NEDD8	NEDD8 ubiquitin like modifier
3900	NPM1	nucleophosmin 1
3922	NRCAM	neuronal cell adhesion molecule
3973	ODF2	outer dense fiber of sperm tails 2
5279	SLC9A2	solute carrier family 9 member A2
5819	TP53	tumor protein p53
5900	TSC2	TSC complex subunit 2
6619	ULK1	unc-51 like autophagy activating kinase 1
6647	DOC2B	double C2 domain beta
7006	SQSTM1	sequestosome 1
7071	PLOD3	procollagen-lysine,2-oxoglutarate 5-dioxygenase 3
7084	HAP1	huntingtin associated pro

In [46]:
# GROUP BY groups by a column and creates summary data for a different column
# count entries for each GO term

sql = '''
SELECT go_id, count(*) 
FROM go_bp 
GROUP BY go_id 
LIMIT 10;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

go_id	count(*)
GO:0000002	11
GO:0000003	1
GO:0000012	9
GO:0000018	6
GO:0000019	2
GO:0000022	2
GO:0000023	3
GO:0000027	7
GO:0000028	4
GO:0000038	13


In [47]:
# specify column in aggregate function and alias the name of the columns

sql = '''
SELECT go_id as "GO Term ID", count(_id) as "Gene Number" 
FROM go_bp 
GROUP BY go_id 
LIMIT 10;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

GO Term ID	Gene Number
GO:0000002	11
GO:0000003	1
GO:0000012	9
GO:0000018	6
GO:0000019	2
GO:0000022	2
GO:0000023	3
GO:0000027	7
GO:0000028	4
GO:0000038	13


In [50]:
# HAVING allows restrictions on the rows used or selected
# a GROUP BY clause is required before HAVING

sql = '''
SELECT go_id, count(_id) as gene_no 
FROM go_bp 
GROUP BY go_id
HAVING gene_no>500;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

go_id	gene_no
GO:0000122	770
GO:0006357	924
GO:0006396	544
GO:0006915	547
GO:0007165	830
GO:0007186	942
GO:0008150	619
GO:0008284	525
GO:0016567	519
GO:0035195	854
GO:0043066	519
GO:0045892	542
GO:0045893	628
GO:0045944	1161
GO:0055114	507


In [52]:
# Select gene ids with more than 100 biological processes associated


sql = '''
SELECT _id, count(go_id) AS go_term_no 
FROM go_bp 
GROUP BY _id
HAVING go_term_no>100
ORDER BY go_term_no DESC;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))



_id	go_term_no
5710	199
534	163
5791	159
175	152
1229	149
5819	148
6084	148
5209	144
9499	128
4058	121
5382	119
487	116
294	114
4607	112
2137	109
1832	108
2021	108
5427	108
296	106
20	105
6042	105
17545	105
2906	103
5712	103
8982	103
1578	102


#### A `PRIMARY KEY` is a very important concept to understand.  
* It is the designation for a column or a set of columns from a table.
* It is recommended to be a serial value and not something related to the business needs of the data in the table.

* A primary key is used to uniquely identify a row of data; combined with a column name, uniquely locates a data entry
* A primary key by definition must be `UNIQUE` and `NOT NULL` 
* The primary key of a table, should be a (sequential) non-repeating and not null value  
* Primary keys are generally identified at time of table creation  
* A common method for generating a primary key, is to set the datatype to `INTEGER` and declare `AUTOINCREMENT` which will function when data is inserted into the table
* Primary keys can be a composite of 2 or more columns that uniquely identify the data in the table



#### A `FOREIGN KEY` is a column(s) that points to the `PRIMARY KEY` of another table 

* The purpose of the foreign key is to ensure referential integrity of the data. 
In other words, only values that are supposed to appear in the database are permitted.<br>
Only the values that exist in the `PRIMARY KEY` column are allowed to be present in the FOREIGN KEY column.
Example: A `gene` table has the `PRIMARY KEY` `gene_id`. The GO2_gene GO term is associated with a gene

They are also the underpinning of how tables are joined and relationships portrayed in the database


#### JOIN tables

* Multiple tables contain different data that we want to retrieve from a single query
* In order to assemble data as part of a query, a JOIN between tables is needed
* This is a very common practice, since it’s rare for all the data you want to be in a single table


* INNER JOIN - return only those rows where there is matching content in BOTH tables (is the default when JOIN is used)
* OUTER JOIN - returns all rows from both tables even if one of the tables is blank
* SELF JOIN - can be used to join a table to itself (through aliasing), to compare data internal to the table

```sql
SELECT ... FROM table1 [INNER] JOIN table2 ON conditional_expression
```


In [53]:
sql = '''
SELECT symbol,go_id, evidence
FROM gene_info AS gi
INNER JOIN go_bp AS go
ON gi._id = go._id
WHERE evidence = "ND"
LIMIT 5;'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

symbol	go_id	evidence
A1BG	GO:0008150	ND
CRYBG1	GO:0008150	ND
BFSP1	GO:0008150	ND
BGN	GO:0008150	ND
TMEM50B	GO:0008150	ND


In [54]:
sql = '''
SELECT gi._id, go._id, gi.symbol, go.go_id, go.evidence
FROM gene_info AS gi
INNER JOIN go_bp AS go
ON gi._id = go._id
WHERE evidence = "ND"
LIMIT 5;'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	_id	symbol	go_id	evidence
1	1	A1BG	GO:0008150	ND
170	170	CRYBG1	GO:0008150	ND
514	514	BFSP1	GO:0008150	ND
516	516	BGN	GO:0008150	ND
618	618	TMEM50B	GO:0008150	ND


In [55]:
sql = '''
SELECT *
FROM gene_info AS gi
INNER JOIN go_bp AS go
ON gi._id = go._id
WHERE evidence = "ND"
LIMIT 5;'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

_id	gene_name	symbol	_id	go_id	evidence
1	alpha-1-B glycoprotein	A1BG	1	GO:0008150	ND
170	crystallin beta-gamma domain containing 1	CRYBG1	170	GO:0008150	ND
514	beaded filament structural protein 1	BFSP1	514	GO:0008150	ND
516	biglycan	BGN	516	GO:0008150	ND
618	transmembrane protein 50B	TMEM50B	618	GO:0008150	ND


#### See the create table statement

In [56]:
# sql column in the sqlite_master table

sql = '''
SELECT sql
FROM sqlite_master 
WHERE type= "table" and name == "go_bp"
LIMIT 2;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

sql
CREATE TABLE go_bp (
      _id INTEGER NOT NULL,                         -- REFERENCES  genes 
      go_id CHAR(10) NOT NULL,                      -- GO ID
      evidence CHAR(3) NOT NULL,                    -- GO evidence code
      FOREIGN KEY (_id) REFERENCES  genes  (_id)
    )


```sql
SELECT sql
FROM sqlite_master 
WHERE type= "table" and name == "go_bp"
LIMIT 2;
```

### CREATE TABLE  - statement
https://www.sqlitetutorial.net/sqlite-create-table/

```sql
CREATE TABLE [IF NOT EXISTS] [schema_name].table_name (
    column_1 data_type PRIMARY KEY,
    column_2 data_type NOT NULL,
    column_3 data_type DEFAULT 0,
    table_constraints
) [WITHOUT ROWID];
```

In this syntax:

* First, specify the name of the table that you want to create after the CREATE TABLE keywords. The name of the table cannot start with sqlite_ because it is reserved for the internal use of SQLite.
* Second, use `IF NOT EXISTS` option to create a new table if it does not exist. Attempting to create a table that already exists without using the IF NOT EXISTS option will result in an error.
* Third, optionally specify the schema_name to which the new table belongs. The schema can be the main database, temp database or any attached database.
* Fourth, specify the column list of the table. Each column has a name, data type, and the column constraint. SQLite supports `PRIMARY KEY, UNIQUE, NOT NULL`, and `CHECK` column constraints.
* Fifth, specify the table constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, and CHECK constraints.
* Finally, optionally use the `WITHOUT ROWID` option. By default, a row in a table has an implicit column, which is referred to as the rowid, oid or _rowid_ column. The rowid column stores a 64-bit signed integer key that uniquely identifies the row inside the table. If you don’t want SQLite creates the rowid column, you specify the WITHOUT ROWID option. A table that contains the rowid column is known as a rowid table. Note that the WITHOUT ROWID option is only available in SQLite 3.8.2 or later.

https://www.sqlite.org/syntaxdiagrams.html#create-table-stmt

<img src = "https://www.sqlite.org/images/syntax/create-table-stmt.gif" width="800"/>

Each value stored in an SQLite database (or manipulated by the database engine) has one of the following storage classes:
https://www.sqlite.org/datatype3.html
* `NULL`. The value is a NULL value.
* `INTEGER`. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.
* `REAL`. The value is a floating point value, stored as an 8-byte IEEE floating point number.
* `TEXT`. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
* `BLOB`. The value is a blob of data, stored exactly as it was input.

#### A `PRIMARY KEY` is a very important concept to understand.  
* It is the designation for a column or a set of columns from a table.
* It is recommended to be a serial value and not something related to the business needs of the data in the table.

* A primary key is used to uniquely identify a row of data; combined with a column name, uniquely locates a data entry
* A primary key by definition must be `UNIQUE` and `NOT NULL` 
* The primary key of a table, should be a (sequential) non-repeating and not null value  
* Primary keys are generally identified at time of table creation  
* A common method for generating a primary key, is to set the datatype to `INTEGER` and declare `AUTOINCREMENT` which will function when data is inserted into the table
* Primary keys can be a composite of 2 or more columns that uniquely identify the data in the table



#### A `FOREIGN KEY` is a column(s) that points to the `PRIMARY KEY` of another table 

* The purpose of the foreign key is to ensure referential integrity of the data. 
In other words, only values that are supposed to appear in the database are permitted.<br>
Only the values that exist in the `PRIMARY KEY` column are allowed to be present in the FOREIGN KEY column.
Example: A `gene` table has the `PRIMARY KEY` `gene_id`. The GO2_gene GO term is associated with a gene

They are also the underpinning of how tables are joined and relationships portrayed in the database


The `sqlite_master` has the following create statement: 
```sql
CREATE TABLE sqlite_master ( type TEXT, name TEXT, tbl_name TEXT, rootpage INTEGER, sql TEXT );
```

#### Create the table `go_bp_ALT`

##### The `connection` object methods can be used to save or revert/reset the changes after a command that makes changes to the database
##### `COMMIT` - save the changes 
##### `ROLLBACK` - revert the changes 


In [57]:
sql='''
CREATE TABLE IF NOT EXISTS go_bp_ALT (
      gene_go_id INTEGER PRIMARY KEY AUTOINCREMENT,
      gene_id INTEGER NOT NULL,                     -- REFERENCES  genes _id 
      go_id CHAR(10) NOT NULL,                      -- GO ID
      evidence CHAR(30) NOT NULL,                   -- GO evidence information
      FOREIGN KEY (gene_id) REFERENCES  genes  (_id)
    );
'''
try:
    cursor.execute(sql)
except connection.DatabaseError:
    print("Creating the go_bp_ALT table resulted in a database error!")
    connection.rollback()
    raise
else:
    connection.commit()
finally:
    print("done!")
    
    

done!


##### Similar error handling, as seen above, can be when executing any statement that changes the database.

##### Check if the new table appears in the `sqlite_master` table 

In [60]:
sql = '''
SELECT name, sql
FROM sqlite_master 
WHERE name LIKE "go_bp%"
LIMIT 4;
'''
cursor.execute(sql)
print(cursor.fetchall())

[('go_bp', 'CREATE TABLE go_bp (\n      _id INTEGER NOT NULL,                         -- REFERENCES  genes \n      go_id CHAR(10) NOT NULL,                      -- GO ID\n      evidence CHAR(3) NOT NULL,                    -- GO evidence code\n      FOREIGN KEY (_id) REFERENCES  genes  (_id)\n    )'), ('go_bp_all', 'CREATE TABLE go_bp_all (\n      _id INTEGER NOT NULL,                         -- REFERENCES  genes \n      go_id CHAR(10) NOT NULL,                      -- GO ID\n      evidence CHAR(3) NOT NULL,                    -- GO evidence code\n      FOREIGN KEY (_id) REFERENCES  genes  (_id)\n    )'), ('go_bp_ALT', 'CREATE TABLE go_bp_ALT (\n      gene_go_id INTEGER PRIMARY KEY AUTOINCREMENT,\n      gene_id INTEGER NOT NULL,                     -- REFERENCES  genes _id \n      go_id CHAR(10) NOT NULL,                      -- GO ID\n      evidence CHAR(30) NOT NULL,                   -- GO evidence information\n      FOREIGN KEY (gene_id) REFERENCES  genes  (_id)\n    )')]


  
<br><br> 
The `sqlite_sequence` table is created and initialized automatically whenever a regular table is created if it has a column with the `AUTOINCREMENT` option set.<br>
https://www.sqlite.org/autoinc.html


##### Check if the new table appears in the `sqlite_master` table 

### INDEXING

Indexes are lookup tables, like the index of a book.
They are usually created for columns that have unique/ or less redundant values and provide a way to quicky search 
the values.<br>
Indexing creates a copy of the indexed columns together with a link to the location of the additional information.<br> 
The index data is stored in a data structure that allows for fast sorting. <br>
E.g.: balanced-tree - every leaf is at most n nodes away from the root) that allows for fast sorting. <br>
All queries (statements) regarding an indexed table are applied to the index


* One important function in Relational Databases is to be able to create indexes on columns in tables  
* These indexes are pre-calculated and stored in the database 
* Indexes should be created on columns that are used in queries and joins   
* They will rapidly speed up query return rate and improve query performance

To create an index use the following command:

```sql
CREATE INDEX indexName ON tableName (columnName)
```

In [61]:
sql = '''
CREATE INDEX gene_go_idx 
ON go_bp_ALT (gene_go_id)
'''
cursor.execute(sql)
connection.commit()


##### Check if the new index appears in the `sqlite_master` table 

In [63]:
sql = '''
SELECT name, sql
FROM sqlite_master 
WHERE type= "index" AND name = "gene_go_idx";
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

name	sql
gene_go_idx	CREATE INDEX gene_go_idx 
ON go_bp_ALT (gene_go_id)



#### Remove the index

In [64]:
sql = '''
DROP INDEX gene_go_idx 
'''
cursor.execute(sql)
connection.commit()


##### Check if the index was removed from the `sqlite_master` table 

In [65]:
sql = '''
SELECT name, sql
FROM sqlite_master 
WHERE type= "index" AND name = "gene_go_idx";
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

name	sql



### INSERT - statement

Makes changes to the database table<br>
Adds new data to a table (if the constraints are met)
Constraint examples: 
* For one designated column or a group of columns that are designated as Primary Key the values are unique
* The value inserted in a column that has a Foreign Key constraint should exist in the column that it refers to

```sql
INSERT INTO <tablename> (<column1>, <column2>, <column3>) VALUES (value1, value2, value3);
```

##### One simple INSERT command adds 1 row of data at a time into an existing table  

##### Connection object allows us to:
* ##### COMMIT - save the changes 
* ##### ROLLBACK - reverts/discards the changes

<br>

##### Let's see what is in the table (it should be nothing):

In [None]:
sql = '''
SELECT *
FROM go_bp_ALT;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

<br>

##### Let's try an insert:
```sql
INSERT INTO <tablename> (<column1>, <column2>, <column3>) VALUES (value1, value2, value3);
```

In [None]:
values_list = [1234,"GO:1234","CM_EV"]

sql = '''
INSERT INTO go_bp_ALT (gene_id, go_id, evidence) 
VALUES (?,?,?);
'''
cursor.execute(sql,values_list)
connection.commit()

In [None]:
# This command retrieves the identifier of the last row from the most current query
# The gene_go_id

id_value = cursor.lastrowid
id_value

<br>


##### We have a row in the table!!! And the gene_go_id was automatically generated.

In [None]:
sql = '''
SELECT *
FROM go_bp_ALT ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

#### You can have a Python "table" structure (list of lists) of insert values and get them all inserted in one command, each sublist having the correct number of values.


In [None]:
values_tbl = [[1235,"GO:1235","CM_EV"], [1236,"GO:1236","CM_EV"], [1236,"GO:1237","CM_EV"]]

sql = '''
INSERT INTO go_bp_ALT (gene_id, go_id, evidence) 
VALUES (?,?,?);
'''
cursor.executemany(sql,values_tbl)
connection.commit()


In [None]:
sql = '''
SELECT *
FROM go_bp_ALT ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

#### UPDATE - statement - changes the table rows



Modifies data (already in a table)  in all rows matching the WHERE clause 

```sql
UPDATE table_name 
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition];
```

Update is generally a single row command, but use of the where clause can cause data to be updated in multiple rows <br>
(whether you intended to or not !!!!)

The following statement updates the evidence for all entries for all genes associated with the 2 biological processses 

#### DELETE - statement - deletes table rows

* MAKES CHANGES TO THE DATA
* Row level deletion – can’t delete less than this. 

```sql
DELETE FROM <tablename> WHERE <column> = <value>
```

* The WHERE predicate is the same as for the SELECT statement, that is, it determines which rows will be deleted  



In [None]:
sql = '''
DELETE FROM go_bp_ALT 
WHERE go_id IN ("GO:1234","GO:1236");
'''
cursor.execute(sql)
connection.commit()


In [None]:
sql = '''
SELECT *
FROM go_bp_ALT ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

```sql
DELETE FROM <tablename>; 
```

* This would delete all rows of data from a table.
* Preserves table structure (table still exists)
* Optimized for speed in SQLite, no row-by-row execution.
* EXISTS <table_name> still evaluates to True


In [None]:
sql = '''
DELETE FROM go_bp_ALT;
'''
cursor.execute(sql)
connection.commit()


In [None]:
sql = '''
SELECT *
FROM go_bp_ALT ;
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

<br>

#### `DROP TABLE` - statement - removes a table (permanently)

In [None]:
sql = '''
DROP TABLE IF EXISTS go_bp_ALT;
'''
cursor.execute(sql)
connection.commit()

In [None]:
sql = '''
SELECT name AS "TABLE NAME"
FROM sqlite_master 
WHERE name LIKE "go_bp%";
'''
cursor.execute(sql)
print(get_header(cursor))
print(get_results(cursor))

In [None]:
# And close()

cursor.close()
connection.close()

#### To remove the database, delete the .sqlite file.