# Day5 exercise 1

- 18.03.2022
- Kriti Amin

# MySQL Exercise

In this exercise we want to create UniProt tables inside our **biodb** database.

## Create database, user and assign rights

Please execute in the MySQL CLI following line by line as root

```sql
DROP DATABASE IF EXISTS biodb;
CREATE DATABASE biodb;
SHOW DATABASES like 'biodb';
CREATE USER IF NOT EXISTS 'biodb_user'@'localhost' IDENTIFIED BY 'biodb_password';
SELECT User FROM mysql.user WHERE User LIKE 'biodb_user';
GRANT ALL ON `biodb`.* TO 'biodb_user'@'localhost';
FLUSH PRIVILEGES;
```

## Creating tables

The following SQL statements defines the tables
1. uniprot
2. uniprot_function
3. uniprot_organism

```sql
CREATE TABLE `uniprot` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `accession` varchar(20) DEFAULT NULL,
  `name` varchar(100) NOT NULL,
  `recommended_name` varchar(255) DEFAULT NULL,
  `taxid` int(11) NOT NULL,
  `function_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name` (`name`),
  UNIQUE KEY `accession` (`accession`),
  KEY `function_id` (`function_id`),
  KEY `ix_uniprot_taxid` (`taxid`),
  CONSTRAINT `uniprot_ibfk_1` FOREIGN KEY (`taxid`) REFERENCES `uniprot_organism` (`taxid`),
  CONSTRAINT `uniprot_ibfk_2` FOREIGN KEY (`function_id`) REFERENCES `uniprot_function` (`id`)
);


CREATE TABLE `uniprot_function` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `description` text DEFAULT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE `uniprot_organism` (
  `taxid` int(11) NOT NULL AUTO_INCREMENT,
  `scientific_name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`taxid`)
);
```

&#127947; Check the [documentation](https://dev.mysql.com/doc/refman/8.0/en/create-table.html) and be prepared to explain each part of the SQL statement. 

&#127947; Try to create the *uniprot* table with first SQL statement. Which error message you see and why?

### Referential integrity and Key checks

Referential integrity refers to the relationship between tables. Because each table in a database must have a primary key, this primary key can appear in other tables because of its relationship to data within those tables. When a primary key from one table appears in another table, it is called a foreign key.

&#127947; Try to create now all 3 tables, but this time execute before

```sql 
SET FOREIGN_KEY_CHECKS=0;
```

and after

```sql
SET FOREIGN_KEY_CHECKS=1;
```

Why this works now?

### Insert data

Data for the first 3 human proteins are stored in 

1. [uniprot.csv](../uniprot/exercise/csv/uniprot.csv)
2. [uniprot_function.csv](../uniprot/exercise/csv/uniprot_function.csv)
3. [uniprot_organism.csv](../uniprot/exercise/csv/uniprot_organism.csv)

&#127947; Create SQL statements to insert the data from all 3 files (use the [MySQL insert reference](https://dev.mysql.com/doc/refman/8.0/en/insert.html)). 

&#127947; Why you fail when you insert the data from table 1-3. 

&#127947; Flip the order and inserts again. Why this works?

## Execute SQL and query with pymysql
Use [pymysql](https://pypi.org/project/PyMySQL/)

In [1]:
from getpass import getpass
import pymysql

In [2]:
biodb_user_password = getpass(prompt='MySQL biodb_user password: ')

MySQL biodb_user password: ········


In [27]:
connection = pymysql.connect(host='localhost',
                             user='biodb_user',
                             password=biodb_user_password,
                             database='biodb',
                             charset='utf8mb4')
cursor = connection.cursor()
cursor_dict = connection.cursor(cursor=pymysql.cursors.DictCursor)

&#127947; create a cursor and a method `get_tables` returning all tables

In [28]:
def get_tables(cursor) -> list:
    cursor.execute("show tables")
    tables = cursor.fetchall()
    return [i[0] for i in tables]

In [29]:
get_tables(cursor)

['uniprot', 'uniprot_function', 'uniprot_organism']

In [30]:
for i in get_tables(cursor):
    cursor.execute(f"select count(*) from {i}")
    print(f'{i}\t:\t{cursor.fetchall()[0][0]}')

uniprot	:	3
uniprot_function	:	2
uniprot_organism	:	1


&#127947; create a method `clean_database` using `get_tables` iterating over the list of tables and delete all entries

In [7]:
def clean_database(cursor):
    for i in get_tables(cursor):
        cursor.execute(f'delete from {i}')

In [8]:
clean_database(cursor)

&#127947; Iterate over the list of tables (use `get_tables`) and print the number of entries as integer.

In [9]:
for i in get_tables(cursor):
    cursor.execute(f"select count(*) from {i}")
    print(f'{i}\t:\t{cursor.fetchall()[0][0]}')

uniprot	:	0
uniprot_function	:	0
uniprot_organism	:	0


In [10]:
connection.commit()

## MySQL and pandas

&#127947; Create [SQLAlchemy engine](https://docs.sqlalchemy.org/en/13/core/connections.html) for the **biodb** database

Use the following schema for the connection string:
> mysql+pymysql://user:password@host/database

In [11]:
from sqlalchemy import create_engine
import pandas as pd

In [12]:
path = "C:\\Users\\kriti\\Desktop\\BioDB\\biodb-2022-teaching\\material\\uniprot\\exercise\\csv\\"
engine = create_engine('mysql+pymysql://biodb_user:biodb_password@localhost/biodb')

&#127947; Insert the data from the csv files with pandas. Before use method `clean_database`. Tipp: set the primary key (see table definition) as index in the DataFrame. Don't replace the already existing tables.

In [13]:
uniprot = pd.read_csv(path+"uniprot.csv").set_index('id')
uniprot

Unnamed: 0_level_0,accession,name,recommended_name,taxid,function_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
82,Q14738,2A5D_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,10
87,Q16537,2A5E_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,10
91,Q13362,2A5G_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,14


In [14]:
uniprot_function = pd.read_csv(path+"uniprot_function.csv").set_index('id')
uniprot_function

Unnamed: 0_level_0,description
id,Unnamed: 1_level_1
10,The B regulatory subunit might modulate substr...
14,The B regulatory subunit might modulate substr...


In [15]:
uniprot_organism = pd.read_csv(path+"uniprot_organism.csv").set_index('taxid')
uniprot_organism

Unnamed: 0_level_0,scientific_name
taxid,Unnamed: 1_level_1
9606,Homo sapiens


In [16]:
uniprot_organism.to_sql('uniprot_organism', engine, if_exists='append')

1

In [17]:
uniprot_function.to_sql('uniprot_function', engine, if_exists='append')

2

In [18]:
uniprot.to_sql('uniprot', engine, if_exists='append')

3

## Queries with pymysql and pandas

&#127947; Write an SQL statement to get 

1. uniprot.recommended_name
2. uniprot_organism.scientific_name,
3. uniprot_function.description

for the uniprot.accession = 'Q14738'. Use pymysql.

In [26]:
st1 = """SELECT uniprot.recommended_name, uniprot_organism.scientific_name, uniprot_function.description, uniprot.accession 
FROM uniprot, uniprot_organism, uniprot_function 
WHERE uniprot.accession='Q14738'"""

In [None]:
cursor_dict.execute(st1)
cursor_dict.fetchall()

&#127947; Do the same with pandas

In [20]:
uni_func = uniprot.set_index('function_id').join(uniprot_function, how='inner')
uni_func

Unnamed: 0,accession,name,recommended_name,taxid,description
10,Q14738,2A5D_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,The B regulatory subunit might modulate substr...
10,Q16537,2A5E_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,The B regulatory subunit might modulate substr...
14,Q13362,2A5G_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,9606,The B regulatory subunit might modulate substr...


In [21]:
uni_func_org = uni_func.set_index('taxid').join(uniprot_organism, how='inner')
uni_func_org

Unnamed: 0_level_0,accession,name,recommended_name,description,scientific_name
taxid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9606,Q14738,2A5D_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,The B regulatory subunit might modulate substr...,Homo sapiens
9606,Q16537,2A5E_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,The B regulatory subunit might modulate substr...,Homo sapiens
9606,Q13362,2A5G_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,The B regulatory subunit might modulate substr...,Homo sapiens


In [22]:
uni_func_org[uni_func_org['accession'] == 'Q14738']

Unnamed: 0_level_0,accession,name,recommended_name,description,scientific_name
taxid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9606,Q14738,2A5D_HUMAN,Serine/threonine-protein phosphatase 2A 56 kDa...,The B regulatory subunit might modulate substr...,Homo sapiens


&#127947; Close the pymysql connection

In [33]:
connection.close()