# Homeworh 5 - part 1

## Create chebi database

# Download database files

- Download [ChEBI SQL file](http://ftp.ebi.ac.uk/pub/databases/chebi/generic_dumps/mysql_create_tables.sql)
- Download the ***3star*** version form [here](http://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/) except references and structures:
    1. chemical_data
    2. comments
    3. compound_origins
    4. compounds
    5. database_accession
    6. names
    7. relation

# Create and populate database

## Create database, user and assign rights

- Create a new database *chebi* with root
- Create new user *chebi_user* (password: *chebi_password*) with root. Don't forget to flush;

```sql
CREATE DATABASE chebi;
SHOW DATABASES like 'chebi';
CREATE USER IF NOT EXISTS 'chebi_user'@'localhost' IDENTIFIED BY 'chebi_password';
SELECT User FROM mysql.user WHERE User LIKE 'chebi_user';
GRANT ALL ON `chebi`.* TO 'chebi_user'@'localhost';
FLUSH PRIVILEGES;
```

- create database structure
```bash
mysql -u chebi_user -pchebi_password chebi < mysql_create_tables.sql
```


## Populate database

- import data with pandas `to_sql` function (important in which order!)

Create SQLAlchemy engine for the **chebi** database

In [1]:
from sqlalchemy import create_engine
import pandas as pd

In [2]:
path = "C:\\Users\\kriti\\Desktop\\BioDB\\2-kriti\\Homework\\Day_5\\chebi\\"
engine = create_engine('mysql+pymysql://chebi_user:chebi_password@localhost/chebi')

Insert the data from the csv files with pandas. Tipp: set the primary key (see table definition) as index in the DataFrame. Don't replace the already existing tables.

In [4]:
chemical_data = None
comments = None
compound_origins = None
compounds = None
database_accession = None
names = None
relation = None

# preprocess tsv files

### check null values (fill if not null column) - see .sql file
### change column position to align with the sql columns - see .sql file
### check column names (align with sql file)

In [4]:
# compounds = pd.read_csv(path+'compounds_3star.tsv.gz', sep='\t', index_col='ID', low_memory=False)
# print('Rows X Columns : ',compounds.shape, '\nRows added:')
# compounds.to_sql('compounds', engine, if_exists='append')
# #compounds

Rows X Columns :  (77678, 9) 
Rows added:


77678

In [6]:
chemical_data = pd.read_csv(path+'chemical_data_3star.tsv', sep='\t', index_col='ID')
chemical_data.fillna(value={'CHEMICAL_DATA' : 'None'}, inplace=True)
# print('Rows X Columns : ',chemical_data.shape, '\nRows added:')
# chemical_data.to_sql('chemical_data', engine, if_exists='append')
chemical_data

Unnamed: 0_level_0,COMPOUND_ID,SOURCE,TYPE,CHEMICAL_DATA
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,18357,KEGG COMPOUND,FORMULA,C8H11NO3
3,28234,KEGG COMPOUND,FORMULA,C13H12O2
4,15399,KEGG COMPOUND,FORMULA,C10H14O
6,7,KEGG COMPOUND,FORMULA,C10H16
7,8,KEGG COMPOUND,FORMULA,C15H22O
...,...,...,...,...
2661024,85516,ChEBI,MONOISOTOPIC MASS,6829.19599
2661025,85516,ChEBI,MASS,6832.506
2661026,85516,ChEBI,FORMULA,C217H269N92O126P21
2661035,6618,ChEBI,FORMULA,C27H35N5O7S


In [16]:
chemical_data.isnull().sum(), chemical_data.isna().sum()

(COMPOUND_ID      0
 SOURCE           0
 TYPE             0
 CHEMICAL_DATA    1
 dtype: int64,
 COMPOUND_ID      0
 SOURCE           0
 TYPE             0
 CHEMICAL_DATA    1
 dtype: int64)

In [13]:
chemical_data[chemical_data['CHEMICAL_DATA'].isnull() == True], #{'CHEMICAL_DATA' : 'None'}

Unnamed: 0_level_0,COMPOUND_ID,SOURCE,TYPE,CHEMICAL_DATA
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
32968,33517,ChEBI,MASS,


In [17]:
kiki = chemical_data.fillna(value={'CHEMICAL_DATA' : 'None'})
kiki.isnull().sum()

COMPOUND_ID      0
SOURCE           0
TYPE             0
CHEMICAL_DATA    0
dtype: int64

In [18]:
comments = pd.read_csv(path+'comments_3star.tsv', sep='\t', index_col='ID')
# print('Rows X Columns : ',comments.shape, '\nRows added:')
# comments.to_sql('comments', engine, if_exists='append')
comments

Unnamed: 0_level_0,COMPOUND_ID,CREATED_ON,DATATYPE_ID,DATATYPE,TEXT
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8,15561,2007-04-24,12789,General,"Name encompasses both (1R,2S) and (1S,2R) isom..."
12,15576,2003-12-23,264,CompoundName,"8,25-dien name wrong in IUBMB list (now correc..."
14,15635,2004-01-09,15635,General,The natural product is the 6S stereoisomer.
15,15636,2004-01-09,15636,General,The naturally occurring compound is the 6R ste...
16,15638,2004-01-09,15638,General,The naturally occurring compound is the tetrah...
...,...,...,...,...,...
5521,144708,2022-02-04,1103551,CompoundName,ambiguous synonym
5522,189677,2022-02-10,189677,General,"For synthesis see E. V. Sukhova et al., J. Car..."
5523,189873,2022-02-25,189873,General,Please note that this is the entry for materia...
5524,190008,2022-02-28,190008,General,Please note that this is the entry for materia...


In [19]:
comments.isnull().sum()

COMPOUND_ID    0
CREATED_ON     0
DATATYPE_ID    0
DATATYPE       0
TEXT           0
dtype: int64

In [21]:
compound_origins = pd.read_csv(path+'compound_origins_3star.tsv', sep='\t', encoding= 'unicode_escape').set_index('ID')
# print('Rows X Columns : ',compound_origins.shape, '\nRows added:')
# compound_origins.to_sql('compound_origins', engine, if_exists='append')
compound_origins

Unnamed: 0_level_0,STATUS,CHEBI_ACCESSION,SOURCE,PARENT_ID,NAME,DEFINITION,MODIFIED_ON,CREATED_BY,STAR
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
65354,Abacopteris penangiana,IPNI:17367310-1,rhizome,BTO:0001181,,,PubMed Id,16499328,
65355,Abacopteris penangiana,IPNI:17367310-1,rhizome,BTO:0001181,,,PubMed Id,16499328,
65356,Abacopteris penangiana,IPNI:17367310-1,rhizome,BTO:0001181,,,PubMed Id,16499328,
65357,Abacopteris penangiana,IPNI:17367310-1,rhizome,BTO:0001181,,,PubMed Id,16499328,
65358,Erythrina abyssinica,NCBI:txid1237573,stem,BTO:0001300,,,PubMed Id,18484536,Previous component: stem bark;
...,...,...,...,...,...,...,...,...,...
7051,Nicotiana tabacum,NCBI:txid4097,,,,,PubMed Id,18136963,
189660,Shinella sp. HZN7,NCBI:txid879274,,,,,PubMed Id,27568381,
190009,Streptomyces himastatinicus ATCC 53653,NCBI:txid457427,,,,,PubMed Id,2211363,
29702,Streptomyces coelicolor,NCBI:txid1902,,,,,PubMed Id,21222119,


In [22]:
compound_origins.isnull().sum()

STATUS                 0
CHEBI_ACCESSION       35
SOURCE             12398
PARENT_ID          12453
NAME               16559
DEFINITION         17541
MODIFIED_ON            0
CREATED_BY             0
STAR               12994
dtype: int64

In [23]:
compound_origins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17541 entries, 65354 to 29702
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATUS           17541 non-null  object 
 1   CHEBI_ACCESSION  17506 non-null  object 
 2   SOURCE           5143 non-null   object 
 3   PARENT_ID        5088 non-null   object 
 4   NAME             982 non-null    object 
 5   DEFINITION       0 non-null      float64
 6   MODIFIED_ON      17541 non-null  object 
 7   CREATED_BY       17541 non-null  object 
 8   STAR             4547 non-null   object 
dtypes: float64(1), object(8)
memory usage: 1.3+ MB


In [24]:
database_accession = pd.read_csv(path+'database_accession_3star.tsv', sep='\t', index_col='ID')
database_accession.fillna(value={'ACCESSION_NUMBER' : 'None'}, inplace=True)
# print('Rows X Columns : ',database_accession.shape, '\nRows added:')
# database_accession.to_sql('database_accession', engine, if_exists='append')
database_accession

Unnamed: 0_level_0,COMPOUND_ID,SOURCE,TYPE,ACCESSION_NUMBER
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15233,27531,KEGG COMPOUND,KEGG COMPOUND accession,C06095
15256,67986,KEGG COMPOUND,KEGG COMPOUND accession,C08945
15257,67986,KEGG COMPOUND,CAS Registry Number,52286-58-5
15296,5381,KEGG COMPOUND,KEGG COMPOUND accession,C09753
15297,5381,KEGG COMPOUND,CAS Registry Number,87440-56-0
...,...,...,...,...
1104554,17826,Europe PMC,PubMed citation,23891734
1104555,17826,Europe PMC,PubMed citation,18701095
1104556,17826,Europe PMC,PubMed citation,19095002
1104557,17826,Europe PMC,PubMed citation,27327130


In [25]:
database_accession.isnull().sum()

COMPOUND_ID         0
SOURCE              0
TYPE                0
ACCESSION_NUMBER    1
dtype: int64

In [26]:
kiki = database_accession.fillna(value={'ACCESSION_NUMBER' : 'None'})
kiki.isnull().sum()

COMPOUND_ID         0
SOURCE              0
TYPE                0
ACCESSION_NUMBER    0
dtype: int64

In [27]:
names = pd.read_csv(path+'names_3star.tsv.gz', sep='\t', index_col='ID')
names.fillna(value={'NAME' : 'None'}, inplace=True)
# print('Rows X Columns : ',names.shape, '\nRows added:')
# names.to_sql('names', engine, if_exists='append')
names

Unnamed: 0_level_0,COMPOUND_ID,TYPE,SOURCE,NAME,ADAPTED,LANGUAGE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,18357,SYNONYM,KEGG COMPOUND,Noradrenaline,F,en
3,18357,SYNONYM,KEGG COMPOUND,L-Noradrenaline,F,en
4,18357,SYNONYM,KEGG COMPOUND,Norepinephrine,F,en
5,18357,SYNONYM,KEGG COMPOUND,Arterenol,F,en
9,28234,SYNONYM,KEGG COMPOUND,"(+)-(3S,4R)-cis-3,4-Dihydroxy-3,4-dihydrofluorene",F,en
...,...,...,...,...,...,...
1104302,28024,SYNONYM,ChEBI,acide cyanique,F,fr
1104303,28024,SYNONYM,ChEBI,acido cianico,F,es
1104304,18421,SYNONYM,ChEBI,superoxido,F,es
1104305,18421,SYNONYM,ChEBI,hiperoxido,F,es


In [28]:
names.isnull().sum()

COMPOUND_ID    0
TYPE           0
SOURCE         0
NAME           1
ADAPTED        0
LANGUAGE       0
dtype: int64

In [29]:
kiki = names.fillna(value={'NAME' : 'None'})
kiki.isnull().sum()

COMPOUND_ID    0
TYPE           0
SOURCE         0
NAME           0
ADAPTED        0
LANGUAGE       0
dtype: int64

In [30]:
relation = pd.read_csv(path+'relation_3star.tsv', sep='\t', index_col='ID')
# print('Rows X Columns : ',relation.shape, '\nRows added:')
# relation.to_sql('relation', engine, if_exists='append')
relation

Unnamed: 0_level_0,TYPE,INIT_ID,FINAL_ID,STATUS
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,is_a,24431,23367,C
18,is_a,23855,22663,E
19,is_a,23855,23315,E
20,is_a,23855,23514,C
22,is_a,23855,24322,E
...,...,...,...,...
1127895,has_role,35610,6618,C
1127896,has_role,35480,6618,C
1127898,has_role,77746,6618,C
1127899,has_role,55322,6618,C


In [31]:
relation.isnull().sum()

TYPE        0
INIT_ID     0
FINAL_ID    0
STATUS      0
dtype: int64

7. Use the [ER model](http://ftp.ebi.ac.uk/pub/databases/chebi/DataModel.png) and [SQL database model](http://ftp.ebi.ac.uk/pub/databases/chebi/generic_dumps/mysql_create_tables.sql) to create your SQLAlchemy

8. Design [ChEBI](https://www.ebi.ac.uk/chebi/) database as SQLALchemy model

9. Create same example queries with SQLAlchemy like in the exercises.