# Homework week5

**author:** Mehmet Can Ay <br>
2023-11-23

In [1]:
## uncomment this if needed
#!pip install -r requirements.txt

## Import

In [2]:
import os
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

## Getting the Data

### Pathway Table

In [3]:
list_of_table: list[pd.DataFrame] = pd.read_html("https://www.wikipathways.org/browse/table.html")
pathways: pd.DataFrame = list_of_table[0]
pathways.rename(columns={0: "Pathway Title", 
                      1: "ID", 
                      2: "Organism", 
                      3: "Last Edited", 
                      4: "Communities", 
                      5: "Pathway Terms", 
                      6: "Disease Terms", 
                      7: "Cell Types"}, inplace=True)

### Pathway Components for Homo Sapiens

Homo sapiens database is downloaded from WikiPathways's [Downloads](https://data.wikipathways.org/current/gpml/) section. Unfortunately the file does not follow conventions of .xml file and therefore, the reading of the file is troublesome.

In [4]:
# An empty list to hold dataframes
dfs: list[pd.DataFrame] = []

# A path to database folder
xml_folder_path: str = "./data/wikipathways/"

# Extracting the names of .XML files
xml_files: list[str] = [file for file in os.listdir(xml_folder_path)]

# Creating a dataframe with each .XML file and appending them to the list of dfs
for xml in xml_files:
    path: str = os.path.join(xml_folder_path, xml)
    df: pd.DataFrame = pd.read_xml(path, namespaces={"doc": "http://pathvisio.org/GPML/2013a"})
    dfs.append(df)

# Concatanating all dfs in the list
homo_sapiens: pd.DataFrame = pd.concat(dfs)

# Resetting the index of the database dataframe
homo_sapiens.reset_index(drop=True, inplace=True)

# Replacing artifacts with NaN.
homo_sapiens.replace({"\n      ": np.nan}, inplace=True)

## Exporting as .csv file

In [5]:
# Saving the pathways as .csv file
pathways.to_csv("./data/pathways.csv", index=False)

In [6]:
# Selecting the first 10000 rows
homo_sapiens = homo_sapiens.iloc[0: 10000]

# Saving the component as .csv file
homo_sapiens.to_csv("./data/homo_sapiens.csv", index=False)

del pathways
del homo_sapiens

## Loading the Data with Pandas

In [7]:
# Reading the sampled database
pathways: pd.DataFrame = pd.read_csv("./data/pathways.csv")

In [8]:
# Reading the sampled database
homo_sapiens: pd.DataFrame = pd.read_csv("./data/homo_sapiens.csv")

# For simplyfing the dataframe, all columns that contains only NaN values dropped.
homo_sapiens.dropna(axis=1, how="all", inplace=True)

# Removing '\n' from the entire DataFrame
homo_sapiens = homo_sapiens.map(lambda x: x.replace('\n', '') if isinstance(x, str) else x)

## Saving as SQL Database

In [9]:
# Creating an SQL database
engine = create_engine("sqlite:///data/pathways.db", echo=False)

# Writing to the creted SQL database
pathways.to_sql("pathways", con=engine, index=False)

1922

In [10]:
# Creating an SQL database
engine = create_engine("sqlite:///data/homo_sapiens.db", echo=False)

# Writing to the creted SQL database
homo_sapiens.to_sql("homo_sapiens", con=engine, index=False)

del pathways
del homo_sapiens

## Opening the Database with SQL

In [11]:
%%capture
%load_ext sql
%sql sqlite:///data/pathways.db

Inspection of the first 10 rows in the database

In [12]:
%%sql
Select * from pathways limit 10

 * sqlite:///data/pathways.db
Done.


Pathway Title,ID,Organism,Last Edited,Communities,Pathway Terms,Disease Terms,Cell Types
Hfe effect on hepcidin production,WP3673,Mus musculus,13 Dec 2016,,"regulatory pathway, iron homeostasis pathway",,
Lipids measured in liver metastasis from breast cancer,WP4627,Mus musculus,29 Nov 2019,,"lipid metabolic pathway, classic metabolic pathway","breast cancer, disease of cellular proliferation",
10q11.21q11.23 copy number variation syndrome,WP5352,Homo sapiens,04 Aug 2023,"Diseases, RareDiseases",disease pathway,,
10q22q23 copy number variation,WP5402,Homo sapiens,18 Aug 2023,,disease pathway,"chromosomal duplication syndrome, chromosomal deletion syndrome, genetic disease",
11p11.2 copy number variation syndrome,WP5348,Homo sapiens,05 Aug 2023,"Diseases, RareDiseases",disease pathway,,
13q12 or CRYL1 copy number variation,WP5405,Homo sapiens,07 Aug 2023,,disease pathway,"chromosomal duplication syndrome, chromosomal deletion syndrome, genetic disease",
13q12.12 copy number variation,WP5406,Homo sapiens,08 Aug 2023,,disease pathway,"chromosomal deletion syndrome, chromosomal duplication syndrome, genetic disease",
15q11.2 copy number variation syndrome,WP4940,Homo sapiens,18 Jan 2023,RareDiseases,disease pathway,"genetic disease, chromosome 15q11.2 deletion syndrome",
15q11q13 copy number variation,WP5407,Homo sapiens,10 Aug 2023,,disease pathway,"chromosomal deletion syndrome, chromosomal duplication syndrome, genetic disease",
15q13.3 copy number variation syndrome,WP4942,Homo sapiens,12 Mar 2021,RareDiseases,disease pathway,"chromosome 15q13.3 microdeletion syndrome, genetic disease",


Inspecting distinct terms.

In [13]:
%%sql
Select distinct "Disease Terms" from pathways limit 10

 * sqlite:///data/pathways.db
Done.


Disease Terms
""
"breast cancer, disease of cellular proliferation"
"chromosomal duplication syndrome, chromosomal deletion syndrome, genetic disease"
"chromosomal deletion syndrome, chromosomal duplication syndrome, genetic disease"
"genetic disease, chromosome 15q11.2 deletion syndrome"
"chromosome 15q13.3 microdeletion syndrome, genetic disease"
"chromosome 15q25 deletion syndrome, genetic disease"
"chromosome 16p11.2 deletion syndrome, genetic disease"
"genetic disease, chromosome 16p11.2 deletion syndrome"
"Miller-Dieker lissencephaly syndrome, genetic disease"


Filtering based on the organism and disease term of interest

In [14]:
%%sql
Select "Pathway Title", ID, Organism, "Disease Terms" from pathways where 
(Organism is "Homo sapiens" and "Disease Terms" like "%cancer%") order by ID limit 10

 * sqlite:///data/pathways.db
Done.


Pathway Title,ID,Organism,Disease Terms
Folate-alcohol and cancer pathway hypotheses,WP1589,Homo sapiens,"oral cavity cancer, disease of cellular proliferation"
Fluoropyrimidine activity,WP1601,Homo sapiens,"cancer, disease of cellular proliferation"
TP53 network,WP1742,Homo sapiens,"cancer, disease of cellular proliferation"
Integrated cancer pathway,WP1971,Homo sapiens,"cancer, disease of cellular proliferation"
Glioblastoma signaling pathways,WP2261,Homo sapiens,"brain cancer, glioblastoma, cancer, central nervous system disease, disease of cellular proliferation"
Androgen receptor network in prostate cancer,WP2263,Homo sapiens,"prostate cancer, disease of cellular proliferation"
Irinotecan pathway,WP229,Homo sapiens,"neutropenia, diarrhea, cancer, , disease of cellular proliferation"
Deregulation of Rab and Rab effector genes in bladder cancer,WP2291,Homo sapiens,"urinary bladder cancer, disease of cellular proliferation"
Gastric cancer network 1,WP2361,Homo sapiens,"gastric adenocarcinoma, stomach cancer, disease of cellular proliferation"
Gastric cancer network 2,WP2363,Homo sapiens,"stomach cancer, gastric adenocarcinoma, cancer, disease of cellular proliferation"


If the ID number of the graph is known

In [15]:
%%sql
SELECT * from pathways where ID IS "WP5352"

 * sqlite:///data/pathways.db
Done.


Pathway Title,ID,Organism,Last Edited,Communities,Pathway Terms,Disease Terms,Cell Types
10q11.21q11.23 copy number variation syndrome,WP5352,Homo sapiens,04 Aug 2023,"Diseases, RareDiseases",disease pathway,,


Ordering by the last edit date

In [16]:
%%sql
Select * from pathways order by "Last Edited" limit 10

 * sqlite:///data/pathways.db
Done.


Pathway Title,ID,Organism,Last Edited,Communities,Pathway Terms,Disease Terms,Cell Types
Biosynthesis and regeneration of tetrahydrobiopterin and catabolism of phenylalanine,WP4156,Homo sapiens,01 Apr 2023,"IEM, RareDiseases","tetrahydrobiopterin metabolic pathway, dopa responsive dystonia pathway, phenylketonuria pathway, phenylalanine degradation pathway, Segawa syndrome pathway, classic metabolic pathway, disease pathway","sepiapterin reductase deficiency, BH4-deficient hyperphenylalaninemia B, BH4-deficient hyperphenylalaninemia A, megaloblastic anemia, dystonia 5, phenylketonuria, aromatic L-amino acid decarboxylase deficiency, genetic disease,",
Disorders of folate metabolism and transport,WP4259,Homo sapiens,01 Apr 2023,"Diseases, IEM, RareDiseases","disease pathway, folate metabolic pathway, methylenetetrahydrofolate reductase deficiency pathway, regulatory pathway","glutamate formiminotransferase deficiency, megaloblastic anemia, , vitamin metabolic disorder, vitamin B12 deficiency, cerebral folate receptor alpha deficiency, genetic disease","central nervous system neuron, animal cell"
GABA metabolism (aka GHB),WP4157,Homo sapiens,01 Apr 2023,"IEM, RareDiseases","xenobiotic metabolic pathway, neurotransmitter metabolic pathway, gamma-aminobutyric acid metabolic pathway, classic metabolic pathway","succinic semialdehyde dehydrogenase deficiency, GABA aminotransferase deficiency, gamma-amino butyric acid metabolism disorder, genetic disease",
"Metabolic pathway of LDL, HDL and TG, including diseases",WP4522,Homo sapiens,01 Apr 2023,"Diseases, IEM, RareDiseases","triacylglycerol metabolic pathway, disease pathway, familial combined hyperlipidemia pathway, lipoprotein metabolic pathway, altered lipoprotein metabolic pathway, classic metabolic pathway","familial combined hyperlipidemia, autosomal recessive hypercholesterolemia, Tangier disease, hypobetalipoproteinemia, genetic disease",
Pyrimidine metabolism and related diseases,WP4225,Homo sapiens,01 Apr 2023,"IEM, RareDiseases","orotic aciduria 1 pathway, beta-ureidopropionase deficiency pathway, pyrimidine metabolic pathway, inborn error of purine-pyrimidine metabolism pathway, dihydropyrimidine dehydrogenase deficiency pathway, disease pathway, classic metabolic pathway","orotic aciduria, pyrimidine metabolic disorder, dihydropyrimidine dehydrogenase deficiency, genetic disease",
Vitamin B6-dependent and responsive disorders,WP4228,Homo sapiens,01 Apr 2023,"Diseases, IEM, RareDiseases","hypophosphatasia pathway, hyperprolinemia type II pathway, proline metabolic pathway, vitamin B6 metabolic pathway, lysine degradation pathway, disease pathway, classic metabolic pathway","hypophosphatasia, pyridoxine-dependent epilepsy, hyperprolinemia type 2, childhood hypophosphatasia, epilepsy, early-onset vitamin B6-dependent epilepsy, infantile hypophosphatasia, pyridoxamine 5'-phosphate oxidase deficiency, genetic disease, central nervous system disease","neural cell, animal cell"
Peptide GPCRs,WP1338,Danio rerio,01 Aug 2016,,"G protein mediated signaling pathway, signaling pathway",,
p53 signaling,WP2902,Mus musculus,01 Aug 2016,,"p53 signaling pathway, regulatory pathway",,
Actin cytoskeleton regulation,WP1062,Bos taurus,01 Feb 2022,,"regulatory pathway, cell adhesion signaling pathway, signaling pathway",,
FAS pathway and stress induction of HSP regulation,WP1019,Bos taurus,01 Feb 2022,,"stress response pathway, FasL mediated signaling pathway, regulatory pathway",,


In [17]:
%sql sqlite:///data/homo_sapiens.db

Because the .xml file was not following the conventions of xml, the informations is incomplete. Here the filtering should have been based on Version (not shown in the table) rather than Comment column.

In [18]:
%%sql
SELECT * from homo_sapiens where Comment LIKE "10q11.21q11.23%" LIMIT 10

 * sqlite:///data/homo_sapiens.db
   sqlite:///data/pathways.db
Done.


Source,Comment,BoardWidth,BoardHeight,TextLabel,Type,GraphId,GroupRef,BiopaxRef,GroupId,Style,CenterX,CenterY,GraphRef,Href
WikiPathways-description,"10q11.21q11.23 copy number variation (CNV) syndrome is a rare genetic disorder caused by a deletion or duplication of genetic material on chromosome 10. The exact genetic location chr10:49,390,199-51,058,796 (GRCh37) was taken from Kirov et al. 2014 and literature cited there.",,,,,,,,,,,,,


## How the data should have look

![How the table should have look 1](./images/correct_table_1.png)

![How the table should have look 2](./images/correct_table_2.png)