<a href="https://colab.research.google.com/github/mejian1/ExopherGeneExpressionProfiling/blob/main/docs/pywikipathways_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kozo2/pywikipathways/blob/main/docs/pywikipathways_Overview.ipynb)

**by Kozo Nishida and Alexander Pico**

**pywikipathways 0.0.2**

*WikiPathways* is a well-known repository for biological pathways that provides unique tools to the research community for content creation, editing and utilization [1].

**Python** is a powerful programming language and environment for statistical and exploratory data analysis.

*pywikipathways* leverages the WikiPathways API to communicate between **Python** and WikiPathways, allowing any pathway to be queried, interrogated and downloaded in both data and image formats. Queries are typically performed based on “Xrefs”, standardized identifiers for genes, proteins and metabolites. Once you can identified a pathway, you can use the WPID (WikiPathways identifier) to make additional queries.

## Prerequisites
All you need is this **pywikipathways** package!
To install pywikipathways, run

In [None]:
!pip install pywikipathways

In [None]:
import pywikipathways as pwpw

In [None]:
celegans_pathway_ids = pwpw.list_pathway_ids(organism='Caenorhabditis elegans')

## Getting started
Lets first get oriented with what WikiPathways contains. For example, here’s how you check to see which species are currently supported by WikiPathways:

In [None]:
pwpw.list_organisms()

In [None]:
for pathway_id in celegans_pathway_ids:
  # Get the list of genes (Entrez Gene IDs) for the current pathway
  genes_in_pathway = pwpw.get_xref_list(pathway_id, 'L')
  print(f"Genes in pathway {pathway_id}: {genes_in_pathway}")
  import pandas as pd
  pathway_genes = {}
  for pathway_id in celegans_pathway_ids:
  #genes_in_pathway = pwpw.get_xref_list(pathway_id, 'L')
  pathway_genes[pathway_id] = genes_in_pathway
  df = pd.DataFrame.from_dict(pathway_genes, orient='index')
df = df.transpose()
df.to_excel('celegans_pathway_genes.xlsx', index=False)


In [None]:
!pip install pywikipathways  # Install the pywikipathways package

import pywikipathways as pwpw  # Import after installation
import pandas as pd

# ... (rest of your code) ...
import pywikipathways as pwpw
import pandas as pd

celegans_pathway_ids = pwpw.list_pathway_ids(organism='Caenorhabditis elegans')

pathway_genes = {} # Initialize pathway_genes outside the loops

for pathway_id in celegans_pathway_ids:
    # Get the list of genes (Entrez Gene IDs) for the current pathway
    genes_in_pathway = pwpw.get_xref_list(pathway_id, 'L')
    print(f"Genes in pathway {pathway_id}: {genes_in_pathway}")

    # Store genes for the current pathway
    pathway_genes[pathway_id] = genes_in_pathway

# Create DataFrame after processing all pathways
df = pd.DataFrame.from_dict(pathway_genes, orient='index')
df = df.transpose()
df.to_excel('celegans_pathway_genes.xlsx', index=False)

In [None]:
!pip install biomart
from biomart import BiomartServer
server = BiomartServer("http://www.ensembl.org/biomart")
ensembl_mart = server.datasets['c_elegans_gene_ensembl']

In [None]:
server = BiomartServer("https://parasite.wormbase.org/parasite/mart.php")
   mart = server.datasets['c_elegans_gene_ensembl']

In [None]:
server = BiomartServer("https://parasite.wormbase.org/parasite/mart.php")
mart = server.datasets['c_elegans_gene_ensembl'] # Removed the extra indentation

In [None]:
from biomart import BiomartServer

server = BiomartServer("https://parasite.wormbase.org/parasite/mart.php?config=gene")
# The URL needs to point to the BioMart registry file, which might be located
# at '...mart.php?config=gene' for Wormbase Parasite. This assumes a specific
# configuration endpoint is available. Please refer to your Wormbase Parasite
# BioMart documentation for the correct registry file location.
mart = server.datasets['c_elegans_gene_ensembl']

In [None]:
from google.colab import drive
drive.mount('/content/drive')

You should see 30 or more species listed. This list is useful for subsequent queries that take an *organism* argument, to avoid misspelling.

Next, let’s see how many pathways are available for Human:

In [None]:
hs_pathways = pwpw.list_pathways('Homo sapiens')

In [None]:
hs_pathways

Yikes! That is a lot of information.
Let’s break that down a bit:

In [None]:
help(pwpw.list_pathways)

In [None]:
hs_pathways.shape

Ok. The help docs tell us that for each Human pathway we are getting a lot of information.
A *pandas.DataFrame.shape* might be all you really want to know.
Or if you’re interested in just one particular piece of information, check out these functions:

In [None]:
help(pwpw.list_pathway_ids)

In [None]:
help(pwpw.list_pathway_names)

In [None]:
help(pwpw.list_pathway_urls)

These return simple lists containing just a particular piece of information for each pathway result.

Finally, there’s another way to find pathways of interest: by Xref. An Xref is simply a standardized identifier form an official source. WikiPathways relies on BridgeDb [2] to provide dozens of Xref sources for genes, proteins and metabolites. See the full list at https://github.com/bridgedb/datasources/blob/main/datasources.tsv

With **pywikipathways**, the approach is simple.
Take a supported identifier for a molecule of interest, e.g., an official gene symbol from HGNC, “TNF” and check the *system code* for the datasource, e.g., HGNC = H (this comes from the second column in the datasources.txt table linked to above), and then form your query:

In [None]:
tnf_pathways = pwpw.find_pathways_by_xref('TNF','H')

In [None]:
tnf_pathways

Ack! That’s a lot of information. We provide not only the pathway information, but also the search result score in case you want to rank results, etc. Again, if all you’re interested in is WPIDs, names or URLs, then there are these handy alternatives that will just return simple lists:

In [None]:
help(pwpw.find_pathway_ids_by_xref)

In [None]:
help(pwpw.find_pathway_names_by_xref)

In [None]:
help(pwpw.find_pathway_urls_by_xref)

*Be aware*: a simple *len* function may be misleading here since a given pathway will be listed multiple times if the Xref is present mutiple times.

## My favorite pathways
At this point, we should have one or more pathways identified from the queries above. Let’s assume we identified ‘WP554’, the Ace Inhibitor Pathway (https://wikipathways.org/instance/WP554). We will use its WPID (WP554) in subsequent queries.

First off, we can get information about the pathway (if we didn’t already collect it above):

In [None]:
pwpw.get_pathway_info('WP554')

Next, we can get all the Xrefs contained in the pathway, mapped to a datasource of our choice. How convenient! We use the same system codes as described above. So, for example, if we want all the genes listed as Entrez Genes from this pathway:

In [None]:
pwpw.get_xref_list('WP554','L')

Alternatively, if we want them listed as Ensembl IDs instead, then…

In [None]:
pwpw.get_xref_list('WP554', 'En')

And, if we want the metabolites, drugs and other small molecules associated with the pathways, then we’d simply provide the system code of a chemical database, e.g., Ch (HMBD), Ce (ChEBI) or Cs (Chemspider):

In [None]:
pwpw.get_xref_list('WP554', 'Ch')

In [None]:
pwpw.get_xref_list('WP554', 'Ce')

In [None]:
pwpw.get_xref_list('WP554', 'Cs')

It’s that easy!

## Give me more
We also provide methods for retrieving pathways as data files and as images. The native file format for WikiPathways is GPML, a custom XML specification. You can retrieve this format by…

In [None]:
gpml = pwpw.get_pathway('WP554')

In [None]:
gpml[:1000]

WikiPathways also provides a monthly data release archived at http://data.wikipathways.org. The archive includes GPML, GMT and SVG collections by organism and timestamped. There’s a Python function for grabbing files from the archive…

In [None]:
pwpw.download_pathway_archive()

This will simply print the archive URL so you can look around (in case you don’t know what you are looking for). By default, it prints the latest collection of GPML files. However, if you provide an organism, then it will download that file to your current working directory or specified **destpath**. For example, here’s how you’d get the latest GMT file for mouse:

In [None]:
pwpw.download_pathway_archive(organism="Mus musculus", format="gmt")

And if you might want to specify an archive date so that you can easily share and reproduce your script at any time in the future and get the same result. Remember, new pathways are being added to WikiPathways every month and existing pathways are improved continuously!

In [None]:
pwpw.download_pathway_archive(date="20171010", organism="Mus musculus", format="gmt")

## References
1. Pico AR, Kelder T, Iersel MP van, Hanspers K, Conklin BR, Evelo C: **WikiPathways: Pathway editing for the people.** *PLoS Biol* 2008, **6**:e184+.

2. Iersel M van, Pico A, Kelder T, Gao J, Ho I, Hanspers K, Conklin B, Evelo C: **The BridgeDb framework: Standardized access to gene, protein and metabolite identifier mapping services.** *BMC Bioinformatics* 2010, **11**:5+.
