# Gene Wiki wikipedia URLs in MyGene.info coverage checker

After retrieving Wikipedia urls for AD genes for the Gene Wiki Review series and manually inspecting the AD genes without Wikipedia urls available in MyGene.info, it became apparent that the Wikipedia url annotation data in MyGene.info may not be up-to-date, this is to check what is missing.

In [1]:
import mygene
import pandas
import json
import urllib.request
import requests
from collections import OrderedDict
from pandas import read_csv

mg = mygene.MyGeneInfo()

infilepath = 'data/'
exppath = 'results/'

In [2]:
###############################################################################
## This module uses mygene.info python wrapper to obtain gene wikipedia 
## titles from entrez ids.
###############################################################################

def check_wiki_titles_for_geneids (genelist):
    print('obtaining wikipedia page titles for gene ids')
    entrezlist = mg.querymany(genelist, scopes='entrezgene', fields=('wikipedia.url_stub','symbol'), as_dataframe=True)
    ## Split the returned dataframe into two lists (one with a result, one without)
    entrezlist['wikipedia'] = entrezlist['wikipedia'].fillna('None')
    no_response = entrezlist.loc[entrezlist['wikipedia']=='None']
    url_found = entrezlist.loc[entrezlist['wikipedia']!='None']
    ## For gene ids that returned a Wikipedia stub, clean up result for querying Wikipedia for Wikipedia information
    url_found['wikipedia'] = url_found['wikipedia'].apply(lambda x: x['url_stub'])
    return(url_found,no_response)

Run the sparql query to retrieve all human genes with English Wikipedia URLs. 

In [3]:

url = 'https://query.wikidata.org/sparql'
query = """
SELECT 
  ?gene ?geneid ?protein ?article
WHERE {
  ?gene wdt:P31 wd:Q7187;  #Find genes
        wdt:P703 wd:Q15978631. #limit to humans
  ?article schema:about ?gene. #limit to genes with corresponding English Wikipedia articles
  ?article schema:inLanguage "en".
  ?article schema:isPartOf <https://en.wikipedia.org/>
  OPTIONAL { ?gene wdt:P688 ?protein } #Get associated proteins
  OPTIONAL { ?gene wdt:P351 ?geneid } #Get the geneid
  }
"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()
print("query completed")

query completed


Clean up the results of the SPARQL query. The SPARQL query will also pull any proteins that are encoded by the genes. Since the mapping of gene to protein isn't always 1:1, there maybe duplicate gene entries corresponding to different proteins or single gene entries corresponding to the same protein.

In [4]:
genes = []
for item in data['results']['bindings']:
    genes.append(OrderedDict({
        'gene_uri': item['gene']['value'],
        'gene_id': item['geneid']['value'] 
            if 'geneid' in item else None,
        'protein_uri': item['protein']['value'] 
            if 'protein' in item else None,
        'wiki_uri': item['article']['value'] 
            if 'article' in item else None}))

wikidata_genes_uri = pandas.DataFrame(genes)
wikidata_genes_uri['genes_wdid'] = wikidata_genes_uri['gene_uri'].astype(str).str.replace("http://www.wikidata.org/entity/","")
wikidata_genes_uri['protein_wdid'] = wikidata_genes_uri['protein_uri'].astype(str).str.replace("http://www.wikidata.org/entity/","")
wikidata_genes_uri['wiki_stub'] = wikidata_genes_uri['wiki_uri'].astype(str).str.replace("https://en.wikipedia.org/wiki/","")
wikidata_genes_uri.head()
print(len(wikidata_genes_uri))

#wikidata_genes_uri.to_csv(exppath+'wikidata_genes_uri.tsv',sep='\t',header=True)

16223


Pull all the unique gene ids for human genes with Wikipedia pages.

In [7]:
genedf = wikidata_genes_uri[['gene_id']].fillna(-1).astype(int)
genedf.drop_duplicates('gene_id', keep='first', inplace=True)
genedf['ids'] = genedf['gene_id'].astype(str)

genelist = genedf['ids'].tolist()
#print(wikidata_genes_uri.head(n=2))
print(genedf[0:2])

   gene_id   ids
0     2312  2312
1     5925  5925


Use MyGene.info to find the Wikipedia urls for those genes

In [8]:
url_found,no_response = check_wiki_titles_for_geneids(genelist)
print('Length of original gene list: ',len(genelist), 
      'length of list with urls in mygene: ',len(url_found))

obtaining wikipedia page titles for gene ids
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-12430...done.
Finished.
11 input query terms found no hit:
	['-1', '653365', '6025', '9193', '11217', '353293', '117153', '544437', '387281', '8142', '107984557
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
Length of original gene list:  12430 length of list with urls in mygene:  10855


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Upon manual inspection, the 10 gene ids in the not found list appear to be genes which have been discontinued or otherwise withdrawn and potentially replaced. In other words, these may be genes which should be marked in Wikidata as deprecated in some manner.

Comparing hte list of unique genes in the original list (12429) and the list returned from mygene (10855), it looks like there are about 1574 genes which are in Wikidata that have associated Wikipedia URLs, which are not found in MyGene.info.

To identify the the genes with Wikipedia URLS according to Wikidata, but no URLS according to MyGene, do a merge

In [9]:
mygene_results = url_found.reset_index()
mygene_to_merge = mygene_results.rename(columns={'_id':'ids'})
mygene_to_merge['ids'] = mygene_to_merge['ids'].astype(str)

gene_compare = wikidata_genes_uri.merge(genedf.merge(mygene_to_merge, on='ids',how='left'), on='gene_id', how='left')
print(gene_compare.head(n=2))

  gene_id                                gene_uri  \
0    2312  http://www.wikidata.org/entity/Q410688   
1    5925   http://www.wikidata.org/entity/Q40108   

                                protein_uri  \
0  http://www.wikidata.org/entity/Q21201832   
1  http://www.wikidata.org/entity/Q21111463   

                                            wiki_uri genes_wdid protein_wdid  \
0            https://en.wikipedia.org/wiki/Filaggrin    Q410688    Q21201832   
1  https://en.wikipedia.org/wiki/Retinoblastoma_p...     Q40108    Q21111463   

                wiki_stub  ids query  _score notfound symbol wikipedia  
0               Filaggrin  NaN   NaN     NaN      NaN    NaN       NaN  
1  Retinoblastoma_protein  NaN   NaN     NaN      NaN    NaN       NaN  


Filter for genes where the MyGene.info failed to retrieve a url.

In [10]:
missing_from_mygene = gene_compare.loc[gene_compare['wikipedia'].isnull()]
print(len(missing_from_mygene))
print(missing_from_mygene.head(n=2))

16223
  gene_id                                gene_uri  \
0    2312  http://www.wikidata.org/entity/Q410688   
1    5925   http://www.wikidata.org/entity/Q40108   

                                protein_uri  \
0  http://www.wikidata.org/entity/Q21201832   
1  http://www.wikidata.org/entity/Q21111463   

                                            wiki_uri genes_wdid protein_wdid  \
0            https://en.wikipedia.org/wiki/Filaggrin    Q410688    Q21201832   
1  https://en.wikipedia.org/wiki/Retinoblastoma_p...     Q40108    Q21111463   

                wiki_stub  ids query  _score notfound symbol wikipedia  
0               Filaggrin  NaN   NaN     NaN      NaN    NaN       NaN  
1  Retinoblastoma_protein  NaN   NaN     NaN      NaN    NaN       NaN  
