All external code used in this project is labeled with a comments "START OF EXTERNAL CODE" and "END OF EXTERNAL CODE". It comes with a inline reference to the source of the code.

# SNPedia Dataset

This notebook describes a procedure to retrieve and dump the [SNPedia](https://www.snpedia.com/) data and store it in a CSV files for further processing. The data is distributed under a [Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-nc-sa/3.0/us/). The SNPedia explicitly allows scraping of the data and provides a [Bulk API](https://www.snpedia.com/index.php/Bulk) to do so.

## Requirements

Python 3.12 or higher is required. The following packages are required:

```bash
pip install -r requirements.txt
```



Import the packages:

In [1]:
from itertools import batched
import pickle

import requests
import mwparserfromhell
from tqdm.auto import tqdm
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Settings:

In [2]:
pd.set_option('display.max_columns', None)

Since we are scraping the data from the web, we need to be prepared for occasional HTTP errors. We will use a `Retry` package to specify a retry strategy for the HTTP requests to be able to recover from these errors:

In [3]:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    # max retries
    total=3,

    # 1 second initial backoff, 2x each subsequent backoff
    backoff_factor=1,

    # errors for which we retry
    status_forcelist=[429, 500, 502, 503, 504]
)

adapter = HTTPAdapter(max_retries=retry_strategy)

http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

## Helper functions

In [4]:
def fetch_titles_in_category(category_name: str) -> [str]:
    """
    Fetches all titles in a category from the SNPedia API
    """

    print(f"Fetching titles in category {category_name}", end="")

    titles = []
    cmcontinue = ""
    while True:
        print(".", end="")
        response = http.get(f'https://bots.snpedia.com/api.php?action=query&list=categorymembers&cmtitle=Category:{category_name}&cmlimit=500&format=json&cmcontinue={cmcontinue}')

        # ensure the API call was successful
        response.raise_for_status()

        # add the snps to the list
        for snp in response.json()['query']['categorymembers']:
            titles.append(snp['title'])

        # we use the cmcontinue value in the next API call to get the next page of the results
        if response.json().get('continue'):
            cmcontinue = response.json()['continue']['cmcontinue']
        else:
            # stop iterating if there are no more pages to fetch
            break

        if cmcontinue == '0|0':
            break

    print("done")
    return titles

In [5]:
def fetch_pages(titles: [str]) -> [dict]:
    """
    Fetches the content of a list of pages from the SNPedia API
    """

    # request 50 pages at a time (the maximum allowed)
    response = http.get('https://bots.snpedia.com/api.php?action=query&prop=revisions&rvslots=*&rvprop=content&format=json&titles={}'.format('|'.join(titles)))

    # ensure the API call was successful
    response.raise_for_status()

    pages = []
    for page in response.json()['query']['pages'].values():
        # snp is the title of the page
        title = page['title']

        # text is the content of the page
        text = page['revisions'][0]['slots']['main']['*']

        # add the snp and text to the list
        pages.append({'title': title, 'text': text})
    
    return pages

In [6]:
def extract_template_params(text, template_name: str) -> dict:
    """
    Extracts all the parameters from a template in the wikitext
    """

    templates = mwparserfromhell.parse(text).filter_templates()

    matched_templates = [template for template in templates if template.name.strip().lower() == template_name.lower()]
    if not matched_templates:
        return {}

    template = matched_templates[0]
    return dict([[param.name.strip(), param.value.strip()] for param in template.params])

## Dump SNP pages

First, we need to fetch a list of all pages that describe SNPs:

In [7]:
snps = fetch_titles_in_category("Is_a_snp")

Fetching titles in category Is_a_snp................................................................................................................................................................................................................................done


Save a list of SNPs to a file to avoid fetching it again:

In [8]:
pickle.dump(snps, open('dataset/snps.pkl', 'wb'))

In [9]:
snps = pickle.load(open('dataset/snps.pkl', 'rb'))

Create a Pandas dataframe to store the pages data in a row format:

Fetch the content of each SNP's page and store it in the dataframe:

In [11]:
df = pd.DataFrame(columns=['title', 'text'])

# split the list of snps into batches of 50
for batch in tqdm(batched(snps, 50)):
    pages = fetch_pages(batch)

    # add new data to the dataframe
    new_df = pd.DataFrame(pages, columns=['title', 'text'])
    df = pd.concat([df, new_df])

2235it [17:48,  2.09it/s]


Save the dataframe in its initial form to a file to avoid fetching it again:

In [12]:
df.to_pickle('dataset/snps_pages.pkl')

In [24]:
df = pd.read_pickle('dataset/snps_pages.pkl')

In [25]:
df.rename(columns={'title': 'ID'}, inplace=True)
df.set_index('ID', inplace=True)
df.index = df.index.str.lower()

Add a pure page content (without any Wikitext templates) as a "description" column:

In [26]:
df["Description"] = df["text"].apply(lambda x: mwparserfromhell.parse(x).strip_code())


Save the dataframe to a CSV file:

In [27]:
df.rename(columns={'Description': 'description'}, inplace=True)

In [34]:
df.to_csv('dataset/snps.csv', columns=['description'])

### Parse Rsnum template

In [35]:
def parse_rsnum(item):
    return extract_template_params(item["text"], "Rsnum")

rsnum_df = df.apply(parse_rsnum, axis=1, result_type="expand")

Check the attributes of the `Rsnum` template:

In [36]:
rsnum_df.describe()

Unnamed: 0,rsid,Gene,Chromosome,position,Orientation,GMAF,Gene_s,Assembly,GenomeBuild,dbSNPBuild,geno1,geno2,geno3,StabilizedOrientation,ReferenceAllele,MissenseAllele,summary,Summary,Flip36,Flip37,Status,Merged,effect1,effect2,effect3,geno4,geno5,Magnitude,Chromsome,TaxonID,orientation,1,Condition,geno6,gene,geno7,Flip38
count,108872,98836,106866,106858,107247,27289.0,97998,102561,102560.0,102560,102870,102870,102870,106394,2434,2231,7,401,1,3,1768,1404,55.0,55.0,55.0,245,74,22,2,1,1,1,1,4,1,1,1
unique,108872,11849,33,103868,2,1688.0,12784,33,25.0,21,355,3389,2943,2,8,5,7,201,1,2,2,1323,1.0,1.0,1.0,41,30,1,2,1,1,1,1,4,1,1,1
top,10,BRCA2,2,55174774,plus,0.0004591,BRCA2,GRCh38,38.1,141,(A;A),(A;G),(T;T),plus,G,A,may be associated with obesity-related phenoty...,Cystic Fibrosis related,true,true,Merged,80357906,,,,(G;T),(T;T),0,1,9031,minus,geno5-(AA;AA),Hair Color,(I;I),CLOCK,(CATTCATG;CATTCATG),true
freq,1,2762,8783,8,69383,381.0,2762,62710,62721.0,52699,40406,29098,43489,68409,844,696,1,154,1,2,1404,4,55.0,55.0,55.0,39,10,22,1,1,1,1,1,1,1,1,1


Save the dataframe to a CSV file:

In [37]:
rsnum_df.to_csv('dataset/rsnums.csv')

### Parse ClinVar template

In [38]:
def parse_clinvar(item):
    return extract_template_params(item["text"], "ClinVar")

clinvar_df = df.apply(parse_clinvar, axis=1, result_type="expand")

Check the attributes of the `ClinVar` template:

In [39]:
clinvar_df.describe()

Unnamed: 0,ALT,CAF,CHROM,CLNACC,CLNALLE,CLNDBN,CLNDSDB,CLNDSDBID,CLNHGVS,CLNORIGIN,CLNREVSTAT,CLNSIG,CLNSRC,CLNSRCID,COMMON,Disease,FwdALT,FwdREF,GENEINFO,GENE_ID,GENE_NAME,REF,RSPOS,Reversed,SAO,SSR,Tags,VC,VP,WGT,dbSNPBuildID,rsid,GMAF,CLNCUI,RS,Risk
count,69584,6507,69584,66857.0,69583,67013,66180,66180,69584,67227,66568,69480,37301,36528.0,6507,66978,69175,69427,68997,68997,68997,69584,69584,69584,69583,69583,69584,69584,69583,69583,69584,69585,1705.0,10753.0,29,1
unique,1771,2333,25,66245.0,37,14337,1782,13491,69063,110,287,10,246,36120.0,2,12614,1471,2515,4731,4699,4731,3097,67045,2,3,3,5912,5,5916,2,65,69585,575.0,2770.0,29,1
top,A,0.9998; 0.0001997,2,,1,not provided,MedGen,CN221809,NC_000013.10:g.32907428dupA,1,single,5,OMIM Allelic Variant,,1,not provided,T,G,BRCA2:675,675,BRCA2,G,55242466,0,1,0,PM;NSF;REF;ASP;LSD,SNV,0x050060001205000002100200,1,137,10010131,0.0005,,1143016,G
freq,16674,1485,6144,121.0,61219,10156,13377,10149,4,55397,25904,45359,11394,121.0,4289,10156,14777,21016,2691,2691,2691,19097,9,40132,41803,69436,3274,50377,5093,68904,7649,1,297.0,953.0,1,1


Save the dataframe to a CSV file:

In [40]:
clinvar_df.to_csv('dataset/clinvars.csv')

### Parse PMIDs template

In [45]:
def parse_pmid_auto(template):
    """
    Parses a PMID Auto template and returns a dict with the PMID and Title
    """

    if template.has("Title"):
        title = template.get("Title").value.strip()
    else:
        title = ''

    return {
            "PMID": template.get("PMID").value.strip(),
            "Title": title
        }

def parse_pmids(item):
    wikitext = mwparserfromhell.parse(item["text"])
    templates = wikitext.filter_templates()

    # PMID templates
    pmids = [{"PMID": template.params[0].value.strip(), "Title": ''} for template in templates if template.name.strip() == "PMID"]

    # PMID Auto templates
    pmids += [parse_pmid_auto(template) for template in templates if template.name.strip() == "PMID Auto"]

    return pmids

pmids_df_orig = df.apply(parse_pmids, axis=1)

In [46]:
pmids_df = pd.DataFrame(pmids_df_orig.explode())
normalized_pmid_title = pd.json_normalize(pmids_df[0])

In [47]:
normalized_pmid_title.index = pmids_df.index

In [48]:
normalized_pmid_title.to_csv('dataset/pmids.csv')

### Parse categories

In [49]:
def parse_categories(item):
    wikitext = mwparserfromhell.parse(item["text"])
    templates = wikitext.filter_templates()

    categories = []
    if [template for template in templates if template.name.strip().lower() == "interesting"]:
        categories.append("Interesting")

    on_chip_templates = [template for template in templates if template.name.strip().lower() == "on chip"]
    on_chip_categories = ["On chip " + template.params[0].value.strip() for template in on_chip_templates]
    categories += on_chip_categories

    return(categories)

categories_df_orig = df.apply(parse_categories, axis=1)

In [50]:
pd.DataFrame(categories_df_orig, columns=["name"]).explode("name").to_csv('dataset/categories.csv')

## Dump Genotype pages

Fetch a list of genotypes:

In [51]:
genotypes = fetch_titles_in_category("Is_a_genotype")

Fetching titles in category Is_a_genotype..................................................................................................................................................................................................................done


Save a list of genotypes to a file to avoid fetching it again:

In [52]:
pickle.dump(genotypes, open('dataset/genotypes.pkl', 'wb'))

In [53]:
genotypes = pickle.load(open('dataset/genotypes.pkl', 'rb'))

In [57]:
df = pd.DataFrame(columns=['title', 'text'])

# split the list of genotypes into batches of 50
for batch in tqdm(batched(genotypes, 50)):
    pages = fetch_pages(batch)

    # add new data to the dataframe
    new_df = pd.DataFrame(pages, columns=['title', 'text'])
    df = pd.concat([df, new_df])

2097it [17:20,  2.01it/s]


In [58]:
df.to_pickle('dataset/genotypes_pages.pkl')

In [59]:
df = pd.read_pickle('dataset/genotypes_pages.pkl')

In [60]:
df.set_index('title', inplace=True)
df["genotype"] = df.index.str.extract(r'\((.*)\)', expand=False)
df["snp"] = df.index.str.extract(r'(.*)\(.*\)', expand=False)

In [61]:
df["snp"] = df["snp"].str.lower()
df["description"] = df["text"].apply(lambda x: mwparserfromhell.parse(x).strip_code())

Extract the genotype information from the SNPedia's Genotype template:

In [62]:
def extract_genotype_template_params(item):
    return extract_template_params(item["text"], "Genotype")

template_df = df.apply(extract_genotype_template_params, axis=1, result_type="expand")

In [None]:
df = pd.concat([df, template_df[["allele1", "allele2", "magnitude", "repute", "summary"]]], axis=1)

Write the genotype information to a resulting CSV file:

In [None]:
df.to_csv('dataset/genotypes.csv', index=False, columns=['snp', 'allele1', 'allele2', 'magnitude', 'repute', 'summary', 'description'])