# SNPedia Dataset

This notebook describes a procedure to retrieve and dump the [SNPedia](https://www.snpedia.com/) data and store it in a CSV files for further processing. The data is distributed under a [Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-nc-sa/3.0/us/). The SNPedia explicitly allows scraping of the data and provides a [Bulk API](https://www.snpedia.com/index.php/Bulk) to do so.

First, install necessary packages

```bash
pip install -r requirements.txt
```



Import the packages:

In [1]:
from itertools import batched
import pickle
import time


import requests
import mwparserfromhell
from tqdm.auto import tqdm
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

In [3]:
def fetch_titles_in_category(category_name: str) -> [str]:
    """
    Fetches all titles in a category from the SNPedia API
    """

    print(f"Fetching titles in category {category_name}", end="")

    titles = []
    cmcontinue = ""
    while True:
        print(".", end="")
        response = http.get(f'https://bots.snpedia.com/api.php?action=query&list=categorymembers&cmtitle=Category:{category_name}&cmlimit=500&format=json&cmcontinue={cmcontinue}')

        # ensure the API call was successful
        response.raise_for_status()

        # add the snps to the list
        for snp in response.json()['query']['categorymembers']:
            titles.append(snp['title'])

        # we use the cmcontinue value in the next API call to get the next page of the results
        if response.json().get('continue'):
            cmcontinue = response.json()['continue']['cmcontinue']
        else:
            # stop iterating if there are no more pages to fetch
            break

        if cmcontinue == '0|0':
            break

    print("done")
    return titles

In [4]:
def fetch_pages(titles: [str]) -> [dict]:
    # request 50 pages at a time (the maximum allowed)
    response = http.get('https://bots.snpedia.com/api.php?action=query&prop=revisions&rvslots=*&rvprop=content&format=json&titles={}'.format('|'.join(titles)))

    # ensure the API call was successful
    response.raise_for_status()

    pages = []
    for page in response.json()['query']['pages'].values():
        # snp is the title of the page
        title = page['title']

        # text is the content of the page
        text = page['revisions'][0]['slots']['main']['*']

        # add the snp and text to the list
        pages.append({'title': title, 'text': text})
    
    return pages

In [174]:
def find_template(templates, name:str):
    matched_templates = [template for template in templates if template.name.strip().lower() == name.lower()]
    if matched_templates:
        return matched_templates[0]
    return None

First, we need to fetch a list of all pages that describe SNPs:

In [None]:
snps = fetch_titles_in_category("Is_a_snp")

In [4]:
len(snps)

111725

Save a list of SNPs to a file to avoid fetching it again:

In [27]:
pickle.dump(snps, open('dataset/snps.pkl', 'wb'))

In [3]:
snps = pickle.load(open('dataset/snps.pkl', 'rb'))

Create a Pandas dataframe to store the pages data in a row format:

In [27]:
df = pd.DataFrame(columns=['title', 'text'])

Fetch the content of each SNP's page and store it in the dataframe:

In [28]:
# split the list of snps into batches of 50
for batch in tqdm(batched(snps, 50)):
    pages = fetch_pages(batch)

    # add new data to the dataframe
    new_df = pd.DataFrame(pages, columns=['title', 'text'])
    df = pd.concat([df, new_df])

2235it [29:57,  1.24it/s]


In [None]:
df.rename(columns={'title': 'snp'}, inplace=True)
df.set_index('snp', inplace=True)

Save the dataframe to a file to avoid fetching it again:


In [34]:
df.to_pickle('dataset/snpedia.pkl')

In [5]:
df = pd.read_pickle('dataset/snpedia.pkl')

In [7]:
df

Unnamed: 0_level_0,text
snp,Unnamed: 1_level_1
I1000001,{{23andMe SNP\n|Magnitude=\n}}\n[[haplogroups]...
I1000003,{{23andMe SNP\n|Magnitude=\n}}\n\n{{on chip | ...
I1000004,{{23andMe SNP\n|Chromosome=MT\n|position=8869\...
I1000015,{{23andMe SNP\n|Chromosome=MT\n|position=6776\...
I3000001,{{23andMe SNP\n|iid=3000001\n|rsid=113993960\n...
...,...
Rs999905,{{Rsnum\n|rsid=999905\n|Gene=NTRK3\n|Chromosom...
Rs9999118,{{Rsnum\n|rsid=9999118\n|Chromosome=4\n|Orient...
Rs999943,{{Rsnum\n|rsid=999943\n|Gene=ITPR3\n|Chromosom...
Rs999986,{{Rsnum\n|rsid=999986\n|Chromosome=14\n|positi...


Fetch a list of genotypes:

In [8]:
genotypes = fetch_titles_in_category("Is_a_genotype")

Fetching titles in category Is_a_genotype..................................................................................................................................................................................................................done


In [9]:
len(genotypes)

104806

Save a list of genotypes to a file to avoid fetching it again:

In [10]:
pickle.dump(genotypes, open('dataset/genotypes.pkl', 'wb'))

In [11]:
genotypes = pickle.load(open('dataset/genotypes.pkl', 'rb'))

In [12]:
df = pd.DataFrame(columns=['title', 'text'])

In [13]:
# split the list of genotypes into batches of 50
for batch in tqdm(batched(genotypes, 50)):
    time.sleep(1)
    pages = fetch_pages(batch)

    # add new data to the dataframe
    new_df = pd.DataFrame(pages, columns=['title', 'text'])
    df = pd.concat([df, new_df])

0it [00:00, ?it/s]

2097it [54:53,  1.57s/it]


In [None]:
df.set_index('title', inplace=True)
df["genotype"] = df.index.str.extract(r'\((.*)\)', expand=False)
df["snp"] = df.index.str.extract(r'(.*)\(.*\)', expand=False)

In [82]:
df["snp"] = df["snp"].str.lower()
df["description"] = df["text"].apply(lambda x: mwparserfromhell.parse(x).strip_code())

Extract the genotype information from the SNPedia's Genotype template:

In [229]:
def extract_genotype_template_params(item):
    template = find_template(mwparserfromhell.parse(item["text"]).filter_templates(), "Genotype")
    if template is None:
        # No genotype template found
        return {}

    # extract the parameters from the template
    iid = template.get("iid").value.strip() if template.has("iid") else None
    allele1 = template.get("allele1").value.strip() if template.has("allele1") else None
    allele2 = template.get("allele2").value.strip() if template.has("allele2") else None
    magnitude = template.get("magnitude").value.strip() if template.has("magnitude") else None
    repute = template.get("repute").value.strip() if template.has("repute") else None
    summary = template.get("summary").value.strip() if template.has("summary") else None

    genotype_params = {
        "iid": iid,
        "allele1": allele1,
        "allele2": allele2,
        "magnitude": magnitude,
        "repute": repute,
        "summary": summary
    }

    return genotype_params

df = pd.concat([df, df.apply(extract_genotype_template_params, axis=1, result_type="expand")], axis=1)

In [231]:
df.to_pickle('dataset/genotypes_texts.pkl')

In [232]:
df = pd.read_pickle('dataset/genotypes_texts.pkl')

In [233]:
df

Unnamed: 0_level_0,text,genotype,snp,description,iid,allele1,allele2,magnitude,repute,summary
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
I15006212(C;C),{{Genotype\n|iid=15006212\n|allele1=C\n|allele...,C;C,i15006212,,15006212,C,C,0,Good,normal
I15006212(C;T),{{Genotype\n|iid=15006212\n|allele1=C\n|allele...,C;T,i15006212,,15006212,C,T,4,Bad,Rhizomelic Chondrodysplasia Punctata Type 1 ca...
I15006212(T;T),{{Genotype\n|iid=15006212\n|allele1=T\n|allele...,T;T,i15006212,Rhizomelic chondrodysplasia punctata type 1 (R...,15006212,T,T,7,Bad,Rhizomelic Chondrodysplasia Punctata Type 1 (R...
I3000043(G;G),{{Genotype\n|iid=3000043\n|allele1=G\n|allele2...,G;G,i3000043,,3000043,G,G,,,
I3001801(C;C),{{Genotype\n|allele1=C\n|allele2=C\n|iid=30018...,C;C,i3001801,,3001801,C,C,0,Good,
...,...,...,...,...,...,...,...,...,...,...
Rs9986786(C;C),{{Genotype\n|rsid=9986786\n|allele1=C\n|allele...,C;C,rs9986786,,,C,C,0,Good,
Rs9987289(G;G),{{Genotype\n|rsid=9987289\n|allele1=G\n|allele...,G;G,rs9987289,,,G,G,0,Good,common on affy axiom data
Rs999380946(C;C),{{Genotype\n|rsid=999380946\n|allele1=C\n|alle...,C;C,rs999380946,,,C,C,0,Good,common in clinvar
Rs999737(C;C),{{Genotype\n|rsid=999737\n|allele1=C\n|allele2...,C;C,rs999737,,,C,C,0,Good,common on affy axiom data


In [None]:
df.to_csv('dataset/genotypes.csv', columns=['snp', 'genotype', 'allele1', 'allele2', 'magnitude', 'repute', 'summary', 'description'])