# SNPedia Dataset

There is a [SNPedia Bulk API](https://www.snpedia.com/index.php/Bulk) to fetch the data from [SNPedia](https://www.snpedia.com/), which is distributed under a [Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-nc-sa/3.0/us/).

First, install neccessary packages:

In [11]:
%pip install -r requirements.txt

Collecting pandas (from -r requirements.txt (line 4))
  Downloading pandas-2.1.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting numpy<2,>=1.26.0 (from pandas->-r requirements.txt (line 4))
  Downloading numpy-1.26.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m479.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas->-r requirements.txt (line 4))
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas->-r requirements.txt (line 4))
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m616.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Downloading pandas-2.1.4-cp312-cp312-macosx_11_0_arm64.whl (10.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.6/10.6 MB[0m [31m975.5 kB/s[0m eta [

Import the packages:

In [12]:
from itertools import batched
import pickle


import requests
import mwparserfromhell
from tqdm.auto import tqdm
import pandas as pd

First, we need to fetch a list of all pages that describe SNPs:

In [25]:
snps = []
cmcontinue = ""
while True:
    print("fetching {}".format(cmcontinue))
    response = requests.get('https://bots.snpedia.com/api.php?action=query&list=categorymembers&cmtitle=Category:Is_a_snp&cmlimit=500&format=json&cmcontinue={}'.format(cmcontinue))

    # ensure the API call was successful
    assert response.status_code == 200

    # add the snps to the list
    for snp in response.json()['query']['categorymembers']:
        snps.append(snp['title'])

    # we use the cmcontinue value in the next API call to get the next page of the results
    if response.json().get('continue'):
        cmcontinue = response.json()['continue']['cmcontinue']
    else:
        # stop interating if there are no more pages to fetch
        break

    if cmcontinue == '0|0':
        break

fetching 
fetching page|4935303030383334|266945
fetching page|4935303039323437|229614
fetching page|4935303130353537|229765
fetching page|4936303036373739|181968
fetching page|4936303134353832|231269
fetching page|525331303034343637|11405
fetching page|52533130323037333932|70296
fetching page|5253313034343139|156414
fetching page|5253313034383836323838|63450
fetching page|5253313034383933393632|26337
fetching page|5253313034383934353036|26835
fetching page|5253313034383935333733|124415
fetching page|525331303531393230|45370
fetching page|525331303537353136333732|239033
fetching page|525331303537353136383736|237135
fetching page|525331303537353137333830|235295
fetching page|525331303537353138303136|238232
fetching page|525331303537353138383436|238694
fetching page|525331303537353139353633|236723
fetching page|525331303537353230323931|237764
fetching page|525331303537353234313634|232593
fetching page|525331303630343939393631|240282
fetching page|525331303630353032343832|237996
fetching p

In [4]:
len(snps)

111725

Save a list of SNPs to a file to avoid fetching it again:

In [27]:
pickle.dump(snps, open('dataset/snps.pkl', 'wb'))

In [3]:
snps = pickle.load(open('dataset/snps.pkl', 'rb'))

Create a Pandas dataframe to store the pages data in a row format:

In [27]:
df = pd.DataFrame(columns=['snp', 'text'])

Fetch the content of each SNP's page and store it in the dataframe:

In [28]:
# split the list of snps into batches of 50
for batch in tqdm(batched(snps, 50)):
    # request 50 pages at a time (the maximum allowed)
    response = requests.get('https://bots.snpedia.com/api.php?action=query&prop=revisions&rvslots=*&rvprop=content&format=json&titles={}'.format('|'.join(batch)))

    # ensure the API call was successful
    assert response.status_code == 200

    pages = []
    for id, page in response.json()['query']['pages'].items():
        # snp is the title of the page
        snp = page['title']

        # text is the content of the page
        text = page['revisions'][0]['slots']['main']['*']

        # add the snp and text to the list
        pages.append({'snp': snp, 'text': text})

    # add new data to the dataframe
    new_df = pd.DataFrame(pages, columns=['snp', 'text'])
    df = pd.concat([df, new_df])

2235it [29:57,  1.24it/s]


In [33]:
df

Unnamed: 0_level_0,text
snp,Unnamed: 1_level_1
I1000001,{{23andMe SNP\n|Magnitude=\n}}\n[[haplogroups]...
I1000003,{{23andMe SNP\n|Magnitude=\n}}\n\n{{on chip | ...
I1000004,{{23andMe SNP\n|Chromosome=MT\n|position=8869\...
I1000015,{{23andMe SNP\n|Chromosome=MT\n|position=6776\...
I3000001,{{23andMe SNP\n|iid=3000001\n|rsid=113993960\n...
...,...
Rs999905,{{Rsnum\n|rsid=999905\n|Gene=NTRK3\n|Chromosom...
Rs9999118,{{Rsnum\n|rsid=9999118\n|Chromosome=4\n|Orient...
Rs999943,{{Rsnum\n|rsid=999943\n|Gene=ITPR3\n|Chromosom...
Rs999986,{{Rsnum\n|rsid=999986\n|Chromosome=14\n|positi...


In [34]:
df.to_pickle('dataset/snpedia.pkl')