# <br><br> Scraping Historical Snapshots from CoinMarketCap

In this step, we collect historical snapshots of the cryptocurrency market from CoinMarketCap, capturing the top-ranked coins by market capitalization on specific dates.

The purpose is to create a time-aware list of major coins that had significant market presence at different points in time. To focus on liquid assets and reduce noise, we **filter out all coins with a market capitalization below $100 million** at each snapshot.

This ensures that our selection reflects relevant, tradeable assets and avoids obscure or illiquid tokens.

⚠️ **Note:** Web scraping scripts are inherently sensitive to changes in website structure.  
CoinMarketCap may update their HTML layout or protection mechanisms over time, which could break the scraper.  
If the script fails, inspect the webpage manually and adjust the parsing logic accordingly.

The output of this step is a time series of coin symbols (e.g., BTC, ETH, SOL) associated with each historical snapshot.


In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd
from tqdm import tqdm
import time
import os

In [2]:
# Get all available dates (1st day of the month)
url = 'https://coinmarketcap.com/es/historical/'

driver = webdriver.Chrome()
driver.get(url)

time.sleep(5)

month_names = {
    'Enero': 1,
    'Febrero': 2,
    'Marzo': 3,
    'Abril': 4,
    'May': 5,
    'Junio': 6,
    'Julio': 7,
    'Agosto': 8,
    'Septiembre': 9,
    'Octubre': 10,
    'Noviembre': 11,
    'Diciembre': 12
}

date_strs = []

year_tables = driver.find_elements(By.XPATH, '//h1[contains(text(), "Instantánea de los datos históricos")]/../div/div')
for year_table in tqdm(year_tables):
    year = year_table.find_elements(By.XPATH, './div')[0].text
    months = year_table.find_elements(By.XPATH, './div/div')
    for month in months:
        month_text = month.find_elements(By.XPATH, './div/div')[0].text
        day_text = month.find_elements(By.XPATH, './div/a')[0].text

        date_str = f'{year}{str(month_names[month_text]).zfill(2)}{str(day_text).zfill(2)}'
        date_strs.append(date_str)

driver.close()
date_strs

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:11<00:00,  1.18it/s]


['20130428',
 '20130505',
 '20130602',
 '20130707',
 '20130804',
 '20130901',
 '20131006',
 '20131103',
 '20131201',
 '20140105',
 '20140202',
 '20140302',
 '20140406',
 '20140504',
 '20140601',
 '20140706',
 '20140803',
 '20140907',
 '20141005',
 '20141102',
 '20141207',
 '20150104',
 '20150201',
 '20150301',
 '20150405',
 '20150503',
 '20150607',
 '20150705',
 '20150802',
 '20150906',
 '20151004',
 '20151101',
 '20151206',
 '20160103',
 '20160207',
 '20160306',
 '20160403',
 '20160501',
 '20160605',
 '20160703',
 '20160807',
 '20160904',
 '20161002',
 '20161106',
 '20161204',
 '20170101',
 '20170205',
 '20170305',
 '20170402',
 '20170507',
 '20170604',
 '20170702',
 '20170806',
 '20170903',
 '20171001',
 '20171105',
 '20171203',
 '20180107',
 '20180204',
 '20180304',
 '20180401',
 '20180506',
 '20180603',
 '20180701',
 '20180805',
 '20180902',
 '20181007',
 '20181104',
 '20181202',
 '20190106',
 '20190203',
 '20190303',
 '20190407',
 '20190505',
 '20190602',
 '20190707',
 '20190804',

In [3]:
if not os.path.exists('../data/market_cap_snapshots'):
    os.mkdir('../data/market_cap_snapshots')
    
saved_date_strs = os.listdir('../data/market_cap_snapshots')
for date_str in tqdm(date_strs):
    if date_str in saved_date_strs:
        continue
    url = f'https://coinmarketcap.com/es/historical/{date_str}/'
    
    driver = webdriver.Chrome()
    driver.get(url)
    
    time.sleep(5)
    
    driver.execute_script("window.scrollBy(0, 300);")
    driver.execute_script("window.scrollBy(0, 300);")

    results = {
        'coin': [],
        'mkt_cap': []
    }
    rows = driver.find_elements(By.XPATH, '//tr[contains(@class, "cmc-table-row")]')
    for i in range(len(rows)):
        while True:
            try:
                row = rows[i]
                cols = row.find_elements(By.TAG_NAME, 'td')
                coin = cols[2].find_element(By.TAG_NAME, 'div').text
                mkt_cap_str = cols[3].find_element(By.TAG_NAME, 'div').text
                mkt_cap = float(mkt_cap_str.replace('$', '').replace(',', '').replace(' ', ''))
                break
            except:
                driver.execute_script("window.scrollBy(0, 300);")
                driver.execute_script("window.scrollBy(0, 300);")
                rows = driver.find_elements(By.XPATH, '//tr[contains(@class, "cmc-table-row")]')
                time.sleep(1)
                
        if mkt_cap < 1e8:
            break
    
        results['coin'].append(coin)
        results['mkt_cap'].append(mkt_cap)

    driver.close()
    pd.DataFrame(results).to_csv(f'../data/market_cap_snapshots/{date_str}.csv', index=False)

100%|████████████████████████████████████████████████████████████████████████████████| 145/145 [49:46<00:00, 20.59s/it]


### 🔍 Note: Manual Alternative

This step uses web scraping to automatically collect historical CoinMarketCap snapshots. However, the same task can be performed manually if desired or if the scraping process breaks due to changes in the website's structure.

CoinMarketCap provides a historical archive of market snapshots at the following URL:

📅 https://coinmarketcap.com/es/historical/