# Introduction

This notebook explains how to run the spider program to fetch block logs from Wikipedia API.

In [1]:
import requests
import pandas as pd

In [2]:
# append system path to import spider program
import sys
sys.path.append('../')
from utils.spiders import retrieve_block_logs_month # fetch data for a specific month in a given year
from utils.spiders import retrieve_block_logs_year # fetch data for a given year

# Code examples

## Retrieve data for 1 month

To retrieve block logs for a specific month, please use the retrieve_block_logs_month function. It automatically retrieves the logs and write them into a local csv file (format: year-month.csv).

It requires 4 parameters:
- year (int): 
    - The year for which block logs should be retrieved.
- month (int):
    - The month for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.
- compress (bool): 
    - Specify if compress the CSV file into a gz file.

**Note:**
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

**About runtime for data retrieval:**
- Before August 2021: it takes around 5-10 minutes maximum to retrieve data for each month.
- August 2021 onwards: it can take up to 30-50 minute to retrieve data for each month, even with good internet connections.

In [3]:
# Set folder path to store fetched data
folder_path = '../data/block_logs/'

In [10]:
retrieve_block_logs_month(year=2006, month=2, folder_path=folder_path, compress=False)

500 records retrieved.
Search with continue index: 20060201215221|1579465, 500 records retrieved.
Search with continue index: 20060202151922|1586911, 500 records retrieved.
Search with continue index: 20060203033653|1593016, 500 records retrieved.
Search with continue index: 20060203215030|1600305, 500 records retrieved.
Search with continue index: 20060204223431|1609446, 500 records retrieved.
Search with continue index: 20060205215452|1618865, 500 records retrieved.
Search with continue index: 20060206175554|1627241, 500 records retrieved.
Search with continue index: 20060207025229|1632446, 500 records retrieved.
Search with continue index: 20060207210849|1640608, 500 records retrieved.
Search with continue index: 20060208035041|1644786, 500 records retrieved.
Search with continue index: 20060208221003|1653278, 500 records retrieved.
Search with continue index: 20060209195305|1663172, 500 records retrieved.
Search with continue index: 20060210175418|1673158, 500 records retrieved.
Se

## Retrieve data for 1 year

To retrieve block logs for a specific year, please use the retrieve_block_logs_month function. It automatically loops over the month retrive function to download and store logs for a given year. The data will be stored in 12 local csv files (format: year-month.csv), each for 1 month.

It requires 3 parameters:
- year (int): 
    - The year for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.
- compress (bool): 
    - Specify if compress the CSV file into a gz file.

**Recommendations**:
- I advise against running this function to scrape yearly data for 2022 and onwards within one go. It can take several hours to scrape all data.

**Note:**
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

In [None]:
# Set folder path to store fetched data
folder_path = '../data/block_logs/'

In [9]:
for i in range(2016, 2021, 1):
    retrieve_block_logs_year(year=i, folder_path=folder_path, compress=True)

500 records retrieved.
Search with continue index: 20160101200219|71517014, 500 records retrieved.
Search with continue index: 20160102154016|71532524, 500 records retrieved.
Search with continue index: 20160103075145|71546824, 500 records retrieved.
Search with continue index: 20160103171210|71554888, 500 records retrieved.
Search with continue index: 20160104051308|71565952, 500 records retrieved.
Search with continue index: 20160104230809|71584008, 500 records retrieved.
Search with continue index: 20160105071023|71593213, 500 records retrieved.
Search with continue index: 20160105234912|71610065, 500 records retrieved.
Search with continue index: 20160106151207|71624199, 500 records retrieved.
Search with continue index: 20160107000805|71633974, 500 records retrieved.
Search with continue index: 20160107151311|71646454, 500 records retrieved.
Search with continue index: 20160108010807|71657180, 500 records retrieved.
Search with continue index: 20160108061856|71661457, 500 records 