# Introduction

This notebook explains how to run the spider program to fetch block logs from Wikipedia API.

In [1]:
import requests
import pandas as pd

In [2]:
# append system path to import spider program
import sys
sys.path.append('../')
from utils.spiders import retrieve_block_logs_month # fetch data for a specific month in a given year
from utils.spiders import retrieve_block_logs_year # fetch data for a given year

# Code examples

## Retrieve data for 1 month

To retrieve block logs for a specific month, please use the retrieve_block_logs_month function. It automatically retrive the logs and write them into a local csv file (format: year-month.csv).

It requires 4 parameters:
- year (int): 
    - The year for which block logs should be retrieved.
- month (int):
    - The month for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.
- compress (bool): 
    - Specify if compress the CSV file into a gz file.

Note:
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

In [3]:
# Set folder path to store fetched data
folder_path = '../data/block_logs/'

In [4]:
retrieve_block_logs_month(year=2006, month=2, folder_path=folder_path, compress=True)

../data/block_logs/2006-02.csv already exists. Cannot overwrite.


FileExistsError: Pls check if a ../data/block_logs/2006-02.csv already exists on local disk.

## Retrieve data for 1 year

To retrieve block logs for a specific year, please use the retrieve_block_logs_month function. It automatically performs a loop of the month retrive function to download and store logs for a given year. The data will be stored in 12 local csv files (format: year-month.csv), each for 1 month.

It requires 4 parameters:
- year (int): 
    - The year for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.
- compress (bool): 
    - Specify if compress the CSV file into a gz file.

Note:
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

In [7]:
for i in range(2009, 2016, 1):
    retrieve_block_logs_year(year=i, folder_path=folder_path, compress=True)

2009
2010
2011
2012
2013
2014
2015


In [6]:
retrieve_block_logs_year(year=2009, folder_path=folder_path, compress=True)

500 records retrieved.
Search with continue index: 20080101231752|12850350, 500 records retrieved.
Search with continue index: 20080103010950|12870209, 500 records retrieved.
Search with continue index: 20080103220145|12884564, 500 records retrieved.
Search with continue index: 20080104193125|12899160, 500 records retrieved.
Search with continue index: 20080105065423|12907743, 500 records retrieved.
Search with continue index: 20080106063921|12926262, 500 records retrieved.
Search with continue index: 20080107135713|12945545, 500 records retrieved.
Search with continue index: 20080108003732|12953869, 500 records retrieved.
Search with continue index: 20080108181834|12964772, 500 records retrieved.
Search with continue index: 20080109133913|12988631, 500 records retrieved.
Search with continue index: 20080110001752|12996772, 500 records retrieved.
Search with continue index: 20080110173502|13007682, 500 records retrieved.
Search with continue index: 20080111075901|13022517, 500 records 