# Introduction

This notebook explains how to run the program to fetch abusefilter logs from Wikipedia API.

In [3]:
import requests
import pandas as pd

In [4]:
# append system path to import spider program
import sys
sys.path.append('../')
from utils.spider_abusefilter import retrieve_abusefilter_logs_month # fetch data for a specific month in a given year
from utils.spider_abusefilter import retrieve_abusefilter_logs_year # fetch data for a given year

# Code examples

## Retrieve data for 1 month

To retrieve logs for a specific month, please use the retrieve_abusefilter_logs_month function. It automatically retrieves the logs and write them into a local csv file (format: year-month.csv).

It requires 4 parameters:
- request_session:
    - A request session to sent API requests.
- year (int): 
    - The year for which block logs should be retrieved.
- month (int):
    - The month for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.
    - Example: './data/abusefilter_logs/'

**Note:**
- No csv will be created, if 0 record is retrieved.
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

In [6]:
# Set folder path to store fetched data
folder_path = '../data/abusefilter_logs/'

In [7]:
s = requests.Session()
retrieve_abusefilter_logs_month(request_session=s, year=2008, month=2, folder_path=folder_path)

Retrieving logevent type: abusefilter
0 records retrieved.
Retrieval complete for logs between 2008-02-01T00:00:00Z and 2008-03-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilterblockeddomainhit
0 records retrieved.
Retrieval complete for logs between 2008-02-01T00:00:00Z and 2008-03-01T00:00:00Z.
No record was retrieved or saved.


## Retrieve data for 1 year

To retrieve block logs for a specific year, please use the retrieve_abusefilter_logs_year function. It automatically loops over the month retrive function to download and store logs for a given year. The data will be stored in 12 local csv files (format: year-month.csv), each for 1 month.

It requires 3 parameters:
- request_session:
    - A request session to sent API requests.
- year (int): 
    - The year for which block logs should be retrieved.
- folder_path (str):
    - The folder path where the CSV file will be saved.

**Note:**
- No csv will be created, if 0 record is retrieved.
- The first abusefilter log was in March 2009.
- If you encounter issues at running the function, please check if a file already exists on your local disk. This function automatically stop to prevent from overwritting local file.

In [9]:
# Set folder path to store fetched data
folder_path = '../data/abusefilter_logs/'
s = requests.Session()

In [10]:
for year in range(2008, 2024, 1):
    retrieve_abusefilter_logs_year(request_session=s, year=year, folder_path=folder_path)

Retrieving logevent type: abusefilter
0 records retrieved.
Retrieval complete for logs between 2008-01-01T00:00:00Z and 2008-02-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilterblockeddomainhit
0 records retrieved.
Retrieval complete for logs between 2008-01-01T00:00:00Z and 2008-02-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilter
0 records retrieved.
Retrieval complete for logs between 2008-02-01T00:00:00Z and 2008-03-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilterblockeddomainhit
0 records retrieved.
Retrieval complete for logs between 2008-02-01T00:00:00Z and 2008-03-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilter
0 records retrieved.
Retrieval complete for logs between 2008-03-01T00:00:00Z and 2008-04-01T00:00:00Z.
No record was retrieved or saved.
Retrieving logevent type: abusefilterblockeddomainhit
0 records retrieved.
Retrieval com