## API: Downloading and Testing data
Downloading the event, asset and collection information by using OpenSea's API.
 * Downloading data
 * Run tests and mark risks (missing or duplicated data): e.g. checking the delta trx time between events, and mark if it is too big or unusual.

In [1]:
# data processing related
import pandas as pd
import numpy as np
import json

# core libraries
from datetime import datetime, timedelta
import imp

# OS related
from os import listdir, makedirs, remove, path

# parallel programming related
from multiprocessing import Pool
import subprocess

# downloading related
import requests

# OpenSea related

# settings
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)

# own libraries
import src.config_API as ca
import src.api_dl as dl
import src.config as cf
import src.util as ut

In [2]:
ca = imp.reload(ca)
cf = imp.reload(cf)
dl = imp.reload(dl)
ut = imp.reload(ut)

## Function test: single thread download
Downloading/storing transactions within a time interval (with the given filters)

In [3]:
# Checking a few hours (single thread loader)

# dl._timeInterval_eventDL_oneThread(API_key=ca.apikey_1, filter_dict={'event_type':'successful'}, 
#                                    time_interval=['2021-01-02 00:00:00', '2021-01-03 00:00:00'], 
#                                    path_dumpdir=cf.PATH_00RAW_API_dump_main + 'test/', batchsize=300, 
#                                    _verbose=True, _timeout_limit=0)

In [4]:
# pdf_results = pd.read_pickle(cf.PATH_00RAW_API_dump_main + 'test/eventresponse_1609545600_1609632000.pickle')
# pdf_results.head(2)

In [5]:
dl = imp.reload(dl) # 1610064000
dl._timeInterval_eventDL_oneThread(API_key=ca.apikey_1, filter_dict={'event_type':'successful'}, 
                                   time_interval=[], time_interval_unixts=[1531526400, 1531612800],
                                   path_dumpdir=cf.PATH_00RAW_API_dump_main + 'test/', batchsize=300, 
                                   _verbose=True, _timeout_limit=0)

[Sun 15:00:06] Downloading events from 1531526400 to 1531612800...
[Sun 15:00:06] File already exists...


## Downloading Earlier Years and Recent Data
At the beginning, it is possible to load bigger periods (although, in 2017, there are some "busy days" - probably the initial transactions/NFTs were uploaded). Later on, shorther periods can be used: days (time_batch_sec=86400), half days (time_batch_sec=43200), or hours (time_batch_sec=3600). 
If you get too many errors in the errors folder (some days are not available / some periods give you error), two things can be tried:
 * (0. check the errors, it might answer what is the problem)
 * Try to download the data a few days/week later. OpenSea will fix the error if it is on their side.
 * Use lower time_batch_sec (5-minute = 300 sec for instance), so the "in-between" gaps will be filled at least. The API loader automatically handling the covered time intervals, independently from the time_batch_sec.

The following functions are downloading the data for 2017 (not full year), 2018, 2019, 2020, 2021 and the recent period.

In [5]:
# Earlier Years (weekly dl)
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2017-01-01 00:00:00', '2018-01-01 00:00:00'], time_batch_sec=1*86400, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [5]:
# Earlier Years (weekly dl)
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2018-01-01 00:00:00', '2019-01-01 00:00:00'], time_batch_sec=86400, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [7]:
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2019-01-01 00:00:00', '2020-01-01 00:00:00'], time_batch_sec=86400, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [5]:
# Downloading a month (multi-thread loader)
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2020-01-01 00:00:00', '2021-01-01 00:00:00'], time_batch_sec=86400, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [6]:
# Downloading a month (multi-thread loader)
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2021-01-01 00:00:00', '2021-04-01 00:00:00'], time_batch_sec=43200, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [7]:
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2021-04-01 00:00:00', '2021-07-01 00:00:00'], time_batch_sec=43200, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [3]:
dl = imp.reload(dl)
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2021-07-01 00:00:00', '2021-10-01 00:00:00'], time_batch_sec=43200, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [22]:
dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
                       time_interval=['2021-10-01 00:00:00', '2022-01-01 00:00:00'], time_batch_sec=21600, 
                       path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, 
                       _verbose=False)

In [None]:
# dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
#                        time_interval=['2021-05-01 00:00:00', '2021-06-01 00:00:00'], time_batch_sec=43200, 
#                        path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, _verbose=False)

In [None]:
# dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
#                        time_interval=['2021-06-01 00:00:00', '2021-07-01 00:00:00'], time_batch_sec=86400, 
#                        path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, _verbose=False)

In [9]:
# dl.TimeIntervalEventDL(list_of_API_keys=[ca.apikey_1, ca.apikey_2], filter_dict={'event_type':'successful'}, 
#                        time_interval=['2021-07-01 00:00:00', '2021-08-01 00:00:00'], time_batch_sec=86400, 
#                        path_out_dumpdir=cf.PATH_00RAW_API_dump_main, batchsize=300, _timeout_limit=0, _verbose=False)

## Prepare some Tests
Testing the downloaded transactions for filtering out possible errors: 
 * The first/last transactions near to the "cut" (for all files downloaded). Compare the differences to the AVG distance cross transactions inside the file.
 * The daily amount of transactions (Compare it to the 30-day moving average).

### Testing the distance (in sec) from the file limit

In [7]:
# Testing a library

def _libTester_event(path_in_dir):
    """
    Testing the event dump files in a library, for identifying the potential errors.
    """

    _list_of_files = listdir(path_in_dir)
    _list_of_files = [_f for _f in _list_of_files if (_f.count('.pickle') > 0)]
    out_df = pd.DataFrame()

    for _file in _list_of_files:
        _mindt = int(_file.split('.')[0].split('_')[1])
        _maxdt = int(_file.split('.')[0].split('_')[2])
        _tmp_df = pd.read_pickle(path_in_dir + _file)
        _realmindt = pd.to_datetime(_tmp_df.iloc[-1].transaction['timestamp']).timestamp()
        _realmaxdt = pd.to_datetime(_tmp_df.iloc[0].transaction['timestamp']).timestamp()
        _odf = pd.DataFrame({'expected_min_dt':[_mindt], 'real_min_dt':[_realmindt], 
                             'expected_max_dt':[_maxdt], 'real_max_dt':[_realmaxdt]})
        out_df = out_df.append(_odf)
    
    out_df = out_df.reset_index(drop=True)
    out_df['min_dt_diff'] = out_df.real_min_dt - out_df.expected_min_dt
    out_df['max_dt_diff'] = out_df.expected_max_dt - out_df.real_max_dt

    return out_df

In [1]:
# _libTester_event('./dump_data/events_20210401_000000_20210501_000000/')