# Data Collection
---

This project involves the collection of Ethereum transaction data from Arkham Intelligence, focusing on addresses of interest. To enhance efficiency, data collection is conducted in two stages, leveraging the Etherscan.io API. Additionally, historical Ethereum value data is retrieved using the Yahoo Finance API.

## Data Sources

1. **Arkham Intelligence:**
   - Ethereum addresses of interest are identified using Arkham Intelligence.

2. **Etherscan.io API:**
   - Utilized to collect detailed information about transactions associated with each Ethereum address.
   - The data is collected in two stages to optimize efficiency.

3. **Yahoo Finance API:**
   - Employed to retrieve historical values of Ethereum.

In [1]:
import requests
import os
import pandas as pd
from dotenv import load_dotenv
import yfinance as yf

In [2]:
# Bringing in API-KEY
load_dotenv('./.env')      
key = os.getenv("API_KEY")


In [4]:
# These files are sets of transactions by a singular address
list_of_files = ['arkham_txns.csv','arkham_txns (1).csv','arkham_txns (2).csv','arkham_txns (3).csv','arkham_txns (4).csv','arkham_txns (5).csv','arkham_txns (6).csv','arkham_txns (7).csv','arkham_txns (8).csv','arkham_txns (9).csv','arkham_txns (10).csv','arkham_txns (11).csv','arkham_txns (12).csv']


# We take just the from and to addresses to obtain a longer list of addresses.
current_transacts = pd.DataFrame()
for i in list_of_files:
    df = pd.read_csv(f"../data/{i}")
    current_transacts = pd.concat([current_transacts, df], ignore_index=True)
list_of_addresses = list(current_transacts['fromAddress'])+list(current_transacts['toAddress'])


In [5]:
# A check of how many address we will be scanning through in this go
list_of_addresses = list(set(list_of_addresses))
len(list_of_addresses)

25

### Pulling Transaction Data
---

In [6]:
url = 'https://api.etherscan.io/api'
all_transactions =[]


for address in list_of_addresses:


    page = 0
    while True:
        
        params = {
            'module': 'account',
            'action': 'txlist',
            'address': address,
            'startblock': 0,
            'endblock': 99999999,
            'page': page,
            'offset': 10,
            'sort': 'asc',
            'apikey': key
        }
        response = requests.get(url, params=params)
        
        if response.status_code == 200:
            data = response.json()
            
            if data['result']:
                all_transactions.extend(data['result'])
                page+=1
            else:
                break
        else:
            break
    # Status Check
    print(f"still going {address}")

            

still going 0x919f9173E2Dc833Ec708812B4f1CB11B1a17eFDe
still going 0x14C43dAC1D4268779279679210F24294D7B15Ed2
still going 0xD512Bc60EAD7b70c1645b1874eA59f295Ed62279
still going 0x8D9c8D005071f86bf557f170A0BBDf6561d4Ec74
still going 0x0DBEfc275a81034f13041B3BcAa3Ad1831f10d60
still going 0xCD531Ae9EFCCE479654c4926dec5F6209531Ca7b
still going 0x21bACe4927263D2F9932a2309bcB55B93590ddC7
still going 0x6000da47483062A0D734Ba3dc7576Ce6A0B645C4
still going 0xDF90C9B995a3b10A5b8570a47101e6c6a29eb945
still going 0x1D68124e65faFC907325e3EDbF8c4d84499DAa8b
still going 0x09aea4b2242abC8bb4BB78D537A67a245A7bEC64
still going 0x58edF78281334335EfFa23101bBe3371b6a36A51
still going 0xeA9946e10d77878B35749D085E38f91207944a5b
still going 0xB8614008D759A299cc9498DC6D8f3CD1705d2bA0
still going 0x75e89d5979E4f6Fba9F97c104c2F0AFB3F1dcB88
still going 0x122D06a722f3ee4AfA33D3b19aba0671bFC98581
still going 0xd91eFec7E42f80156d1D9f660a69847188950747
still going 0x4E5B2e1dc63F6b91cb6Cd759936495434C7e972F
still goin

In [7]:
len(all_transactions) # checking the size of our data

176959

In [8]:
eth_df = pd.DataFrame(all_transactions) 

In [10]:
eth_df.to_csv('../data/eth_trans_data2.csv')  # due to sheer amount of data I ran this notebook twice pulling half the data each time, the first time was put into eth_trans_data1.csv

### Ethereum price over the years
---

In [None]:
eth_ticker = "ETH-USD"

start_d = "2015-08-01"
end_d = "2024-02-01"

price_eth = yf.download(eth_ticker, start = start_d, end = end_d)

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-11-09,308.644989,329.451996,307.056,320.884003,320.884003,893249984
2017-11-10,320.67099,324.717987,294.541992,299.252991,299.252991,885985984
2017-11-11,298.585999,319.453003,298.191986,314.681,314.681,842300992


In [None]:
price_eth = price_eth.sort_index()
price_eth_close = price_eth['Close']
price_eth_close.to_csv('../data/eth_price_per_day.csv')