# Web Scraping and Introductory Data Analysis

Welcome to Homework 0, where we will delve into web scraping and perform an introductory data analysis. This homework will be a hands-on exercise that will help you become familiar with the process of extracting data from websites and conducting basic statistical analysis. 

## Objectives

By the end of this homework, you will be able to:

1. Set up a Python environment with the necessary libraries for web scraping and data analysis.
2. Write a web scraping script using Beautiful Soup and Selenium to collect data from a website.
3. Sample from the collected dataset and compare the statistics of the sample and the population.
   
## Tasks

1. **Environment Setup**: Install the required libraries such as Beautiful Soup, Selenium, pandas, numpy, matplotlib, and seaborn.

2. **Web Scraping**: Write a script to scrape transaction data from [Etherscan.io](https://etherscan.io/txs). Use Selenium to interact with the website and Beautiful Soup to parse the HTML content.

3. **Data Sampling**: Once the data is collected, create a sample from the dataset. Compare the sample statistics (mean and standard deviation) with the population statistics.


## Deliverables

1. A Jupyter notebook with all the code and explanations.
2. A detailed report on the findings, including the comparison of sample and population statistics.
Note: You can include the report in your notebook.

## Getting Started

Begin by setting up your Python environment and installing the necessary libraries. Then, proceed with the web scraping task, ensuring that you handle any potential issues such as rate limiting. Once you have the data, move on to the data sampling and statistical analysis tasks. 

Remember to document your process and findings in the Jupyter notebook, and to include visualizations where appropriate to illustrate your results. <br>
Good luck, and happy scraping!

## Data Collection (Etherscan)

In this section, we will use web scraping to gather transaction data from the Ethereum blockchain using the Etherscan block explorer. Our objective is to collect transactions from the **last 10 blocks** on Ethereum.

To accomplish this task, we will employ web scraping techniques to extract the transaction data from the Etherscan website. The URL we will be targeting for our data collection is:

[https://etherscan.io/txs](https://etherscan.io/txs)

### Steps

1. **Navigate to the URL**: Use Selenium to open the Etherscan transactions page in a browser.

2. **Locate the Transaction Data**: Identify the HTML elements that contain the transaction data for the specified block range.

3. **Extract the Data**: Write a script to extract the transaction details e.g. Hash, Method, Block, etc.

4. **Handle Pagination**: If the transactions span multiple pages, implement pagination handling to navigate through the pages and collect all relevant transaction data.

5. **Store the Data**: Save the extracted transaction data into a structured format, such as a CSV file or a pandas DataFrame, for further analysis.

### Considerations

- **Rate Limiting**: Be mindful of the website's rate limits to avoid being blocked. Implement delays between requests if necessary.
- **Dynamic Content**: The Etherscan website may load content dynamically. Ensure that Selenium waits for the necessary elements to load before attempting to scrape the data.
- **Data Cleaning**: After extraction, clean the data to remove any inconsistencies or errors that may have occurred during the scraping process.

### Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Ethereum](https://ethereum.org/en/)

In [111]:
# Generated by Selenium IDE
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time


driver = webdriver.Firefox()
driver.get("https://etherscan.io/txs")
the_soup = BeautifulSoup(driver.page_source, 'html.parser')

The geckodriver version (0.33.0) detected in PATH at /usr/local/bin/geckodriver might not be compatible with the detected firefox version (123.0); currently, geckodriver 0.34.0 is recommended for firefox 123.*, so it is advised to delete the driver in PATH and retry


In [112]:
t = the_soup.find("tbody", attrs={"class":"align-middle text-nowrap"})

In [113]:
row = [i for i in t.children]
str(row[1])[:400]

'<tr>\n<td><button class="js-tnx-preview btn btn-sm btn-white fs-70x content-center mx-auto myFnExpandBox" data-bs-container="body" data-bs-content="&lt;i class=\'fas fa-circle-notch fa-spin text-primary fa-2x\'&gt;&lt;/i&gt;" data-bs-content-id="js-tnx-preview" data-bs-custom-class="popover-preview" data-bs-html="true" data-bs-placement="right" data-bs-toggle="popover" data-bs-trigger="manual" data-i'

In [114]:
for i in range(1, 10):
    print(i, row[i].findAll("td")[2].find("span").text)

1 Transfer
2 Transfer
3 Ccip Send
4 Execute
5 Execute
6 Transfer
7 Transfer
8 Sell To Uniswap
9 Sell To Uniswap


In [115]:
def convert_row_to_dict(row):
    cells = row.findAll("td")
    return {"Txn Hash": cells[1].find("a").text,
            "Method": cells[2].find("span").text, # TODO cant show full text
            "Block": cells[3].find("a").text,
            "Time": cells[4].find("span").text,
            "From": cells[7].find("a")["data-bs-title"] if cells[7].find("a").has_attr('data-bs-title')\
            else cells[7].find("span")["data-bs-title"],
            "Self": cells[8].text == "SELF",
            "To": cells[9].find("a")["data-bs-title"] if cells[9].find("a").has_attr('data-bs-title')\
            else cells[9].find("span")["data-bs-title"],
            "Value": cells[10].text,
            "Txn Fee": cells[11].text,
            "GasPrice": cells[12].text,
            "Error": cells[1].findAll("span")[0]["data-bs-title"]\
            if len(cells[1].findAll("span")) > 1 else ""}
convert_row_to_dict(row[1])

{'Txn Hash': '0xeba960e42576852a90760dadef2ea709fde92c8114da45edb91b70fed649f7b4',
 'Method': 'Transfer',
 'Block': '19347405',
 'Time': '2024-03-02 12:13:11',
 'From': '0x4648451b5f87ff8f0f7d622bd40574bb97e25980',
 'Self': False,
 'To': 'Public Tag: Tether: USDT Stablecoin<br/>(0xdac17f958d2ee523a2206206994597c13d831ec7)',
 'Value': '0 ETH',
 'Txn Fee': '0.00275637',
 'GasPrice': '43.60736268',
 'Error': ''}

In [116]:
def parse_transaction(the_soup, log=False):
    table = the_soup.find("tbody", attrs={"class":"align-middle text-nowrap"})
    rows = [i for i in t.children]
    if log:
        print(f"--------\nsample row: {str(row[1])[:300]}...\n----------\n")
    row_data = [convert_row_to_dict(rows[i]) for i in range(1, len(rows) - 1)]
    if log:
        print(f"--------\nsample data: {row_data[1]}\n----------\n")
    return row_data

page_data = parse_transaction(the_soup, True)

--------
sample row: <tr>
<td><button class="js-tnx-preview btn btn-sm btn-white fs-70x content-center mx-auto myFnExpandBox" data-bs-container="body" data-bs-content="&lt;i class='fas fa-circle-notch fa-spin text-primary fa-2x'&gt;&lt;/i&gt;" data-bs-content-id="js-tnx-preview" data-bs-custom-class="popover-preview" da...
----------

--------
sample data: {'Txn Hash': '0x12351f92b225a5abe8673dc514525c26cd5f45d8f9f76f8bad940248cb1c40df', 'Method': 'Transfer', 'Block': '19347405', 'Time': '2024-03-02 12:13:11', 'From': '0x474fc1f28b6394f957d471d4148db761f3559ae5', 'Self': False, 'To': '0x17a7227be96d8a3dfbe81bfa2bcea4f353d4274e', 'Value': '0.000939273 ETH', 'Txn Fee': '0.00091577', 'GasPrice': '43.60836135', 'Error': ''}
----------



In [117]:
def write_in_json_file(data_list):
    json_data = json.dumps(data_list, indent=4)

    file_path = f"data_block_{data_list[0]['Block']}_{data_list[-1]['Block']}.json"
    with open(file_path, "w") as file:
        file.write(json_data)

    print(f"Data has been saved to {file_path}")

In [118]:
def find_first_block(the_soup):
    return int(parse_transaction(the_soup)[1]["Block"])
find_first_block(the_soup)

19347405

In [119]:
def get_page_count(the_soup):
    return int(the_soup.find(class_="page-link text-nowrap").text.split()[3])
get_page_count(the_soup)

10000

In [120]:
def collect_block_data(driver, block):
    block_data = list()
    driver.get(f"https://etherscan.io/txs?block={block}")
    the_soup = BeautifulSoup(driver.page_source, 'html.parser')
    page_count = get_page_count(the_soup)
    for i in range(1, page_count + 1):
        driver.get(f"https://etherscan.io/txs?block={block}&p={i}")
        the_soup = BeautifulSoup(driver.page_source, 'html.parser')
        block_data.extend(parse_transaction(the_soup))
    return block_data

In [121]:
BLOCK_COUNT = 10
def scrape_data(driver):
    output = list()
    driver.get("https://etherscan.io/txs")
    the_soup = BeautifulSoup(driver.page_source, 'html.parser')
    first_block = find_first_block(the_soup)
    for i in range(BLOCK_COUNT):
        output.extend(collect_block_data(driver, first_block - i))
        print(f"sample block {first_block - i}: {output[-1]}\n")
        time.sleep(1)
    write_in_json_file(output)
scrape_data(driver)

sample block 19347405: {'Txn Hash': '0xde82fcee93a6f1ba7e86af9a5cacb1911a7d8c50d0e7a1de8bb1778b3c0d1008', 'Method': 'Reveal', 'Block': '19347405', 'Time': '2024-03-02 12:13:11', 'From': 'tigerbull.eth<br/>(0x0c3ea13e4f5597fb6e7ee2c6352f8afecc1bb9b1)', 'Self': False, 'To': '0x7037ae030238c688dd7bb421d4a4f78d7684533a', 'Value': '0 ETH', 'Txn Fee': '0.00472485', 'GasPrice': '43.63312337', 'Error': ''}

sample block 19347404: {'Txn Hash': '0xde82fcee93a6f1ba7e86af9a5cacb1911a7d8c50d0e7a1de8bb1778b3c0d1008', 'Method': 'Reveal', 'Block': '19347405', 'Time': '2024-03-02 12:13:11', 'From': 'tigerbull.eth<br/>(0x0c3ea13e4f5597fb6e7ee2c6352f8afecc1bb9b1)', 'Self': False, 'To': '0x7037ae030238c688dd7bb421d4a4f78d7684533a', 'Value': '0 ETH', 'Txn Fee': '0.00472485', 'GasPrice': '43.63312337', 'Error': ''}

sample block 19347403: {'Txn Hash': '0xde82fcee93a6f1ba7e86af9a5cacb1911a7d8c50d0e7a1de8bb1778b3c0d1008', 'Method': 'Reveal', 'Block': '19347405', 'Time': '2024-03-02 12:13:11', 'From': 'tigerbu

## Data Analysis

Now that we have collected the transaction data from Etherscan, the next step is to perform conduct an initial analysis. This task will involve the following steps:

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by converting data types, removing any irrelevant information, and handling **duplicate** values.

3. **Statistical Analysis**: Calculate the mean and standard deviation of the population. Evaluate these statistics to understand the distribution of transaction values. The analysis and plotting will be on **Txn Fee** and **Value**.

4. **Visualization**: This phase involves the creation of visual representations to aid in the analysis of transaction values. The visualizations include:
    - A histogram for each data column, which provides a visual representation of the data distribution. The selection of bin size is crucial and should be based on the data's characteristics to ensure accurate representation. Provide an explanation on the bin size selection!
    - A normal distribution plot fitted alongside the histogram to compare the empirical distribution of the data with the theoretical normal distribution.
    - A box plot and a violin plot to identify outliers and provide a comprehensive view of the data's distribution.

### Deliverables

The project aims to deliver the following deliverables:

- A refined pandas DataFrame containing the transaction data, which has undergone thorough cleaning and is ready for analysis.
- A simple statistical analysis evaluating the population statistics, offering insights into the distribution of transaction values and fees.
- A set of visualizations showcasing the distribution of transaction values for the population. These visualizations include histograms, normal distribution plots, box plots, and violin plots, each serving a specific purpose in the analysis.

### Getting Started

The project starts with the importing of transaction data into a pandas DataFrame, setting the stage for data manipulation and analysis. Subsequent steps involve the cleaning of the data to ensure its quality and reliability. Followed by the calculation of population statistics. Finally, a series of visualizations are created to visually analyze the distribution of transaction values and fees.

In [122]:
# Your code here

## Data Sampling and Analysis

In this section, we will delve into the process of data sampling and perform an initial analysis on the transaction data we have collected. Our objective is to understand the distribution of transaction values by sampling the data and comparing the sample statistics with the population statistics.

### Steps

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by handling missing values, converting data types, and removing any irrelevant information.

3. **Simple Random Sampling (SRS)**: Create a sample from the dataset using a simple random sampling method. This involves randomly selecting a subset of the data without regard to any specific characteristics of the data.

4. **Stratified Sampling**: Create another sample from the dataset using a stratified sampling method. This involves dividing the data into strata based on a specific characteristic (e.g., transaction value) and then randomly selecting samples from each stratum. Explain what you have stratified the data by and why you chose this column.

5. **Statistical Analysis**: Calculate the mean and standard deviation of the samples and the population. Compare these statistics to understand the distribution of transaction values.

6. **Visualization**: Plot the distribution of transaction values and fees for both the samples and the population to visually compare their distributions.

### Considerations

- **Sample Size**: The size of the sample should be large enough to represent the population accurately but not so large that it becomes impractical to analyze.
- **Sampling Method**: Choose the appropriate sampling method based on the characteristics of the data and the research question.

Explain the above considerations in your report.

In [123]:
# Your code here