# 1-Webcrawller Script
Date created: 13-04-2023
Created by: Jayden Dzierbicki
Last updated: 14-04-2023

The purpose of this notebook is to collect data on XRP, a cryptocurrency, from two different sources: historic price data and forum posts. The notebook scrapes data from the following websites:

- https://finance.yahoo.com/quote/XRP-AUD/history?p=XRP-AUD (for XRP price in AUD)
- https://www.cryptocompare.com/coins/xrp/forum (for XRP forum posts)
- https://www.investing.com/crypto/xrp/chat (for XRP forum posts)

All collected data is stored in a mySQL database. A CSV backup of the data is also created and saved to the task1 folder with this notebook.

It's important to note that the mySQL database schema includes tables for both price data and forum posts, with column names and data types specified for each. While this script only collects data from two sources, it demonstrates how web scraping can be used to collect data from multiple sources and store it in a database for analysis.

One potential issue with this data collection process is that there may be biases in the forum posts that are collected, as they are only from two websites. Additionally, there may be technical challenges in scraping the data from these websites. By acknowledging these limitations, readers can better interpret the data and its potential biases.

In [1]:
#Requirments to work
import time # Provides time-related functions
from datetime import datetime, timedelta, date # Provides date and time-related functions
from bs4 import BeautifulSoup # For parsing HTML and XML documents
import pandas as pd # For data manipulation and analysis
import dateparser # For parsing date strings
from selenium import webdriver # For automated web testing
from selenium.webdriver.common.keys import Keys # For simulating keyboard keys
from selenium.webdriver.chrome.options import Options # For setting up Chrome options
from tqdm import tqdm # For displaying progress bars
import getpass # For getting a password without displaying it on the screen
from sqlalchemy import create_engine # For connecting to SQL databases
from yahoo_fin.stock_info import get_data # For getting historical stock data from Yahoo Finance


# Function to load data into mySQL
def load_data_mySQL(database_name, table_name, df):
    user = 'root'

    # Prompt the user for a password
    password = getpass.getpass("Enter your MySQL password: ")
    host = 'localhost'
    port = 3306
    database = database_name
    engine = create_engine(f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}")

    # Write the DataFrame to a SQL table
    table_name =  table_name
    df.to_sql(table_name, engine, if_exists='replace', index=False)

    # Close the connection
    engine.dispose()

## cryptocompare.com crawler
Will need to undergo a simple ETL proccess
- Step 1: Extract data from cryptocompare.com
- Step 2: Transform data in prepration for initial storage (add ID, convert datetime to date and time)
- Step 3: Load data into mySQL

In [12]:
# Step 1: - MIGHT DELETE AS CREATED FUNCTION
scroll_duration = timedelta(minutes=35)  # Set the duration for scrolling

# Configure Chrome options for headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")  # More efficient RAM usage

# Replace with the path to your ChromeDriver
driver = webdriver.Chrome(executable_path='C:/webdrivers/chromedriver', options=chrome_options)

url = f"https://www.cryptocompare.com/coins/xrp/forum"
driver.get(url)

# Calculate the number of iterations for the progress bar
scroll_interval = 6  # seconds based on sleep time
total_iterations = int(scroll_duration.total_seconds() // scroll_interval)

# Initialize the progress bar
progress_bar = tqdm(range(total_iterations), desc="Scraping posts")

start_time = datetime.now()
post_data2 = []
prev_post_count = 0
iteration = 0
while iteration < total_iterations:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_interval)

    html_content = driver.page_source
    soup = BeautifulSoup(html_content, "html.parser")
    posts = soup.find_all("div", {"class": "post-content"})

    if len(posts) > prev_post_count:  # Check if new posts are loaded
        for post in posts[prev_post_count:]:
            comment_element = post.find("div", {"class": "content-body"})
            date_element = post.find("div", {"class": "item-ago ng-binding"})
            if comment_element and date_element:
                date_str = date_element['title']
                parsed_date = dateparser.parse(date_str, settings={"RELATIVE_BASE": datetime.now()})
                comment = comment_element.text.replace('\n', '').strip()
                post_data2.append({"comment": comment, "date": parsed_date})

        prev_post_count = len(posts)

    iteration += 1
    progress_bar.update(1)  # Update the progress bar

driver.quit()
progress_bar.close()  # Close the progress bar

df2 = pd.DataFrame(post_data2)


# Save DataFrame as CSV file - backup for SQL
today_date = date.today()
df2.to_csv(f'cryptocompare_{today_date}.csv', index=False)
df2

  driver = webdriver.Chrome(executable_path='C:/webdrivers/chromedriver', options=chrome_options)
  now = self.get_local_tz().localize(now)
Scraping posts:  10%|██████                                                         | 34/350 [06:00<1:15:14, 14.29s/it]

KeyboardInterrupt: 

In [14]:
# Extract Data

class cryptocompareCrawler():
    def __init__(self, start_link):
        self.link_to_explore = start_link
        self.comments = pd.DataFrame(columns=['date', 'comment'])
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        self.driver = webdriver.Chrome(executable_path='C:/webdrivers/chromedriver', options=chrome_options)
        self.pagecount = 1
        self.next = True

    def run(self):
        try:
            post_data = self.scrape_crypto_forum(self.link_to_explore)
            self.extract_data(post_data)
            self.save_data_to_file()
        except:
            print("Cannot get the page " + self.link_to_explore)
            raise
            

    def scrape_crypto_forum(self, url, scroll_duration=timedelta(minutes=32)):
        self.driver.get(url)

        scroll_interval = 6
        total_iterations = int(scroll_duration.total_seconds() // scroll_interval)

        progress_bar = tqdm(range(total_iterations), desc="Scraping posts")

        start_time = datetime.now()
        post_data = []
        prev_post_count = 0
        iteration = 0
        while iteration < total_iterations:
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(scroll_interval)

            html_content = self.driver.page_source
            soup = BeautifulSoup(html_content, "html.parser")
            posts = soup.find_all("div", {"class": "post-content"})

            if len(posts) > prev_post_count:
                for post in posts[prev_post_count:]:
                    comment_element = post.find("div", {"class": "content-body"})
                    date_element = post.find("div", {"class": "item-ago ng-binding"})
                    if comment_element and date_element:
                        date_str = date_element['title']
                        parsed_date = dateparser.parse(date_str, settings={"RELATIVE_BASE": datetime.now()})
                        comment = comment_element.text.replace('\n', '').strip()
                        post_data.append({"comment": comment, "date": parsed_date})

                prev_post_count = len(posts)

            iteration += 1
            progress_bar.update(1)

        progress_bar.close()
        return post_data

    def extract_data(self, post_data):
        for data in post_data:
            comment_str = data['comment']
            standardized_date = data['date'].strftime('%Y-%m-%d %H:%M:%S')
            self.comments.loc[len(self.comments)] = [standardized_date, comment_str]

    def save_data_to_file(self):
        today_date = datetime.today().strftime('%Y-%m-%d')
        self.comments.to_csv(f'cryptocompare_{today_date}.csv', index=False)

    def close_spider(self):
        self.driver


In [15]:
# Extract data
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--no-sandbox')

# Initialize the class with the base URL
crawler = cryptocompareCrawler('https://www.cryptocompare.com/coins/xrp/forum')

# Run the crawler
crawler.run()

# Close the spider
crawler.close_spider()

  self.driver = webdriver.Chrome(executable_path='C:/webdrivers/chromedriver', options=chrome_options)

Scraping posts:   0%|                                                                          | 0/320 [00:00<?, ?it/s][A
Scraping posts:   0%|▏                                                                 | 1/320 [00:06<34:23,  6.47s/it][A
Scraping posts:   1%|▍                                                                 | 2/320 [00:15<41:17,  7.79s/it][A
Scraping posts:   1%|▌                                                                 | 3/320 [00:21<38:45,  7.34s/it][A
Scraping posts:   1%|▊                                                                 | 4/320 [00:28<37:42,  7.16s/it][A
Scraping posts:   2%|█                                                                 | 5/320 [00:35<37:25,  7.13s/it][A
Scraping posts:   2%|█▏                                                                | 6/320 [00:43<37:34,  7.18s/it][A
Scraping posts:   2%|█▍            

Scraping posts:  20%|████████████▌                                                  | 64/320 [15:23<1:44:16, 24.44s/it][A
Scraping posts:  20%|████████████▊                                                  | 65/320 [15:47<1:44:18, 24.54s/it][A
Scraping posts:  21%|████████████▉                                                  | 66/320 [16:13<1:45:02, 24.81s/it][A
Scraping posts:  21%|█████████████▏                                                 | 67/320 [16:41<1:48:31, 25.74s/it][A
Scraping posts:  21%|█████████████▍                                                 | 68/320 [17:09<1:51:50, 26.63s/it][A
Scraping posts:  22%|█████████████▌                                                 | 69/320 [17:37<1:52:13, 26.83s/it][A
Scraping posts:  22%|█████████████▊                                                 | 70/320 [18:05<1:54:15, 27.42s/it][A
Scraping posts:  22%|█████████████▉                                                 | 71/320 [18:38<1:59:57, 28.90s/it][A
Scraping posts: 

Scraping posts:  41%|████████████████████████▍                                   | 130/320 [1:08:39<3:47:22, 71.80s/it][A
Scraping posts:  41%|████████████████████████▌                                   | 131/320 [1:09:56<3:51:06, 73.37s/it][A
Scraping posts:  41%|████████████████████████▊                                   | 132/320 [1:11:12<3:53:01, 74.37s/it][A
Scraping posts:  42%|████████████████████████▉                                   | 133/320 [1:12:28<3:52:41, 74.66s/it][A
Scraping posts:  42%|█████████████████████████▏                                  | 134/320 [1:13:46<3:55:06, 75.84s/it][A
Scraping posts:  42%|█████████████████████████▎                                  | 135/320 [1:15:14<4:04:35, 79.32s/it][A
Scraping posts:  42%|█████████████████████████▌                                  | 136/320 [1:16:34<4:04:35, 79.76s/it][A
Scraping posts:  43%|█████████████████████████▋                                  | 137/320 [1:17:54<4:03:13, 79.75s/it][A
Scraping posts: 

Scraping posts:  61%|████████████████████████████████████▏                      | 196/320 [3:24:21<6:46:49, 196.85s/it][A
Scraping posts:  62%|████████████████████████████████████▎                      | 197/320 [3:27:20<6:32:22, 191.40s/it][A
Scraping posts:  62%|████████████████████████████████████▌                      | 198/320 [3:30:13<6:17:57, 185.88s/it][A
Scraping posts:  62%|████████████████████████████████████▋                      | 199/320 [3:32:58<6:01:59, 179.50s/it][A
Scraping posts:  62%|████████████████████████████████████▉                      | 200/320 [3:36:00<6:00:30, 180.25s/it][A
Scraping posts:  63%|█████████████████████████████████████                      | 201/320 [3:38:42<5:47:01, 174.97s/it][A
Scraping posts:  63%|█████████████████████████████████████▏                     | 202/320 [3:41:40<5:45:49, 175.84s/it][A
Scraping posts:  63%|█████████████████████████████████████▍                     | 203/320 [3:44:41<5:45:46, 177.32s/it][A
Scraping posts: 

Scraping posts:  82%|████████████████████████████████████████████████▎          | 262/320 [7:56:55<4:39:39, 289.29s/it][A
Scraping posts:  82%|████████████████████████████████████████████████▍          | 263/320 [8:01:43<4:34:25, 288.87s/it][A
Scraping posts:  82%|████████████████████████████████████████████████▋          | 264/320 [8:06:21<4:26:36, 285.65s/it][A
Scraping posts:  83%|████████████████████████████████████████████████▊          | 265/320 [8:11:13<4:23:40, 287.64s/it][A
Scraping posts:  83%|█████████████████████████████████████████████████          | 266/320 [8:16:42<4:30:02, 300.04s/it][A
Scraping posts:  83%|█████████████████████████████████████████████████▏         | 267/320 [8:21:50<4:27:00, 302.28s/it][A
Scraping posts:  84%|█████████████████████████████████████████████████▍         | 268/320 [8:27:03<4:24:43, 305.45s/it][A
Scraping posts:  84%|█████████████████████████████████████████████████▌         | 269/320 [8:31:51<4:15:17, 300.33s/it][A
Scraping posts: 

In [21]:
# Trasnform Data 

# Add an ID column to the DataFrame
df = pd.read_csv(f'cryptocompare_2023-04-15.csv')
df['id'] = df.index + 1
df['source'] = 'https://www.cryptocompare.com/coins/xrp/forum'
df[['date', 'time']] = df['date'].str.split(' ', n=1, expand=True)
df

Unnamed: 0,date,comment,id,source,time
0,2023-04-14,"moon soon, coon. mark my words (lol idk - no f...",1,https://www.cryptocompare.com/coins/xrp/forum,08:41:00
1,2023-04-14,Boring xrp at the Moment,2,https://www.cryptocompare.com/coins/xrp/forum,02:27:00
2,2023-04-14,So bankrupted FTX….they recover $7.3 billion i...,3,https://www.cryptocompare.com/coins/xrp/forum,01:20:00
3,2023-04-13,Is UMU what many thought XRP was going to be u...,4,https://www.cryptocompare.com/coins/xrp/forum,19:11:00
4,2023-04-13,Coinbase just deposited Flare into my account....,5,https://www.cryptocompare.com/coins/xrp/forum,00:49:00
...,...,...,...,...,...
6397,2021-02-02,I somewhat feel sorry for the newcommers. I gu...,6398,https://www.cryptocompare.com/coins/xrp/forum,02:30:00
6398,2021-02-02,Veyor..MGI,6399,https://www.cryptocompare.com/coins/xrp/forum,02:28:00
6399,2021-02-02,big jump in ODL volume since the sec lawsuitis...,6400,https://www.cryptocompare.com/coins/xrp/forum,02:27:00
6400,2021-02-02,Funny how humans brain works. In fomo Times ev...,6401,https://www.cryptocompare.com/coins/xrp/forum,02:11:00


In [22]:
load_data_mySQL('MA5851_A3', 'cryptocompare_xrp', df)

Enter your MySQL password: ········


## Investing.com crawler
Will need to undergo a simple ETL proccess
- Step 1: Extract data from Investing.com
- Step 2: Transform data in prepration for initial storage (add ID, convert datetime to date and time)
- Step 3: Load data into mySQL

In [6]:
# Code to extract data
# Scrape https://www.investing.com/crypto/xrp/chat 
# Need to crawl multiple pages over 1200 pages! with data from 2019

class investingCrawler():
    def __init__(self, start_link):
        self.link_to_explore = start_link
        self.comments = pd.DataFrame(columns = ['date','comment'])
        self.driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)           
        self.pagecount = 1
        self.next = True
        
    def run(self):
        with tqdm(total=1222, dynamic_ncols=True, desc="Scraping posts") as pbar:
            while self.next:
                if self.pagecount >= 1222:
                    self.save_data_to_file()
                    self.next = False
                try:
                    self.driver.get(self.link_to_explore + "/" + str(self.pagecount))
                    self.driver.implicitly_wait(15)
                    self.extract_data()      
                    self.pagecount = self.pagecount + 1
                    pbar.update(1)
                except:
                    print ("Cannot get the page " + self.link_to_explore)
                    self.next = False
                    raise

    def extract_data(self):
        html_content = self.driver.page_source
        soup = BeautifulSoup(html_content, "html.parser")

        comment_wrappers = soup.find_all("div", {"class": "commentInnerWrapper"})

        count = 0
        for wrapper in comment_wrappers:
            date_element = wrapper.find("span", {"class": "js-date"})
            date_str = date_element["comment-date-formatted"]

            # Parse date string and convert it to a standardized format
            parsed_date = dateparser.parse(date_str)
            standardized_date = parsed_date.strftime('%Y-%m-%d %H:%M:%S')

            comment_element = wrapper.find("span", {"class": "js-text"})
            comment_str = comment_element.text.strip()

            # Adding date and comment to the dataframe
            self.comments.loc[len(self.comments)] = [standardized_date, comment_str]
            count += 1

        return count



    def save_data_to_file(self):
    #we save the dataframe content to a CSV file
        today_date = date.today()
        self.comments.to_csv(f'investing_{today_date}.csv', index=False)
    def close_spider(self):
    #end the session
        self.driver.quit()

In [7]:
# Extract data
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--no-sandbox')

# Initialize the class with the base URL
crawler = investingCrawler('https://www.investing.com/crypto/xrp/chat')

# Run the crawler
crawler.run()

# Close the spider
crawler.close_spider()

  self.driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
Scraping posts: 100%|██████████████████████████████████████████████████████████████| 1222/1222 [41:11<00:00,  2.02s/it]


In [8]:
# Transform data
df = pd.read_csv(f'investing_2023-04-14.csv')

# Split the date and time values into separate columns
df[['date', 'time']] = df['date'].str.split(' ', n=1, expand=True)

# Add an ID column & source URL to the DataFrame
df['id'] = df.index + 1
df['source'] = 'https://www.investing.com/crypto/xrp/chat'

# Print to confrim looks correct
df

Unnamed: 0,date,comment,time,id,source
0,2023-04-14,Still a bargain. Regards.,12:15:57,1,https://www.investing.com/crypto/xrp/chat
1,2023-04-12,btc going higher so will this,23:10:00,2,https://www.investing.com/crypto/xrp/chat
2,2023-04-12,Going back down to .40. No settlement. Market ...,14:39:00,3,https://www.investing.com/crypto/xrp/chat
3,2023-04-12,$1,11:01:00,4,https://www.investing.com/crypto/xrp/chat
4,2023-04-12,Back to sick,07:57:00,5,https://www.investing.com/crypto/xrp/chat
...,...,...,...,...,...
59743,2018-03-18,Same happened in 2013 ....many sold and few go...,03:29:00,59744,https://www.investing.com/crypto/xrp/chat
59744,2018-03-17,Sell it off before it’s too late,15:22:00,59745,https://www.investing.com/crypto/xrp/chat
59745,2018-03-17,No more up for crypto,15:22:00,59746,https://www.investing.com/crypto/xrp/chat
59746,2018-03-16,Now cardano it's going to ********up anytime!,18:43:00,59747,https://www.investing.com/crypto/xrp/chat


In [9]:
# Load data to mySQL ready for task 2
load_data_mySQL('MA5851_A3', 'investingcom_xrp', df)

Enter your MySQL password: ········


## Yahoofinance.com price extraction
Yahoo provide API and tools to obtain data which is much more efficent then scrapping via a built in function for python
Will need to undergo a simple ETL proccess
- Step 1: Extract data from yahoofinance.com
- Step 2: Transform data in prepration for initial storage (add ID, convert datetime to date and time)
- Step 3: Load data into mySQL

In [None]:
# Extract data

df= get_data("XRP-AUD", start_date="01/01/2017", end_date="14/04/2023", index_as_date = False, interval="1d")


In [None]:
# Transform data
# Add an ID column to the DataFrame
df['id'] = df.index + 1

# Create the connection to the MySQL database using sqlalchemy
user = 'root'

# Prompt the user for a password
password = getpass("Enter your MySQL password: ")
host = 'localhost'
port = 3306
database = 'MA5851_A3'

engine = create_engine(f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}")

# Write the DataFrame to a SQL table
table_name = 'cryptocompare_xrp'
df.to_sql(table_name, engine, if_exists='replace', index=False)

# Close the connection
engine.dispose()

In [None]:
# Load data
load_data_mySQL('MA5851_A3', 'xrp_price_yahoo', df)