### Scraping Twitter using Selenium (solution)

Welcome to this notebook which is part of an introduction to web scraping with Selenium. Specifically, we are going to scrape tweets about bitcoin.

Disclaimer: 
- There are lots of improvements that can be done to this code, which significantly improve the data quality obtained. This notebook has only one purpose, namely to explain the basics of selenium web scraping.

- For some parts, I have used Izzy Analytics on Youtube as inspiration. I recommend to give him a watch: https://www.youtube.com/watch?v=3KaffTIZ5II&t=289s 

##### Task 1: Collecting our ingredients: (Guided) 

You need 
- An python environment with Selenium.
- Google Chrome.
- ChromeDriver (Chromium)
- A Twitter Account

The collection of these are described in the presentation pdf, which is also in this repo.

Also, we need to import the following:

In [1]:
from time import sleep  # Will come in hand
from getpass import getpass  # For logging in to Twitter through Python
from selenium import webdriver  # Our WebDriver

# other, but necessary:
from selenium.webdriver.common.by import By  # For Crawling
from selenium.webdriver.common.keys import Keys  # For Crawling
from selenium.webdriver.chrome.options import (
    Options,
)  # For setting some options for the driver, see Appendix.
from selenium.common.exceptions import NoSuchElementException  # Avoiding adds
from selenium.webdriver.support import expected_conditions as EC  # Conditions
from selenium.webdriver.support.ui import (
    WebDriverWait,
)  # Make sure the element is loaded

##### Task 2: Setting up, and starting our driver: (Guided)

In [4]:
# Path configurations:
DRIVER_PATH = "C:\\Program Files (x86)\\chromedriver.exe"

# Set some options: default
#options = Options()

# Start driver:
driver = webdriver.Chrome(DRIVER_PATH)

NoSuchDriverException: Message: Unable to obtain chromedriver using Selenium Manager; 'str' object has no attribute 'capabilities'; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location


##### Task 3: Open Twitter, and provide the notebook with your login: (Guided)

In [12]:
web_site = "https://twitter.com/home"
driver.get(web_site)

In [4]:
my_username = input("Provide a username: ")
my_password = getpass()

#### Extra: HTML and XPATH

What makes Selenium very powerful compared to more traditional web scraping framework, is that we can easily extract the parts of the html we want. This makes it easy to get clean data sets from the start.

Instead of downloading whole html pages, and then clean out the data, we can give Selenium instruction to where the elements we want are located, and then extract only this information.

*HTML*, which stands for HyperText Markup Language, is the foundation of every website you see on the internet. It is a simple and powerful language used to create the structure and content of web pages. Think of HTML as the skeleton that gives a web page its shape.

Example:

    <div>
        First div
        <div>
            Second div
            <input type="text" placeholder="Middle input" />
        </div>
    </div>
    <div>
        Third div
    </div>

*XPath* is a query language used to navigate and select elements from a HTML document. It provides a concise way to locate specific elements or extract data based on their element structure, attributes, or content.

To get the input element in the code above, we would have to feed Selenium with
    
    /html/body/div[1]/div/input[@placeholder='Middle input']

##### In our case...

The location of the element where you provide your username at twitter in full XPATH:

    "/html/body/div[1]/div/div/div[1]/div/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/
    div[5]/label/div/div[2]/div/input"

But this also works:

    "//input[@name='text']"

Because it's name is unique in the whole HTML code. As we see, getting the right identifier takes some practice.



##### Task 4: Our first crawling by logging in: (Guided)

In [13]:
username = driver.find_element(By.XPATH, "//input[@name='text']")
username.send_keys(my_username)
username.send_keys(Keys.RETURN)

##### Task 5: Our second crawling: (Try yourself) - 10 min

In [14]:
password = driver.find_element(By.XPATH, "//input[@name='password']")
password.send_keys(my_password)
password.send_keys(Keys.RETURN)

##### Task 5: Search for tweets mentioning "bitcoin": (Guided)

In [15]:
search_box = driver.find_element(By.XPATH, "//input[@aria-label='Search query']")
search_box.send_keys("bitcoin")
search_box.send_keys(Keys.RETURN)

Note: If you make the width of the screen smaller, the element is not there anymore.

##### Extra: Twitter Advanced Search

Using Selenium enables us to navigate pages, but it also force us to think smart. We want our code to do as little as possible to save time. Take this example:

Bitcoin was exchanged at about 50'000 dollars in october 2021.
Bitcoin was exchanged at about 20'000 dollars in october 2022.

To search for particular dates, we can search for:

```"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```

and
 
```"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```

Here, we will have also filtered such that we get: *only english tweets*, *no links* and *no replies*.

This could also be achieved by clicking "advanced search", then the boxes we want. Here we saved a lot of time, by prompting the search box instead.

In [16]:
search_box = driver.find_element(By.XPATH, "//input[@aria-label='Search query']")
search_box.send_keys(
    '"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies'
)
search_box.send_keys(Keys.RETURN)

Question: The upper code-snippet might not work, why?

##### Task 6: Click on Latest (Homework)
We want to look at the latest. Try to click it by
1. Locating the element
2. Use element.click()

In [17]:
driver.find_element(By.LINK_TEXT, "Latest").click()

If you have more time, try clicking "Top" again, or try to click on the "Tweet" button

##### Task 7: Scraping tweets by locating tweets (cards), collect them, and combine them in to a deck of "cards": (Guided)

In [18]:
cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')

In [None]:
for card in cards:
    print(card)

The cards are WebElements until now. We can pick one card, and go a bit deeper.

In [None]:
card = cards[0]
card.text

##### Task 8: Finding the Twitter Handle (Name of Twitter Account, not username): (Guided)

NOTE: as soon as we have selected an element, we have to start the xpath with "."

In [None]:
handle = card.find_element(By.XPATH, ".//a/div/div[1]/span/span").text
print(handle)

##### Task 9: We can also find username and date: (Homework)

First, try yourself. Username is a bit easier than date. *Hint*: Try to look for an unique identifier / tag. 

Selenium has the following ways of identifying elements:

    driver.find_element(By.ID, "id")
    driver.find_element(By.NAME, "name")
    driver.find_element(By.XPATH, "xpath")
    driver.find_element(By.LINK_TEXT, "link text")
    driver.find_element(By.PARTIAL_LINK_TEXT, "partial link text")
    driver.find_element(By.TAG_NAME, "tag name")
    driver.find_element(By.CLASS_NAME, "class name")
    driver.find_element(By.CSS_SELECTOR, "css selector")

In [22]:
username = card.find_element(By.XPATH, ".//span[contains(text(),'@')]").text
date = card.find_element(By.XPATH, ".//time").get_attribute(
    "datetime"
)  # Sponsored Content does not have this

##### Task 10: At last, lets collect the tweet itself (This is a bit more complicated):

In [23]:
tweet_body = card.find_elements(By.XPATH, ".//div/div[2]/div[2]/div[2]/div/span")
text_list = [span.text for span in tweet_body]
tweet_text = " "
tweet_text = tweet_text.join(text_list)

Let's extend our collection from one to several tweets

##### Wrapping up: Make a function that executes all the steps above, and makes each tweet and the collected information into a tuple

In [24]:
def collect_tweet(card):
    try:
        date = card.find_element(By.XPATH, ".//time").get_attribute(
            "datetime"
        )  # Sponsored Content does not have this
    except NoSuchElementException:
        return False

    handle = card.find_element(By.XPATH, ".//a/div/div[1]/span/span").text
    username = card.find_element(By.XPATH, ".//span[contains(text(),'@')]").text

    tweet_text = _collect_text(card)

    tweet = (handle, username, date, tweet_text)
    return tweet


def _collect_text(card):
    tweet_body = card.find_elements(By.XPATH, ".//div/div[2]/div[2]/div[2]/div/span")
    text_list = [span.text for span in tweet_body]
    tweet_text = " "
    return tweet_text.join(text_list)

In [None]:
tweets = []
for card in cards:
    tweet = collect_tweet(card)
    if tweet:
        tweets.append(tweet)

tweets

We need to scroll, which can be done by:

In [32]:
driver.execute_script("window.scroll(0,document.body.scrollHeight);")

Last part is inspired by @israel-dryer (github), and updated to fit our case. 

- Especially the 

    ```
    driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')
    ```

    is replaced by

    ```
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
        )
    ```

- I have also added a loading bar.

In [26]:
def my_scraper(DRIVER_PATH, options, max_tweets):
    driver = webdriver.Chrome(DRIVER_PATH, options=options)
    web_site = "https://twitter.com/home"
    driver.get(web_site)

    # Crawl:

    username = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
    )
    username.send_keys(my_username)
    username.send_keys(Keys.RETURN)

    password = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='password']"))
    )
    password.send_keys(my_password)
    password.send_keys(Keys.RETURN)

    search_box = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located(
            (By.XPATH, "//input[@aria-label='Search query']")
        )
    )
    search_box.send_keys(
        '"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies'
    )
    search_box.send_keys(Keys.RETURN)

    latest = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "Latest"))
    )
    latest.click()

    # Scrape:

    data = []
    tweet_ids = set()  # In order to not collect duplicates
    last_position = driver.execute_script("return window.pageYOffset;")
    scrolling = True

    while scrolling:
        page_cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')
        for card in page_cards[-15:]:
            tweet = collect_tweet(card)

            if tweet:
                tweet_id = "".join(tweet)

                if tweet_id not in tweet_ids:
                    tweet_ids.add(tweet_id)
                    data.append(tweet)

        # Loading bar VISUALIZATION
        percent_done = int((len(data) / max_tweets) * 100)
        print(f"{percent_done}% ", end="", flush=True)

        scroll_attempt = 0

        while True:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sleep(2)
            curr_position = driver.execute_script("return window.pageYOffset;")

            if last_position == curr_position:
                scroll_attempt += 1

                # end of scroll region
                if scroll_attempt >= 3:
                    scrolling = False
                    break

                else:
                    sleep(2)  # attempt another scroll

            else:
                last_position = curr_position
                break

        if len(data) > max_tweets:
            scrolling = False

    # Close the web driver
    driver.close()
    return data

In [None]:
data = my_scraper(DRIVER_PATH, options, max_tweets=17)
data

## Appendix








In [None]:
# Some mentionworthy options:

options.add_experimental_option(
    "prefs",
    {
        "download.default_directory": PLACE_YOUR_DESIRED_PATH,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True,
    },
)
# setDownloadPreferences: Sets the download preferences for the browser.
# Here, it specifies the default download directory, disables the download prompt,
# enables directory upgrade, and enables safe browsing.

options.add_argument("--headless=new")
# setHeadlessMode: Sets the browser in headless mode, which means it runs without a
# graphical user interface.

options.add_argument("--disable-gpu")
# disableGPU: Disables the use of the GPU (graphics processing unit) in the browser.

options.add_argument("--no-sandbox")
# disableSandbox: Disables the sandbox mode, which provides an extra layer of security for the browser.

options.add_argument("--disable-dev-shm-usage")
# disableDevShmUsage: Disables the use of /dev/shm temporary storage in the browser.

options.add_argument("--log-level=3")
# setLogLevel: Sets the logging level for the browser. Here, it sets the log level to 3, which is the highest level of logging.

options.add_argument("--silent")
# setSilentMode: Sets the browser in silent mode, which suppresses most browser notifications and prompts.